This document provides comprehensive operational procedures for cloud teams managing Quilt platform deployments.
- Service Down: Jump to Service Recovery
- Storage Full: Jump to Emergency Scaling
- Security Incident: Jump to Security Response
- Health check (5 min) - Instructions
- Monitor alerts - Dashboard Links
- Review logs - Log Locations
- Backup verification (10 min) - Instructions
- Security updates (15 min) - Instructions
- Capacity review (10 min) - Instructions
- Capacity planning (20 min) - Instructions
- Cost review (15 min) - Instructions
- Security audit (30 min) - Instructions
Infrastructure Requirements
- AWS Account with appropriate permissions
- SSL Certificate in AWS Certificate Manager
- Domain name with DNS control
- S3 bucket for Terraform state
- CloudFormation template from Quilt
Team Requirements
- Terraform >= 1.5.0 installed
- AWS CLI configured
- Git repository for configuration
- Access to monitoring systems
- Emergency contact list updated
1. Create Project Structure
# Create dedicated directory
mkdir quilt-${ENVIRONMENT}
cd quilt-${ENVIRONMENT}
# Initialize git
git init
git remote add origin https://github.com/yourorg/quilt-${ENVIRONMENT}.git
# Create initial structure
mkdir -p {scripts,docs,backups}2. Download and Configure Templates
# Download main configuration
curl -o main.tf https://raw.githubusercontent.com/quiltdata/iac/main/examples/main.tf
# Create environment-specific variables
cat > terraform.tfvars << EOF
# Environment: ${ENVIRONMENT}
# Created: $(date)
# Owner: ${TEAM_NAME}
google_client_secret = "${GOOGLE_CLIENT_SECRET}"
okta_client_secret = "${OKTA_CLIENT_SECRET}"
EOF
# Create .gitignore
cat > .gitignore << 'EOF'
.terraform/
*.tfplan
*.tfstate
*.tfstate.backup
terraform.tfvars
.terraform.lock.hcl
*.log
EOF3. Configure main.tf
# Edit main.tf with your values
vim main.tf
# Required updates:
# - allowed_account_ids: ["YOUR_AWS_ACCOUNT_ID"]
# - region: "YOUR_AWS_REGION"
# - bucket: "YOUR_TERRAFORM_STATE_BUCKET"
# - local.name: "quilt-${ENVIRONMENT}"
# - local.quilt_web_host: "data-${ENVIRONMENT}.YOUR-COMPANY.com"
# - CertificateArnELB: "YOUR_CERTIFICATE_ARN"
# - AdminEmail: "admin@YOUR-COMPANY.com"
# - zone_id: "YOUR_ROUTE53_ZONE_ID"1. Initialize Terraform
# Initialize
terraform init
# Validate syntax
terraform validate
# Format code
terraform fmt
# Security scan (if available)
# tfsec . || echo "Security scan not available"2. Plan Deployment
# Create plan
terraform plan -out=tfplan
# Review plan output carefully
# Check for:
# - No unexpected deletions
# - Correct resource counts
# - Proper naming conventions
# - Security group configurations3. Deploy Infrastructure
# Apply configuration
terraform apply tfplan
# Monitor deployment progress
# Typical deployment time: 20-30 minutes
# Watch AWS Console for resource creation status4. Verify Deployment
# Get deployment outputs
terraform output admin_password > admin_password.txt
chmod 600 admin_password.txt
terraform output quilt_url
# Test connectivity
QUILT_URL=$(terraform output -raw quilt_url)
curl -I "${QUILT_URL}" || echo "Service not yet ready"
# Wait for service to be fully ready (up to 10 minutes)
for i in {1..20}; do
if curl -f "${QUILT_URL}/health" >/dev/null 2>&1; then
echo "Service is ready!"
break
fi
echo "Waiting for service... (attempt $i/20)"
sleep 30
done1. Access Quilt Catalog
# Get credentials
ADMIN_EMAIL=$(terraform output -raw admin_email)
ADMIN_PASSWORD=$(cat admin_password.txt)
QUILT_URL=$(terraform output -raw quilt_url)
echo "Access Quilt at: ${QUILT_URL}"
echo "Username: ${ADMIN_EMAIL}"
echo "Password: ${ADMIN_PASSWORD}"2. Initial Setup Tasks
- Login to Quilt catalog
- Change default admin password
- Configure organization settings
- Set up initial users and permissions
- Test basic functionality (upload/download)
3. Documentation
# Create deployment documentation
cat > docs/deployment-info.md << EOF
# Quilt Deployment Information
**Environment**: ${ENVIRONMENT}
**Deployed**: $(date)
**Version**: $(git describe --tags 2>/dev/null || echo "latest")
## Access Information
- **URL**: ${QUILT_URL}
- **Admin Email**: ${ADMIN_EMAIL}
- **AWS Region**: $(terraform output -raw aws_region 2>/dev/null || echo "YOUR-AWS-REGION")
## Infrastructure
- **VPC ID**: $(terraform show -json | jq -r '.values.root_module.child_modules[] | select(.address=="module.quilt") | .child_modules[] | select(.address=="module.vpc") | .resources[] | select(.type=="aws_vpc") | .values.id' 2>/dev/null || echo "N/A")
- **Database**: $(terraform show -json | jq -r '.values.root_module.child_modules[] | select(.address=="module.quilt") | .child_modules[] | select(.address=="module.db") | .resources[] | select(.type=="aws_db_instance") | .values.identifier' 2>/dev/null || echo "N/A")
- **ElasticSearch**: $(terraform show -json | jq -r '.values.root_module.child_modules[] | select(.address=="module.quilt") | .child_modules[] | select(.address=="module.search") | .resources[] | select(.type=="aws_elasticsearch_domain") | .values.domain_name' 2>/dev/null || echo "N/A")
## Next Steps
1. Configure monitoring and alerting
2. Set up backup procedures
3. Configure user access
4. Test disaster recovery procedures
EOFAutomated Health Check Script
#!/bin/bash
# File: scripts/daily-health-check.sh
set -e
QUILT_URL=$(terraform output -raw quilt_url)
STACK_NAME=$(terraform output -raw stack_name 2>/dev/null || echo "quilt-prod")
echo "=== Daily Health Check - $(date) ==="
# 1. Infrastructure Status
echo "Checking infrastructure status..."
terraform refresh >/dev/null
if terraform plan -detailed-exitcode >/dev/null; then
echo "✓ Infrastructure: No drift detected"
else
echo "⚠ Infrastructure: Drift detected - review required"
fi
# 2. Application Health
echo "Checking application health..."
if curl -f "${QUILT_URL}/health" >/dev/null 2>&1; then
echo "✓ Application: Healthy"
else
echo "✗ Application: Health check failed"
fi
# 3. ElasticSearch Health
echo "Checking ElasticSearch health..."
ES_STATUS=$(aws es describe-elasticsearch-domain --domain-name "${STACK_NAME}" --query 'DomainStatus.Processing' --output text 2>/dev/null || echo "unknown")
if [ "$ES_STATUS" = "False" ]; then
echo "✓ ElasticSearch: Ready"
else
echo "⚠ ElasticSearch: Processing or unavailable"
fi
# 4. Database Health
echo "Checking database health..."
DB_STATUS=$(aws rds describe-db-instances --db-instance-identifier "${STACK_NAME}" --query 'DBInstances[0].DBInstanceStatus' --output text 2>/dev/null || echo "unknown")
if [ "$DB_STATUS" = "available" ]; then
echo "✓ Database: Available"
else
echo "⚠ Database: Status is $DB_STATUS"
fi
echo "=== Health Check Complete ==="Usage:
# Make script executable
chmod +x scripts/daily-health-check.sh
# Run daily health check
./scripts/daily-health-check.sh
# Add to crontab for automation
# 0 9 * * * cd /path/to/quilt-prod && ./scripts/daily-health-check.sh >> logs/health-check.log 2>&1Application Logs
# View ECS service logs (log group name matches stack name)
STACK_NAME=$(terraform output -raw stack_name 2>/dev/null || echo "quilt-prod")
aws logs describe-log-groups --log-group-name-prefix "${STACK_NAME}"
# Stream recent logs
aws logs tail "${STACK_NAME}" --follow
# Search for errors
aws logs filter-log-events \
--log-group-name "${STACK_NAME}" \
--start-time $(date -d '1 hour ago' +%s)000 \
--filter-pattern "ERROR"Infrastructure Logs
# CloudTrail logs (stored in S3 bucket, not CloudWatch Logs by default)
# To query CloudTrail events, use AWS CLI or Athena queries against the S3 bucket
STACK_NAME=$(terraform output -raw stack_name 2>/dev/null || echo "quilt-prod")
# List CloudTrail events using AWS CLI
aws cloudtrail lookup-events \
--lookup-attributes AttributeKey=ResourceName,AttributeValue="${STACK_NAME}" \
--max-results 50
# Alternative: Query CloudTrail S3 bucket directly (requires CloudTrail bucket name)
# aws s3 ls s3://your-cloudtrail-bucket/AWSLogs/<account-id>/CloudTrail/
# Note: VPC Flow Logs are not enabled by default in the CloudFormation template
# To enable, configure VPC Flow Logs manually in the VPC settingsDatabase Backup Check
#!/bin/bash
# File: scripts/verify-backups.sh
STACK_NAME=$(terraform output -raw stack_name 2>/dev/null || echo "quilt-prod")
echo "=== Backup Verification - $(date) ==="
# Check RDS automated backups
echo "Checking RDS automated backups..."
BACKUP_COUNT=$(aws rds describe-db-snapshots \
--db-instance-identifier "${STACK_NAME}" \
--snapshot-type automated \
--query 'length(DBSnapshots)' \
--output text)
if [ "$BACKUP_COUNT" -gt 0 ]; then
echo "✓ RDS: $BACKUP_COUNT automated backups found"
# Show latest backup
LATEST_BACKUP=$(aws rds describe-db-snapshots \
--db-instance-identifier "${STACK_NAME}" \
--snapshot-type automated \
--query 'DBSnapshots | sort_by(@, &SnapshotCreateTime) | [-1].{Time:SnapshotCreateTime,Status:Status}' \
--output table)
echo "$LATEST_BACKUP"
else
echo "⚠ RDS: No automated backups found"
fi
# Check manual snapshots
MANUAL_COUNT=$(aws rds describe-db-snapshots \
--db-instance-identifier "${STACK_NAME}" \
--snapshot-type manual \
--query 'length(DBSnapshots)' \
--output text)
echo "✓ RDS: $MANUAL_COUNT manual snapshots available"
# Verify Terraform state backup
echo "Checking Terraform state backup..."
# Extract backend bucket from terraform configuration (works with Terraform 1.1+)
STATE_BUCKET=$(grep -A 5 'backend "s3"' *.tf 2>/dev/null | grep 'bucket' | sed 's/.*bucket.*=.*"\(.*\)".*/\1/' | head -1 || echo "unknown")
if [ "$STATE_BUCKET" != "unknown" ] && aws s3 ls "s3://${STATE_BUCKET}/quilt/terraform.tfstate" >/dev/null 2>&1; then
echo "✓ Terraform: State file accessible"
else
echo "⚠ Terraform: State file not accessible or backend not configured"
fi
echo "=== Backup Verification Complete ==="Update Check Script
#!/bin/bash
# File: scripts/check-updates.sh
echo "=== Security Update Check - $(date) ==="
# Check Terraform module updates
echo "Checking Terraform module updates..."
CURRENT_VERSION=$(grep 'ref=' main.tf | head -1 | sed 's/.*ref=//' | tr -d '"')
echo "Current version: $CURRENT_VERSION"
# Check latest release (requires gh CLI or manual check)
echo "Latest releases: https://github.com/quiltdata/iac/releases"
# Check AWS security bulletins
echo "Check AWS security bulletins:"
echo "- https://aws.amazon.com/security/security-bulletins/"
echo "- Review RDS, ElasticSearch, and ECS security updates"
# Check for outdated AMIs
echo "Checking ECS optimized AMI updates..."
aws ssm get-parameter \
--name "/aws/service/ecs/optimized-ami/amazon-linux-2/recommended" \
--query 'Parameter.Value' \
--output text | jq '.image_id'
echo "=== Update Check Complete ==="Capacity Analysis Script
#!/bin/bash
# File: scripts/capacity-analysis.sh
STACK_NAME=$(terraform output -raw stack_name 2>/dev/null || echo "quilt-prod")
START_TIME=$(date -u -d '30 days ago' +%Y-%m-%dT%H:%M:%S)
END_TIME=$(date -u +%Y-%m-%dT%H:%M:%S)
echo "=== Capacity Analysis - $(date) ==="
echo "Analysis period: $START_TIME to $END_TIME"
# ElasticSearch storage utilization
echo "ElasticSearch Storage Utilization:"
aws cloudwatch get-metric-statistics \
--namespace AWS/ES \
--metric-name StorageUtilization \
--dimensions Name=DomainName,Value="$STACK_NAME" Name=ClientId,Value="$(aws sts get-caller-identity --query Account --output text)" \
--start-time "$START_TIME" \
--end-time "$END_TIME" \
--period 86400 \
--statistics Average,Maximum \
--query 'Datapoints | sort_by(@, &Timestamp) | [-5:]' \
--output table
# RDS CPU utilization
echo "RDS CPU Utilization:"
aws cloudwatch get-metric-statistics \
--namespace AWS/RDS \
--metric-name CPUUtilization \
--dimensions Name=DBInstanceIdentifier,Value="$STACK_NAME" \
--start-time "$START_TIME" \
--end-time "$END_TIME" \
--period 86400 \
--statistics Average,Maximum \
--query 'Datapoints | sort_by(@, &Timestamp) | [-5:]' \
--output table
# RDS connections
echo "RDS Database Connections:"
aws cloudwatch get-metric-statistics \
--namespace AWS/RDS \
--metric-name DatabaseConnections \
--dimensions Name=DBInstanceIdentifier,Value="$STACK_NAME" \
--start-time "$START_TIME" \
--end-time "$END_TIME" \
--period 86400 \
--statistics Average,Maximum \
--query 'Datapoints | sort_by(@, &Timestamp) | [-5:]' \
--output table
# Generate recommendations
echo "=== Capacity Recommendations ==="
echo "Review the above metrics and consider:"
echo "- ElasticSearch storage > 80%: Plan storage scaling"
echo "- RDS CPU > 70%: Consider instance upgrade"
echo "- RDS connections > 80% of max: Review connection pooling"
echo "=== Capacity Analysis Complete ==="When: Storage utilization > 90% or disk space alerts
Immediate Actions (15 minutes)
#!/bin/bash
# File: scripts/emergency-es-scaling.sh
STACK_NAME=$(terraform output -raw stack_name 2>/dev/null || echo "quilt-prod")
echo "=== EMERGENCY: ElasticSearch Storage Scaling ==="
# 1. Check current status
echo "Current ElasticSearch status:"
aws es describe-elasticsearch-domain --domain-name "$STACK_NAME" \
--query 'DomainStatus.{Status:DomainStatus,Storage:EBSOptions,Processing:Processing}' \
--output table
# 2. Get current storage size
CURRENT_SIZE=$(terraform show -json | jq -r '.values.root_module.child_modules[] | select(.address=="module.quilt") | .resources[] | select(.name=="search_volume_size") | .values // empty')
echo "Current storage size: ${CURRENT_SIZE}GB"
# 3. Calculate new size (increase by 50%)
NEW_SIZE=$((CURRENT_SIZE * 3 / 2))
echo "Recommended new size: ${NEW_SIZE}GB"
# 4. Update configuration
echo "Updating main.tf..."
sed -i.backup "s/search_volume_size.*=.*/search_volume_size = $NEW_SIZE/" main.tf
# 5. Plan and apply
echo "Planning scaling operation..."
terraform plan -out=emergency-scaling.tfplan
echo "Ready to apply scaling. This will cause temporary performance impact."
echo "Run: terraform apply emergency-scaling.tfplan"
echo "Monitor progress with: aws es describe-elasticsearch-domain --domain-name $STACK_NAME"When: CPU > 90% or connection limit reached
Immediate Actions (10 minutes)
#!/bin/bash
# File: scripts/emergency-db-scaling.sh
STACK_NAME=$(terraform output -raw stack_name 2>/dev/null || echo "quilt-prod")
echo "=== EMERGENCY: Database Scaling ==="
# 1. Check current status
echo "Current database status:"
aws rds describe-db-instances --db-instance-identifier "$STACK_NAME" \
--query 'DBInstances[0].{Status:DBInstanceStatus,Class:DBInstanceClass,CPU:ProcessorFeatures}' \
--output table
# 2. Get current instance class
CURRENT_CLASS=$(terraform show -json | jq -r '.values.root_module.child_modules[] | select(.address=="module.quilt") | .resources[] | select(.name=="db_instance_class") | .values // empty')
echo "Current instance class: $CURRENT_CLASS"
# 3. Suggest upgrade path
case $CURRENT_CLASS in
"db.t3.micro") NEW_CLASS="db.t3.small" ;;
"db.t3.small") NEW_CLASS="db.t3.medium" ;;
"db.t3.medium") NEW_CLASS="db.t3.large" ;;
"db.t3.large") NEW_CLASS="db.r5.large" ;;
*) NEW_CLASS="db.r5.xlarge" ;;
esac
echo "Recommended upgrade: $CURRENT_CLASS -> $NEW_CLASS"
# 4. Update configuration
echo "Updating main.tf..."
sed -i.backup "s/db_instance_class.*=.*/db_instance_class = \"$NEW_CLASS\"/" main.tf
echo "Ready to apply scaling. This will cause brief downtime."
echo "Run: terraform plan -out=emergency-db-scaling.tfplan"
echo "Then: terraform apply emergency-db-scaling.tfplan"Capacity Planning Worksheet
#!/bin/bash
# File: scripts/quarterly-capacity-review.sh
echo "=== Quarterly Capacity Review - $(date) ==="
# Data collection period (90 days)
START_TIME=$(date -u -d '90 days ago' +%Y-%m-%dT%H:%M:%S)
END_TIME=$(date -u +%Y-%m-%dT%H:%M:%S)
# Create capacity report
cat > "capacity-report-$(date +%Y%m%d).md" << EOF
# Quarterly Capacity Review
**Period**: $START_TIME to $END_TIME
**Generated**: $(date)
## Current Configuration
### ElasticSearch
- **Instance Count**: $(terraform show -json | jq -r '.values.root_module.child_modules[] | select(.address=="module.quilt") | .resources[] | select(.name=="search_instance_count") | .values // "N/A"')
- **Instance Type**: $(terraform show -json | jq -r '.values.root_module.child_modules[] | select(.address=="module.quilt") | .resources[] | select(.name=="search_instance_type") | .values // "N/A"')
- **Volume Size**: $(terraform show -json | jq -r '.values.root_module.child_modules[] | select(.address=="module.quilt") | .resources[] | select(.name=="search_volume_size") | .values // "N/A"')GB
### Database
- **Instance Class**: $(terraform show -json | jq -r '.values.root_module.child_modules[] | select(.address=="module.quilt") | .resources[] | select(.name=="db_instance_class") | .values // "N/A"')
- **Multi-AZ**: $(terraform show -json | jq -r '.values.root_module.child_modules[] | select(.address=="module.quilt") | .resources[] | select(.name=="db_multi_az") | .values // "N/A"')
## Utilization Metrics
### Storage Utilization Trends
EOF
# Add metrics to report
aws cloudwatch get-metric-statistics \
--namespace AWS/ES \
--metric-name StorageUtilization \
--dimensions Name=DomainName,Value="$(terraform output -raw stack_name 2>/dev/null || echo "quilt-prod")" \
--start-time "$START_TIME" \
--end-time "$END_TIME" \
--period 604800 \
--statistics Average,Maximum \
--output table >> "capacity-report-$(date +%Y%m%d).md"
echo "Capacity report generated: capacity-report-$(date +%Y%m%d).md"
echo "Review and plan scaling operations based on trends."Recovery Procedure (30-60 minutes)
Step 1: Assess Situation (5 minutes)
# Check overall service status
QUILT_URL=$(terraform output -raw quilt_url)
curl -I "$QUILT_URL" || echo "Service unreachable"
# Check infrastructure components
terraform refresh
terraform plan -detailed-exitcode || echo "Infrastructure drift detected"
# Check AWS service health
aws health describe-events --filter services=EC2,RDS,ES --query 'events[?eventTypeCategory==`issue`]'Step 2: Component-Level Diagnosis (10 minutes)
STACK_NAME=$(terraform output -raw stack_name 2>/dev/null || echo "quilt-prod")
# Check ECS services
aws ecs describe-services --cluster "$STACK_NAME" --services "$STACK_NAME"
# Check database
aws rds describe-db-instances --db-instance-identifier "$STACK_NAME"
# Check ElasticSearch
aws es describe-elasticsearch-domain --domain-name "$STACK_NAME"
# Check load balancer
aws elbv2 describe-load-balancers --names "$STACK_NAME"Step 3: Recovery Actions
# If infrastructure drift detected
terraform apply -auto-approve
# If ECS service issues
aws ecs update-service --cluster "$STACK_NAME" --service "$STACK_NAME" --force-new-deployment
# If database issues
# (Manual intervention required - check RDS console)
# If ElasticSearch issues
# (Manual intervention required - check ES console)Recovery Procedure (45-90 minutes)
Step 1: Identify Recovery Point
STACK_NAME=$(terraform output -raw stack_name 2>/dev/null || echo "quilt-prod")
# List available snapshots
aws rds describe-db-snapshots \
--db-instance-identifier "$STACK_NAME" \
--query 'DBSnapshots | sort_by(@, &SnapshotCreateTime) | [-10:]' \
--output table
# Select snapshot for recovery
read -p "Enter snapshot identifier: " SNAPSHOT_IDStep 2: Update Configuration
# Backup current configuration
cp main.tf main.tf.backup
# Add snapshot identifier to configuration
sed -i '/db_instance_class/a \ \ db_snapshot_identifier = "'$SNAPSHOT_ID'"' main.tf
# Plan recovery
terraform plan -out=recovery.tfplanStep 3: Execute Recovery
# Apply recovery configuration
terraform apply recovery.tfplan
# Monitor recovery progress
aws rds describe-db-instances --db-instance-identifier "$STACK_NAME" \
--query 'DBInstances[0].{Status:DBInstanceStatus,Progress:StatusInfos}'Step 4: Verify Recovery
# Test database connectivity
QUILT_URL=$(terraform output -raw quilt_url)
curl -f "$QUILT_URL/health" || echo "Service not ready yet"
# Remove snapshot identifier from config
sed -i '/db_snapshot_identifier/d' main.tfImmediate Actions (15 minutes)
Step 1: Isolate and Assess
# Get current security group rules
STACK_NAME=$(terraform output -raw stack_name 2>/dev/null || echo "quilt-prod")
aws ec2 describe-security-groups --filters "Name=group-name,Values=*$STACK_NAME*"
# Check recent access logs (log group name matches stack name)
aws logs filter-log-events \
--log-group-name "$STACK_NAME" \
--start-time $(date -d '2 hours ago' +%s)000 \
--filter-pattern "ERROR"
# Check CloudTrail for suspicious activity (using CloudTrail lookup-events)
aws cloudtrail lookup-events \
--start-time $(date -d '24 hours ago' +%s) \
--max-results 100 \
--query 'Events[?Username!=`null`]'Step 2: Immediate Containment
# Temporarily restrict access (if needed)
# Update security groups to allow only known IPs
# This should be done through Terraform for consistency
# Force password reset for admin account
# (Done through Quilt UI)
# Rotate API keys and secrets
# (Update terraform.tfvars and re-apply)Step 3: Evidence Collection
# Export recent logs (log group name matches stack name)
aws logs create-export-task \
--log-group-name "$STACK_NAME" \
--from $(date -d '7 days ago' +%s)000 \
--to $(date +%s)000 \
--destination "security-incident-$(date +%Y%m%d)" \
--destination-prefix "logs/"
# Snapshot current state
terraform show > "terraform-state-$(date +%Y%m%d-%H%M).txt"
# Create database snapshot
aws rds create-db-snapshot \
--db-instance-identifier "$STACK_NAME" \
--db-snapshot-identifier "$STACK_NAME-incident-$(date +%Y%m%d)"#!/bin/bash
# File: scripts/setup-monitoring.sh
STACK_NAME=$(terraform output -raw stack_name 2>/dev/null || echo "quilt-prod")
AWS_ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
AWS_REGION=$(aws configure get region)
# Create CloudWatch dashboard
cat > dashboard-config.json << EOF
{
"widgets": [
{
"type": "metric",
"properties": {
"metrics": [
["AWS/ES", "StorageUtilization", "DomainName", "$STACK_NAME", "ClientId", "$AWS_ACCOUNT"],
["AWS/ES", "ClusterStatus.yellow", "DomainName", "$STACK_NAME", "ClientId", "$AWS_ACCOUNT"],
["AWS/ES", "ClusterStatus.red", "DomainName", "$STACK_NAME", "ClientId", "$AWS_ACCOUNT"]
],
"period": 300,
"stat": "Average",
"region": "$AWS_REGION",
"title": "ElasticSearch Health"
}
},
{
"type": "metric",
"properties": {
"metrics": [
["AWS/RDS", "CPUUtilization", "DBInstanceIdentifier", "$STACK_NAME"],
["AWS/RDS", "DatabaseConnections", "DBInstanceIdentifier", "$STACK_NAME"],
["AWS/RDS", "FreeStorageSpace", "DBInstanceIdentifier", "$STACK_NAME"]
],
"period": 300,
"stat": "Average",
"region": "$AWS_REGION",
"title": "RDS Performance"
}
}
]
}
EOF
# Create dashboard
aws cloudwatch put-dashboard \
--dashboard-name "Quilt-$STACK_NAME" \
--dashboard-body file://dashboard-config.json
echo "Dashboard created: https://console.aws.amazon.com/cloudwatch/home?region=$AWS_REGION#dashboards:name=Quilt-$STACK_NAME"#!/bin/bash
# File: scripts/setup-alerts.sh
STACK_NAME=$(terraform output -raw stack_name 2>/dev/null || echo "quilt-prod")
AWS_ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
# Create SNS topic for alerts
SNS_TOPIC_ARN=$(aws sns create-topic --name "quilt-$STACK_NAME-alerts" --query 'TopicArn' --output text)
# Subscribe email to topic
read -p "Enter email for alerts: " ALERT_EMAIL
aws sns subscribe --topic-arn "$SNS_TOPIC_ARN" --protocol email --notification-endpoint "$ALERT_EMAIL"
# ElasticSearch storage alert
aws cloudwatch put-metric-alarm \
--alarm-name "Quilt-$STACK_NAME-ES-Storage-High" \
--alarm-description "ElasticSearch storage utilization > 80%" \
--metric-name StorageUtilization \
--namespace AWS/ES \
--statistic Average \
--period 300 \
--threshold 80 \
--comparison-operator GreaterThanThreshold \
--dimensions Name=DomainName,Value="$STACK_NAME" Name=ClientId,Value="$AWS_ACCOUNT" \
--evaluation-periods 2 \
--alarm-actions "$SNS_TOPIC_ARN"
# RDS CPU alert
aws cloudwatch put-metric-alarm \
--alarm-name "Quilt-$STACK_NAME-RDS-CPU-High" \
--alarm-description "RDS CPU utilization > 80%" \
--metric-name CPUUtilization \
--namespace AWS/RDS \
--statistic Average \
--period 300 \
--threshold 80 \
--comparison-operator GreaterThanThreshold \
--dimensions Name=DBInstanceIdentifier,Value="$STACK_NAME" \
--evaluation-periods 2 \
--alarm-actions "$SNS_TOPIC_ARN"
echo "Alerts configured. Check email for subscription confirmation."#!/bin/bash
# File: scripts/performance-monitoring.sh
QUILT_URL=$(terraform output -raw quilt_url)
STACK_NAME=$(terraform output -raw stack_name 2>/dev/null || echo "quilt-prod")
echo "=== Performance Monitoring Report - $(date) ==="
# Response time check
echo "Checking response times..."
for endpoint in "/" "/health" "/api/status"; do
RESPONSE_TIME=$(curl -o /dev/null -s -w "%{time_total}" "$QUILT_URL$endpoint" 2>/dev/null || echo "failed")
echo " $endpoint: ${RESPONSE_TIME}s"
done
# ECS service metrics
echo "ECS Service Status:"
aws ecs describe-services \
--cluster "$STACK_NAME" \
--services "$STACK_NAME" \
--query 'services[0].{Running:runningCount,Desired:desiredCount,Status:status}' \
--output table
# Load balancer metrics
echo "Load Balancer Health:"
aws elbv2 describe-target-health \
--target-group-arn "$(aws elbv2 describe-target-groups --names "$STACK_NAME" --query 'TargetGroups[0].TargetGroupArn' --output text)" \
--query 'TargetHealthDescriptions[].{Target:Target.Id,Health:TargetHealth.State}' \
--output table
echo "=== Performance Monitoring Complete ==="#!/bin/bash
# File: scripts/cost-analysis.sh
STACK_NAME=$(terraform output -raw stack_name 2>/dev/null || echo "quilt-prod")
CURRENT_MONTH=$(date +%Y-%m-01)
NEXT_MONTH=$(date -d "next month" +%Y-%m-01)
echo "=== Monthly Cost Analysis - $(date) ==="
# Get costs by service for current month
aws ce get-cost-and-usage \
--time-period Start="$CURRENT_MONTH",End="$NEXT_MONTH" \
--granularity MONTHLY \
--metrics BlendedCost \
--group-by Type=DIMENSION,Key=SERVICE \
--filter file://<(cat << EOF
{
"Dimensions": {
"Key": "RESOURCE_ID",
"Values": ["*$STACK_NAME*"]
}
}
EOF
) \
--query 'ResultsByTime[0].Groups[?Metrics.BlendedCost.Amount>`0`]' \
--output table
# Get daily costs for trend analysis
aws ce get-cost-and-usage \
--time-period Start="$(date -d '30 days ago' +%Y-%m-%d)",End="$(date +%Y-%m-%d)" \
--granularity DAILY \
--metrics BlendedCost \
--filter file://<(cat << EOF
{
"Dimensions": {
"Key": "RESOURCE_ID",
"Values": ["*$STACK_NAME*"]
}
}
EOF
) \
--query 'ResultsByTime[-7:].{Date:TimePeriod.Start,Cost:Total.BlendedCost.Amount}' \
--output table
echo "=== Cost Analysis Complete ==="#!/bin/bash
# File: scripts/cost-optimization.sh
echo "=== Cost Optimization Recommendations ==="
# Check for unused EBS volumes
echo "Checking for unused EBS volumes..."
aws ec2 describe-volumes \
--filters Name=status,Values=available \
--query 'Volumes[].{VolumeId:VolumeId,Size:Size,Type:VolumeType,CreateTime:CreateTime}' \
--output table
# Check for unattached Elastic IPs
echo "Checking for unattached Elastic IPs..."
aws ec2 describe-addresses \
--query 'Addresses[?!InstanceId].{IP:PublicIp,AllocationId:AllocationId}' \
--output table
# RDS instance recommendations
echo "RDS Instance Analysis:"
STACK_NAME=$(terraform output -raw stack_name 2>/dev/null || echo "quilt-prod")
aws rds describe-db-instances \
--db-instance-identifier "$STACK_NAME" \
--query 'DBInstances[0].{Class:DBInstanceClass,MultiAZ:MultiAZ,Storage:AllocatedStorage,IOPS:Iops}' \
--output table
echo "Optimization Recommendations:"
echo "1. Consider Reserved Instances for long-running resources"
echo "2. Review ElasticSearch instance types for right-sizing"
echo "3. Implement S3 lifecycle policies for log retention"
echo "4. Use Spot Instances for non-critical workloads"
echo "=== Cost Optimization Complete ==="- AWS IAM user created with appropriate permissions
- Terraform access configured
- Git repository access granted
- Monitoring dashboard access provided
- Emergency contact list updated
- Training on procedures completed
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"ec2:*",
"rds:*",
"es:*",
"ecs:*",
"elbv2:*",
"cloudwatch:*",
"logs:*",
"s3:GetObject",
"s3:PutObject",
"s3:ListBucket"
],
"Resource": "*",
"Condition": {
"StringLike": {
"aws:RequestedRegion": ["YOUR-AWS-REGION", "YOUR-BACKUP-REGION"]
}
}
}
]
}-
Planning Phase
- Document proposed changes
- Assess impact and risks
- Schedule maintenance window
- Notify stakeholders
-
Testing Phase
- Test changes in development environment
- Validate rollback procedures
- Review with team lead
-
Implementation Phase
- Execute during maintenance window
- Monitor for issues
- Validate successful deployment
- Update documentation
-
Post-Implementation
- Confirm system stability
- Update runbooks if needed
- Conduct lessons learned review
-
Immediate Assessment (5 minutes)
- Identify scope and urgency
- Notify team lead and stakeholders
- Document issue and proposed solution
-
Rapid Implementation (15-30 minutes)
- Execute emergency change
- Monitor system stability
- Document actions taken
-
Post-Emergency Review (within 24 hours)
- Conduct root cause analysis
- Update procedures to prevent recurrence
- Schedule formal change if needed
This operations guide provides comprehensive procedures for cloud teams to successfully deploy, maintain, and scale the Quilt platform. Regular review and updates of these procedures ensure continued operational excellence.