rinafcode · RUKAYAT-CODER · May 30, 2026 · May 29, 2026 · May 29, 2026
diff --git a/.gitignore b/.gitignore
@@ -80,3 +80,21 @@ tsconfig.build.tsbuildinfo
 
 # Backups
 /backups
+
+# AI assistant / tooling artifacts
+.claude/
+.claude/**
+CLAUDE.md
+.cursor/
+.cursorrules
+.cursorignore
+.aider*
+.windsurfrules
+.github/copilot-instructions.md
+.continue/
+.codeium/
+.gemini/
+.specstory/
+.roo/
+.kilocode/
+.augment/
diff --git a/dr/README.md b/dr/README.md
@@ -7,12 +7,17 @@ Welcome to the TeachLink Disaster Recovery (DR) documentation. This directory co
 ### Planning & Strategy
 - **[RTO/RPO Definitions](./procedures/RTO-RPO.md)** — Recovery Time and Point Objectives, alert thresholds
 - **[Failover Plan](./procedures/failover-plan.md)** — Failover procedures, failback strategy, infrastructure requirements
+- **[Data Replication Strategy](./procedures/data-replication.md)** — Cross-region RDS replica, S3 CRR, cache/state handling
 
 ### Incident Response
 - **[Database Failure Runbook](./runbooks/database-failure.md)** — PostgreSQL failures, connection issues, data integrity problems
 - **[Region Outage Runbook](./runbooks/region-outage.md)** — AWS region unavailability, cross-region failover procedures
+- **[Multi-Region Deployment Runbook](./runbooks/multi-region-deployment.md)** — Deploy, drill, fail over and fail back the two-region topology
 - **[Data Corruption Runbook](./runbooks/data-corruption.md)** — Data inconsistency, corruption detection, point-in-time recovery
 
+### Infrastructure as Code
+- **[Multi-Region Terraform](../tf/multi-region/README.md)** — Active/standby deployment across two regions (issue #620)
+
 ## 🎯 Recovery Objectives at a Glance
 
 | Objective | Target | Notes |

diff --git a/dr/procedures/data-replication.md b/dr/procedures/data-replication.md
@@ -0,0 +1,112 @@
+# Data Replication Strategy
+
+This document describes how TeachLink data is replicated across regions to meet
+the recovery objectives in [RTO-RPO.md](./RTO-RPO.md) and enable the failover
+flow in [failover-plan.md](./failover-plan.md).
+
+Implemented by [`tf/multi-region`](../../tf/multi-region) — issue **#620**.
+
+---
+
+## Summary
+
+| Data store | Mechanism | Direction | RPO | On failover |
+| ---------- | --------- | --------- | --- | ----------- |
+| PostgreSQL (RDS) | Cross-region **read replica** | primary → secondary (continuous) | seconds | Promote replica to standalone primary |
+| Object storage (S3) | **Cross-Region Replication (CRR)** | primary → secondary (async) | seconds–minutes | Already present in secondary bucket |
+| Redis (ElastiCache) | Independent standby (no replication) | n/a | n/a (cache) | Warm standby; repopulates from DB |
+| Terraform state | S3 versioning + DynamoDB lock | n/a | n/a | Restore from versioned state |
+
+---
+
+## 1. Database: RDS cross-region read replica
+
+The primary PostgreSQL instance lives in the primary region. A **cross-region
+read replica** is provisioned in the secondary region by
+[`tf/modules/database-replica`](../../tf/modules/database-replica).
+
+- **How it works**: RDS streams the primary's write-ahead log to the replica
+  asynchronously, typically keeping it within a few seconds of the primary
+  (`ReplicaLag` metric). This gives an effective **RPO of seconds**, a large
+  improvement over the backup-only RPO of up to 7 days.
+- **Encryption**: the source is encrypted, so the replica uses a dedicated
+  KMS key created in the secondary region (cross-region requirement).
+- **Backups**: the replica keeps a 7-day backup retention so it can be promoted
+  and itself replicated after a failover.
+- **On failover**: `infra/scripts/failover.sh activate` runs
+  `aws rds promote-read-replica`, converting the replica into a standalone
+  read/write primary. Promotion is irreversible — failback requires re-seeding
+  the original primary (see failover-plan.md).
+
+### Monitoring replica lag
+
+```bash
+aws cloudwatch get-metric-statistics \
+  --namespace AWS/RDS --metric-name ReplicaLag \
+  --dimensions Name=DBInstanceIdentifier,Value=teachlink-prod-db-replica \
+  --statistics Average --period 60 \
+  --start-time "$(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%SZ)" \
+  --end-time   "$(date -u +%Y-%m-%dT%H:%M:%SZ)" \
+  --region us-west-2
+```
+
+Alert when `ReplicaLag > 5s` (RPO target). The monthly drill
+(`infra/scripts/failover-drill.sh`) asserts this automatically.
+
+---
+
+## 2. Object storage: S3 Cross-Region Replication
+
+Both the `uploads` and `backups` buckets replicate from the primary region to
+their secondary-region counterparts via
+[`tf/modules/replication`](../../tf/modules/replication).
+
+- **Prerequisite**: versioning is enabled on source and destination (the
+  storage module already does this).
+- **IAM**: a scoped replication role grants S3 permission to read source object
+  versions and write them to the destination.
+- **Scope**: all objects (`prefix = ""`), including delete markers, so deletes
+  propagate.
+- **Storage class**: replicated objects land in `STANDARD_IA` to reduce cost.
+- **Latency**: replication is asynchronous (usually seconds to minutes). Objects
+  written immediately before a regional failure may not have replicated — this
+  is the S3 RPO and is acceptable for uploads/backups.
+
+> Note: CRR only replicates objects written **after** the rule is enabled. For a
+> brand-new secondary bucket, run a one-time `aws s3 sync` (or S3 Batch
+> Replication) to backfill existing objects.
+
+---
+
+## 3. Cache: Redis standby
+
+ElastiCache holds ephemeral cache data, so it is **not** replicated across
+regions. The secondary region runs an independent standby Redis cluster that is
+empty until failover, after which it repopulates naturally from the promoted
+database. This avoids the cost and complexity of a Global Datastore for
+non-durable data.
+
+---
+
+## 4. Terraform state
+
+State is stored in a versioned S3 bucket with a DynamoDB lock table (see
+[`tf/README.md`](../../tf/README.md)). Versioning provides point-in-time recovery
+of the state file itself. Use a **distinct state key** for the multi-region
+configuration to avoid clobbering the single-region state.
+
+---
+
+## Verification
+
+| Check | Command / Tool |
+| ----- | -------------- |
+| Replica exists & lag OK | `infra/scripts/failover-drill.sh` |
+| CRR enabled | `aws s3api get-bucket-replication --bucket <bucket>` |
+| End-to-end failover | Quarterly drill (see [dr/README.md](../README.md)) |
+
+---
+
+**Document Version**: 1.0
+**Owner**: Platform Engineering
+**Related**: [failover-plan.md](./failover-plan.md), [RTO-RPO.md](./RTO-RPO.md), [region-outage runbook](../runbooks/region-outage.md)
diff --git a/dr/runbooks/multi-region-deployment.md b/dr/runbooks/multi-region-deployment.md
@@ -0,0 +1,110 @@
+# Runbook: Multi-Region Deployment & Failover Operations
+
+Operational runbook for the active/standby multi-region topology defined in
+[`tf/multi-region`](../../tf/multi-region). Covers initial deployment, routine
+verification, manual failover, and failback.
+
+> Related: [Region Outage runbook](./region-outage.md) · [Failover Plan](../procedures/failover-plan.md) · [Data Replication Strategy](../procedures/data-replication.md)
+
+---
+
+## 1. Initial deployment
+
+**Prerequisites**
+- Terraform >= 1.5.0; AWS credentials valid in both regions.
+- ACM certificates in **each** region (for HTTPS).
+- A registered domain (provide `hosted_zone_id` or let Terraform create a zone).
+
+**Steps**
+```bash
+cd tf/multi-region
+cp terraform.tfvars.example terraform.tfvars   # edit values
+terraform init
+terraform plan  -var-file=terraform.tfvars
+terraform apply -var-file=terraform.tfvars
+```
+
+**Post-deploy validation**
+```bash
+# Static checks (no creds needed)
+infra/scripts/validate-multiregion.sh
+
+# Live readiness checks
+export PRIMARY_ALB_URL="https://<primary-alb-dns>"
+export SECONDARY_ALB_URL="https://<secondary-alb-dns>"
+infra/scripts/failover-drill.sh
+```
+If `hosted_zone_id` was empty, delegate your domain to the
+`hosted_zone_name_servers` output before traffic will resolve.
+
+---
+
+## 2. Routine verification (monthly drill)
+
+Run on the **third Tuesday, 02:00 UTC** (see [dr/README.md](../README.md)).
+
+```bash
+infra/scripts/failover-drill.sh   # non-destructive; exits non-zero on failure
+```
+Confirms: both ALBs healthy, replica lag within RPO, S3 CRR enabled.
+
+Record results in the DR drill log. Investigate any ❌ before relying on failover.
+
+---
+
+## 3. Manual failover (primary region lost)
+
+> Route 53 shifts **traffic** automatically when the primary health check goes
+> red (~90s). These steps promote the **data tier**, which is not automatic.
+
+1. **Confirm** the outage is regional (AWS Health Dashboard, `failover.sh status`).
+2. **Activate**:
+   ```bash
+   export PRIMARY_REGION=us-east-1 SECONDARY_REGION=us-west-2 ENVIRONMENT=prod
+   infra/scripts/failover.sh activate --dry-run   # review
+   infra/scripts/failover.sh activate             # execute
+   ```
+   This promotes the read replica and scales up the secondary ECS service.
+3. **Repoint the app**: update `DB_HOST`, `REDIS_HOST`, `AWS_REGION` to the
+   secondary endpoints (Terraform outputs `replica_db_endpoint`, etc.) and
+   redeploy / restart tasks so writes hit the promoted database.
+4. **Verify**: `failover.sh status`; `curl https://api.<domain>/health` → 200;
+   confirm DNS resolves to the secondary ALB (`dig +short api.<domain>`).
+5. **Communicate** per the [Failover Plan](../procedures/failover-plan.md) comms timeline.
+
+**Target RTO: ≤ 15 min.** Escalate to Platform Lead if exceeded.
+
+---
+
+## 4. Failback (primary region recovered)
+
+Only after the primary is fully restored and **data re-synced** to it.
+
+1. Re-seed the primary database from the promoted secondary (`pg_dump`/restore
+   or a new replica in the reverse direction), verify integrity.
+2. Re-apply Terraform to restore the primary RDS as a fresh primary and the
+   secondary as a read replica again.
+3. Scale the secondary back to standby:
+   ```bash
+   infra/scripts/failover.sh failback
+   ```
+4. Route 53 returns traffic to the primary once its health check is green.
+5. Validate, then document in the post-incident review.
+
+---
+
+## 5. Troubleshooting
+
+| Symptom | Likely cause | Action |
+| ------- | ------------ | ------ |
+| Traffic not failing over | Health check too lenient / DNS TTL cached | Check `primary_health_check_id` status; wait for interval × threshold |
+| Replica promotion fails | Replica mid-update | `aws rds wait db-instance-available`; retry |
+| App errors after failover | Still pointing at dead primary | Update `DB_HOST`/`REDIS_HOST`, redeploy |
+| S3 objects missing in secondary | Written before CRR enabled, or async lag | One-time `aws s3 sync`; check replication metrics |
+| `terraform apply` global-name clash | Reused single-region state | Use a separate state key for `tf/multi-region` |
+
+---
+
+**Document Version**: 1.0
+**Owner**: Platform Engineering
+**Review**: After each failover drill
diff --git a/infra/scripts/failover-drill.sh b/infra/scripts/failover-drill.sh
@@ -0,0 +1,109 @@
+#!/usr/bin/env bash
+#
+# failover-drill.sh — Non-destructive validation of the failover setup.
+#
+# Run this monthly (see dr/README.md testing schedule) to confirm the
+# multi-region deployment is failover-ready WITHOUT promoting the replica or
+# shifting production traffic. It checks:
+#   1. Both region ALBs answer the health-check path.
+#   2. The RDS read replica exists and replication lag is within RPO.
+#   3. S3 cross-region replication is enabled on the source buckets.
+#   4. Route 53 health checks report healthy.
+#
+# Exit code is non-zero if any check fails, so it can gate CI / a scheduled job.
+#
+# Usage: ./failover-drill.sh
+# Requires: awscli v2, jq, curl.
+set -uo pipefail
+
+PRIMARY_REGION="${PRIMARY_REGION:-us-east-1}"
+SECONDARY_REGION="${SECONDARY_REGION:-us-west-2}"
+ENVIRONMENT="${ENVIRONMENT:-prod}"
+REPLICA_DB_ID="${REPLICA_DB_ID:-teachlink-${ENVIRONMENT}-db-replica}"
+HEALTH_PATH="${HEALTH_PATH:-/health}"
+MAX_REPLICA_LAG_SECONDS="${MAX_REPLICA_LAG_SECONDS:-5}"
+PRIMARY_ALB_URL="${PRIMARY_ALB_URL:-}"
+SECONDARY_ALB_URL="${SECONDARY_ALB_URL:-}"
+
+PASS=0
+FAIL=0
+ok() {
+  printf '  ✅ %s\n' "$*"
+  PASS=$((PASS + 1))
+}
+bad() {
+  printf '  ❌ %s\n' "$*"
+  FAIL=$((FAIL + 1))
+}
+section() { printf '\n=== %s ===\n' "$*"; }
+
+check_endpoint() {
+  local name="$1" url="$2"
+  [[ -z "$url" ]] && {
+    bad "$name endpoint URL not set (export ${name}_ALB_URL)"
+    return
+  }
+  local code
+  code="$(curl -s -o /dev/null -w '%{http_code}' --max-time 10 "${url}${HEALTH_PATH}" || echo 000)"
+  if [[ "$code" == "200" ]]; then
+    ok "$name health endpoint returned 200"
+  else
+    bad "$name health endpoint returned ${code}"
+  fi
+}
+
+section "1. Region health endpoints"
+check_endpoint "PRIMARY" "$PRIMARY_ALB_URL"
+check_endpoint "SECONDARY" "$SECONDARY_ALB_URL"
+
+section "2. RDS cross-region read replica"
+if command -v aws >/dev/null 2>&1; then
+  replica_json="$(aws rds describe-db-instances --db-instance-identifier "$REPLICA_DB_ID" \
+    --region "$SECONDARY_REGION" --output json 2>/dev/null || echo '')"
+  if [[ -n "$replica_json" ]]; then
+    ok "Read replica '${REPLICA_DB_ID}' exists in ${SECONDARY_REGION}"
+    lag="$(aws cloudwatch get-metric-statistics \
+      --namespace AWS/RDS --metric-name ReplicaLag \
+      --dimensions Name=DBInstanceIdentifier,Value="$REPLICA_DB_ID" \
+      --statistics Average --period 60 \
+      --start-time "$(date -u -d '5 minutes ago' +%Y-%m-%dT%H:%M:%SZ 2>/dev/null || echo '')" \
+      --end-time "$(date -u +%Y-%m-%dT%H:%M:%SZ)" \
+      --region "$SECONDARY_REGION" \
+      --query 'sort_by(Datapoints,&Timestamp)[-1].Average' --output text 2>/dev/null || echo None)"
+    if [[ "$lag" == "None" || -z "$lag" ]]; then
+      bad "Could not read ReplicaLag metric (no recent datapoints)"
+    elif awk "BEGIN{exit !(${lag} <= ${MAX_REPLICA_LAG_SECONDS})}"; then
+      ok "Replica lag ${lag}s is within RPO target (${MAX_REPLICA_LAG_SECONDS}s)"
+    else
+      bad "Replica lag ${lag}s exceeds RPO target (${MAX_REPLICA_LAG_SECONDS}s)"
+    fi
+  else
+    bad "Read replica '${REPLICA_DB_ID}' not found in ${SECONDARY_REGION}"
+  fi
+else
+  bad "awscli not available — skipping RDS checks"
+fi
+
+section "3. S3 cross-region replication"
+if command -v aws >/dev/null 2>&1; then
+  for kind in uploads backups; do
+    bucket="teachlink-${PRIMARY_REGION//-/}-${ENVIRONMENT}-${kind}"
+    status="$(aws s3api get-bucket-replication --bucket "$bucket" \
+      --query 'ReplicationConfiguration.Rules[0].Status' --output text 2>/dev/null || echo None)"
+    if [[ "$status" == "Enabled" ]]; then
+      ok "Replication enabled on ${bucket}"
+    else
+      bad "Replication not enabled on ${bucket} (status=${status})"
+    fi
+  done
+else
+  bad "awscli not available — skipping S3 checks"
+fi
+
+section "Drill summary"
+printf 'Passed: %d  Failed: %d\n' "$PASS" "$FAIL"
+[[ "$FAIL" -eq 0 ]] || {
+  echo "Drill FAILED — investigate before relying on failover."
+  exit 1
+}
+echo "Drill PASSED — deployment is failover-ready."