Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 18 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -80,3 +80,21 @@ tsconfig.build.tsbuildinfo

# Backups
/backups

# AI assistant / tooling artifacts
.claude/
.claude/**
CLAUDE.md
.cursor/
.cursorrules
.cursorignore
.aider*
.windsurfrules
.github/copilot-instructions.md
.continue/
.codeium/
.gemini/
.specstory/
.roo/
.kilocode/
.augment/
5 changes: 5 additions & 0 deletions dr/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,12 +7,17 @@ Welcome to the TeachLink Disaster Recovery (DR) documentation. This directory co
### Planning & Strategy
- **[RTO/RPO Definitions](./procedures/RTO-RPO.md)** — Recovery Time and Point Objectives, alert thresholds
- **[Failover Plan](./procedures/failover-plan.md)** — Failover procedures, failback strategy, infrastructure requirements
- **[Data Replication Strategy](./procedures/data-replication.md)** — Cross-region RDS replica, S3 CRR, cache/state handling

### Incident Response
- **[Database Failure Runbook](./runbooks/database-failure.md)** — PostgreSQL failures, connection issues, data integrity problems
- **[Region Outage Runbook](./runbooks/region-outage.md)** — AWS region unavailability, cross-region failover procedures
- **[Multi-Region Deployment Runbook](./runbooks/multi-region-deployment.md)** — Deploy, drill, fail over and fail back the two-region topology
- **[Data Corruption Runbook](./runbooks/data-corruption.md)** — Data inconsistency, corruption detection, point-in-time recovery

### Infrastructure as Code
- **[Multi-Region Terraform](../tf/multi-region/README.md)** — Active/standby deployment across two regions (issue #620)

## 🎯 Recovery Objectives at a Glance

| Objective | Target | Notes |
Expand Down
112 changes: 112 additions & 0 deletions dr/procedures/data-replication.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
# Data Replication Strategy

This document describes how TeachLink data is replicated across regions to meet
the recovery objectives in [RTO-RPO.md](./RTO-RPO.md) and enable the failover
flow in [failover-plan.md](./failover-plan.md).

Implemented by [`tf/multi-region`](../../tf/multi-region) — issue **#620**.

---

## Summary

| Data store | Mechanism | Direction | RPO | On failover |
| ---------- | --------- | --------- | --- | ----------- |
| PostgreSQL (RDS) | Cross-region **read replica** | primary → secondary (continuous) | seconds | Promote replica to standalone primary |
| Object storage (S3) | **Cross-Region Replication (CRR)** | primary → secondary (async) | seconds–minutes | Already present in secondary bucket |
| Redis (ElastiCache) | Independent standby (no replication) | n/a | n/a (cache) | Warm standby; repopulates from DB |
| Terraform state | S3 versioning + DynamoDB lock | n/a | n/a | Restore from versioned state |

---

## 1. Database: RDS cross-region read replica

The primary PostgreSQL instance lives in the primary region. A **cross-region
read replica** is provisioned in the secondary region by
[`tf/modules/database-replica`](../../tf/modules/database-replica).

- **How it works**: RDS streams the primary's write-ahead log to the replica
asynchronously, typically keeping it within a few seconds of the primary
(`ReplicaLag` metric). This gives an effective **RPO of seconds**, a large
improvement over the backup-only RPO of up to 7 days.
- **Encryption**: the source is encrypted, so the replica uses a dedicated
KMS key created in the secondary region (cross-region requirement).
- **Backups**: the replica keeps a 7-day backup retention so it can be promoted
and itself replicated after a failover.
- **On failover**: `infra/scripts/failover.sh activate` runs
`aws rds promote-read-replica`, converting the replica into a standalone
read/write primary. Promotion is irreversible — failback requires re-seeding
the original primary (see failover-plan.md).

### Monitoring replica lag

```bash
aws cloudwatch get-metric-statistics \
--namespace AWS/RDS --metric-name ReplicaLag \
--dimensions Name=DBInstanceIdentifier,Value=teachlink-prod-db-replica \
--statistics Average --period 60 \
--start-time "$(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%SZ)" \
--end-time "$(date -u +%Y-%m-%dT%H:%M:%SZ)" \
--region us-west-2
```

Alert when `ReplicaLag > 5s` (RPO target). The monthly drill
(`infra/scripts/failover-drill.sh`) asserts this automatically.

---

## 2. Object storage: S3 Cross-Region Replication

Both the `uploads` and `backups` buckets replicate from the primary region to
their secondary-region counterparts via
[`tf/modules/replication`](../../tf/modules/replication).

- **Prerequisite**: versioning is enabled on source and destination (the
storage module already does this).
- **IAM**: a scoped replication role grants S3 permission to read source object
versions and write them to the destination.
- **Scope**: all objects (`prefix = ""`), including delete markers, so deletes
propagate.
- **Storage class**: replicated objects land in `STANDARD_IA` to reduce cost.
- **Latency**: replication is asynchronous (usually seconds to minutes). Objects
written immediately before a regional failure may not have replicated — this
is the S3 RPO and is acceptable for uploads/backups.

> Note: CRR only replicates objects written **after** the rule is enabled. For a
> brand-new secondary bucket, run a one-time `aws s3 sync` (or S3 Batch
> Replication) to backfill existing objects.

---

## 3. Cache: Redis standby

ElastiCache holds ephemeral cache data, so it is **not** replicated across
regions. The secondary region runs an independent standby Redis cluster that is
empty until failover, after which it repopulates naturally from the promoted
database. This avoids the cost and complexity of a Global Datastore for
non-durable data.

---

## 4. Terraform state

State is stored in a versioned S3 bucket with a DynamoDB lock table (see
[`tf/README.md`](../../tf/README.md)). Versioning provides point-in-time recovery
of the state file itself. Use a **distinct state key** for the multi-region
configuration to avoid clobbering the single-region state.

---

## Verification

| Check | Command / Tool |
| ----- | -------------- |
| Replica exists & lag OK | `infra/scripts/failover-drill.sh` |
| CRR enabled | `aws s3api get-bucket-replication --bucket <bucket>` |
| End-to-end failover | Quarterly drill (see [dr/README.md](../README.md)) |

---

**Document Version**: 1.0
**Owner**: Platform Engineering
**Related**: [failover-plan.md](./failover-plan.md), [RTO-RPO.md](./RTO-RPO.md), [region-outage runbook](../runbooks/region-outage.md)
110 changes: 110 additions & 0 deletions dr/runbooks/multi-region-deployment.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
# Runbook: Multi-Region Deployment & Failover Operations

Operational runbook for the active/standby multi-region topology defined in
[`tf/multi-region`](../../tf/multi-region). Covers initial deployment, routine
verification, manual failover, and failback.

> Related: [Region Outage runbook](./region-outage.md) · [Failover Plan](../procedures/failover-plan.md) · [Data Replication Strategy](../procedures/data-replication.md)

---

## 1. Initial deployment

**Prerequisites**
- Terraform >= 1.5.0; AWS credentials valid in both regions.
- ACM certificates in **each** region (for HTTPS).
- A registered domain (provide `hosted_zone_id` or let Terraform create a zone).

**Steps**
```bash
cd tf/multi-region
cp terraform.tfvars.example terraform.tfvars # edit values
terraform init
terraform plan -var-file=terraform.tfvars
terraform apply -var-file=terraform.tfvars
```

**Post-deploy validation**
```bash
# Static checks (no creds needed)
infra/scripts/validate-multiregion.sh

# Live readiness checks
export PRIMARY_ALB_URL="https://<primary-alb-dns>"
export SECONDARY_ALB_URL="https://<secondary-alb-dns>"
infra/scripts/failover-drill.sh
```
If `hosted_zone_id` was empty, delegate your domain to the
`hosted_zone_name_servers` output before traffic will resolve.

---

## 2. Routine verification (monthly drill)

Run on the **third Tuesday, 02:00 UTC** (see [dr/README.md](../README.md)).

```bash
infra/scripts/failover-drill.sh # non-destructive; exits non-zero on failure
```
Confirms: both ALBs healthy, replica lag within RPO, S3 CRR enabled.

Record results in the DR drill log. Investigate any ❌ before relying on failover.

---

## 3. Manual failover (primary region lost)

> Route 53 shifts **traffic** automatically when the primary health check goes
> red (~90s). These steps promote the **data tier**, which is not automatic.

1. **Confirm** the outage is regional (AWS Health Dashboard, `failover.sh status`).
2. **Activate**:
```bash
export PRIMARY_REGION=us-east-1 SECONDARY_REGION=us-west-2 ENVIRONMENT=prod
infra/scripts/failover.sh activate --dry-run # review
infra/scripts/failover.sh activate # execute
```
This promotes the read replica and scales up the secondary ECS service.
3. **Repoint the app**: update `DB_HOST`, `REDIS_HOST`, `AWS_REGION` to the
secondary endpoints (Terraform outputs `replica_db_endpoint`, etc.) and
redeploy / restart tasks so writes hit the promoted database.
4. **Verify**: `failover.sh status`; `curl https://api.<domain>/health` → 200;
confirm DNS resolves to the secondary ALB (`dig +short api.<domain>`).
5. **Communicate** per the [Failover Plan](../procedures/failover-plan.md) comms timeline.

**Target RTO: ≤ 15 min.** Escalate to Platform Lead if exceeded.

---

## 4. Failback (primary region recovered)

Only after the primary is fully restored and **data re-synced** to it.

1. Re-seed the primary database from the promoted secondary (`pg_dump`/restore
or a new replica in the reverse direction), verify integrity.
2. Re-apply Terraform to restore the primary RDS as a fresh primary and the
secondary as a read replica again.
3. Scale the secondary back to standby:
```bash
infra/scripts/failover.sh failback
```
4. Route 53 returns traffic to the primary once its health check is green.
5. Validate, then document in the post-incident review.

---

## 5. Troubleshooting

| Symptom | Likely cause | Action |
| ------- | ------------ | ------ |
| Traffic not failing over | Health check too lenient / DNS TTL cached | Check `primary_health_check_id` status; wait for interval × threshold |
| Replica promotion fails | Replica mid-update | `aws rds wait db-instance-available`; retry |
| App errors after failover | Still pointing at dead primary | Update `DB_HOST`/`REDIS_HOST`, redeploy |
| S3 objects missing in secondary | Written before CRR enabled, or async lag | One-time `aws s3 sync`; check replication metrics |
| `terraform apply` global-name clash | Reused single-region state | Use a separate state key for `tf/multi-region` |

---

**Document Version**: 1.0
**Owner**: Platform Engineering
**Review**: After each failover drill
109 changes: 109 additions & 0 deletions infra/scripts/failover-drill.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
#!/usr/bin/env bash
#
# failover-drill.sh — Non-destructive validation of the failover setup.
#
# Run this monthly (see dr/README.md testing schedule) to confirm the
# multi-region deployment is failover-ready WITHOUT promoting the replica or
# shifting production traffic. It checks:
# 1. Both region ALBs answer the health-check path.
# 2. The RDS read replica exists and replication lag is within RPO.
# 3. S3 cross-region replication is enabled on the source buckets.
# 4. Route 53 health checks report healthy.
#
# Exit code is non-zero if any check fails, so it can gate CI / a scheduled job.
#
# Usage: ./failover-drill.sh
# Requires: awscli v2, jq, curl.
set -uo pipefail

PRIMARY_REGION="${PRIMARY_REGION:-us-east-1}"
SECONDARY_REGION="${SECONDARY_REGION:-us-west-2}"
ENVIRONMENT="${ENVIRONMENT:-prod}"
REPLICA_DB_ID="${REPLICA_DB_ID:-teachlink-${ENVIRONMENT}-db-replica}"
HEALTH_PATH="${HEALTH_PATH:-/health}"
MAX_REPLICA_LAG_SECONDS="${MAX_REPLICA_LAG_SECONDS:-5}"
PRIMARY_ALB_URL="${PRIMARY_ALB_URL:-}"
SECONDARY_ALB_URL="${SECONDARY_ALB_URL:-}"

PASS=0
FAIL=0
ok() {
printf ' ✅ %s\n' "$*"
PASS=$((PASS + 1))
}
bad() {
printf ' ❌ %s\n' "$*"
FAIL=$((FAIL + 1))
}
section() { printf '\n=== %s ===\n' "$*"; }

check_endpoint() {
local name="$1" url="$2"
[[ -z "$url" ]] && {
bad "$name endpoint URL not set (export ${name}_ALB_URL)"
return
}
local code
code="$(curl -s -o /dev/null -w '%{http_code}' --max-time 10 "${url}${HEALTH_PATH}" || echo 000)"
if [[ "$code" == "200" ]]; then
ok "$name health endpoint returned 200"
else
bad "$name health endpoint returned ${code}"
fi
}

section "1. Region health endpoints"
check_endpoint "PRIMARY" "$PRIMARY_ALB_URL"
check_endpoint "SECONDARY" "$SECONDARY_ALB_URL"

section "2. RDS cross-region read replica"
if command -v aws >/dev/null 2>&1; then
replica_json="$(aws rds describe-db-instances --db-instance-identifier "$REPLICA_DB_ID" \
--region "$SECONDARY_REGION" --output json 2>/dev/null || echo '')"
if [[ -n "$replica_json" ]]; then
ok "Read replica '${REPLICA_DB_ID}' exists in ${SECONDARY_REGION}"
lag="$(aws cloudwatch get-metric-statistics \
--namespace AWS/RDS --metric-name ReplicaLag \
--dimensions Name=DBInstanceIdentifier,Value="$REPLICA_DB_ID" \
--statistics Average --period 60 \
--start-time "$(date -u -d '5 minutes ago' +%Y-%m-%dT%H:%M:%SZ 2>/dev/null || echo '')" \
--end-time "$(date -u +%Y-%m-%dT%H:%M:%SZ)" \
--region "$SECONDARY_REGION" \
--query 'sort_by(Datapoints,&Timestamp)[-1].Average' --output text 2>/dev/null || echo None)"
if [[ "$lag" == "None" || -z "$lag" ]]; then
bad "Could not read ReplicaLag metric (no recent datapoints)"
elif awk "BEGIN{exit !(${lag} <= ${MAX_REPLICA_LAG_SECONDS})}"; then
ok "Replica lag ${lag}s is within RPO target (${MAX_REPLICA_LAG_SECONDS}s)"
else
bad "Replica lag ${lag}s exceeds RPO target (${MAX_REPLICA_LAG_SECONDS}s)"
fi
else
bad "Read replica '${REPLICA_DB_ID}' not found in ${SECONDARY_REGION}"
fi
else
bad "awscli not available — skipping RDS checks"
fi

section "3. S3 cross-region replication"
if command -v aws >/dev/null 2>&1; then
for kind in uploads backups; do
bucket="teachlink-${PRIMARY_REGION//-/}-${ENVIRONMENT}-${kind}"
status="$(aws s3api get-bucket-replication --bucket "$bucket" \
--query 'ReplicationConfiguration.Rules[0].Status' --output text 2>/dev/null || echo None)"
if [[ "$status" == "Enabled" ]]; then
ok "Replication enabled on ${bucket}"
else
bad "Replication not enabled on ${bucket} (status=${status})"
fi
done
else
bad "awscli not available — skipping S3 checks"
fi

section "Drill summary"
printf 'Passed: %d Failed: %d\n' "$PASS" "$FAIL"
[[ "$FAIL" -eq 0 ]] || {
echo "Drill FAILED — investigate before relying on failover."
exit 1
}
echo "Drill PASSED — deployment is failover-ready."
Loading
Loading