A production-grade, event-driven order processing pipeline built entirely on AWS serverless services. Designed to handle 10,000+ orders/min using the Saga pattern for distributed transactions with automatic compensation on failure.
Demonstrates distributed systems design, serverless architecture, Infrastructure as Code, and event-driven patterns.
The system follows an event-driven microservices architecture where each component is decoupled through queues and events. A client submits an order via REST API, and the system asynchronously orchestrates inventory reservation, payment processing, order confirmation, and domain event publishing β all with automatic rollback if anything fails.
flowchart TD
subgraph Client["π Client"]
A["π± POST /orders"]
end
subgraph API["π‘οΈ API Gateway"]
B["π JSON Schema<br/>Validation"]
B2["π¦ Throttling<br/>200 burst / 100 sustained"]
end
subgraph Create["β‘ createOrder Lambda"]
C1["π Idempotency Check<br/>(Powertools body hash)"]
C2["πΎ Write to DynamoDB<br/>(status: PENDING)"]
C3["π¨ Enqueue to SQS"]
end
subgraph Queue["π¬ SQS"]
D1["π₯ Order Queue"]
D2["β οΈ Dead Letter Queue<br/>(after 3 failures)"]
end
subgraph Process["β‘ processOrder Lambda"]
E["π Start Saga<br/>(orderId as execution name)"]
end
subgraph Saga["π Step Functions Express β Order Saga"]
direction TB
F1["π¦ Reserve Inventory"]
F2["π³ Process Payment"]
F3["β
Confirm Order"]
F4["π‘ Emit Event"]
end
subgraph Events["π‘ EventBridge"]
G1["π Custom Event Bus"]
G2["π OrderPlaced Rule"]
G3["π CloudWatch Logs"]
end
A --> B --> B2 --> C1 --> C2 --> C3
C3 --> D1
D1 --> E
D1 -. "3 failures" .-> D2
E --> F1 --> F2 --> F3 --> F4
F4 --> G1 --> G2 --> G3
style Client fill:#1a1a2e,stroke:#e94560,color:#fff
style API fill:#16213e,stroke:#0f3460,color:#fff
style Create fill:#1a1a2e,stroke:#e94560,color:#fff
style Queue fill:#16213e,stroke:#0f3460,color:#fff
style Process fill:#1a1a2e,stroke:#e94560,color:#fff
style Saga fill:#0f3460,stroke:#53a8b6,color:#fff
style Events fill:#16213e,stroke:#0f3460,color:#fff
When a step fails, the saga doesn't just stop β it undoes everything that succeeded before it. This guarantees data consistency across services without distributed locks.
flowchart LR
subgraph Happy["β
Happy Path"]
direction LR
H1["π¦ Reserve<br/>Inventory"] --> H2["π³ Process<br/>Payment"] --> H3["β
Confirm<br/>Order"] --> H4["π‘ Emit<br/>Event"]
end
subgraph Fail1["β Payment Fails"]
direction LR
X1["π¦ Reserve<br/>Inventory β"] --> X2["π³ Payment<br/>β FAILS"] --> X3["π¦ Release<br/>Inventory β©οΈ"] --> X4["β Fail<br/>Order"]
end
subgraph Fail2["β Confirm Fails"]
direction LR
Y1["π¦ Reserve β<br/>π³ Payment β"] --> Y2["β
Confirm<br/>β FAILS"] --> Y3["π³ Refund<br/>Payment β©οΈ"] --> Y4["π¦ Release<br/>Inventory β©οΈ"] --> Y5["β Fail<br/>Order"]
end
style Happy fill:#0d7377,stroke:#14ffec,color:#fff
style Fail1 fill:#6b0f1a,stroke:#e94560,color:#fff
style Fail2 fill:#6b0f1a,stroke:#e94560,color:#fff
Each state has 3 retries with exponential backoff (2s β 4s β 8s) before triggering compensation.
The REST API exposes two endpoints:
| Method | Path | Lambda | Purpose |
|---|---|---|---|
POST |
/orders |
createOrder | Submit a new order |
GET |
/orders/{orderId} |
getOrder | Check order status |
Request validation happens before Lambda even runs. A JSON Schema model enforces:
userIdβ non-empty stringitemsβ non-empty array, each withproductId(string) andqty(integer β₯ 1)totalAmountβ number β₯ 0.01
Invalid requests get a 400 response at the API Gateway level β zero Lambda invocations, zero cost. Rate limiting (200 burst / 100 sustained) protects downstream services from traffic spikes.
This Lambda is the API entry point and the most critical piece for data integrity. It prevents duplicate orders at two independent layers:
Layer 1: Powertools Idempotency
ββ Hashes the request body β stores in Idempotency table (1hr TTL)
ββ Duplicate body within 1 hour? β returns cached response, handler never runs
Layer 2: DynamoDB Conditional Write
ββ ConditionExpression: attribute_not_exists(orderId)
ββ UUID collision (near-impossible)? β caught and rejected with 409
Flow: Generate UUID β write order to DynamoDB (status: PENDING) β enqueue to SQS β return 201 { orderId, status: "PENDING" }. The client gets an instant response; processing is fully asynchronous.
The queue sits between the API and the saga. This isn't just a "nice to have" β it's essential for resilience:
| Feature | Configuration | Why |
|---|---|---|
| π Retry | 3 attempts | Failed messages get reprocessed before giving up |
| β οΈ Dead Letter Queue | 14-day retention | Poisons messages are quarantined, not lost |
| β±οΈ Long polling | 10 seconds | Reduces empty receives and API costs |
| ποΈ Visibility timeout | 60 seconds | Prevents concurrent processing of the same message |
| π§© Partial batch failures | ReportBatchItemFailures |
One bad message doesn't block the whole batch of 10 |
Receives up to 10 messages per batch, parses each, and starts a Step Functions Express execution. The orderId is used as the execution name β built-in dedup at the AWS level (duplicate execution names are rejected).
Returns { batchItemFailures } so only failed messages retry while successful ones are deleted by SQS.
stateDiagram-v2
[*] --> ReserveInventory
ReserveInventory --> ProcessPayment: β
Stock reserved
ReserveInventory --> FailOrder: β Insufficient stock
ProcessPayment --> ConfirmOrder: β
Payment OK
ProcessPayment --> ReleaseInventory: β Payment declined
ConfirmOrder --> EmitOrderPlaced: β
Status β CONFIRMED
ConfirmOrder --> RefundPayment: β Update failed
EmitOrderPlaced --> [*]: π‘ Event published
ReleaseInventory --> FailOrder: β©οΈ Stock restored
RefundPayment --> CompensateRelease: β©οΈ Payment refunded
CompensateRelease --> FailOrder: β©οΈ Stock restored
FailOrder --> [*]: π Status β FAILED
Why Express over Standard? Express state machines run synchronously, cost less (priced per execution, not per state transition), and are designed for sub-5-minute workflows. Perfect for order processing.
Atomic conditional update per item:
UpdateExpression: 'SET stock = stock - :qty'
ConditionExpression: 'attribute_exists(productId) AND stock >= :qty'DynamoDB guarantees atomicity β two concurrent orders for the last item can't both succeed. If any item fails, already-reserved items in the same batch are rolled back within the Lambda before the saga-level compensation even kicks in.
Simulates payment with a configurable failure rate (FAIL_PAYMENT_PERCENT, default 20%). This is intentional β it triggers compensation paths during demos so you can observe the saga rollback in X-Ray traces.
Updates the order status from PENDING to CONFIRMED in DynamoDB. Uses ExpressionAttributeNames: { '#status': 'status' } because status is a DynamoDB reserved word.
Publishes an OrderPlaced domain event to EventBridge with:
{
"Source": "ordering-system",
"DetailType": "OrderPlaced",
"Detail": { "orderId", "userId", "totalAmount", "itemCount", "timestamp" }
}This step has no compensation catch β if it fails after 3 retries, the saga ends but the order is already confirmed in DynamoDB. Event emission is "best effort" β the database is the source of truth.
A custom event bus (dev-ser-ord-sys-events) receives OrderPlaced events. An event rule pattern-matches on source and detail-type, routing every matched event to a CloudWatch Log Group for visibility and debugging.
flowchart LR
A["β‘ emitEvent<br/>Lambda"] -->|PutEvents| B["π Custom<br/>Event Bus"]
B --> C{"π order-placed-rule<br/>source: ordering-system<br/>detail-type: OrderPlaced"}
C --> D["π CloudWatch<br/>Logs"]
C -. "future" .-> E["π§ SNS<br/>Notification"]
C -. "future" .-> F["β‘ Analytics<br/>Lambda"]
style A fill:#1a1a2e,stroke:#e94560,color:#fff
style B fill:#0f3460,stroke:#53a8b6,color:#fff
style C fill:#16213e,stroke:#0f3460,color:#fff
style D fill:#0d7377,stroke:#14ffec,color:#fff
style E fill:#2d2d2d,stroke:#666,color:#999,stroke-dasharray: 5 5
style F fill:#2d2d2d,stroke:#666,color:#999,stroke-dasharray: 5 5
The bus is extensible β future consumers (notifications, analytics, audit) can subscribe to the same events without modifying the producer.
erDiagram
ORDERS {
string orderId PK "Partition Key (UUID)"
string userId "Customer ID"
list items "Array of productId + qty"
number totalAmount "Order total"
string status "PENDING | CONFIRMED | FAILED"
string failureReason "null or error message"
string createdAt "ISO 8601"
string updatedAt "ISO 8601"
}
INVENTORY {
string productId PK "Partition Key"
string name "Product name"
number price "Unit price"
number stock "Available quantity"
string category "Product category"
}
IDEMPOTENCY {
string id PK "Request body hash"
string status "INPROGRESS | COMPLETED"
string data "Cached Lambda response"
number expiration "TTL epoch (1hr)"
}
ORDERS ||--o{ INVENTORY : "items reference"
| Table | PK | GSI | TTL | PITR |
|---|---|---|---|---|
| π Orders | orderId |
userId-createdAt-index |
β | β |
| π¦ Inventory | productId |
β | β | β |
| π Idempotency | id |
β | expiration (1hr) |
β |
| Layer | Technology | Purpose |
|---|---|---|
| ποΈ IaC | Terraform | Custom reusable modules for every AWS resource |
| β‘ Compute | Lambda (Node.js 20.x ESM) | 11 functions, each with least-privilege IAM |
| π¦ Shared Code | Lambda Layer | Powertools + AWS SDK v3 + utility modules |
| π Orchestration | Step Functions Express | Saga pattern with compensation flows |
| πΎ Storage | DynamoDB (on-demand) | 3 tables with PITR, GSI, TTL |
| π API | API Gateway REST | JSON Schema validation, throttling, X-Ray |
| π¬ Queue | SQS + DLQ | Buffering, retry, partial batch failures |
| π‘ Events | EventBridge | Custom bus, pattern-matched rules |
| π Observability | CloudWatch + X-Ray | Structured logs, traces, custom metrics |
π Two-Layer Idempotency β Why one layer isn't enough
Layer 1 (Powertools): Hashes the request body and caches the response. Protects against network-level retries where the client re-sends the exact same payload.
Layer 2 (Conditional Write): attribute_not_exists(orderId) on the DynamoDB PutItem. Protects against the edge case where the idempotency cache TTL has expired but the order still exists.
Neither layer alone covers all scenarios β together they provide bulletproof deduplication.
π¦ Partial Batch Failure Reporting β Processing 10 messages without poisoning the batch
SQS delivers up to 10 messages per Lambda invocation. Without ReportBatchItemFailures, a single failing message would cause all 10 to retry β including the 9 that succeeded (which would then create duplicates).
By returning { batchItemFailures: [{ itemIdentifier: failedMessageId }] }, only the specific failed messages retry. The successful ones are deleted from the queue.
π Express vs Standard Step Functions β The right tool for the job
| Feature | Express | Standard |
|---|---|---|
| Max duration | 5 minutes | 1 year |
| Pricing | Per execution | Per state transition |
| Execution mode | Synchronous | Asynchronous |
| Dedup via name | β | β |
Order processing completes in seconds. Express is cheaper, synchronous (processOrder waits for the result), and the execution-name-based dedup prevents reprocessing the same order from SQS retries.
π¦ Atomic Inventory Reservation β Preventing overselling without locks
ConditionExpression: 'attribute_exists(productId) AND stock >= :qty'
UpdateExpression: 'SET stock = stock - :qty'DynamoDB evaluates the condition and applies the update atomically in a single operation. Two concurrent orders each requesting the last unit cannot both succeed β one will get a ConditionalCheckFailedException. No distributed locks, no race conditions.
π‘ Best-Effort Event Emission β Why EmitOrderPlaced has no compensation
The EmitOrderPlaced saga step has retries but no Catch block. If EventBridge publishing fails after 3 attempts, the saga ends β but the order is already CONFIRMED in DynamoDB. The database is the authoritative source of truth, not the event. Downstream consumers are designed for eventual consistency.
π infrastructure/
βββ π main.tf # AWS provider, region, default tags
βββ π variables.tf # Input variables (region, env, project name)
βββ π var.tfvars # Variable values
βββ π dynamodb.tf # 3 DynamoDB tables
βββ π sqs.tf # Order queue + dead letter queue
βββ π lambda.tf # 11 Lambda functions + SQS event mapping
βββ π lambda_layer.tf # Shared dependencies layer
βββ π api_gateway.tf # REST API, routes, JSON Schema validation
βββ π step_functions.tf # Express state machine (saga)
βββ π eventbridge.tf # Custom event bus + rules + log target
βββ οΏ½ cloudwatch.tf # Dashboard, alarms, SNS topic
βββ οΏ½π asl/
β βββ π order_saga.asl.json # Amazon States Language definition
βββ π iam/
β βββ π policies/ # Per-Lambda IAM policy modules
βββ π outputs.tf # Terraform outputs
π backend/
βββ π layers/shared-deps/
β βββ π package.json # Powertools, AWS SDK v3, @middy/core
β βββ π build_layer.sh # Build + zip script
β βββ π nodejs/lib/
β βββ π dynamodb.mjs # DynamoDB DocumentClient (X-Ray traced)
β βββ π sqs.mjs # SQS client (X-Ray traced)
β βββ π sfn.mjs # Step Functions client (X-Ray traced)
β βββ π eventbridge.mjs # EventBridge client (X-Ray traced)
β βββ π response.mjs # HTTP response helpers
βββ π lambdas/orders/
βββ π createOrder/ # API β validate, write, enqueue
βββ π getOrder/ # API β read order by ID
βββ π processOrder/ # SQS β start saga execution
βββ π replayDlq/ # Ops β drain DLQ to main queue
βββ π reserveInventory/ # Saga β atomic stock decrement
βββ π releaseInventory/ # Saga β compensation: restore stock
βββ π processPayment/ # Saga β simulated payment
βββ π refundPayment/ # Saga β compensation: log refund
βββ π confirmOrder/ # Saga β status β CONFIRMED
βββ π failOrder/ # Saga β status β FAILED
βββ π emitEvent/ # Saga β publish OrderPlaced event
π scripts/
βββ π seed_inventory.sh # Seeds 10 sample products
- AWS CLI configured with valid credentials
- Terraform β₯ 1.5.0
- Node.js 20.x
- jq (for seed script)
# 1οΈβ£ Build the Lambda Layer
cd backend/layers/shared-deps && ./build_layer.sh
# 2οΈβ£ Initialize and deploy infrastructure
cd ../../../infrastructure
terraform init
terraform plan -var-file=var.tfvars
terraform apply -var-file=var.tfvars
# 3οΈβ£ Seed inventory data (10 products)
cd .. && ./scripts/seed_inventory.sh# Create an order
curl -X POST https://<api-id>.execute-api.eu-west-1.amazonaws.com/dev/orders \
-H "Content-Type: application/json" \
-d '{
"userId": "user-123",
"items": [
{ "productId": "PROD-001", "qty": 2 },
{ "productId": "PROD-003", "qty": 1 }
],
"totalAmount": 109.97
}'
# Check order status
curl https://<api-id>.execute-api.eu-west-1.amazonaws.com/dev/orders/<orderId>Every Lambda is instrumented with AWS Lambda Powertools:
| Tool | What It Provides |
|---|---|
| π Logger | Structured JSON logs with orderId correlation across all functions |
| π Tracer | X-Ray tracing β every AWS SDK call appears in the service map |
| π Metrics | Custom CloudWatch metrics: OrderCreated, PaymentFailed, InventoryReserved, etc. |
All AWS SDK clients in the shared layer are wrapped with tracer.captureAWSv3Client(), so the full request journey β from API Gateway through Lambda, DynamoDB, SQS, Step Functions, and EventBridge β is visible as a single distributed trace in X-Ray.
A unified dashboard (dev-ser-ord-sys-dashboard) provides real-time visibility across 6 widget rows:
| Row | Widgets |
|---|---|
| 1 | API Gateway β request count, latency percentiles (P50/P95/P99), 4xx/5xx errors |
| 2 | Step Functions β saga success vs failure, execution duration, throttles |
| 3 | SQS β messages sent/received/deleted, DLQ depth, oldest message age |
| 4 | Lambda duration P95 β API functions and saga step functions |
| 5 | Lambda errors and concurrent executions across all functions |
| 6 | Custom Powertools metrics β order lifecycle and payment/inventory events |
| Alarm | Condition | Action |
|---|---|---|
| DLQ Depth | > 10 messages visible | SNS notification |
| Saga Failure Rate | > 30% over 5 minutes | SNS notification |
| API 5xx Rate | > 5% over 5 minutes | SNS notification |
All alarms send to the dev-ser-ord-sys-alarms SNS topic. Subscribe an email or Slack webhook to receive alerts.
| Document | Description |
|---|---|
| π Build Order | Phased implementation plan with dependency matrix |
| π Project Details | Full architecture documentation |
Built with β€οΈ on AWS Serverless
Terraform β’ Lambda β’ Step Functions β’ DynamoDB β’ SQS β’ EventBridge β’ API Gateway
