Add Error Handling & Retry Patterns category with 7 new patterns#5
Add Error Handling & Retry Patterns category with 7 new patterns#5joshmsmith wants to merge 3 commits into
Conversation
- Add overview page with pattern grid and decision tree - Add Fixed Count of Retries pattern - Add Fixed Wall-Time Retries pattern - Add Non-Retryable Errors pattern - Add Delayed Retry pattern (per-error nextRetryDelay) - Add Fast/Slow Retries pattern - Add Retry Alerting via Metrics pattern - Add Resumable Activity pattern - Move Resumable Activity from External Interaction to Error Handling - Add pattern-page-icon to all new pattern pages - Update sidebar in config.mts
…dbox samples - delayed retry - fast-slow - fixed count - fixed wall time - non-retryable - resumable - custom metrics at high error count ## added TLDR for these and the QoS patterns - per @lainecsmith's feedback
| except ActivityError as e: | ||
| # All retries exhausted — handle the failure here. | ||
| # Options: alert on-call, trigger a compensation activity, or escalate to a human. | ||
| workflow.logger.error(f"Payment failed after 3 attempts: {e}") |
There was a problem hiding this comment.
Maybe worth demonstrating how to narrow down to the specific error type: https://python.temporal.io/temporalio.exceptions.RetryState.html#MAXIMUM_ATTEMPTS_REACHED
There was a problem hiding this comment.
Ooh nice idea, will do.
| ), | ||
| ) | ||
| except ActivityError: | ||
| workflow.logger.error( |
There was a problem hiding this comment.
|
|
||
| ## Best practices | ||
|
|
||
| - **Set both timeouts for clarity.** Use `ScheduleToCloseTimeout` as the total SLA and `StartToCloseTimeout` as a per-attempt safety valve. Omitting `StartToCloseTimeout` means a single slow response can consume the entire budget. |
There was a problem hiding this comment.
If we recommend this practice, we should suggest checking the specific error type when handling exceptions too.
| - **Handle `ActivityError` explicitly.** When the SLA expires, Temporal delivers an error to the Workflow. Catch it to send an alert, trigger a compensation, or record a breach in an audit log. | ||
| - **Distinguish SLA breaches from transient errors.** Inspect the error cause — a `ScheduleToCloseTimeout` breach has a specific error type that differs from an Activity application failure. | ||
|
|
||
| ## Common pitfalls |
There was a problem hiding this comment.
Maybe we can mention the impact of queue-depth to the delayed schedule to start that adds to the schedule to close timeout. This means the workers need to provisioned sufficiently to handle spiky traffic or employ auto-scale strategies .
There was a problem hiding this comment.
Great idea, will do
|
|
||
| @activity.defn | ||
| async def process_order(order_id: str) -> str: | ||
| order = await db.get_order(order_id) |
There was a problem hiding this comment.
I'm thinking if we could simplify this example by covering a more common scenario - a single API call, e.g. check 422 Unprocessable Entity - validation error returned from upstream.
There was a problem hiding this comment.
Oooh, good idea, working on it.
|
|
||
| ### Declaring non-retryable types in the RetryPolicy | ||
|
|
||
| Alternatively, list error type names in the `RetryPolicy` at the Workflow call site. |
There was a problem hiding this comment.
Maybe worth showing the code on the throw site in this approach so they can compare with ApplicationFailure.nonRetryable?
---- * fixed-count-retries: Import RetryState and check MAXIMUM_ATTEMPTS_REACHED in workflow catch blocks (Python, Go, Java, TypeScript) and sandbox samples * fixed-wall-time-retries: Import TimeoutError/TimeoutType and check SCHEDULE_TO_CLOSE in workflow catch blocks (all 4 languages) and sandbox samples; update "Distinguish SLA breaches" best practice to reference the specific enum; add new pitfall for ScheduleToStart queue-depth eating into the ScheduleToClose budget * non-retryable-errors: Simplify activity examples from DB-lookup/payment-service to a single HTTP call (404 -> OrderNotFoundError, 422 -> ValidationError); add Activity throw-site code to the "Declaring non-retryable types in the RetryPolicy" section to contrast plain ApplicationError vs nonRetryable
|
Love the feedback! Great suggestions. I think i implemented it all, but let me know if i missed anything. |
| The Delayed Retry pattern overrides the next retry interval for a specific failure by throwing an `ApplicationFailure` with a `nextRetryDelay` field set from inside the Activity. | ||
| Use it when a particular error carries information about how long to wait before retrying — such as a rate-limit response with a `Retry-After` header, or a known maintenance window with a fixed end time. | ||
|
|
||
| > **SDK availability:** `nextRetryDelay` is supported in the **Java**, **TypeScript**, and **Rust** SDKs. |
There was a problem hiding this comment.
Pretty sure this is supported in all SDKs, e.g.
https://pkg.go.dev/go.temporal.io/sdk@v1.42.0/internal#ApplicationError.NextRetryDelay
| - A downstream system returns an error message saying "maintenance until 02:00 UTC" — a precise, known delay. | ||
| - A database error includes a lock timeout duration that indicates when the resource will be available. | ||
|
|
||
| With a global `RetryPolicy`, you have two bad options: set a short interval and retry too early (wasting quota and adding load), or set a long interval and wait longer than necessary. |
There was a problem hiding this comment.
Maybe change "bad" to less negative language?
| const response = await fetch(endpoint); | ||
|
|
||
| if (response.status === 429) { | ||
| const retryAfter = parseInt(response.headers.get('Retry-After') ?? '60', 10); |
There was a problem hiding this comment.
Maybe only use retryAfter if the response header is set, and fallback to the retry policy if not, instead of a blanket 10s?
|
|
||
| - **Use the error's own delay information when available.** HTTP 429 `Retry-After`, database lock timeouts, and API-provided backoff hints are more accurate than any value you could configure statically. | ||
| - **Fall back to the RetryPolicy for unknown errors.** Only set `nextRetryDelay` for error types where you have reliable delay information. Let the RetryPolicy handle all other failures normally. | ||
| - **Still set a meaningful RetryPolicy.** `nextRetryDelay` overrides a single retry interval; the RetryPolicy governs everything else — maximum attempts, schedule-to-close timeout, and fallback intervals. |
There was a problem hiding this comment.
schedule-to-close timeout is not governed by timeout settings not RetryPolicy. Also, maybe we should drop the fallback interval since it will be overridden by the retry delay.
| - **Use the error's own delay information when available.** HTTP 429 `Retry-After`, database lock timeouts, and API-provided backoff hints are more accurate than any value you could configure statically. | ||
| - **Fall back to the RetryPolicy for unknown errors.** Only set `nextRetryDelay` for error types where you have reliable delay information. Let the RetryPolicy handle all other failures normally. | ||
| - **Still set a meaningful RetryPolicy.** `nextRetryDelay` overrides a single retry interval; the RetryPolicy governs everything else — maximum attempts, schedule-to-close timeout, and fallback intervals. | ||
| - **Log the override.** When setting a non-standard delay, log the delay value and its source (e.g., the `Retry-After` header value) so the Workflow history and your logging backend show why the Activity waited an unusual amount of time. |
There was a problem hiding this comment.
It's not clear how to show the delay value in the Workflow history. Maybe we can suggest adding the retry after into the ApplicationFailure's message for clarity.
| - **Fall back to the RetryPolicy for unknown errors.** Only set `nextRetryDelay` for error types where you have reliable delay information. Let the RetryPolicy handle all other failures normally. | ||
| - **Still set a meaningful RetryPolicy.** `nextRetryDelay` overrides a single retry interval; the RetryPolicy governs everything else — maximum attempts, schedule-to-close timeout, and fallback intervals. | ||
| - **Log the override.** When setting a non-standard delay, log the delay value and its source (e.g., the `Retry-After` header value) so the Workflow history and your logging backend show why the Activity waited an unusual amount of time. | ||
| - **Set `startToCloseTimeout` to be more than the downstream call typically takes.** You want the downstream call to fail before the Activity times out. For example, for an HTTP request that times out after 30s, have the `startToCloseTimeout` be 45 or 60 seconds. |
There was a problem hiding this comment.
This is a good practice for startToCloseTimeout and the activity's execution time but not as relevant as scheduleToCloseTimeout which will be impacted by the prolonged the retryAfter backoff.
| - **Assuming `nextRetryDelay` persists across all retries.** It only applies to the immediate next retry. If the following attempt also fails without setting `nextRetryDelay`, the RetryPolicy interval resumes. | ||
| - **Setting `nextRetryDelay` longer than `ScheduleToCloseTimeout`.** If the override delay exceeds the remaining `ScheduleToCloseTimeout` budget, the retry will never execute — Temporal will expire the Activity before the delay elapses. | ||
| - **Using `nextRetryDelay` in Go or Python.** This feature is not available in those SDKs. For Go and Python, use `BackoffCoefficient=1.0` with a fixed `InitialInterval` to approximate a fixed delay. | ||
| - **Setting `startToCloseTimeout` to be less than the downstream call typically takes.** If you set the `startToCloseTimeout` to be less than your call timeout, the Activity will be failed (and usually retried) even though the downstream call may have succeeded at the last moment. |
| ## Best practices | ||
|
|
||
| - **Use a bounded `MaximumAttempts` before parking.** Allow a few automatic retries to recover from transient failures. Parking immediately on the first failure forces operators to intervene for problems that would have resolved on their own. | ||
| - **Cap the correction loop.** Each correction cycle — park, receive signal, re-execute Activity — adds events to the Workflow history (signal received, state transitions, Activity scheduled/completed). Bound the loop to a small number of attempts (the examples use 5) and fail the Workflow explicitly if exceeded, rather than allowing unbounded growth from a stream of bad corrections. |
There was a problem hiding this comment.
If the goal is to prevent hitting the history limit, maybe we can suggest CAN instead of forcing an arbitrary correction counter?
| while True: | ||
| self._status = "TRANSFERRING" | ||
| try: | ||
| result = await workflow.execute_activity( |
There was a problem hiding this comment.
Shall we show the implementation of executeTransfer to highlight the need of throwing non-retryable error as part of this pattern?
| - **Cap the correction loop.** Each correction cycle — park, receive signal, re-execute Activity — adds events to the Workflow history (signal received, state transitions, Activity scheduled/completed). Bound the loop to a small number of attempts (the examples use 5) and fail the Workflow explicitly if exceeded, rather than allowing unbounded growth from a stream of bad corrections. | ||
| - **Expose status via a Query method.** The `getStatus` Query gives operations tooling visibility into where the Workflow is parked without requiring access to the Workflow history. | ||
| - **Validate the correction in the Signal handler.** Check that the corrected account is non-empty and matches the expected format before setting the state. An invalid correction just parks the Workflow again, but a clear error message helps operators. | ||
| - **Log at every state transition.** The `AWAITING_CORRECTION` and `AWAITING_APPROVAL` states can last hours or days. Structured log lines at each transition make the audit trail clear. |
There was a problem hiding this comment.
For better visibility, we could suggest using search attributes to capture the state for operators to query from.
Summary
Adds a new Error Handling & Retry Patterns category to the catalog with 7 new patterns and an overview page, along with sandbox runner implementations for each.
New Patterns
nextRetryDelayonApplicationFailureSandbox Runner Implementations
Each pattern includes sandbox code (TypeScript + Python; Java for Delayed Retry and Resumable Activity):
sandbox-runner/patterns/delayed-retry/— TypeScript + Javasandbox-runner/patterns/fast-slow-retries/— TypeScript + Pythonsandbox-runner/patterns/fixed-count-retries/— TypeScript + Pythonsandbox-runner/patterns/fixed-wall-time-retries/— TypeScript + Pythonsandbox-runner/patterns/non-retryable-errors/— TypeScript + Pythonsandbox-runner/patterns/resumable-activity/— TypeScript + Python + Javasandbox-runner/patterns/retry-metrics/— TypeScript + PythonOther Changes
docs/index.md— Added pattern tiles for all new patterns and missing overview tiles for existing categoriesdocs/.vitepress/config.mts— Added Error Handling & Retry Patterns sidebar section