Add Error Handling & Retry Patterns category with 7 new patterns by joshmsmith · Pull Request #5 · temporal-sa/temporal-design-patterns

joshmsmith · 2026-05-13T19:20:10Z

Summary

Adds a new Error Handling & Retry Patterns category to the catalog with 7 new patterns and an overview page, along with sandbox runner implementations for each.

New Patterns

Pattern	Description
Error Handling & Retry Patterns Overview	Category overview with decision tree for choosing the right strategy
Fixed Count of Retries	Cap retry attempts to control cost for paid/limited resources
Fixed Wall-Time Retries	Bound total elapsed retry time to enforce business SLAs
Non-Retryable Errors	Fail fast on errors that will never succeed (validation, missing records)
Delayed Retry	Override next retry interval using `nextRetryDelay` on `ApplicationFailure`
Fast/Slow Retries	Aggressive short-interval retries first, then shift to long-interval
Retry Alerting via Metrics	Emit custom metrics when attempt count crosses a threshold
Resumable Activity	Park workflow after retries exhausted, resume after human signal

Sandbox Runner Implementations

Each pattern includes sandbox code (TypeScript + Python; Java for Delayed Retry and Resumable Activity):

sandbox-runner/patterns/delayed-retry/ — TypeScript + Java
sandbox-runner/patterns/fast-slow-retries/ — TypeScript + Python
sandbox-runner/patterns/fixed-count-retries/ — TypeScript + Python
sandbox-runner/patterns/fixed-wall-time-retries/ — TypeScript + Python
sandbox-runner/patterns/non-retryable-errors/ — TypeScript + Python
sandbox-runner/patterns/resumable-activity/ — TypeScript + Python + Java
sandbox-runner/patterns/retry-metrics/ — TypeScript + Python

Other Changes

docs/index.md — Added pattern tiles for all new patterns and missing overview tiles for existing categories
docs/.vitepress/config.mts — Added Error Handling & Retry Patterns sidebar section

- Add overview page with pattern grid and decision tree - Add Fixed Count of Retries pattern - Add Fixed Wall-Time Retries pattern - Add Non-Retryable Errors pattern - Add Delayed Retry pattern (per-error nextRetryDelay) - Add Fast/Slow Retries pattern - Add Retry Alerting via Metrics pattern - Add Resumable Activity pattern - Move Resumable Activity from External Interaction to Error Handling - Add pattern-page-icon to all new pattern pages - Update sidebar in config.mts

@lainecsmith

…dbox samples - delayed retry - fast-slow - fixed count - fixed wall time - non-retryable - resumable - custom metrics at high error count ## added TLDR for these and the QoS patterns - per @lainecsmith's feedback

taonic · 2026-05-14T21:59:32Z

+        except ActivityError as e:
+            # All retries exhausted — handle the failure here.
+            # Options: alert on-call, trigger a compensation activity, or escalate to a human.
+            workflow.logger.error(f"Payment failed after 3 attempts: {e}")


Maybe worth demonstrating how to narrow down to the specific error type: https://python.temporal.io/temporalio.exceptions.RetryState.html#MAXIMUM_ATTEMPTS_REACHED

Ooh nice idea, will do.

taonic · 2026-05-14T22:00:08Z

+                ),
+            )
+        except ActivityError:
+            workflow.logger.error(


Ditto: https://python.temporal.io/temporalio.exceptions.TimeoutType.html

taonic · 2026-05-14T22:02:58Z

+
+## Best practices
+
+- **Set both timeouts for clarity.** Use `ScheduleToCloseTimeout` as the total SLA and `StartToCloseTimeout` as a per-attempt safety valve. Omitting `StartToCloseTimeout` means a single slow response can consume the entire budget.


If we recommend this practice, we should suggest checking the specific error type when handling exceptions too.

Sounds great.

taonic · 2026-05-14T22:06:07Z

+- **Handle `ActivityError` explicitly.** When the SLA expires, Temporal delivers an error to the Workflow. Catch it to send an alert, trigger a compensation, or record a breach in an audit log.
+- **Distinguish SLA breaches from transient errors.** Inspect the error cause — a `ScheduleToCloseTimeout` breach has a specific error type that differs from an Activity application failure.
+
+## Common pitfalls


Maybe we can mention the impact of queue-depth to the delayed schedule to start that adds to the schedule to close timeout. This means the workers need to provisioned sufficiently to handle spiky traffic or employ auto-scale strategies .

Great idea, will do

taonic · 2026-05-14T22:15:18Z

+
+@activity.defn
+async def process_order(order_id: str) -> str:
+    order = await db.get_order(order_id)


I'm thinking if we could simplify this example by covering a more common scenario - a single API call, e.g. check 422 Unprocessable Entity - validation error returned from upstream.

Oooh, good idea, working on it.

taonic · 2026-05-14T23:16:50Z

+
+### Declaring non-retryable types in the RetryPolicy
+
+Alternatively, list error type names in the `RetryPolicy` at the Workflow call site.


Maybe worth showing the code on the throw site in this approach so they can compare with ApplicationFailure.nonRetryable?

---- * fixed-count-retries: Import RetryState and check MAXIMUM_ATTEMPTS_REACHED in workflow catch blocks (Python, Go, Java, TypeScript) and sandbox samples * fixed-wall-time-retries: Import TimeoutError/TimeoutType and check SCHEDULE_TO_CLOSE in workflow catch blocks (all 4 languages) and sandbox samples; update "Distinguish SLA breaches" best practice to reference the specific enum; add new pitfall for ScheduleToStart queue-depth eating into the ScheduleToClose budget * non-retryable-errors: Simplify activity examples from DB-lookup/payment-service to a single HTTP call (404 -> OrderNotFoundError, 422 -> ValidationError); add Activity throw-site code to the "Declaring non-retryable types in the RetryPolicy" section to contrast plain ApplicationError vs nonRetryable

joshmsmith · 2026-05-15T21:15:21Z

Love the feedback! Great suggestions. I think i implemented it all, but let me know if i missed anything.
Really appreciate the review.

taonic · 2026-05-17T01:05:39Z

+The Delayed Retry pattern overrides the next retry interval for a specific failure by throwing an `ApplicationFailure` with a `nextRetryDelay` field set from inside the Activity.
+Use it when a particular error carries information about how long to wait before retrying — such as a rate-limit response with a `Retry-After` header, or a known maintenance window with a fixed end time.
+
+> **SDK availability:** `nextRetryDelay` is supported in the **Java**, **TypeScript**, and **Rust** SDKs. 


Pretty sure this is supported in all SDKs, e.g.
https://pkg.go.dev/go.temporal.io/sdk@v1.42.0/internal#ApplicationError.NextRetryDelay

taonic · 2026-05-17T01:50:25Z

+- A downstream system returns an error message saying "maintenance until 02:00 UTC" — a precise, known delay.
+- A database error includes a lock timeout duration that indicates when the resource will be available.
+
+With a global `RetryPolicy`, you have two bad options: set a short interval and retry too early (wasting quota and adding load), or set a long interval and wait longer than necessary.


Maybe change "bad" to less negative language?

taonic · 2026-05-17T01:53:39Z

+    const response = await fetch(endpoint);
+
+    if (response.status === 429) {
+        const retryAfter = parseInt(response.headers.get('Retry-After') ?? '60', 10);


Maybe only use retryAfter if the response header is set, and fallback to the retry policy if not, instead of a blanket 10s?

taonic · 2026-05-17T01:58:11Z

+
+- **Use the error's own delay information when available.** HTTP 429 `Retry-After`, database lock timeouts, and API-provided backoff hints are more accurate than any value you could configure statically.
+- **Fall back to the RetryPolicy for unknown errors.** Only set `nextRetryDelay` for error types where you have reliable delay information. Let the RetryPolicy handle all other failures normally.
+- **Still set a meaningful RetryPolicy.** `nextRetryDelay` overrides a single retry interval; the RetryPolicy governs everything else — maximum attempts, schedule-to-close timeout, and fallback intervals.


schedule-to-close timeout is not governed by timeout settings not RetryPolicy. Also, maybe we should drop the fallback interval since it will be overridden by the retry delay.

taonic · 2026-05-17T02:02:59Z

+- **Use the error's own delay information when available.** HTTP 429 `Retry-After`, database lock timeouts, and API-provided backoff hints are more accurate than any value you could configure statically.
+- **Fall back to the RetryPolicy for unknown errors.** Only set `nextRetryDelay` for error types where you have reliable delay information. Let the RetryPolicy handle all other failures normally.
+- **Still set a meaningful RetryPolicy.** `nextRetryDelay` overrides a single retry interval; the RetryPolicy governs everything else — maximum attempts, schedule-to-close timeout, and fallback intervals.
+- **Log the override.** When setting a non-standard delay, log the delay value and its source (e.g., the `Retry-After` header value) so the Workflow history and your logging backend show why the Activity waited an unusual amount of time.


It's not clear how to show the delay value in the Workflow history. Maybe we can suggest adding the retry after into the ApplicationFailure's message for clarity.

taonic · 2026-05-17T02:05:17Z

+- **Fall back to the RetryPolicy for unknown errors.** Only set `nextRetryDelay` for error types where you have reliable delay information. Let the RetryPolicy handle all other failures normally.
+- **Still set a meaningful RetryPolicy.** `nextRetryDelay` overrides a single retry interval; the RetryPolicy governs everything else — maximum attempts, schedule-to-close timeout, and fallback intervals.
+- **Log the override.** When setting a non-standard delay, log the delay value and its source (e.g., the `Retry-After` header value) so the Workflow history and your logging backend show why the Activity waited an unusual amount of time.
+- **Set `startToCloseTimeout` to be more than the downstream call typically takes.** You want the downstream call to fail before the Activity times out. For example, for an HTTP request that times out after 30s, have the `startToCloseTimeout` be 45 or 60 seconds.


This is a good practice for startToCloseTimeout and the activity's execution time but not as relevant as scheduleToCloseTimeout which will be impacted by the prolonged the retryAfter backoff.

taonic · 2026-05-17T02:06:06Z

+- **Assuming `nextRetryDelay` persists across all retries.** It only applies to the immediate next retry. If the following attempt also fails without setting `nextRetryDelay`, the RetryPolicy interval resumes.
+- **Setting `nextRetryDelay` longer than `ScheduleToCloseTimeout`.** If the override delay exceeds the remaining `ScheduleToCloseTimeout` budget, the retry will never execute — Temporal will expire the Activity before the delay elapses.
+- **Using `nextRetryDelay` in Go or Python.** This feature is not available in those SDKs. For Go and Python, use `BackoffCoefficient=1.0` with a fixed `InitialInterval` to approximate a fixed delay.
+- **Setting `startToCloseTimeout` to be less than the downstream call typically takes.** If you set the `startToCloseTimeout` to be less than your call timeout, the Activity will be failed (and usually retried) even though the downstream call may have succeeded at the last moment.


Suggest to remove.

taonic · 2026-05-17T02:17:47Z

+## Best practices
+
+- **Use a bounded `MaximumAttempts` before parking.** Allow a few automatic retries to recover from transient failures. Parking immediately on the first failure forces operators to intervene for problems that would have resolved on their own.
+- **Cap the correction loop.** Each correction cycle — park, receive signal, re-execute Activity — adds events to the Workflow history (signal received, state transitions, Activity scheduled/completed). Bound the loop to a small number of attempts (the examples use 5) and fail the Workflow explicitly if exceeded, rather than allowing unbounded growth from a stream of bad corrections.


If the goal is to prevent hitting the history limit, maybe we can suggest CAN instead of forcing an arbitrary correction counter?

taonic · 2026-05-17T02:19:39Z

+        while True:
+            self._status = "TRANSFERRING"
+            try:
+                result = await workflow.execute_activity(


Shall we show the implementation of executeTransfer to highlight the need of throwing non-retryable error as part of this pattern?

taonic · 2026-05-17T02:21:47Z

+- **Cap the correction loop.** Each correction cycle — park, receive signal, re-execute Activity — adds events to the Workflow history (signal received, state transitions, Activity scheduled/completed). Bound the loop to a small number of attempts (the examples use 5) and fail the Workflow explicitly if exceeded, rather than allowing unbounded growth from a stream of bad corrections.
+- **Expose status via a Query method.** The `getStatus` Query gives operations tooling visibility into where the Workflow is parked without requiring access to the Workflow history.
+- **Validate the correction in the Signal handler.** Check that the corrected account is non-empty and matches the expected format before setting the state. An invalid correction just parks the Workflow again, but a clear error message helps operators.
+- **Log at every state transition.** The `AWAITING_CORRECTION` and `AWAITING_APPROVAL` states can last hours or days. Structured log lines at each transition make the audit trail clear.


For better visibility, we could suggest using search attributes to capture the state for operators to query from.

joshmsmith added 2 commits May 11, 2026 17:09

taonic reviewed May 14, 2026

View reviewed changes

taonic reviewed May 17, 2026

View reviewed changes


		## Best practices

		- Set both timeouts for clarity. Use `ScheduleToCloseTimeout` as the total SLA and `StartToCloseTimeout` as a per-attempt safety valve. Omitting `StartToCloseTimeout` means a single slow response can consume the entire budget.


		### Declaring non-retryable types in the RetryPolicy

		Alternatively, list error type names in the `RetryPolicy` at the Workflow call site.

Conversation

joshmsmith commented May 13, 2026

Summary

New Patterns

Sandbox Runner Implementations

Other Changes

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

taonic May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

joshmsmith commented May 15, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

taonic May 14, 2026 •

edited

Loading