Skip to content

Add Error Handling & Retry Patterns category with 7 new patterns#5

Open
joshmsmith wants to merge 3 commits into
temporal-sa:mainfrom
joshmsmith:error-handling
Open

Add Error Handling & Retry Patterns category with 7 new patterns#5
joshmsmith wants to merge 3 commits into
temporal-sa:mainfrom
joshmsmith:error-handling

Conversation

@joshmsmith
Copy link
Copy Markdown
Contributor

Summary

Adds a new Error Handling & Retry Patterns category to the catalog with 7 new patterns and an overview page, along with sandbox runner implementations for each.

New Patterns

Pattern Description
Error Handling & Retry Patterns Overview Category overview with decision tree for choosing the right strategy
Fixed Count of Retries Cap retry attempts to control cost for paid/limited resources
Fixed Wall-Time Retries Bound total elapsed retry time to enforce business SLAs
Non-Retryable Errors Fail fast on errors that will never succeed (validation, missing records)
Delayed Retry Override next retry interval using nextRetryDelay on ApplicationFailure
Fast/Slow Retries Aggressive short-interval retries first, then shift to long-interval
Retry Alerting via Metrics Emit custom metrics when attempt count crosses a threshold
Resumable Activity Park workflow after retries exhausted, resume after human signal

Sandbox Runner Implementations

Each pattern includes sandbox code (TypeScript + Python; Java for Delayed Retry and Resumable Activity):

  • sandbox-runner/patterns/delayed-retry/ — TypeScript + Java
  • sandbox-runner/patterns/fast-slow-retries/ — TypeScript + Python
  • sandbox-runner/patterns/fixed-count-retries/ — TypeScript + Python
  • sandbox-runner/patterns/fixed-wall-time-retries/ — TypeScript + Python
  • sandbox-runner/patterns/non-retryable-errors/ — TypeScript + Python
  • sandbox-runner/patterns/resumable-activity/ — TypeScript + Python + Java
  • sandbox-runner/patterns/retry-metrics/ — TypeScript + Python

Other Changes

  • docs/index.md — Added pattern tiles for all new patterns and missing overview tiles for existing categories
  • docs/.vitepress/config.mts — Added Error Handling & Retry Patterns sidebar section

- Add overview page with pattern grid and decision tree
- Add Fixed Count of Retries pattern
- Add Fixed Wall-Time Retries pattern
- Add Non-Retryable Errors pattern
- Add Delayed Retry pattern (per-error nextRetryDelay)
- Add Fast/Slow Retries pattern
- Add Retry Alerting via Metrics pattern
- Add Resumable Activity pattern
- Move Resumable Activity from External Interaction to Error Handling
- Add pattern-page-icon to all new pattern pages
- Update sidebar in config.mts
…dbox samples

- delayed retry
- fast-slow
- fixed count
- fixed wall time
- non-retryable
- resumable
- custom metrics at high error count

## added TLDR for these and the  QoS patterns
- per @lainecsmith's feedback
Comment thread docs/fixed-count-retries.md Outdated
except ActivityError as e:
# All retries exhausted — handle the failure here.
# Options: alert on-call, trigger a compensation activity, or escalate to a human.
workflow.logger.error(f"Payment failed after 3 attempts: {e}")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe worth demonstrating how to narrow down to the specific error type: https://python.temporal.io/temporalio.exceptions.RetryState.html#MAXIMUM_ATTEMPTS_REACHED

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooh nice idea, will do.

Comment thread docs/fixed-wall-time-retries.md Outdated
),
)
except ActivityError:
workflow.logger.error(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 will do


## Best practices

- **Set both timeouts for clarity.** Use `ScheduleToCloseTimeout` as the total SLA and `StartToCloseTimeout` as a per-attempt safety valve. Omitting `StartToCloseTimeout` means a single slow response can consume the entire budget.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we recommend this practice, we should suggest checking the specific error type when handling exceptions too.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds great.

- **Handle `ActivityError` explicitly.** When the SLA expires, Temporal delivers an error to the Workflow. Catch it to send an alert, trigger a compensation, or record a breach in an audit log.
- **Distinguish SLA breaches from transient errors.** Inspect the error cause — a `ScheduleToCloseTimeout` breach has a specific error type that differs from an Activity application failure.

## Common pitfalls
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can mention the impact of queue-depth to the delayed schedule to start that adds to the schedule to close timeout. This means the workers need to provisioned sufficiently to handle spiky traffic or employ auto-scale strategies .

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea, will do

Comment thread docs/non-retryable-errors.md Outdated

@activity.defn
async def process_order(order_id: str) -> str:
order = await db.get_order(order_id)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking if we could simplify this example by covering a more common scenario - a single API call, e.g. check 422 Unprocessable Entity - validation error returned from upstream.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oooh, good idea, working on it.


### Declaring non-retryable types in the RetryPolicy

Alternatively, list error type names in the `RetryPolicy` at the Workflow call site.
Copy link
Copy Markdown
Collaborator

@taonic taonic May 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe worth showing the code on the throw site in this approach so they can compare with ApplicationFailure.nonRetryable?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love it.

----
* fixed-count-retries: Import RetryState and check MAXIMUM_ATTEMPTS_REACHED in workflow catch blocks (Python, Go, Java, TypeScript) and sandbox samples
* fixed-wall-time-retries: Import TimeoutError/TimeoutType and check SCHEDULE_TO_CLOSE in workflow catch blocks (all 4 languages) and sandbox samples; update "Distinguish SLA breaches" best practice to reference the specific enum; add new pitfall for ScheduleToStart queue-depth eating into the ScheduleToClose budget

* non-retryable-errors: Simplify activity examples from DB-lookup/payment-service to a single HTTP call (404 -> OrderNotFoundError, 422  -> ValidationError); add Activity throw-site code to the "Declaring non-retryable types in the RetryPolicy" section to contrast plain ApplicationError vs nonRetryable
@joshmsmith
Copy link
Copy Markdown
Contributor Author

Love the feedback! Great suggestions. I think i implemented it all, but let me know if i missed anything.
Really appreciate the review.

Comment thread docs/delayed-retry.md
The Delayed Retry pattern overrides the next retry interval for a specific failure by throwing an `ApplicationFailure` with a `nextRetryDelay` field set from inside the Activity.
Use it when a particular error carries information about how long to wait before retrying — such as a rate-limit response with a `Retry-After` header, or a known maintenance window with a fixed end time.

> **SDK availability:** `nextRetryDelay` is supported in the **Java**, **TypeScript**, and **Rust** SDKs.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread docs/delayed-retry.md
- A downstream system returns an error message saying "maintenance until 02:00 UTC" — a precise, known delay.
- A database error includes a lock timeout duration that indicates when the resource will be available.

With a global `RetryPolicy`, you have two bad options: set a short interval and retry too early (wasting quota and adding load), or set a long interval and wait longer than necessary.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe change "bad" to less negative language?

Comment thread docs/delayed-retry.md
const response = await fetch(endpoint);

if (response.status === 429) {
const retryAfter = parseInt(response.headers.get('Retry-After') ?? '60', 10);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe only use retryAfter if the response header is set, and fallback to the retry policy if not, instead of a blanket 10s?

Comment thread docs/delayed-retry.md

- **Use the error's own delay information when available.** HTTP 429 `Retry-After`, database lock timeouts, and API-provided backoff hints are more accurate than any value you could configure statically.
- **Fall back to the RetryPolicy for unknown errors.** Only set `nextRetryDelay` for error types where you have reliable delay information. Let the RetryPolicy handle all other failures normally.
- **Still set a meaningful RetryPolicy.** `nextRetryDelay` overrides a single retry interval; the RetryPolicy governs everything else — maximum attempts, schedule-to-close timeout, and fallback intervals.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

schedule-to-close timeout is not governed by timeout settings not RetryPolicy. Also, maybe we should drop the fallback interval since it will be overridden by the retry delay.

Comment thread docs/delayed-retry.md
- **Use the error's own delay information when available.** HTTP 429 `Retry-After`, database lock timeouts, and API-provided backoff hints are more accurate than any value you could configure statically.
- **Fall back to the RetryPolicy for unknown errors.** Only set `nextRetryDelay` for error types where you have reliable delay information. Let the RetryPolicy handle all other failures normally.
- **Still set a meaningful RetryPolicy.** `nextRetryDelay` overrides a single retry interval; the RetryPolicy governs everything else — maximum attempts, schedule-to-close timeout, and fallback intervals.
- **Log the override.** When setting a non-standard delay, log the delay value and its source (e.g., the `Retry-After` header value) so the Workflow history and your logging backend show why the Activity waited an unusual amount of time.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear how to show the delay value in the Workflow history. Maybe we can suggest adding the retry after into the ApplicationFailure's message for clarity.

Comment thread docs/delayed-retry.md
- **Fall back to the RetryPolicy for unknown errors.** Only set `nextRetryDelay` for error types where you have reliable delay information. Let the RetryPolicy handle all other failures normally.
- **Still set a meaningful RetryPolicy.** `nextRetryDelay` overrides a single retry interval; the RetryPolicy governs everything else — maximum attempts, schedule-to-close timeout, and fallback intervals.
- **Log the override.** When setting a non-standard delay, log the delay value and its source (e.g., the `Retry-After` header value) so the Workflow history and your logging backend show why the Activity waited an unusual amount of time.
- **Set `startToCloseTimeout` to be more than the downstream call typically takes.** You want the downstream call to fail before the Activity times out. For example, for an HTTP request that times out after 30s, have the `startToCloseTimeout` be 45 or 60 seconds.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good practice for startToCloseTimeout and the activity's execution time but not as relevant as scheduleToCloseTimeout which will be impacted by the prolonged the retryAfter backoff.

Comment thread docs/delayed-retry.md
- **Assuming `nextRetryDelay` persists across all retries.** It only applies to the immediate next retry. If the following attempt also fails without setting `nextRetryDelay`, the RetryPolicy interval resumes.
- **Setting `nextRetryDelay` longer than `ScheduleToCloseTimeout`.** If the override delay exceeds the remaining `ScheduleToCloseTimeout` budget, the retry will never execute — Temporal will expire the Activity before the delay elapses.
- **Using `nextRetryDelay` in Go or Python.** This feature is not available in those SDKs. For Go and Python, use `BackoffCoefficient=1.0` with a fixed `InitialInterval` to approximate a fixed delay.
- **Setting `startToCloseTimeout` to be less than the downstream call typically takes.** If you set the `startToCloseTimeout` to be less than your call timeout, the Activity will be failed (and usually retried) even though the downstream call may have succeeded at the last moment.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest to remove.

## Best practices

- **Use a bounded `MaximumAttempts` before parking.** Allow a few automatic retries to recover from transient failures. Parking immediately on the first failure forces operators to intervene for problems that would have resolved on their own.
- **Cap the correction loop.** Each correction cycle — park, receive signal, re-execute Activity — adds events to the Workflow history (signal received, state transitions, Activity scheduled/completed). Bound the loop to a small number of attempts (the examples use 5) and fail the Workflow explicitly if exceeded, rather than allowing unbounded growth from a stream of bad corrections.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the goal is to prevent hitting the history limit, maybe we can suggest CAN instead of forcing an arbitrary correction counter?

while True:
self._status = "TRANSFERRING"
try:
result = await workflow.execute_activity(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we show the implementation of executeTransfer to highlight the need of throwing non-retryable error as part of this pattern?

- **Cap the correction loop.** Each correction cycle — park, receive signal, re-execute Activity — adds events to the Workflow history (signal received, state transitions, Activity scheduled/completed). Bound the loop to a small number of attempts (the examples use 5) and fail the Workflow explicitly if exceeded, rather than allowing unbounded growth from a stream of bad corrections.
- **Expose status via a Query method.** The `getStatus` Query gives operations tooling visibility into where the Workflow is parked without requiring access to the Workflow history.
- **Validate the correction in the Signal handler.** Check that the corrected account is non-empty and matches the expected format before setting the state. An invalid correction just parks the Workflow again, but a clear error message helps operators.
- **Log at every state transition.** The `AWAITING_CORRECTION` and `AWAITING_APPROVAL` states can last hours or days. Structured log lines at each transition make the audit trail clear.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For better visibility, we could suggest using search attributes to capture the state for operators to query from.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants