Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -167,9 +167,9 @@ traffic_shaping:

## Outbound Options

The following options (`allow_only_http2`, `dedupe_enabled`, `pool_idle_timeout`, `request_timeout`,
and `tls`) can be set globally for all subgraphs or overridden on a per-subgraph basis by nesting
them under the subgraph's name within the `traffic_shaping` map.
The following options (`allow_only_http2`, `circuit_breaker`, `dedupe_enabled`, `pool_idle_timeout`,
`request_timeout`, and `tls`) can be set globally for all subgraphs or overridden on a per-subgraph
basis by nesting them under the subgraph's name within the `traffic_shaping` map.

For example, the following example shows how to set global defaults and override them for a specific
subgraph named `products`:
Expand Down Expand Up @@ -204,6 +204,93 @@ traffic_shaping:
allow_only_http2: true
```

### `circuit_breaker`

- **Type:** `object | null`
- **Default:** `null` (disabled)

Enables the circuit breaker pattern for subgraph requests. When the error rate of requests to a
subgraph exceeds the configured threshold, the circuit breaker opens and subsequent requests are
immediately rejected with an error — instead of waiting for the subgraph to respond. After the
`reset_timeout` elapses, the circuit enters a half-open state and allows a probe request through. If
that request succeeds, the circuit closes again.

For a detailed explanation and tuning guidance see the
[Circuit Breaker section](/docs/router/guides/performance-tuning#circuit-breaker) of the Performance
Tuning guide.

#### `circuit_breaker.enabled`

- **Type:** `boolean | null`
- **Default:** `false`

Enables or disables the circuit breaker for the subgraph(s). When omitted in a per-subgraph
`circuit_breaker` block, the value inherits from the global `traffic_shaping.all.circuit_breaker`
configuration.

#### `circuit_breaker.error_threshold`

- **Type:** `string`
- **Default:** `50%`

The error rate (as a percentage string) above which the circuit breaker opens. For example, `50%`
means the circuit trips when 50% or more of requests in the evaluation window fail. When omitted in
a per-subgraph override, the value falls back to the global configuration.

#### `circuit_breaker.volume_threshold`

- **Type:** `integer | null`
- **Default:** `5`

The minimum number of requests that must be observed before the circuit breaker starts evaluating
the error rate. This prevents the circuit from tripping due to a small number of failures during
low-traffic periods. When omitted in a per-subgraph override, the value falls back to the global
configuration.

#### `circuit_breaker.reset_timeout`

- **Type:** `string`
- **Default:** `30s`

The duration the circuit breaker stays open before transitioning to a half-open state to probe
whether the subgraph has recovered. Accepts a duration string (e.g. `30s`, `1m`). When omitted in
a per-subgraph override, the value falls back to the global configuration.

#### `circuit_breaker.error_status_codes`

- **Type:** `integer[] | null`
- **Default:** `[503]`

The list of HTTP status codes returned by the subgraph that should be counted as failures by the
circuit breaker. Only responses whose status code appears in this list are recorded as failures —
responses with any other status code (including other 5xx codes) are treated as successes from the
circuit breaker's perspective. When omitted in a per-subgraph override, the value falls back to the
global configuration.

```yaml title="router.config.yaml"
traffic_shaping:
all:
circuit_breaker:
enabled: true
error_status_codes: [500, 502, 503, 504]
```

**Example — global circuit breaker with a per-subgraph override:**

```yaml title="router.config.yaml"
traffic_shaping:
all:
circuit_breaker:
enabled: true
error_threshold: 50%
volume_threshold: 5
reset_timeout: 30s
subgraphs:
payments:
circuit_breaker:
volume_threshold: 3 # more sensitive for the payments subgraph; other settings inherit from global
```

### `dedupe_enabled`

- **Type:** `boolean`
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -251,3 +251,143 @@ many users run simultaneously).
- Your queries are always unique (heavily personalized)
- You're debugging and want to see every request
- You have very low traffic where deduplication doesn't help

## Circuit Breaker

The circuit breaker pattern prevents the router from continuously sending requests to a subgraph
that is failing or unresponsive. When the subgraph's error rate rises above a configurable
threshold, the circuit "opens" and subsequent requests are immediately rejected with a
`SUBGRAPH_CIRCUIT_BREAKER_REJECTED` error — instead of waiting for the subgraph to time out. This
frees up router resources and gives the subgraph time to recover.

- **Default:** `circuit_breaker: null` (disabled). Any per-field defaults listed below apply only
when a `circuit_breaker` object is provided.

### How It Works

The circuit breaker has three states:

| State | Behaviour |
| ------------- | ---------------------------------------------------------------------------------------------------- |
| **Closed** | Normal operation — all requests pass through. |
| **Open** | Requests are immediately rejected. No traffic reaches the subgraph. |
| **Half-open** | After `reset_timeout`, one probe request is allowed. Success closes the circuit; failure reopens it. |

The circuit transitions from **closed → open** once both conditions are met:

1. At least `volume_threshold` requests have been observed.
2. The fraction of those requests that errored is ≥ `error_threshold`.

### Configuration

```yaml title="router.config.yaml"
traffic_shaping:
all:
circuit_breaker:
enabled: true
error_threshold: 50% # open when ≥ 50% of requests fail
volume_threshold: 5 # evaluate after at least 5 requests
reset_timeout: 30s # retry after 30s
```

| Option | Type | Default | Description |
| -------------------- | ----------------- | ------- | -------------------------------------------------------------------------------------- |
| `enabled` | `boolean \| null` | `false` | Enable or disable the circuit breaker. At subgraph level, omit to inherit from global. |
| `error_threshold` | `string` | `50%` | Error rate percentage that triggers the breaker (e.g. `50%`). |
| `volume_threshold` | `integer \| null` | `5` | Minimum request count before the error rate is evaluated. |
| `reset_timeout` | `string` | `30s` | Duration the circuit stays open before allowing a probe request. |
| `error_status_codes` | `integer[]` | `[503]` | HTTP status codes that count as failures (default: only 503). |

### Global vs Per-Subgraph Configuration

Circuit breaker settings can be applied globally to all subgraphs under `traffic_shaping.all`, and
selectively overridden for individual subgraphs under `traffic_shaping.subgraphs.<name>`. All
per-subgraph fields are merged with the global configuration — omit any field to inherit the global
value.

<Tabs items={["Global only", "Per-subgraph only", "Mixed"]}>

<Tabs.Tab>

```yaml title="router.config.yaml"
traffic_shaping:
all:
circuit_breaker:
enabled: true
error_threshold: 50%
volume_threshold: 5
reset_timeout: 30s
```

</Tabs.Tab>

<Tabs.Tab>

```yaml title="router.config.yaml"
traffic_shaping:
all:
circuit_breaker:
enabled: false # disabled globally
subgraphs:
accounts:
circuit_breaker:
enabled: true
error_threshold: 60%
volume_threshold: 3
reset_timeout: 10s
products:
circuit_breaker:
enabled: true
error_threshold: 70%
volume_threshold: 4
reset_timeout: 15s
```

</Tabs.Tab>

<Tabs.Tab>

```yaml title="router.config.yaml"
traffic_shaping:
all:
circuit_breaker:
enabled: true
error_threshold: 50%
volume_threshold: 10
reset_timeout: 30s
subgraphs:
accounts:
circuit_breaker:
volume_threshold: 3 # override only volume_threshold; enabled and other settings inherit from global
```

</Tabs.Tab>

</Tabs>

### Metrics

When the circuit breaker rejects a request, the router increments the
`hive.router.circuit_breaker.rejected_requests_total` counter. The counter carries a `subgraph.name`
label so you can track rejection rates per subgraph in your metrics backend.

### Tuning Guidelines

- **`error_threshold`** — Lower values make the breaker more sensitive. Start at the default
(`50%`) and tighten it only if you need to protect fragile subgraphs more aggressively.
- **`volume_threshold`** — Keep this high enough to avoid false positives during low-traffic
periods. A value of `5`–`10` is reasonable for most deployments.
- **`reset_timeout`** — Should be long enough to give the subgraph time to recover, but short
enough that you notice quickly when it does. `30s`–`60s` is a sensible starting point.

**When to enable the circuit breaker:**

- You have subgraphs that are occasionally slow or unavailable, and you want to fail fast rather
than accumulate timeout latency
- You want to give a struggling subgraph breathing room to recover instead of being overwhelmed by
retried requests

**When you might leave it disabled:**

- All subgraphs are highly reliable and well within their resource limits
- You prefer to surface subgraph errors directly to clients rather than short-circuiting them
Loading