You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The SDK retries fetch-level errors (timeouts, transport failures, connection resets) for all HTTP methods including POST. For an idempotent method like GET or DELETE this is correct, but for POST it creates a real risk of duplicate item creation when the first request completed server-side before the client errored out.
The blast radius is worst for `batch_create`, which can create up to 100 items in a single POST. A mid-flight network error that succeeds on retry could result in up to 200 items where the caller expected 100.
Root cause
In `src/layerv_qurl/client.py` around line 641-651 (and the async mirror in `async_client.py`), the retry loop handles fetch-level errors with a blanket `continue` regardless of HTTP method:
```python
except httpx.TimeoutException as exc:
logger.debug("%s %s timed out", method, url)
if attempt < self._max_retries:
last_error = exc
continue # ← retries even for POST / PATCH
raise QURLTimeoutError(str(exc), cause=exc) from exc
except httpx.TransportError as exc:
logger.debug("%s %s transport error: %s", method, url, exc)
if attempt < self._max_retries:
last_error = exc
continue # ← retries even for POST / PATCH
raise QURLNetworkError(str(exc), cause=exc) from exc
```
The `RETRYABLE_STATUS_POST = {429}` check further down (which correctly restricts HTTP-status-based retries to 429 only for POST) is only reached when a response actually comes back. Network errors never reach that check.
The sync/async parity guard is correctly maintained — both clients exhibit the same bug.
Repro (hypothetical)
Client issues `batch_create([...100 items...])`
Server receives the request, creates all 100 items, starts writing the response
Connection drops mid-write (or client timeout fires just after the server commits)
Client catches `TransportError` (or `TimeoutException`), sees `attempt < self._max_retries`, retries
Server receives the second request and creates another 100 items
POST retry tests are already in place for the HTTP-status path (`test_post_does_not_retry_on_503`, `test_post_still_retries_on_429`) — those lock in the correct behavior for responses with a status code, but don't cover the network-error path.
Fix options
A. Don't retry network errors for mutating methods.
Simplest fix — mirror the `RETRYABLE_STATUS_POST` logic for the network-error path: POST only retries on 429 (HTTP-level), never on network errors. Matches the HTTP-status retry posture. Downside: transient network blips on create operations always fail, even when a retry would clearly succeed and not duplicate.
B. Require idempotency keys on POST retries.
Each POST retry sends an `Idempotency-Key` header derived from a caller-provided or SDK-generated key. The server deduplicates based on the key. Industry-standard (Stripe, Square, etc.) but requires server-side support — worth verifying with qurl-service whether the API supports `Idempotency-Key`.
C. Add a `retry_network_errors_on_mutating: bool` option (default `False`).
Explicit opt-in. Callers who know the failure mode can opt in; the default is safe. Lowest-impact change for existing behavior but requires a minor API addition.
Recommendation
Option B if qurl-service supports `Idempotency-Key` — it's the most robust solution and would fix duplicate-creation risk across all create flows (not just `batch_create`). Option A as a fallback if server support isn't available.
Context
Flagged during review of PR #8 (`feat/align-api-v2`). Reviewer explicitly called it out as "pre-existing behavior, not introduced by this PR, but worth flagging since `batch_create` amplifies the blast radius" and "Consider documenting this or adding a `retry_network_errors` option in a future PR."
Problem
The SDK retries fetch-level errors (timeouts, transport failures, connection resets) for all HTTP methods including POST. For an idempotent method like GET or DELETE this is correct, but for POST it creates a real risk of duplicate item creation when the first request completed server-side before the client errored out.
The blast radius is worst for `batch_create`, which can create up to 100 items in a single POST. A mid-flight network error that succeeds on retry could result in up to 200 items where the caller expected 100.
Root cause
In `src/layerv_qurl/client.py` around line 641-651 (and the async mirror in `async_client.py`), the retry loop handles fetch-level errors with a blanket `continue` regardless of HTTP method:
```python
except httpx.TimeoutException as exc:
logger.debug("%s %s timed out", method, url)
if attempt < self._max_retries:
last_error = exc
continue # ← retries even for POST / PATCH
raise QURLTimeoutError(str(exc), cause=exc) from exc
except httpx.TransportError as exc:
logger.debug("%s %s transport error: %s", method, url, exc)
if attempt < self._max_retries:
last_error = exc
continue # ← retries even for POST / PATCH
raise QURLNetworkError(str(exc), cause=exc) from exc
```
The `RETRYABLE_STATUS_POST = {429}` check further down (which correctly restricts HTTP-status-based retries to 429 only for POST) is only reached when a response actually comes back. Network errors never reach that check.
The sync/async parity guard is correctly maintained — both clients exhibit the same bug.
Repro (hypothetical)
Scope
Fix options
A. Don't retry network errors for mutating methods.
Simplest fix — mirror the `RETRYABLE_STATUS_POST` logic for the network-error path: POST only retries on 429 (HTTP-level), never on network errors. Matches the HTTP-status retry posture. Downside: transient network blips on create operations always fail, even when a retry would clearly succeed and not duplicate.
B. Require idempotency keys on POST retries.
Each POST retry sends an `Idempotency-Key` header derived from a caller-provided or SDK-generated key. The server deduplicates based on the key. Industry-standard (Stripe, Square, etc.) but requires server-side support — worth verifying with qurl-service whether the API supports `Idempotency-Key`.
C. Add a `retry_network_errors_on_mutating: bool` option (default `False`).
Explicit opt-in. Callers who know the failure mode can opt in; the default is safe. Lowest-impact change for existing behavior but requires a minor API addition.
Recommendation
Option B if qurl-service supports `Idempotency-Key` — it's the most robust solution and would fix duplicate-creation risk across all create flows (not just `batch_create`). Option A as a fallback if server support isn't available.
Context
Flagged during review of PR #8 (`feat/align-api-v2`). Reviewer explicitly called it out as "pre-existing behavior, not introduced by this PR, but worth flagging since `batch_create` amplifies the blast radius" and "Consider documenting this or adding a `retry_network_errors` option in a future PR."