Skip to content

POST retries on network errors can duplicate items (batch_create especially) #19

@justin-layerv

Description

@justin-layerv

Problem

The SDK retries fetch-level errors (timeouts, transport failures, connection resets) for all HTTP methods including POST. For an idempotent method like GET or DELETE this is correct, but for POST it creates a real risk of duplicate item creation when the first request completed server-side before the client errored out.

The blast radius is worst for `batch_create`, which can create up to 100 items in a single POST. A mid-flight network error that succeeds on retry could result in up to 200 items where the caller expected 100.

Root cause

In `src/layerv_qurl/client.py` around line 641-651 (and the async mirror in `async_client.py`), the retry loop handles fetch-level errors with a blanket `continue` regardless of HTTP method:

```python
except httpx.TimeoutException as exc:
logger.debug("%s %s timed out", method, url)
if attempt < self._max_retries:
last_error = exc
continue # ← retries even for POST / PATCH
raise QURLTimeoutError(str(exc), cause=exc) from exc
except httpx.TransportError as exc:
logger.debug("%s %s transport error: %s", method, url, exc)
if attempt < self._max_retries:
last_error = exc
continue # ← retries even for POST / PATCH
raise QURLNetworkError(str(exc), cause=exc) from exc
```

The `RETRYABLE_STATUS_POST = {429}` check further down (which correctly restricts HTTP-status-based retries to 429 only for POST) is only reached when a response actually comes back. Network errors never reach that check.

The sync/async parity guard is correctly maintained — both clients exhibit the same bug.

Repro (hypothetical)

  1. Client issues `batch_create([...100 items...])`
  2. Server receives the request, creates all 100 items, starts writing the response
  3. Connection drops mid-write (or client timeout fires just after the server commits)
  4. Client catches `TransportError` (or `TimeoutException`), sees `attempt < self._max_retries`, retries
  5. Server receives the second request and creates another 100 items
  6. Caller ends up with 200 items, no error surfaced

Scope

  • Pre-existing: this behavior was not introduced by the `feat/align-api-v2` PR (feat!: align types and client with latest API spec #8). The PR description and migration notes don't mention it.
  • Cross-SDK: the same issue exists in layervai/qurl-typescript#22. Whatever fix ships should be mirrored.
  • POST retry tests are already in place for the HTTP-status path (`test_post_does_not_retry_on_503`, `test_post_still_retries_on_429`) — those lock in the correct behavior for responses with a status code, but don't cover the network-error path.

Fix options

A. Don't retry network errors for mutating methods.
Simplest fix — mirror the `RETRYABLE_STATUS_POST` logic for the network-error path: POST only retries on 429 (HTTP-level), never on network errors. Matches the HTTP-status retry posture. Downside: transient network blips on create operations always fail, even when a retry would clearly succeed and not duplicate.

B. Require idempotency keys on POST retries.
Each POST retry sends an `Idempotency-Key` header derived from a caller-provided or SDK-generated key. The server deduplicates based on the key. Industry-standard (Stripe, Square, etc.) but requires server-side support — worth verifying with qurl-service whether the API supports `Idempotency-Key`.

C. Add a `retry_network_errors_on_mutating: bool` option (default `False`).
Explicit opt-in. Callers who know the failure mode can opt in; the default is safe. Lowest-impact change for existing behavior but requires a minor API addition.

Recommendation

Option B if qurl-service supports `Idempotency-Key` — it's the most robust solution and would fix duplicate-creation risk across all create flows (not just `batch_create`). Option A as a fallback if server support isn't available.

Context

Flagged during review of PR #8 (`feat/align-api-v2`). Reviewer explicitly called it out as "pre-existing behavior, not introduced by this PR, but worth flagging since `batch_create` amplifies the blast radius" and "Consider documenting this or adding a `retry_network_errors` option in a future PR."

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingpriority: highImportant bugs, security issues

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions