Skip to content

fix(aa): fail over fast + quietly when primary bundler/paymaster is slow#26

Merged
DamirAGI merged 1 commit into
mainfrom
fix/aa-failover-quiet-fast
Jun 9, 2026
Merged

fix(aa): fail over fast + quietly when primary bundler/paymaster is slow#26
DamirAGI merged 1 commit into
mainfrom
fix/aa-failover-quiet-fast

Conversation

@DamirAGI

@DamirAGI DamirAGI commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

Root cause (investigated per request)

The error: This operation was aborted seen mid-mint is intermittent CDP latency, not a deterministic bug:

  • CDP basic methods (eth_chainId, eth_supportedEntryPoints) respond in ~0.3s — CDP isn't down.
  • The mint succeeds on CDP most of the time (3 fresh mints: 1 failed over, 2 didn't) — so it's not "CDP doesn't sponsor MockUSDC".
  • Occasionally a heavier AA call (pm_getPaymasterData / eth_sendUserOperation) exceeds the timeout → AbortSignal → failover to Pimlico → succeeds.

Two things made a recovered failover look like a failure:

  1. Slow: the bundler treated a timeout as transient and retried the same hung provider up to maxRetries before failing over — up to ~90s stalling on a dead CDP. isNonTransient now flags timeouts → callWithFallback flips to backup immediately. Bundler timeout 30s → 20s (paymaster already 15s).
  2. Noisy: failover/retry logged at warn → surfaced "error: This operation was aborted" during a successful mint. Downgraded to debug (Bundler + Paymaster). A total failure (both providers fail) still throws a clear error.

Net: occasional CDP slowness now fails over to Pimlico fast + silently; the mint/pay just works (wow UX preserved).

Full SDK suite green (2315); AA tests 53/53.

🤖 Generated with Claude Code

…is slow

Root cause of the "error: This operation was aborted" seen mid-mint: an
INTERMITTENT CDP latency spike. The CDP endpoint is reachable + fast for basic
methods, but the heavier AA calls occasionally exceed the timeout. Two problems
made a recovered failover look like a failure:

1. The bundler treated a timeout (AbortError) as transient and RETRIED the same
   hung provider up to maxRetries before failing over — up to ~90s of stalling
   on a dead CDP. Now `isNonTransient` flags timeouts so callWithFallback flips
   to the backup immediately. Bundler timeout 30s → 20s (paymaster already 15s).
2. The failover/retry messages were logged at `warn` and surfaced to the user —
   "error: This operation was aborted" during a *successful* mint. Downgraded to
   `debug` in BundlerClient + PaymasterClient. A total failure (both providers
   fail) still throws a clear, surfaced error.

Net: occasional CDP slowness now fails over to Pimlico quickly and silently;
the mint/pay just works. Full SDK suite green (2315); AA tests 53/53.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@DamirAGI DamirAGI requested a review from roosch269 as a code owner June 9, 2026 11:43
@DamirAGI DamirAGI merged commit ebcb94e into main Jun 9, 2026
7 of 9 checks passed
DamirAGI added a commit that referenced this pull request Jun 9, 2026
Intermittent CDP latency now fails over to the backup bundler/paymaster
quickly (timeout treated as non-transient; bundler 30s→20s) and silently
(failover/retry logs warn→debug), so an occasional slow primary no longer
stalls or prints a scary "aborted" mid-mint. Source merged in #26.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant