Summary
Tracker for follow-up work on the async LLM path introduced in #3284 (feat: add async LLM completion support across the SDK, merged). The agent server now drives real LocalConversations through arun() by default (event_service.py:285-294), so these affect production server runs, not just opt-in async callers.
Findings below were cross-checked against merged main by two independent reviews; the two High-severity items were reproduced.
Checklist
Correctness
Maintainability
B1 — Async error path bypasses fallback / mapping / telemetry (High)
In acompletion (llm.py:996-1054) and aresponses (llm.py:1338-1460), the retry loop is outside the try/except that calls _ahandle_error:
async for attempt in self.async_retry(...): # reraise=True
with attempt:
resp = await self._atransport_call(...) # provider errors raised HERE
try: # only wraps response parsing
message = Message.from_llm_chat_message(...)
return LLMResponse(...)
except Exception as e:
return await self._ahandle_error(e, ...)
The sync completion puts _one_attempt() inside the try (llm.py:900). map_provider_exception is called only in _handle_error/_ahandle_error (llm.py:718, 751). So on the common failure mode (retry exhaustion / non-retryable provider error), async:
- never invokes
fallback_strategy (backup LLM silently skipped),
- never maps litellm exceptions to SDK types,
- never calls
telemetry.on_error.
Most damaging consequence: astep recovers from context overflow via except LLMContextWindowExceedError → CondensationRequest. That SDK type is produced only by map_provider_exception (mapping.py:35). In async the raw litellm.ContextWindowExceededError escapes unmapped, the handler misses it, and the conversation errors out instead of auto-condensing. Same for LLMMalformedConversationHistoryError.
Repro: primary litellm_acompletion → APIConnectionError, fallback LLM configured; acompletion() re-raised APIConnectionError and the fallback call count stayed 0.
Fix: wrap the async retry loop in the same try/except scope as sync, for both acompletion and aresponses.
B2 — Async retry loses temperature→1.0 on LLMNoResponseError (High)
before_sleep bumps temperature on LLMNoResponseError by mutating retry_state.kwargs (retry_mixin.py:50-59). Sync reads it back via _one_attempt(**retry_kwargs) → final_kwargs = {**call_kwargs, **retry_kwargs} (llm.py:870). The async async for attempt: with attempt: idiom has no retried function and uses a call_kwargs local that is never updated, so the bump is lost.
Repro (pinned tenacity 9.1.2): temperatures per attempt — sync [0, 1.0, 1.0], async [0, 0, 0]. With temperature=0 the model tends to repeat the empty response, so all retries fail identically (and then compound with B1).
Fix: thread the retry-adjusted kwargs into the per-attempt call in the async loop (cleanest if combined with B1 via a shared per-attempt setup helper). Affects acompletion and aresponses.
B3 — cancel_token cleared while interrupted tool threads run (Medium)
arun()'s finally sets self._cancel_token = None (local_conversation.py:1067). run_in_executor worker threads can't be force-cancelled and may outlive arun(). A long-running tool that polls the documented conversation.cancel_token pattern after arun() exits sees None and loses the cancellation signal. (Terminal tools also get executor.interrupt(), so they're usually fine; cooperative-polling tools are not.)
Fix: keep the cancelled token reachable until the interrupted batch drains, or pass the token directly into tool execution rather than reading it back from the conversation.
B4 — RemoteConversation.interrupt() degrades to pause() (Medium)
The PR adds POST /conversations/{id}/interrupt and LocalConversation.interrupt(), but RemoteConversation (remote_conversation.py) doesn't override interrupt(), so it inherits BaseConversation.interrupt() → self.pause(). SDK callers using a remote conversation get pause semantics from a method named interrupt().
Fix: add a RemoteConversation.interrupt() that POSTs to the /interrupt endpoint; add a test asserting it hits /interrupt, not /pause.
B5 — Sync-only/custom agents routed through threaded astep (Medium)
EventService.run() selects arun() whenever the conversation overrides BaseConversation.arun (event_service.py:285-294) — always true for LocalConversation. But only Agent overrides astep; ACPAgent and other AgentBase subclasses inherit the default astep (base.py:613) that runs sync step() via run_in_executor. So step() runs in a worker thread while arun() holds the FIFOLock on the event-loop thread.
Verified this is not a deadlock (step() never acquires the state lock), but it changes the threading/ordering model that ACP and custom agents were written against (callbacks now fire from a worker thread, not the lock-owning thread).
Fix: gate the async path on the agent's capability, e.g. type(conversation.agent).astep is not AgentBase.astep; otherwise run sync run() in the executor.
Q1 — Dead code
Tool.acall (tool.py:392) — zero callers; also a false affordance, since aexecute_batch/_arun_safe always wrap the sync __call__ in a thread (parallel_executor.py:761) and never call acall.
agenerate_title_with_llm / agenerate_title_from_message (title_utils.py, ~78 lines) — only call each other; server still uses sync generate_title.
_return_metrics — never read in any of the 4 LLM methods; propagated into the new async ones.
Q2 — Metrics.get_snapshot() not reused
MetricsSnapshot(...) is hand-built at llm.py:907, 1032, 1256, 1450; Metrics.get_snapshot() (metrics.py:136) already does this and additionally deepcopys accumulated_token_usage. The hand-built copies alias the live object, so they aren't true snapshots — reusing the helper removes ~20 lines and fixes the aliasing.
Q3 — Parallel-tool throughput regression
Sync execute_batch uses a dedicated ThreadPoolExecutor(max_workers) per batch; async aexecute_batch uses the shared default run_in_executor(None, …) pool gated by Semaphore(max_workers) (parallel_executor.py:729-737). Under concurrency or when max_workers exceeds the default pool size, previously-parallel tool calls can serialize. Correctness intact (order + locking preserved).
Q4 — FIFOLock held across await
arun() holds the thread-reentrant FIFOLock across await astep (local_conversation.py:953-1007). Safe only because the server routes every state mutator through run_in_executor worker threads. Any future state-mutating coroutine on the event-loop thread would silently re-enter the lock and corrupt history. Worth documenting or releasing the lock across the await.
Q5 — Sync/async duplication
After normalizing await/async, the twins are ~90%+ identical (astep/step 96/126 lines; arun/run loop 75/80; plus acompletion/aresponses/_atransport_call/_ahandle_error/condenser). ~600 of the ~1,740 added production lines are reclaimable via shared color-agnostic helpers (_prepare_chat_request/_finalize_chat_response, _classify_step_error, _map_and_raise). This duplication is the root cause of B1/B2 diverging from the sync path.
Suggested test additions
- async fallback success; all-fallbacks-fail → mapped SDK exception; responses-API async fallback;
telemetry.on_error fires on async failure (covers B1)
LLMNoResponseError → temperature escalates on async retry (covers B2)
- tool still polling
cancel_token after arun() enters finally still sees is_cancelled (covers B3)
RemoteConversation.interrupt() posts to /interrupt (covers B4)
- server conversation with a sync-only
AgentBase subclass runs via run() in executor (covers B5)
Summary
Tracker for follow-up work on the async LLM path introduced in #3284 (
feat: add async LLM completion support across the SDK, merged). The agent server now drives realLocalConversations througharun()by default (event_service.py:285-294), so these affect production server runs, not just opt-in async callers.Findings below were cross-checked against merged
mainby two independent reviews; the two High-severity items were reproduced.Checklist
Correctness
telemetry.on_error(High) Fix async error path in acompletion/aresponses to invoke fallback/mapping/telemetry #3355temperature→1.0escalation onLLMNoResponseError(High) Fix async retry losing temperature→1.0 on LLMNoResponseError #3356cancel_tokencleared inarun()'sfinallywhile interrupted tool threads may still be running (Medium) fix(sdk): keep cancelled token observable after arun() interrupt #3395RemoteConversation.interrupt()not overridden → silently degrades topause()(Medium) fix(sdk): override RemoteConversation.interrupt() to POST /interrupt (B4 of #3341) #3398astepoverride) run syncstep()in a worker thread whilearun()holds the state lock on the event-loop thread (Medium) fix(agent-server): route sync-only agents through sync run() instead of threaded astep (B5) #3404Maintainability
Tool.acall,agenerate_title_with_llm/agenerate_title_from_message, dead_return_metricsparamMetrics.get_snapshot()not reused (4 hand-built copies + token-usage aliasing) fix(sdk): snapshot LLM metrics via Metrics.get_snapshot() (Q2 of #3341) #3397FIFOLockheld acrossawait— undocumented "mutators must run off the event-loop thread" invariant docs(sdk): document FIFOLock held across arun()'s await (Q4 of #3341) #3400B1 — Async error path bypasses fallback / mapping / telemetry (High)
In
acompletion(llm.py:996-1054) andaresponses(llm.py:1338-1460), the retry loop is outside thetry/exceptthat calls_ahandle_error:The sync
completionputs_one_attempt()inside thetry(llm.py:900).map_provider_exceptionis called only in_handle_error/_ahandle_error(llm.py:718, 751). So on the common failure mode (retry exhaustion / non-retryable provider error), async:fallback_strategy(backup LLM silently skipped),telemetry.on_error.Most damaging consequence:
asteprecovers from context overflow viaexcept LLMContextWindowExceedError → CondensationRequest. That SDK type is produced only bymap_provider_exception(mapping.py:35). In async the rawlitellm.ContextWindowExceededErrorescapes unmapped, the handler misses it, and the conversation errors out instead of auto-condensing. Same forLLMMalformedConversationHistoryError.Repro: primary
litellm_acompletion→APIConnectionError, fallback LLM configured;acompletion()re-raisedAPIConnectionErrorand the fallback call count stayed0.Fix: wrap the async retry loop in the same
try/exceptscope as sync, for bothacompletionandaresponses.B2 — Async retry loses
temperature→1.0onLLMNoResponseError(High)before_sleepbumps temperature onLLMNoResponseErrorby mutatingretry_state.kwargs(retry_mixin.py:50-59). Sync reads it back via_one_attempt(**retry_kwargs)→final_kwargs = {**call_kwargs, **retry_kwargs}(llm.py:870). The asyncasync for attempt: with attempt:idiom has no retried function and uses acall_kwargslocal that is never updated, so the bump is lost.Repro (pinned tenacity 9.1.2): temperatures per attempt — sync
[0, 1.0, 1.0], async[0, 0, 0]. Withtemperature=0the model tends to repeat the empty response, so all retries fail identically (and then compound with B1).Fix: thread the retry-adjusted kwargs into the per-attempt call in the async loop (cleanest if combined with B1 via a shared per-attempt setup helper). Affects
acompletionandaresponses.B3 —
cancel_tokencleared while interrupted tool threads run (Medium)arun()'sfinallysetsself._cancel_token = None(local_conversation.py:1067).run_in_executorworker threads can't be force-cancelled and may outlivearun(). A long-running tool that polls the documentedconversation.cancel_tokenpattern afterarun()exits seesNoneand loses the cancellation signal. (Terminal tools also getexecutor.interrupt(), so they're usually fine; cooperative-polling tools are not.)Fix: keep the cancelled token reachable until the interrupted batch drains, or pass the token directly into tool execution rather than reading it back from the conversation.
B4 —
RemoteConversation.interrupt()degrades topause()(Medium)The PR adds
POST /conversations/{id}/interruptandLocalConversation.interrupt(), butRemoteConversation(remote_conversation.py) doesn't overrideinterrupt(), so it inheritsBaseConversation.interrupt()→self.pause(). SDK callers using a remote conversation get pause semantics from a method namedinterrupt().Fix: add a
RemoteConversation.interrupt()that POSTs to the/interruptendpoint; add a test asserting it hits/interrupt, not/pause.B5 — Sync-only/custom agents routed through threaded
astep(Medium)EventService.run()selectsarun()whenever the conversation overridesBaseConversation.arun(event_service.py:285-294) — always true forLocalConversation. But onlyAgentoverridesastep;ACPAgentand otherAgentBasesubclasses inherit the defaultastep(base.py:613) that runs syncstep()viarun_in_executor. Sostep()runs in a worker thread whilearun()holds theFIFOLockon the event-loop thread.Verified this is not a deadlock (
step()never acquires the state lock), but it changes the threading/ordering model that ACP and custom agents were written against (callbacks now fire from a worker thread, not the lock-owning thread).Fix: gate the async path on the agent's capability, e.g.
type(conversation.agent).astep is not AgentBase.astep; otherwise run syncrun()in the executor.Q1 — Dead code
Tool.acall(tool.py:392) — zero callers; also a false affordance, sinceaexecute_batch/_arun_safealways wrap the sync__call__in a thread (parallel_executor.py:761) and never callacall.agenerate_title_with_llm/agenerate_title_from_message(title_utils.py, ~78 lines) — only call each other; server still uses syncgenerate_title._return_metrics— never read in any of the 4 LLM methods; propagated into the new async ones.Q2 —
Metrics.get_snapshot()not reusedMetricsSnapshot(...)is hand-built atllm.py:907, 1032, 1256, 1450;Metrics.get_snapshot()(metrics.py:136) already does this and additionallydeepcopysaccumulated_token_usage. The hand-built copies alias the live object, so they aren't true snapshots — reusing the helper removes ~20 lines and fixes the aliasing.Q3 — Parallel-tool throughput regression
Sync
execute_batchuses a dedicatedThreadPoolExecutor(max_workers)per batch; asyncaexecute_batchuses the shared defaultrun_in_executor(None, …)pool gated bySemaphore(max_workers)(parallel_executor.py:729-737). Under concurrency or whenmax_workersexceeds the default pool size, previously-parallel tool calls can serialize. Correctness intact (order + locking preserved).Q4 —
FIFOLockheld acrossawaitarun()holds the thread-reentrantFIFOLockacrossawait astep(local_conversation.py:953-1007). Safe only because the server routes every state mutator throughrun_in_executorworker threads. Any future state-mutating coroutine on the event-loop thread would silently re-enter the lock and corrupt history. Worth documenting or releasing the lock across theawait.Q5 — Sync/async duplication
After normalizing
await/async, the twins are ~90%+ identical (astep/step96/126 lines;arun/runloop 75/80; plusacompletion/aresponses/_atransport_call/_ahandle_error/condenser). ~600 of the ~1,740 added production lines are reclaimable via shared color-agnostic helpers (_prepare_chat_request/_finalize_chat_response,_classify_step_error,_map_and_raise). This duplication is the root cause of B1/B2 diverging from the sync path.Suggested test additions
telemetry.on_errorfires on async failure (covers B1)LLMNoResponseError→ temperature escalates on async retry (covers B2)cancel_tokenafterarun()entersfinallystill seesis_cancelled(covers B3)RemoteConversation.interrupt()posts to/interrupt(covers B4)AgentBasesubclass runs viarun()in executor (covers B5)