fix(sdk): keep cancelled token observable after arun() interrupt#3395
fix(sdk): keep cancelled token observable after arun() interrupt#3395VascoSch92 wants to merge 1 commit into
Conversation
Python API breakage checks — ✅ PASSEDResult: ✅ PASSED |
REST API breakage checks (OpenAPI) — ✅ PASSEDResult: ✅ PASSED |
all-hands-bot
left a comment
There was a problem hiding this comment.
Taste Rating: 🟢 Good taste — the fix is small and preserves the intended cooperative-cancellation contract. I didn’t find any blocking code issues in the diff; focused interrupt tests pass locally (uv run pytest tests/sdk/conversation/test_interrupt.py -q).
Because this touches agent interruption/tool execution behavior, I’m leaving a COMMENT rather than approval per repo eval-risk guidance; a human maintainer should decide whether lightweight eval coverage is needed.
Automated review generated by OpenHands on behalf of the requester.
[RISK ASSESSMENT]
- [Overall PR]
⚠️ Risk Assessment: 🟡 MEDIUM — small localized cancellation lifecycle change, but it affects agent run/interruption semantics and tool execution observability.
Was this automated review useful? React with 👍 or 👎 to this review to help us measure review quality.
Workflow run: https://github.com/OpenHands/software-agent-sdk/actions/runs/26497199123
Coverage Report •
|
||||||||||||||||||||
arun()'s finally cleared self._cancel_token unconditionally. Tool calls dispatched via run_in_executor run in worker threads that can't be force-cancelled and may outlive arun(), so a tool polling conversation.cancel_token after an interrupt saw None and lost the cancellation signal. Retain a cancelled token; the next run() replaces it.
d495eee to
0998a62
Compare
all-hands-bot
left a comment
There was a problem hiding this comment.
✅ QA Report: PASS
Verified the SDK async interrupt flow with a real Conversation.arun() call: the PR keeps the cancelled token observable after interrupt and replaces it with a fresh token on the next run.
Does this PR achieve its stated goal?
Yes. The PR's goal was to prevent arun() cleanup from clearing a cancelled token before late-running worker/thread code can observe cooperative cancellation. On origin/main, the same SDK script showed after_interrupt_token_present=False and late_worker_token_present=False; on the PR commit, both became True with is_cancelled=True, and the next run received a fresh uncancelled token.
| Phase | Result |
|---|---|
| Environment Setup | ✅ make build completed successfully; dependencies installed and build marked ready. |
| CI Status | 🟡 gh pr checks reported 0 failing, 19 successful, 7 pending, 6 skipped at review time. |
| Functional Verification | ✅ Reproduced the base bug and verified the PR fixes token observability plus fresh-token behavior. |
Functional Verification
Test 1: Interrupted async run keeps cancellation observable for late worker-style polling
Step 1 — Reproduce / establish baseline (without the fix):
Checked out base origin/main (58771eeb) and ran a standalone SDK script that creates an Agent with a hanging async LLM, starts Conversation.arun(), calls conversation.interrupt(), then has a separate worker-style thread poll conversation.cancel_token after arun() has unwound.
Ran OPENHANDS_SUPPRESS_BANNER=1 uv run python /tmp/qa_cancel_token.py 2>&1 | grep -E '^(during|after|late|next)':
58771eeb (HEAD, origin/main) fix(sdk): dedupe LLMs by usage_id in _ensure_agent_ready registration loop (#3392)
during_run_token_present=True
during_run_token_cancelled=False
after_interrupt_token_present=False
after_interrupt_token_cancelled=None
late_worker_token_present=False
late_worker_token_cancelled=None
next_run_token_present=True
next_run_token_cancelled=False
next_run_token_is_same_as_retained=False
This confirms the bug exists on base: while the token exists during the run, after interrupt() and arun() cleanup a late poll sees None, so cooperative cancellation is no longer observable.
Step 2 — Apply the PR's changes:
Checked out fix/cancel-token-after-interrupt at d495eeeaab6631dbe182af8e8fbfa1e96f7965f7.
Step 3 — Re-run with the fix in place:
Ran the same command on the PR branch:
d495eeea (HEAD -> fix/cancel-token-after-interrupt, origin/fix/cancel-token-after-interrupt) fix(sdk): keep cancelled token observable after arun() interrupt
during_run_token_present=True
during_run_token_cancelled=False
after_interrupt_token_present=True
after_interrupt_token_cancelled=True
late_worker_token_present=True
late_worker_token_cancelled=True
next_run_token_present=True
next_run_token_cancelled=False
next_run_token_is_same_as_retained=False
This confirms the fix works: after interrupt, both the direct public conversation.cancel_token poll and the late worker-style poll observe the retained cancelled token. The follow-up run also creates a fresh uncancelled token, matching the PR's safety claim.
Issues Found
None.
This review was created by an AI agent (OpenHands) on behalf of the user.
| # A cancelled token must stay observable: interrupted tool calls run | ||
| # in worker threads that can outlive arun() and still poll it. A | ||
| # fresh token is created on the next run(). | ||
| if self._cancel_token is not None and not self._cancel_token.is_cancelled: | ||
| self._cancel_token = None |
There was a problem hiding this comment.
I think we should pass a cancel token to each tool call. If arun can be called without a tool call finishing, interrupts should stop everything that's currently in flight
wdyt?
Why
Addresses B3 of #3341.
arun()'sfinallyclearedself._cancel_tokenunconditionally. Oninterrupt(), the in-flightasyncio.Taskis cancelled (breaking theawaitinastep), but tool calls dispatched viarun_in_executorrun in worker threads that can't be force-cancelled and may outlivearun(). A tool pollingconversation.cancel_token— the documented cooperative-cancellation pattern — after thefinallyran sawNoneand lost the cancellation signal, the opposite of whatinterrupt()intended.What
_cancel_tokeninarun()'sfinallyonly when it was not cancelled. A cancelled token stays observable; the nextrun()/arun()creates a fresh one, so leaving it is safe.test_interrupt_sets_cancel_token(it encoded the old behaviour) and addedtest_cancel_token_stays_observable_after_interrupt.The sync
run()path is not affected:interrupt()falls back topause()there andstep()runs inline, so thefinallyonly fires after tools return.Agent Server images for this PR
• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server
Variants & Base Images
eclipse-temurin:17-jdknikolaik/python-nodejs:python3.13-nodejs22-slimgolang:1.21-bookwormPull (multi-arch manifest)
# Each variant is a multi-arch manifest supporting both amd64 and arm64 docker pull ghcr.io/openhands/agent-server:0998a62-pythonRun
All tags pushed for this build
About Multi-Architecture Support
0998a62-python) is a multi-arch manifest supporting both amd64 and arm640998a62-python-amd64) are also available if needed