Fix anyio 4.13 100% CPU hot-loop in CancelScope._deliver_cancellation by constkolesnyak · Pull Request #128 · ClickHouse/nerve

constkolesnyak · 2026-06-20T22:20:15Z

Problem

anyio 4.13.0's CancelScope._deliver_cancellation re-queues itself via call_soon() on every event-loop tick when the cancel can't actually land. Two shapes hit this:

Live current task in the scope. The loop sets should_retry = True for every task in self._tasks then reschedules — but a task cannot cancel itself from inside its own cancel callback. If the scope's only member is the current task (common shape from anyio task groups / Claude Agent SDK), the loop reschedules forever.
Done tasks left in the scope (_must_cancel). A child task that already finished still sits in _tasks with _must_cancel=True and yields nothing to cancel. The loop again sets should_retry=True and reschedules forever.

Both produce the same observed symptom: one core pinned at ~97% CPU with ~45k–61k epoll_pwait syscalls/sec, no work done. Caught twice in production (24h+ stuck before manual restart). The existing _safe_disconnect() workaround in nerve/agent/engine.py only clears the scope during client.disconnect(), so any spin triggered by Telegram polling, cron, or a live SDK request isn't covered.

Fix

Monkeypatch anyio.CancelScope._deliver_cancellation (applied from nerve/__init__.py so any import path picks it up). Semantics match upstream byte-for-byte except:

Skip the current task entirely — it cannot cancel itself.
Skip tasks that are done (task.done()).
Only set should_retry=True when we actually delivered a cancel or the task is still waiting pickup (_must_cancel is False but the task is not done).

Net effect: a scope whose only tasks are (current task ∪ done tasks ∪ pickup-pending tasks) stops rescheduling itself and the loop becomes idle. Before the fix: 20–97% CPU. After: idle range (~5% CPU).

Applied via a tiny monkeypatch instead of a vendored anyio because (a) the upstream anyio change is one line in _deliver_cancellation and (b) we want it to disappear automatically once anyio ships a fix.

Tests

tests/test_anyio_patch.py (8 tests, all passing on this branch):

test_patch_applied — patch is installed at import time
test_does_not_reschedule_when_only_task_is_current — the original 100% CPU shape
test_skips_done_tasks — settled-tasks shape (the second 100% CPU bug)
test_must_cancel_pickup_pending_still_retries — semantics preservation: legit pickup-pending cancels still retry
test_delivered_cancel_still_retries — semantics preservation: real cancellations retry until landed
test_idempotent_apply — applying twice is a no-op
test_original_signature_preserved — no API drift
test_no_effect_when_scope_finished — scope without pending cancellation is untouched

Full suite: 1252 passed, 2 skipped, 2 failed — the 2 failures are pre-existing in tests/test_cron.py::TestMaybeRotateContext (unrelated rotate_at branch), present on main too.

Files

nerve/__init__.py — apply patch at import
nerve/_anyio_patch.py — the patched _deliver_cancellation (new)
tests/test_anyio_patch.py — regression suite (new)

Three commits kept as iterative narrative; squash on merge if preferred.

anyio 4.13.0's CancelScope._deliver_cancellation sets should_retry=True unconditionally for every task in self._tasks, then reschedules itself via call_soon(). When every task in the scope is the *current* task, nothing gets cancelled but the callback re-queues on every event-loop tick — pinning one CPU core at 100% with ~45k epoll_pwait syscalls/sec. Observed on April 22 and again on April 23 (24h+ of 97% CPU, no work done). The existing _safe_disconnect() workaround in agent.engine only clears the stuck scope during client.disconnect(), so spins triggered by telegram polling / cron / live SDK requests weren't covered. The patch sets should_retry=True only when we actually delivered a cancel or the task is still waiting pickup (_must_cancel). Semantics otherwise match upstream byte-for-byte. Applied via nerve/__init__.py so any import path picks it up. Includes a regression test that exercises the exact pathological shape (scope whose only task is the current task) and asserts the scope stops rescheduling itself.

The April 23 patch (ec0f8f7) fixed the 96%-CPU hot loop where should_retry was set unconditionally, but left a narrower version of the same bug: the `_must_cancel` branch still set should_retry=True. When the current task itself sits with _must_cancel=True while running the cancel callback (observed in production today, nerve/SDK path), this re-queues _deliver_cancellation on every event-loop tick: before fix: 20% CPU, ~61k epoll_pwait/sec after fix: should be idle (~5% CPU range) Changes: - Skip the current task entirely — it cannot cancel itself from inside the callback it is running. - Drop should_retry=True in the _must_cancel branch — asyncio's Task.__step raises CancelledError when the task resumes, no retry needed from us. - should_retry is now True only when we actually called task.cancel() in this pass. - Add regression test that poisons the _must_cancel branch with a fake task. The existing "current-task-only" test passed without reproducing this variant because it didn't set _must_cancel.

Third iteration of the anyio _deliver_cancellation spin. This time a CancelScope retained already-finished tasks in self._tasks (anyio doesn't always prune them before the cancel callback fires). For a done task _must_cancel=False (cleared on final step), _task_started is True, _fut_waiter is None — so the previous patch fell into the "waiter not done → cancel()" branch. task.cancel() is a no-op on done tasks, but should_retry was flagged anyway, so call_soon kept re-queuing forever. Observed live: three zombie-scopes in one process producing ~55k epoll_pwait/sec combined, 100% CPU on MainThread, load 1.6, 60°C. Confirmed via py-spy (stack parked on lines 91/95/97/98 of the patch) and a gc-scan dump of CancelScope objects (three active _cancel_handle scopes, each with a single done=True task). Fix: add `if task.done(): continue` at the top of the loop. Also add regression test that reproduces the zombie-scope shape with a stubbed done task and asserts no cancel() call, no retry, and no pending _cancel_handle.

constkolesnyak added 3 commits June 21, 2026 00:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix anyio 4.13 100% CPU hot-loop in CancelScope._deliver_cancellation#128

Fix anyio 4.13 100% CPU hot-loop in CancelScope._deliver_cancellation#128
constkolesnyak wants to merge 3 commits into
ClickHouse:mainfrom
constkolesnyak:upstream/anyio-cpu-fix

constkolesnyak commented Jun 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

constkolesnyak commented Jun 20, 2026

Problem

Fix

Tests

Files

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant