Skip to content

Fix anyio 4.13 100% CPU hot-loop in CancelScope._deliver_cancellation#128

Open
constkolesnyak wants to merge 3 commits into
ClickHouse:mainfrom
constkolesnyak:upstream/anyio-cpu-fix
Open

Fix anyio 4.13 100% CPU hot-loop in CancelScope._deliver_cancellation#128
constkolesnyak wants to merge 3 commits into
ClickHouse:mainfrom
constkolesnyak:upstream/anyio-cpu-fix

Conversation

@constkolesnyak

Copy link
Copy Markdown
Contributor

Problem

anyio 4.13.0's CancelScope._deliver_cancellation re-queues itself via call_soon() on every event-loop tick when the cancel can't actually land. Two shapes hit this:

  1. Live current task in the scope. The loop sets should_retry = True for every task in self._tasks then reschedules — but a task cannot cancel itself from inside its own cancel callback. If the scope's only member is the current task (common shape from anyio task groups / Claude Agent SDK), the loop reschedules forever.
  2. Done tasks left in the scope (_must_cancel). A child task that already finished still sits in _tasks with _must_cancel=True and yields nothing to cancel. The loop again sets should_retry=True and reschedules forever.

Both produce the same observed symptom: one core pinned at ~97% CPU with ~45k–61k epoll_pwait syscalls/sec, no work done. Caught twice in production (24h+ stuck before manual restart). The existing _safe_disconnect() workaround in nerve/agent/engine.py only clears the scope during client.disconnect(), so any spin triggered by Telegram polling, cron, or a live SDK request isn't covered.

Fix

Monkeypatch anyio.CancelScope._deliver_cancellation (applied from nerve/__init__.py so any import path picks it up). Semantics match upstream byte-for-byte except:

  • Skip the current task entirely — it cannot cancel itself.
  • Skip tasks that are done (task.done()).
  • Only set should_retry=True when we actually delivered a cancel or the task is still waiting pickup (_must_cancel is False but the task is not done).

Net effect: a scope whose only tasks are (current task ∪ done tasks ∪ pickup-pending tasks) stops rescheduling itself and the loop becomes idle. Before the fix: 20–97% CPU. After: idle range (~5% CPU).

Applied via a tiny monkeypatch instead of a vendored anyio because (a) the upstream anyio change is one line in _deliver_cancellation and (b) we want it to disappear automatically once anyio ships a fix.

Tests

tests/test_anyio_patch.py (8 tests, all passing on this branch):

  • test_patch_applied — patch is installed at import time
  • test_does_not_reschedule_when_only_task_is_current — the original 100% CPU shape
  • test_skips_done_tasks — settled-tasks shape (the second 100% CPU bug)
  • test_must_cancel_pickup_pending_still_retries — semantics preservation: legit pickup-pending cancels still retry
  • test_delivered_cancel_still_retries — semantics preservation: real cancellations retry until landed
  • test_idempotent_apply — applying twice is a no-op
  • test_original_signature_preserved — no API drift
  • test_no_effect_when_scope_finished — scope without pending cancellation is untouched

Full suite: 1252 passed, 2 skipped, 2 failed — the 2 failures are pre-existing in tests/test_cron.py::TestMaybeRotateContext (unrelated rotate_at branch), present on main too.

Files

  • nerve/__init__.py — apply patch at import
  • nerve/_anyio_patch.py — the patched _deliver_cancellation (new)
  • tests/test_anyio_patch.py — regression suite (new)

Three commits kept as iterative narrative; squash on merge if preferred.

anyio 4.13.0's CancelScope._deliver_cancellation sets should_retry=True
unconditionally for every task in self._tasks, then reschedules itself
via call_soon(). When every task in the scope is the *current* task,
nothing gets cancelled but the callback re-queues on every event-loop
tick — pinning one CPU core at 100% with ~45k epoll_pwait syscalls/sec.

Observed on April 22 and again on April 23 (24h+ of 97% CPU, no work
done). The existing _safe_disconnect() workaround in agent.engine only
clears the stuck scope during client.disconnect(), so spins triggered
by telegram polling / cron / live SDK requests weren't covered.

The patch sets should_retry=True only when we actually delivered a
cancel or the task is still waiting pickup (_must_cancel). Semantics
otherwise match upstream byte-for-byte. Applied via nerve/__init__.py
so any import path picks it up.

Includes a regression test that exercises the exact pathological shape
(scope whose only task is the current task) and asserts the scope
stops rescheduling itself.
The April 23 patch (ec0f8f7) fixed the 96%-CPU hot loop where
should_retry was set unconditionally, but left a narrower version of
the same bug: the `_must_cancel` branch still set should_retry=True.

When the current task itself sits with _must_cancel=True while running
the cancel callback (observed in production today, nerve/SDK path),
this re-queues _deliver_cancellation on every event-loop tick:

    before fix:  20% CPU, ~61k epoll_pwait/sec
    after fix:   should be idle (~5% CPU range)

Changes:
- Skip the current task entirely — it cannot cancel itself from inside
  the callback it is running.
- Drop should_retry=True in the _must_cancel branch — asyncio's
  Task.__step raises CancelledError when the task resumes, no retry
  needed from us.
- should_retry is now True only when we actually called task.cancel()
  in this pass.
- Add regression test that poisons the _must_cancel branch with a
  fake task. The existing "current-task-only" test passed without
  reproducing this variant because it didn't set _must_cancel.
Third iteration of the anyio _deliver_cancellation spin. This time
a CancelScope retained already-finished tasks in self._tasks (anyio
doesn't always prune them before the cancel callback fires). For a
done task _must_cancel=False (cleared on final step), _task_started
is True, _fut_waiter is None — so the previous patch fell into the
"waiter not done → cancel()" branch. task.cancel() is a no-op on
done tasks, but should_retry was flagged anyway, so call_soon kept
re-queuing forever.

Observed live: three zombie-scopes in one process producing ~55k
epoll_pwait/sec combined, 100% CPU on MainThread, load 1.6, 60°C.
Confirmed via py-spy (stack parked on lines 91/95/97/98 of the
patch) and a gc-scan dump of CancelScope objects (three active
_cancel_handle scopes, each with a single done=True task).

Fix: add `if task.done(): continue` at the top of the loop.

Also add regression test that reproduces the zombie-scope shape
with a stubbed done task and asserts no cancel() call, no retry,
and no pending _cancel_handle.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant