Subint era tooling#462
Open
goodboy wants to merge 107 commits into
Open
Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds “subint-era” hardening and diagnostics across tractor’s spawn/cancel/IPC teardown paths, plus a small framework for Trio monkey-patches and associated regression tests. It also bumps the supported Python range and expands dev/test tooling to better debug hangs and cleanup leaked resources (UDS sock-files, SHM segments, zombie subactors).
Changes:
- Introduces
tractor.trionics.patches(catalog + contract) and a first defensive Trio patch forWakeupSocketpair.drain()EOF busy-loop, with regression tests. - Improves teardown robustness: cancel escalation (
ActorTooSlowError), bounded peer waits, parent-channel shield break, per-endpoint close isolation, and post-SIGKILL UDS sock cleanup. - Adds developer tooling: per-actor proc titles, hang dumps/resource deltas helpers, a
tractor-reapcleanup script, and logging-spec parsing/apply support.
Reviewed changes
Copilot reviewed 78 out of 81 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| tractor/trionics/patches/README.md | Documents patch scope + per-module API/contract. |
| tractor/trionics/patches/init.py | Registers patch modules and provides apply_all(). |
| tractor/trionics/patches/_wakeup_socketpair.py | Monkey-patch for Trio wakeup socketpair EOF busy-loop + repro. |
| tractor/_child.py | Applies patches early in child boot + sets per-actor proctitle. |
| tractor/devx/_proctitle.py | Optional setproctitle integration for actor processes. |
| tests/trionics/test_patches.py | Regression tests for Trio patches + apply_all() idempotency. |
| tractor/spawn/_trio.py | Passes bind addr/subactor context into hard_kill() for cleanup. |
| tractor/spawn/_spawn.py | Adds post-mortem UDS sock unlinking hook to hard_kill(). |
| tractor/spawn/_reap.py | New parent-side “reap” helpers (UDS sock cleanup). |
| tractor/runtime/_portal.py | Adds raise_on_timeout option + ActorTooSlowError on cancel timeout. |
| tractor/runtime/_supervise.py | Cancel-then-escalate helper used by ActorNursery.cancel(). |
| tractor/_exceptions.py | Adds ActorTooSlowError for cancel-ack timeout escalation. |
| tractor/runtime/_runtime.py | Teardown deadlock guards (shield-break, bounded peer waits, stackscope enable). |
| tractor/runtime/_state.py | Adds runtime-vars defaults + set_runtime_vars() API. |
| tractor/_root.py | Env-var overrides (loglevel/spawn method) + transport/registry mismatch fail-fast. |
| tractor/ipc/_server.py | Ensures endpoint-close failures don’t deadlock shutdown signaling. |
| tractor/ipc/_uds.py | Makes UDS unlink tolerant to concurrent unlink races. |
| tractor/ipc/_shm.py | Makes SHM unlink best-effort and tolerant to prior deallocation. |
| tractor/ipc/_mp_bs.py | Simplifies resource_tracker disabling + always uses track=False. |
| tractor/ipc/_linux.py | Adds import-time ImportError note for cffi support expectations. |
| tractor/log.py | Fixes logger-name collapsing edge cases + adds logspec parse/apply. |
| tractor/devx/_debug_hangs.py | Adds hang dump + resource delta tracking utilities. |
| tractor/devx/init.py | Exposes new devx hang helpers. |
| tractor/devx/debug/_tty_lock.py | Adjusts formatting of lock repr/log output. |
| tractor/_testing/addr.py | Improves per-process random TCP port selection to reduce collisions. |
| tests/test_spawning.py | Extends forkserver capture skip rationale. |
| tests/test_shm.py | Skips SHM tests on subint backend with rationale. |
| tests/test_ringbuf.py | Adds cffi import-or-skip note for py3.14 constraints; keeps test skipped. |
| tests/test_pubsub.py | Skips pubsub on subint pending hang issues. |
| tests/test_multi_program.py | Removes old multi-program tests (moved under discovery). |
| tests/discovery/test_multi_program.py | New home for multi-program discovery tests + cancellation regression. |
| tests/discovery/conftest.py | Adds daemon fixture + active readiness polling (replaces blind sleeps). |
| tests/discovery/test_registrar.py | Adjusts registrar tests; adds hang dump guard; fixture usage refinements. |
| tests/test_log_sys.py | Updates expectations for explicit logger naming behavior. |
| tests/test_local.py | Updates no-runtime test to use trio.run() and NoRuntime. |
| tests/test_legacy_one_way_streaming.py | Backend-aware timeouts + cancellation semantics tweaks. |
| tests/test_inter_peer_cancellation.py | Skips known subint hang classes + adds per-test UDS orphan tracking. |
| tests/test_context_stream_semantics.py | Makes tests pass registry_addrs; adjusts timeouts for fork-based spawners. |
| tests/test_clustering.py | Adjusts fail_after budget for fork-based spawners. |
| tests/test_advanced_streaming.py | Adds internal trio-based timeouts + richer hang diagnostics and cancellation matching. |
| tests/msg/test_pldrx_limiting.py | Backend-aware timeout default selection for msg validation tests. |
| tests/msg/test_ext_types_msgspec.py | Adds fork-aware capture fixture usage + wraps in fail_after. |
| tests/ipc/test_multi_tpt.py | Asserts new enable_transports vs registry_addrs mismatch fail-fast behavior. |
| tests/devx/conftest.py | Extends pexpect harness to drive spawn backend/loglevel via env vars. |
| tests/devx/test_tooling.py | Updates expectations and harness typing; adjusts behavior under fork spawners. |
| tests/devx/test_proctitle.py | Adds tests for proctitle + intrinsic subactor detection (Linux). |
| tests/devx/test_pause_from_non_trio.py | Adds greenback gating for sync pause tests; adjusts expected output. |
| tests/devx/test_debugger.py | Makes debugger tests more fork-aware; adjusts loglevels/pattern matching. |
| tests/conftest.py | Moves common logging/daemon fixtures into plugin/sub-conftest; updates notes. |
| scripts/tractor-reap | Adds CLI utility to reap zombie subactors + sweep shm/UDS leaks. |
| pyproject.toml | Bumps Python range; adds deps/groups (setproctitle, pytest-timeout, psutil, etc.). |
| .github/workflows/ci.yml | Adjusts pytest invocation formatting + explicit capture mode. |
| .gitignore | Updates ignore patterns (AI/worktree/tooling related). |
| examples/debugging/sync_bp.py | Adds env-driven prompt-color disabling + loglevel adjustments for harness. |
| examples/debugging/subactor_error.py | Minor debugger example cleanup. |
| examples/debugging/subactor_bp_in_ctx.py | Notes future env-var parametrization for transport selection. |
| examples/debugging/shield_hang_in_sub.py | Updates actor nursery child lookup to use aid.uid. |
| examples/debugging/root_timeout_while_child_crashed.py | Cleans docstring/commentary and prints child aid. |
| examples/debugging/root_cancelled_but_child_is_in_tty_lock.py | Reorders/annotates debug loglevel/transport args. |
| examples/debugging/multi_nested_subactors_error_up_through_nurseries.py | Sets loglevel for test pattern matching. |
| examples/debugging/multi_daemon_subactors.py | Minor nursery variable rename/cleanup. |
| ai/tooling-todos/logspec_leaf_module_granularity_route_b.md | Adds design notes for future logger granularity work. |
| ai/conc-anal/trio_wakeup_socketpair_busy_loop_under_fork_issue.md | Adds concurrency analysis + repro for Trio busy-loop issue. |
| ai/conc-anal/trio_033_cancel_cascade_slowdown_depth3_issue.md | Adds analysis note for Trio version cascade slowdown. |
| ai/conc-anal/test_register_duplicate_name_daemon_connect_race_issue.md | Adds analysis note for daemon readiness race and mitigation. |
| ai/conc-anal/spawn_time_boot_death_dup_name_issue.md | Adds analysis note for duplicate-name boot death race. |
| ai/conc-anal/fork_thread_semantics_execution_vs_memory.md | Adds reference doc on fork semantics in multi-threaded procs. |
| .claude/skills/conc-anal/SKILL.md | Updates conc-anal guidance (hang patterns, teardown waits). |
| .claude/settings.local.json | Updates local agent permissions/config. |
| .claude/notes/rt_vars_lift_plan.md | Adds draft plan for runtime-vars env var lift. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+317
to
+321
| cancel_timeout: float = ( | ||
| timeout | ||
| or | ||
| self.cancel_timeout | ||
| ) |
Comment on lines
+339
to
343
| raise ActorTooSlowError( | ||
| f'Peer {peer_id} did not ack `Actor.cancel()`' | ||
| f'-RPC within bounded wait of ' | ||
| f'{cancel_timeout!r}s' | ||
| ) |
Comment on lines
+24
to
+32
| try: | ||
| import cffi | ||
| except ImportError as ie: | ||
| if sys.version_info < (3, 14): | ||
| ie.add_note( | ||
| f'The `cffi` pkg has no 3.14 support yet.\n' | ||
| ) | ||
|
|
||
| raise ie |
Comment on lines
+12
to
+17
| # disable `pbdp` prompt colors | ||
| # for prompt matching in test. | ||
| def disable_pdbp_color(): | ||
| if os.environ['PYTHON_COLORS'] == '0': | ||
| from tractor.devx.debug import _repl | ||
| _repl.TractorConfig.use_pygments = False |
| # itself relies on... but `repro()` runs OUTSIDE | ||
| # a trio.run, so it's plain stdlib semantics here | ||
| # — alarm WILL fire during `recv` syscall). | ||
| signal.alarm(2) |
Comment on lines
+57
to
+62
| # Apply the patch. | ||
| applied: bool = wsp.apply() | ||
| # First call MUST return True; idempotent guard | ||
| # prevents False on subsequent calls within the | ||
| # same process. | ||
| assert applied is True or applied is False # idempotent |
f937cc9 to
6b0cb17
Compare
Prep for a future sub-interpreter (PEP 734 `concurrent.interpreters`) spawn backend per issue test-harness error-gating; the backend itself comes later. Deats, - bump `pyproject.toml` `requires-python` to `>=3.12, <3.15` and list the `3.14` classifier — the new stdlib `concurrent.interpreters` module only ships on 3.14 - `_testing.pytest.pytest_configure` wraps `try_set_start_method()` in a `pytest.UsageError` handler so an unsupported `--spawn-backend` on the running py-version prints a clean banner instead of a traceback (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit d318f1f) (factored: kept only the pyproject + `_testing/pytest.py` parts of "Add `'subint'` spawn backend scaffold (#379)"; dropped tractor/spawn/_spawn.py + tractor/spawn/_subint.py)
Since we're devving subints we require the 3.14+ stdlib API
and a couple compiled libs don't support it yet, namely:
- `cffi`, which we're only using for the `.ipc._linux` eventfd
stuff (now factored into `hotbaud` anyway).
- `greenback`, which requires `greenlet` which doesn't seem to be
wheeled yet
* on nixos the sdist build was failing due to lack of `g++` which
i don't care to figure out rn since we don't need `.devx` stuff
immediately for this subints prototype.
* [ ] we still need to adjust any dependent suites to skip.
Adjust `test_ringbuf` to skip on import failure.
Also project wide,
- pin us to py 3.13+ in prep for last-2-minor-version policy.
- drop `msgspec>=0.20.0`, the first release with py3.14 support.
(cherry picked from commit d2ea8aa)
Pull the `_child.py` `__main__` block body out into a callable `_actor_child_main()` so alternate spawn backends can bootstrap a subactor without going through the CLI entrypoint. Deats, - new `_actor_child_main(uid, loglevel, parent_addr, infect_asyncio, spawn_method='trio')` holds the full child-side runtime startup previously inlined under `if __name__ == '__main__':` - `__main__` block reduces to arg-parsing + a call into the new func - add `"subint"` to the `_runtime.py` spawn-method check so a child accepts `SpawnSpec` from that (future) backend; inert str-compare w/o it (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit b8f243e) (factored: kept only the `_child.py`/`_runtime.py` entry-extraction parts of "Impl min-viable `subint` spawn backend (B.2)"; dropped tractor/spawn/_subint.py + subint prompt-io logs)
Bottle up the diagnostic primitives that actually cracked the silent mid-suite hangs in the `subint` spawn-backend bringup (issue there" session has them on the shelf instead of reinventing from scratch. Deats, - `dump_on_hang(seconds, *, path)` — context manager wrapping `faulthandler.dump_traceback_later()`. Critical gotcha baked in: dumps go to a *file*, not `sys.stderr`, bc pytest's stderr capture silently eats the output and you can spend an hour convinced you're looking at the wrong thing - `track_resource_deltas(label, *, writer)` — context manager logging per-block `(threading.active_count(), len(_interpreters.list_all()))` deltas; quickly rules out leak-accumulation theories when a suite progressively worsens (if counts don't grow, it's not a leak, look for a race on shared cleanup instead) - `resource_delta_fixture(*, autouse, writer)` — factory returning a `pytest` fixture wrapping `track_resource_deltas` per-test; opt in by importing into a `conftest.py`. Kept as a factory (not a bare fixture) so callers own `autouse` / `writer` wiring Also, - export the three names from `tractor.devx` - dep-free on py<3.13 (swallows `ImportError` for `_interpreters`) - link back to the provenance in the module docstring (issue #379 / commit `26fb820`) (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit 09466a1)
Wrap the test's `trio.run(main)` in `dump_on_hang(seconds=20)` so any future hang regression captures a stack dump for triage instead of wedging CI silently; under the default backends it's a no-op safety net. Includes a "KNOWN ISSUE" comment block documenting the (future) `subint` backend hang classes observed against this test during Phase B bringup (#379). (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit 4a32545) (factored: kept only the tests/discovery/test_registrar.py part of "Doc `subint` backend hang classes + arm `dump_on_hang`"; dropped subint conc-anal docs + tests/test_subint_cancellation.py)
Add a hard process-level wall-clock bound on a test known to wedge un-Ctrl-C-ably under an in-dev spawn backend, so an unattended suite run can't hang indefinitely. Deats, - New `testing` dep: `pytest-timeout>=2.3`. - `test_stale_entry_is_deleted`: `@pytest.mark.timeout(3, method='thread')`. The `method='thread'` choice is deliberate — `method='signal'` routes via `SIGALRM` which can be starved by the same GIL-hostage path that drops `SIGINT`, so it'd never actually fire in the starvation case. At timeout, `pytest-timeout` hard-kills the pytest process itself — that's the intended behavior here; the alternative is the suite never returning. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit 189f4e3f72e9f1eda5d24bcbab5743f7e35bd913) (factored: kept pyproject + tests/discovery/test_registrar.py parts of "Wall-cap `subint` audit tests via `pytest-timeout`"; dropped tests/test_subint_cancellation.py)
(cherry picked from commit 985ea76)
A reusable `@pytest.mark.skipon_spawn_backend( '<backend>' [, ...],
reason='...')` marker for backend-specific known-hang / -borked cases
— avoids scattering `@pytest.mark.skipif(lambda ...)` branches across
tests that misbehave under a particular `--spawn-backend`.
Deats,
- `pytest_configure()` registers the marker via
`addinivalue_line('markers', ...)`.
- New `pytest_collection_modifyitems()` hook walks
each collected item with `item.iter_markers(
name='skipon_spawn_backend')`, checks whether the
active `--spawn-backend` appears in `mark.args`, and
if so injects a concrete `pytest.mark.skip(
reason=...)`. `iter_markers()` makes the decorator
work at function, class, or module (`pytestmark =
[...]`) scope transparently.
- First matching mark wins; default reason is
`f'Borked on --spawn-backend={backend!r}'` if the
caller doesn't supply one.
Also, tighten type annotations on nearby `pytest`
integration points — `pytest_configure`, `debug_mode`,
`spawn_backend`, `tpt_protos`, `tpt_proto` — now taking
typed `pytest.Config` / `pytest.FixtureRequest` params.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 3b26b59)
Adopt the `@pytest.mark.skipon_spawn_backend('subint',
reason=...)` marker (a617b52) across the suites
reproducing the `subint` GIL-contention / starvation
hang classes doc'd in `ai/conc-anal/subint_*_issue.md`.
Deats,
- Module-level `pytestmark` on full-file-hanging suites:
- `tests/test_cancellation.py`
- `tests/test_inter_peer_cancellation.py`
- `tests/test_pubsub.py`
- `tests/test_shm.py`
- Per-test decorator where only one test in the file
hangs:
- `tests/discovery/test_registrar.py
::test_stale_entry_is_deleted` — replaces the
inline `if start_method == 'subint': pytest.skip`
branch with a declarative skip.
- `tests/test_subint_cancellation.py
::test_subint_non_checkpointing_child`.
- A few per-test decorators are left commented-in-
place as breadcrumbs for later finer-grained unskips.
Also, some nearby tidying in the affected files:
- Annotate loose fixture / test params
(`pytest.FixtureRequest`, `str`, `tuple`, `bool`) in
`tests/conftest.py`, `tests/devx/conftest.py`, and
`tests/test_cancellation.py`.
- Normalize `"""..."""` → `'''...'''` docstrings per
repo convention on a few touched tests.
- Add `timeout=6` / `timeout=10` to
`@tractor_test(...)` on `test_cancel_infinite_streamer`
and `test_some_cancels_all`.
- Drop redundant `spawn_backend` param from
`test_cancel_via_SIGINT`; use `start_method` in the
`'mp' in ...` check instead.
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 4b2a088)
(factored: dropped spawn-backend-only path: tests/test_subint_cancellation.py)
Resetting `_runtime_vars` post-(forking-)spawn was previously only possible via direct mutation of `_state._runtime_vars` from an external module + an inline default dict duplicating the `_state.py`-internal defaults. Split the access surface into a pure getter + explicit setter so such a reset call site becomes a one-liner composition: `set_runtime_vars(get_runtime_vars(clear_values=True))`. Deats `tractor/runtime/_state.py`, - extract initial values into a module-level `_RUNTIME_VARS_DEFAULTS: dict[str, Any]` constant; the live `_runtime_vars` is now initialised from `dict(_RUNTIME_VARS_DEFAULTS)` - `get_runtime_vars()` grows a `clear_values: bool = False` kwarg. When True, returns a fresh copy of `_RUNTIME_VARS_DEFAULTS` instead of the live dict — still a **pure read**, never mutates anything - new `set_runtime_vars(rtvars: dict | RuntimeVars)` — atomic replacement of the live dict's contents via `.clear()` + `.update()`, so existing references to the same dict object remain valid. Accepts either the historical dict form or the `RuntimeVars` struct (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit 7804a9fe57693dd5e15bee6a08e7d2fa14b6a98a) (factored: kept only the tractor/runtime/_state.py part; dropped tractor/spawn/_subint_forkserver.py call-site rewire)
The `subint_forkserver` backend's child runtime is trio-native (uses
`_trio_main` + receives `SpawnSpec` over IPC just like `trio`/`subint`),
so `tractor.devx.debug._tty_lock` works in those subactors. Wire the
runtime gates that historically hard-coded `_spawn_method == 'trio'` to
recognize this third backend.
Deats,
- new `_DEBUG_COMPATIBLE_BACKENDS` module-const in `tractor._root`
listing the spawn backends whose subactor runtime is trio-native
(`'trio'`, `'subint_forkserver'`). Both the enable-site
(`_runtime_vars['_debug_mode'] = True`) and the cleanup-site reset
key.
off the same tuple — keep them in lockstep when adding backends
- `open_root_actor`'s `RuntimeError` for unsupported backends now
reports the full compatible-set + the rejected method instead of the
stale "only `trio`" msg.
- `runtime._runtime.Actor._from_parent`'s SpawnSpec-recv gate adds
`'subint_forkserver'` to the existing `('trio', 'subint')` tuple
— fork child-side runtime receives the same SpawnSpec IPC handshake as
the others.
- `subint_forkserver_proc` child-target now passes
`spawn_method='subint_forkserver'` (was hard-coded `'trio'`) so
`Actor.pformat()` / log lines reflect the actual parent-side spawn
mechanism rather than masquerading as plain `trio`.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 8bcbe73)
Stopgap companion to d012196 (`subint_forkserver` test-cancellation leak doc): five tests in `tests/test_cancellation.py` were running against the default `:1616` registry, so any leaked `subint-forkserv` descendant from a prior test holds the port and blows up every subsequent run with `TooSlowError` / "address in use". Thread the session-unique `reg_addr` fixture through so each run picks its own port — zombies can no longer poison other tests (they'll only cross-contaminate whatever happens to share their port, which is now nothing). Deats, - add `reg_addr: tuple` fixture param to: - `test_cancel_infinite_streamer` - `test_some_cancels_all` - `test_nested_multierrors` - `test_cancel_via_SIGINT` - `test_cancel_via_SIGINT_other_task` - explicitly pass `registry_addrs=[reg_addr]` to the two `open_nursery()` calls that previously had no kwargs at all (in `test_cancel_via_SIGINT` and `test_cancel_via_SIGINT_other_task`) - add bounded `@pytest.mark.timeout(7, method='thread')` to `test_nested_multierrors` so a hung run doesn't wedge the whole session Still doesn't close the real leak — the `subint_forkserver` backend's `_ForkedProc.kill()` is PID-scoped not tree-scoped, so grandchildren survive teardown regardless of registry port. This commit is just blast-radius containment until that fix lands. See `ai/conc-anal/ subint_forkserver_test_cancellation_leak_issue.md`. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit 1af2121)
Completes the nested-cancel deadlock fix started in 0cd0b63 (fork-child FD scrub) and fe540d0 (pidfd- cancellable wait). The remaining piece: the parent- channel `process_messages` loop runs under `shield=True` (so normal cancel cascades don't kill it prematurely), and relies on EOF arriving when the parent closes the socket to exit naturally. Under exec-spawn backends (`trio_proc`, mp) that EOF arrival is reliable — parent's teardown closes the handler-task socket deterministically. But fork- based backends like `subint_forkserver` share enough process-image state that EOF delivery becomes racy: the loop parks waiting for an EOF that only arrives after the parent finishes its own teardown, but the parent is itself blocked on `os.waitpid()` for THIS actor's exit. Mutual wait → deadlock. Deats, - `async_main` stashes the cancel-scope returned by `root_tn.start(...)` for the parent-chan `process_messages` task onto the actor as `_parent_chan_cs` - `Actor.cancel()`'s teardown path (after `ipc_server.cancel()` + `wait_for_shutdown()`) calls `self._parent_chan_cs.cancel()` to explicitly break the shield — no more waiting for EOF delivery, unwinding proceeds deterministically regardless of backend - inline comments on both sites explain the mutual- wait deadlock + why the explicit cancel is backend-agnostic rather than a forkserver-specific workaround With this + the prior two fixes, the `subint_forkserver` nested-cancel cascade unwinds cleanly end-to-end. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit 8ac3dfe)
Skip-mark the still-hanging
`test_nested_multierrors[subint_forkserver]` via
`@pytest.mark.skipon_spawn_backend('subint_forkserver',
reason=...)` so it stops blocking the test matrix
while the remaining bug is being chased. The mark is
an inert no-op until that (in-dev) backend lands.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 506617c)
(factored: kept only the tests/test_cancellation.py skip-mark; dropped
the subint_forkserver conc-anal doc update)
Fifth diagnostic pass pinpointed the hang to `async_main`'s finally block — every stuck actor reaches `FINALLY ENTER` but never `RETURNING`. Specifically `await ipc_server.wait_for_no_more_ peers()` never returns when a peer-channel handler is stuck: the `_no_more_peers` Event is set only when `server._peers` empties, and stuck handlers keep their channels registered. Wrap the call in `trio.move_on_after(3.0)` + a warning-log on timeout that records the still- connected peer count. 3s is enough for any graceful cancel-ack round-trip; beyond that we're in bug territory and need to proceed with local teardown so the parent's `_ForkedProc.wait()` can unblock. Defensive-in-depth regardless of the underlying bug — a local finally shouldn't block on remote cooperation forever. Verified: with this fix, ALL 15 actors reach `async_main: RETURNING` (up from 10/15 before). Test still hangs past 45s though — there's at least one MORE unbounded wait downstream of `async_main`. Candidates enumerated in the doc update (`open_root_actor` finally / `actor.cancel()` internals / trio.run bg tasks / `_serve_ipc_eps` finally). Skip-mark stays on `test_nested_multierrors[subint_forkserver]`. Also updates `subint_forkserver_test_cancellation_leak_issue.md` with the new pinpoint + summary of the 6-item investigation win list: 1. FD hygiene fix (`_close_inherited_fds`) — orphan-SIGINT closed 2. pidfd-based `_ForkedProc.wait` — cancellable 3. `_parent_chan_cs` wiring — shielded parent-chan loop now breakable 4. `wait_for_no_more_peers` bound — THIS commit 5. Ruled-out hypotheses: tree-kill missing, stuck socket recv, capture-pipe fill (all wrong) 6. Remaining unknown: at least one more unbounded wait in the teardown cascade above `async_main` (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit e312a68) (factored: dropped subint_forkserver conc-anal doc update)
Refresh the `test_nested_multierrors` skip-mark reason to the final diagnosis: the hang is pytest's default `--capture=fd` pipe filling from high-volume subactor traceback output inherited via fds 1,2 in fork children — `pytest -s` passes cleanly. Records the fix direction (redirect child stdio to `/dev/null` in the fork-child prelude) for whoever lands the backend. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit eceed29) (factored: kept only the tests/test_cancellation.py skip-reason update of "Pin forkserver hang to pytest `--capture=fd`"; dropped the subint conc-anal doc + tests/spawn/test_subint_forkserver.py)
Lands the capture-pipe workaround from the prior cluster of diagnosis
commits: switch pytest's `--capture` mode from the default `fd`
(redirects fd 1,2 to temp files, which fork children inherit and can
deadlock writing into) to `sys` (only `sys.stdout` / `sys.stderr` — fd
1,2 left alone).
Trade-off documented inline in `pyproject.toml`:
- LOST: per-test attribution of raw-fd output (C-ext writes,
`os.write(2, ...)`, subproc stdout). Still goes to terminal / CI
capture, just not per-test-scoped in the failure report.
- KEPT: `print()` + `logging` capture per-test (tractor's logger uses
`sys.stderr`).
- KEPT: `pytest -s` debugging behavior.
This allows us to re-enable `test_nested_multierrors` without
skip-marking + clears the class of pytest-capture-induced hangs for any
future fork-based backend tests.
Deats,
- `pyproject.toml`: `'--capture=sys'` added to `addopts` w/ ~20 lines of
rationale comment cross-ref'ing the post-mortem doc
- `test_cancellation`: drop `skipon_spawn_backend('subint_forkserver')`
from `test_nested_ multierrors` — no longer needed.
* file-level `pytestmark` covers any residual.
- `tests/spawn/test_subint_forkserver.py`: orphan-SIGINT test's xfail
mark loosened from `strict=True` to `strict=False` + reason rewritten.
* it passes in isolation but is session-env-pollution sensitive
(leftover subactor PIDs competing for ports / inheriting harness
FDs).
* tolerate both outcomes until suite isolation improves.
- `test_shm`: extend the existing
`skipon_spawn_backend('subint', ...)` to also skip
`'subint_forkserver'`.
* Different root cause from the cancel-cascade class:
`multiprocessing.SharedMemory`'s `resource_tracker` + internals
assume fresh- process state, don't survive fork-without-exec cleanly
- `tests/discovery/test_registrar.py`: bump timeout 3→7s on one test
(unrelated to forkserver; just a flaky-under-load bump).
- `tractor.spawn._subint_forkserver`: inline comment-only future-work
marker right before `_actor_child_main()` describing the planned
conditional stdout/stderr-to-`/dev/null` redirect for cases where
`--capture=sys` isn't enough (no code change — the redirect logic
itself is deferred).
EXTRA NOTEs
-----------
The `--capture=sys` approach is the minimum- invasive fix: just a pytest
ini change, no runtime code change, works for all fork-based backends,
trade-offs well-understood (terminal-level capture still happens, just
not pytest's per-test attribution of raw-fd output).
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 4c133ab)
(factored: dropped spawn-backend-only paths: tests/spawn/test_subint_forkserver.py + tractor/spawn/_subint_forkserver.py; the xfail-loosening bullet above no longer applies)
Continues the hygiene pattern from de60167 (cancel tests) into `tests/test_infected_asyncio.py`: many tests here were calling `tractor.open_nursery()` w/o `registry_addrs=[reg_addr]` and thus racing on the default `:1616` registry across sessions. Thread the session-unique `reg_addr` through so leaked or slow-to-teardown subactors from a prior test can't cross-pollute. Deats, - add `registry_addrs=[reg_addr]` to `open_nursery()` calls in suite where missing. - `test_sigint_closes_lifetime_stack`: - add `reg_addr`, `debug_mode`, `start_method` fixture params - `delay` now reads the `debug_mode` param directly instead of calling `tractor.debug_mode()` (fires slightly earlier in the test lifecycle) - sanity assert `if debug_mode: assert tractor.debug_mode()` after nursery open - new print showing SIGINT target (`send_sigint_to` + resolved pid) - catch `trio.TooSlowError` around `ctx.wait_for_result()` and conditionally `pytest.xfail` when `send_sigint_to == 'child' and start_method == 'subint_forkserver'` — the known orphan-SIGINT limitation tracked in `ai/conc-anal/subint_forkserver_orphan_sigint_hang_issue.md` - parametrize id typo fix: `'just_trio_slee'` → `'just_trio_sleep'` (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit b350aa0)
(cherry picked from commit 2ca0f41)
Zombie-subactor cleanup for the test suite, SC-polite discipline
(`SIGINT` first, bounded grace, `SIGKILL` only on survivors). Two parts:
a shared reaper module + an autouse session-end fixture that runs it.
Deats,
- new `tractor/_testing/_reap.py` (+230 LOC) — Linux- only reaper using
`/proc/<pid>/{status,cwd,cmdline}` inspection. Two detection modes:
- `find_descendants(parent_pid)` for the in-session case
(PPid-direct-match while pytest is still alive).
- `find_orphans(repo_root)` for the CLI / post- mortem case (`PPid==1`
reparented to init + `cwd` filter to repo root + `python` cmdline
filter).
- `reap(pids, *, grace=3.0, poll=0.25)` does the signal ladder: SIGINT
all, poll up to `grace` for exit, SIGKILL any survivors. Returns
`(signalled, killed)` for caller-side reporting.
- new `_reap_orphaned_subactors` session-scoped autouse fixture in
`tractor/_testing/pytest.py` — after `yield`, runs
`find_descendants(os.getpid())` + `reap(...)` so each pytest session
leaves no surviving forks.
- companion CLI scaffolding lives at `scripts/tractor-reap` (separate
commit) for the pytest-died-mid-session case where the in-session
fixture didn't get to run.
Also,
- promote `from tractor.spawn._spawn import SpawnMethodKey` to
module-top in `pytest.py` (was inline-imported inside
`pytest_generate_tests`), and reuse it in
`pytest_collection_modifyitems` to assert each `skipon_spawn_backend`
mark arg is a valid spawn-method literal — catches typos at collection
time.
- inline `# ?TODO` flags running these through the `try_set_backend`
checker for stronger validation.
Cross-refs `feedback_sc_graceful_cancel_first.md` for the
SIGINT-before-SIGKILL discipline rationale.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit eae478f)
New `scripts/tractor-reap` CLI wraps the `_testing._reap` mod for manual zombie-subactor cleanup after crashed pytest sessions. Two modes: - orphan-mode (default): finds PPid==1 procs with cwd matching repo root + `python` in cmdline. - descendant-mode (`--parent <pid>`): scoped sweep under a still-live supervisor. SC-polite: SIGINT with bounded grace window (default 3s) before escalating to SIGKILL. Exit code signals whether escalation was needed (useful for CI health-checks). Also, document both the auto-reap fixture and the CLI in `/run-tests` SKILL.md (section 10). (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit 6d76b60)
Outer `signal.alarm` cap that fires even when trio's `fail_after` is blocked by a shielded-await deadlock (the bug-class-3 hang under MTF backends). Only armed for fork-based spawners where the bug lives. Deats, - `_DIAG_CAP_S = fail_after_s + 5` — slightly larger than the trio-native guard so it always loses when the in-band path works. - `test_log.cancel()` breadcrumbs at each cancel-scope boundary so the last-fired breadcrumb names the swallow point on hang. - try/finally wrapping around each scope level for deterministic breadcrumb emission. - add `is_forking_spawner`, `set_fork_aware_capture` fixture params. - rework `fail_after_s`: 4s for fork, 12s for trio (was 30/12). Also, - `test_sigint_both_stream_types`: `assert 0` -> `pytest.fail()`, add TODO re `pytest.raises()`. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit 10db117)
- `test_ext_types_over_ipc`: wrap `main()` in `fa_main()` with `trio.fail_after(2)` + commented `capfd.disabled()` investigation (pytest#14444). - `test_basic_payload_spec`: add fixture param with note on fork-spawner hang prevention. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit 8aa07a7)
(this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit 75d5b4c)
- `skipon_spawn_backend('subint')`: expand reason with specific
analysis doc refs + GH issue #379 umbrella link.
- add `track_orphaned_uds_per_test` fixture via `usefixtures` to
blame-attribute UDS sock-file orphans left by SIGKILL cancel
cascades.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 7d0a53d)
For forking spawner backends that is. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit b10011a)
Deats, - `test_echoserver_detailed_mechanics`: add `is_forking_spawner` param, wrap `main()` in `fa_main()` with per-backend `trio.fail_after` (4s fork / 1s trio) to cap cancel-cascade teardown that compounds under forkserver. - `test_sigint_closes_lifetime_stack`: swap `start_method` param for `is_forking_spawner`, pre-init `tmp_file`/`ctx` to `None` so KBI firing before `open_context` body doesn't `UnboundLocalError`, add `pytest.fail` guard for the spawn-time IPC race case, arm `signal.alarm` AFK-safety cap (10s) under fork backends Also, - `pytestmark`: add `track_orphaned_uds_per_test` + `detect_runaway_subactors_per_test` fixtures. - `delay()`: hardcode `return 1e3` at top (debug override still in place). (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit 7ee0dc2)
Extract all pure-Python diagnostic helpers (`dump_proc_tree`, `dump_hung_state`, `scan_bindspace`, `dump_all`, `resolve_pids`, `ensure_sudo_cached`, etc.) from the xonsh xontrib into a new `tractor/_testing/trace.py` module so the same logic is callable from both the `acli.*` terminal aliases AND in-test capture-on-hang fixtures. Deats, - `_testing/trace.py`: new module (1171 lines) — proc-tree walker, hung-state dumper, bindspace scanner, `dump_all()` snapshot archiver, `AFKAlarmTimeout` exc, `fail_after_w_trace()` async CM (trio `fail_after` + auto-snapshot on `TooSlowError`), `afk_alarm_w_trace()` sync CM (`signal.alarm` + snapshot on `SIGALRM`), plus pytest fixture wrappers for both. - `_testing/pytest.py`: re-export the two fixtures via `from .trace import` so pytest plugin-discovery picks them up. - `tractor_diag.xsh`: thin terminal wrappers that import from `_testing.trace` — drops ~627 lines of inline impl. Add `acli.dump_all` alias for full snapshot-bundle CLI access. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit 7509e31)
Deats, - `_find_tractor_strays()`: scan `/proc/*/cmdline` for `tractor._child` procs NOT in the walk's `seen` set — surfaces ghost subactor trees from prior test runs (cross-test launchpad contamination). - `dump_proc_tree(include_strays=True)`: refactor classification into `_classify_walk()` closure, walk stray roots as additional trees, emit stray-root summary in header. Also: `tractor._child` procs reparented to init are now always classified as orphans regardless of cgroup-slice (leaked subactor ≠ desktop-launched app). - `_do_capture_snapshot()`: use `sys.__stderr__` to bypass pytest `--capture=sys` redirection so snapshot paths always land on the real terminal - `fail_after_w_trace()`: capture diag snapshot on non-`TooSlowError` exceptions when the `fail_after` scope's cancel had already fired (e.g. nursery wraps `Cancelled` into a `BaseExceptionGroup` that escapes before `TooSlowError` can be raised). (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit 3a243a1)
- `_testing/trace.py`: add `_SNAPSHOT_INDEX` session- scoped list populated by `_do_capture_snapshot()` on each successful dump; add TODO for future `TRACTOR_TRACE_HOLD=1` pause-on-hang mode - `_testing/pytest.py`: add `pytest_terminal_summary` hook that prints all captured snapshot dirs at end-of-session so paths don't get buried in scrollback (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit fb87c36)
`reap(include_descendants=True)` now expands each orphan-root pid into its full psutil subtree before delivering SIGINT, so a multi-level leaked actor-tree gets torn down in a single pass instead of requiring repeated calls (each pass kills the current `ppid==1` level, the level below becomes init-adopted, etc.). Falls back to the original flat `pids` list when `psutil` is unavailable. Emits a log line when expansion adds descendant pids. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit 8de684f)
Post-yield now also reaps init-adopted (`ppid==1`) tractor procs that appeared during the test — leaked subactors whose mid-tier parent died during cascade teardown, reparenting them to init. Pre-yield snapshot of existing orphans scopes reap to THIS test's leaks only, avoiding reap of unrelated tractor uses (piker, etc.) on the box. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit 01ce285)
Deats,
- `pytestmark`: enrich `skipon_spawn_backend('subint')` reason with
conc-anal doc refs + GH#379 link, add `reap_subactors_per_test`,
`track_orphaned_uds_per_test`,
`detect_runaway_subactors_per_test` fixtures
- `test_nested_multierrors`: parametrize over `depth` `{1, 3}`, add
MTF `xfail(strict=False)` with detailed race-window comment
explaining the BEG shape mismatch, wrap body in
`fail_after_w_trace` with per-backend timeout budget, bump
`@tractor_test(timeout=10)`, drop old multiprocessing depth
special-casing
- `test_multierror_fast_nursery`: wrap in
`fail_after_w_trace(30.0)`, accept `TooSlowError` in
`pytest.raises`, surface explicit `pytest.fail` on hang
- `test_cancel_while_childs_child_in_sync_sleep`: swap
`spawn_backend` param for `is_forking_spawner`, widen
`fail_after` delay for fork-based spawners
- `test_remote_error`, `test_multierror`,
`test_cancel_infinite_streamer`, `test_some_cancels_all`: add
`set_fork_aware_capture` fixture param
- Drop commented-out per-test `skipon_spawn_backend` blocks (now
covered by module-level `pytestmark`)
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 32955db)
Replace inline `trio.fail_after` + manual `signal.alarm` guard with the `_testing.trace` CM helpers that auto-capture a full ptree/wchan/py-spy diag snapshot to disk on timeout. Deats, - inner guard: `trio.fail_after` → `fail_after_w_trace` (async CM, captures on `TooSlowError`). - outer AFK guard: raw `signal.alarm` → `afk_alarm_w_trace` (sync CM, captures on `SIGALRM`), only armed under fork backends. Extracts `_run_and_match()` helper to keep branching clean. - bump `fail_after_s` from 4/12 → 8/20 to stop borderline flakes while diag harness accumulates evidence. - drop `_DIAG_CAP_S` var + manual signal import (now internal to `afk_alarm_w_trace`). (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit bd07a95)
Only flag `tractor._child` procs as cross-test ghosts of THIS run if `ppid==1` (init-adopted real leak) or `ppid` is in the walk's `seen` set (descendant we missed via race). Previously, procs whose `ppid` points to some OTHER live non-`pytest` (in the use of `acli.ptree pytest`) process belong to a different tractor app (`piker`, another `pytest` shell, a long-running tractor daemon) and were being falsely flagged as cross-test ghosts. Deats, - post-cmdline-match check via `_ppid_from_proc(pid)`, short-circuit on `None` (proc died in-flight). - expand module docstring to spell out the ownership filter rule + its rationale. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit a6d4ac3)
(cherry picked from commit f617c8c)
Per-terminal optimized `watch`-like xonsh alias that runs an arbitrary callable alias in a loop inside the alt-screen buffer with flicker-free repaint. Supersedes the inline `acli.ptree` polling .xsh snippet (removed from `_ptree` docstr in favor of `acli.watch acli.ptree pytest`). Deats, - alt-screen entry/exit (`\033[?1049h/l`) + cursor-hide (`\033[?25l/h`) wrapped in try/finally so Ctrl-C always returns to a pristine shell. - per-frame draw uses cursor-home (`\033[H`) + per-line EL (`\033[K` before each `\n`) + post-draw erase-down (`\033[J`) → stale tail chars from a longer prior frame are obvi cleared; no full-screen flash. - SIGWINCH-aware: terminal resize sets a flag, next frame does a full clear (`\033[H\033[2J`) instead of the cheap cursor-home path. - Ctrl-C handling: install `signal.default_int_handler` so `KeyboardInterrupt` lands cleanly; prior handler restored on exit. - Output capture: redirect the alias's stdout to `StringIO` per frame so we can post-process the EL fix. Aliases writing directly to `sys.stdout.buffer` / `os.write(1)` bypass capture — EL-fix won't apply but loop still works. - Alias unwrap: xonsh stores callables as either a bare callable OR `[fn, *preset_args]`. Both forms handled; subprocess-style aliases rejected w/ a friendly err msg. - `argparse` w/ `-n`/`--interval` (default 0.3s); rest of argv forwarded as alias args. - Reg `'acli.watch': watch` in `_TCLI_ALIASES`. Other, - Tn `_ptree` `args: list[str]` param. - Mod-header `Provides:` block updated w/ `acli.watch` entry. - Top-level imports: `os`, `sys`, `signal`, `time`, `typing.Callable`. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit bb239e8)
Adopt the `_testing.trace` CM helpers in two MTF-hang-prone tests so on-timeout we get a fresh `ptree`/`wchan`/`py-spy` diag snapshot on disk instead of opaque pytest timeout-kills. Same shape as bd07a95 for `test_dynamic_pub_sub`. Deats, - `test_echoserver_detailed_mechanics`: * inner `trio.fail_after` → `fail_after_w_trace`. Adds `fail_after_w_trace: FailAfterWTraceFactory` fixture param. * mv per-backend `timeout` calc to top of test body (was interleaved w/ helper defs). * factor deep `open_nursery`/`open_context`/`open_stream` body into `_body()` so the wrapping `main()` stays a 2-liner — keeps the nested-CM block at its natural indent level instead of pushing it under yet another `async with`. * drop `with_timeout: bool` knob + `fa_main()` helper (knob was hard-coded `True`). - `test_sigint_closes_lifetime_stack`: * outer `signal.alarm`/`try`/`finally` → single `afk_alarm_w_trace(10)` CM. Adds `afk_alarm_w_trace: AfkAlarmWTraceFactory` fixture param. * drop `_AFK_CAP_S` + `armed_alarm` vars (CM owns both). * explanatory comment refreshed to mention `AFKAlarmTimeout` + the disk-snapshot side effect. Other, - Drop debug `return 1e3` short-circuit from `delay()` fixture — snuck in as a scratch line, was clobbering the proper `debug_mode`-branched return. - Top-level import: `FailAfterWTraceFactory`, `AfkAlarmWTraceFactory` from `tractor._testing.trace`. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit 1cafaec)
SIGUSR1 task-tree dumps via `stackscope` should work in plain (non-pdb) runs too — esp. in infected-`asyncio` processes where the kernel-default SIGUSR1 disposition is `Term` (proc dies on `kill -USR1` w/o an installed handler). Ungate the install path from `_debug_mode` in both root and sub-actor init; the `use_stackscope` rt-var + `TRACTOR_ENABLE_STACKSCOPE` env-var checks remain as the actual opt-in (e.g. via `--enable-stackscope`). Deats, - `_root.open_root_actor`: drop the `debug_mode and ...` conjunction around the `enable_stack_on_sig()` call; now gated only on the `enable_stack_on_sig` arg itself. - `_runtime.Actor` sub-actor init: lift the `use_stackscope`/`TRACTOR_ENABLE_STACKSCOPE` branch out of the `if rvs['_debug_mode']:` block to peer-level. The `use_greenback` branch stays inside `_debug_mode` (pdb-specific). - Refresh inline comments on both sites to call out the infected-`asyncio` "default SIGUSR1 = terminate proc" rationale. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit 3d9c75b)
Factor the "deliver an exc to a running aio task" pattern out of
`translate_aio_errors()` + `open_channel_from()` into a shared
`maybe_signal_aio_task()` helper. Add a cause-chain matrix comment
+ relay-echo guard so the final-raise block can't cycle
`trio_err.__cause__` back onto its own derivative relay.
`maybe_signal_aio_task()`,
- Delivers `exc` via `aio_task._fut_waiter.set_exception()` — NOT
`aio_task.set_exception()` which on py3.13+ ALWAYS raises
`RuntimeError("Task does not support set_exception")` (dead code as
a relay mechanism).
- Returns `(delivered: bool, report: str)`. Caller uses `delivered` to
flip `wait_on_aio_task` when delivery failed (avoids hanging on
`_aio_task_complete.wait()`).
- `pre_captured_fut=`: required when the caller crosses a trio
checkpoint between capturing `_fut_waiter` and invoking the helper.
`Task._wakeup` clears `_fut_waiter = None` so re-reading
post-checkpoint loses the ref even though the exc is still in-flight
on the (now-`done()`) original fut.
- `cause=`: sets `exc.__cause__ = cause` so the relay carries
a "trio_err -> caused -> relay" chain through `set_exception()`
→ `Task._wakeup` → coro raise → `wait_on_coro_final_result`
→ `signal_trio_when_done` → `task.result()`-raise.
- `allow_cancel_fallback=True`: opt-in `aio_task.cancel()` for the
narrow case where `_fut_waiter is None` AND task is runnable (sitting
in asyncio's ready queue, not parked on a poke-able future). NEVER
cancels when `_fut_waiter` carries an in-flight exc — that would race
+ mask the real terminating exc.
`translate_aio_errors()`,
- Replace the two ad-hoc `_fut_waiter.set_exception()`
/ `aio_task.set_exception()` call sites w/ the helper.
- Capture `pre_cp_fut = aio_task._fut_waiter` BEFORE the post-shutdown
`trio.lowlevel.checkpoint()` (critical: `_wakeup` clears the ref).
- New "cross-loop cause-chain matrix" comment block on the final-raise
— tabulates every `(trio_err, aio_err, trio_to_raise)` combo into
exactly one terminal `raise X [from Y]` or early `return`. Covers the
sibling `signal_trio_when_done()` resolution + the relay-echo
INVARIANT.
- New relay-echo guard: if `aio_err` is one of OUR OWN signals
(`TrioTaskExited`/`TrioCancelled`) AND `aio_err.__cause__ is
trio_err`, raise the bare `trio_err` instead of `trio_err from
aio_err` (which would CYCLE the cause chain since the relay was itself
caused-by `trio_err`).
- Drop the stale "the `task.set_exception(aio_taskc)` call MUST NOT
EXCEPT or this WILL HANG" warning — the helper handles the failure
path explicitly via `delivered=False` → `wait_on_aio_task = False`.
- Carry `cause=trio_err` on both the cancel-relay (`TrioCancelled`) and
the graceful-exit relay (`TrioTaskExited`) so the aio-side traceback
shows the real root.
`open_channel_from()`,
- Adopt the same helper; drop the dead "SHOULD NEVER GET HERE !?!?"
+ `tractor.pause(shield=True)` panic branch.
- Capture in-flight trio-side exc via `sys.exc_info()[1]` and pass as
`cause=` — non-`None` only when the `try` body raised (graceful exit
→ None).
Other,
- Top-level import: `sys` (for `sys.exc_info()`).
- `run_as_asyncio_guest()`: add commented-out alt `out: Outcome = await
trio_done_fute` next to the shielded version — exploratory note for
the longstanding "why is `.shield()` needed?" TODO.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit acd1cbe)
Two coupled changes that let downstream projects (eg. `modden`) inherit
the test-harness loglevel plumbing for free via
`tractor._testing.pytest`:
Plugin lift (`tests/conftest.py` → `_testing/pytest.py`),
- mv `pytest_addoption(--ll)`, the `loglevel` autouse
fixture, and `test_log` fixture out of the test-suite-
local conftest into the reusable plugin.
- add `--tl`/`--tractor-loglevel` as a DISTINCT flag from
`--ll`: `--ll` is the consuming-project's OWN app
loglevel (scoped to its pkg-hierarchy), `--tl` is the
`tractor.*` runtime loglevel. `--tl` falls back to
`--ll` when unset (preserves current `tractor`-suite
behavior).
- add `testing_pkg_name` session fixture (default
`'tractor'`) — downstream projects override to e.g.
`'modden'` so `--ll` scopes to their own hierarchy
instead of `tractor.*`.
- `loglevel` fixture now yields the resolved
tractor-runtime level (passed to
`open_root_actor(loglevel=<.>)` by `@tractor_test`)
AND separately applies `--ll` to the
`testing_pkg_name` hierarchy when that isn't
`tractor`. `test_log` scopes the per-test logger to
`testing_pkg_name`.
`tractor.log` "logging-spec" mini-DSL,
- `LogSpec = str|bool`. Accepted forms:
- `True` → enable `pkg_name` root at `default_level`
(fallback `'cancel'`).
- `False` → no-op.
- bare level eg. `'info'` → root-logger at that level.
- `'sub:info,x:cancel'` → per-sub-logger filter-spec;
each `<name>` is RELATIVE to `pkg_name` (must NOT
include the pkg-token).
- `parse_logspec()` → `{sublog|None: level}` mapping.
`None` key = root-logger. Mixed bare-level + filters
in one spec is rejected w/ a helpful err msg; so is
embedding the `pkg_name` token in a sub-name.
- `apply_logspec()` → `(primary_level, {name: log})`:
parses then enables a `colorlog` stderr handler per
named (sub)logger. Authoritative sub-logger filters
get `propagate=False` so they don't double-emit
through a parallel root-level handler.
- !GRANULARITY CAVEAT! sub-logger names match at
sub-pkg granularity, not leaf-module — so `devx.debug`
collapses to the same `tractor.devx` logger as a bare
`devx`, and top-level lib modules (eg.
`tractor.to_asyncio`) emit under the *root* logger
rather than a phantom `to_asyncio` child. Documented
inline on `LogSpec`.
Other,
- `tests/conftest.py` keeps a NOTE pointing to the
plugin for future-debugging clarity (don't remove
silently — the lift is the relevant signal).
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 19a7770)
Strip the trailing `pkg_path` token ONLY when it duplicates the
caller's leaf-*module* name (which the console header already
shows via `{filename}`), instead of blindly dropping the last
token. This keeps genuine, possibly-*nested* sub-PACKAGE parts
addressable as their own sub-loggers.
- detect a true leaf-mod by comparing the caller's `__name__`
vs `__package__` (a pkg `__init__` has them equal -> its
trailing token is a real sub-pkg, NOT a leaf to strip).
- `name='devx.debug'` now -> `tractor.devx.debug`, DISTINCT
from a bare `devx` -> `tractor.devx`; the old unconditional
`pkg_path = subpkg_path` collapsed both to `tractor.devx` and
silently broke per-sub-pkg level control via the logging-spec.
- `get_logger(__name__)` leaf-strip still works (cosmetic, bc
the leaf-mod is in the `{filename}` header field).
Also,
- update the `LogSpec` caveat: sub-PACKAGE granularity now
addressable at ANY depth; leaf *modules* intentionally aren't
(they're the `{filename}`); top-level mods (eg. `to_asyncio`)
still emit on the root logger.
- adjust `test_root_pkg_not_duplicated_in_logger_name` to the
new literal explicit-`name` contract (no leaf-collapse).
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(cherry picked from commit 9c36363)
trio 0.29 → 0.33 lock bump (c7741bb) slowed the depth=3 cancel-cascade in `test_nested_multierrors` from <6s to ~7-8s; the 6s deadline was firing and its `Cancelled(source='deadline')` (trio 0.33's new cancel-reason metadata) collapsed a BEG branch, breaking the `RemoteActorError` assertion downstream. - Split the `('trio', _)` case-match into per-depth arms: `('trio', 1)` keeps 6s (still finishes in ~3s); `('trio', 3)` → 12s. - Updated inline NOTE explains the version pivot + links the tracking issue `ai/conc-anal/trio_033_cancel_cascade_slowdown_depth3_issue.md`. - Existing MTF/`subint_forkserver` budgets unchanged. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit ea67f1b) (cherry picked from commit 57b3ea5)
Same trio 0.29 → 0.33 cancel-cascade slowdown that hit `test_nested_multierrors` (ea67f1b) — bumps the `trio`-backend (non-debug, non-forking) budget in `test_echoserver_detailed_mechanics` from 1s → 4s. - The 1s budget raced the ~1s teardown deadline. On a deadline-fire trio 0.33 injects `Cancelled(source='deadline')` (cancel-reason metadata) that wraps the mid-stream KBI in a `BaseExceptionGroup`, breaking the bare `pytest.raises(KeyboardInterrupt)` below. - Bump matches the forking-spawner branch (4s). - Inline NOTE references the tracking issue `ai/conc-anal/trio_033_cancel_cascade_slowdown_depth3_issue.md`. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code (cherry picked from commit d7da502) (cherry picked from commit d0144e5)
`open_root_actor()` writes `_enable_tpts` (and friends) into the process-global `_state._runtime_vars` dict but nothing resets it on actor teardown. Under the in-proc `pytest` launchpad a uds-using test leaks `_enable_tpts=['uds']` into a sibling tcp test, tripping the `registry_addrs`×`enable_transports` proto-guard in `open_root_actor()` with a `ValueError`. New `_reset_runtime_vars` fixture snapshots + restores the dict around every test so no runtime-var state crosses a test boundary. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
Regenerate the lockfile so it's consistent with the post-rebase `pyproject.toml` — which now carries both #461's landed tooling (`pytest>=9.0.3`, …) and this branch's tractor deps (`setproctitle`, `pytest-timeout`, `psutil`), - `uv lock` resolves the merged dep set against the landed `main` baseline. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
5f52284 to
41b5371
Compare
Left-over debug trap from the `_runtime_vars` pure get/set refactor — it fired on *every* struct-form rt-var write (e.g. via `.update()`), hanging any non-tty / CI / forked actor on `pdb` stdin. Surfaced by a `/code-review high` pass on #462. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
Two fixes to the hang-debug SIGUSR1 task-tree dump path, surfaced by `/code-review high` on #462, - re-add `_debug_mode` to the sub-actor handler-install gate in `_runtime.py`. Dropping it (rel. `3a386ba5`/`3d9c75b6` "Drop debug_mode gate", from the `custom_log_levels_api` follow-up) was meant to *also* enable non-pdb runs, but nothing sets `use_stackscope` from `debug_mode`, so debug-mode subs were left with NO handler — and the default SIGUSR1 disposition then *kills* them. Now additive: `_debug_mode OR use_stackscope OR env`. - pass `write_file=True` at both `dump_task_tree()` SIGUSR1 call sites so the advertised `/tmp/tractor-stackscope-<pid>` `.log` tee is actually written (was dead under `--capture=fd`). Matches `1b1ef10a` "Re-enable writing `stackscope` to file by default"; param from `0df90500`. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
Two defensive fixes around the `Portal.cancel_actor()` + `_try_cancel_then_kill()` escalation from `34f333a0` "Escalate cancel-ack timeouts to `proc.terminate()`" (the `trionics.start_or_cancel` follow-up); surfaced by `/code-review high` on #462, - guard `proc.terminate()` for backends whose `proc` slot isn't a `Process` — the future `subint` backend stores an `int` interp-id, so escalation would `AttributeError` instead of hard-killing; now it logs + no-ops. - swap `assert cs.cancelled_caught` for an `if cs.cancelled_caught and raise_on_timeout:` guard so an unexpected shielded-scope exit returns a soft `False` rather than crashing `cancel_actor()` mid-teardown. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
The `TRACTOR_LOGLEVEL`/`TRACTOR_SPAWN_METHOD` override-notice branches were unreachable: `loglevel`/`start_method` were reassigned to the env value BEFORE the `!=` compare, so the "OVERRIDES caller-passed" message never fired. Capture the caller value first, then compare. Rel. `208e7c09`/`d4eac06d` "Honor env-vars" (`trionics.start_or_cancel`); surfaced by `/code-review high` on #462. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Harden
tractorruntime + test-infra (subint era)Motivation
This is the
tractor-library counterpart to #461. While building themain_thread_forkserver(MTF) / sub-interpreter (subint) spawnbackend, a large body of runtime + test-infra hardening accreted that
— unlike #461's pure dev-tooling slice — does touch
tractor/library code. It's all the load-bearing groundwork the new backend
needs (py3.14 gating, orphan-reaping, capture/diag, runtime-var
plumbing, cancel-cascade timeouts) plus a pile of flake fixes that
fell out along the way.
This PR lands all of that except the actual backend impl, which
comes last (it's blocked on
msgspec/PEP-684 sub-interpretersupport). It's stacked on #461 — the ~19 shared dev-tooling commits
in the current diff auto-shed once this rebases onto a merged #461
(see TODOs).
Summary of changes
main_thread_forkserverbackend surface: openthe py-version range + harness gate for py3.14 backends (Trying out sub-interpreters (subints), maybe
fork()can be hacked now?' #379),treat py3.14+ incompats as test-skips, add a
skipon_spawn_backendpytest marker + mark the known subint-hanging tests, enable
debug_modefor the forkserver backend, and extract_actor_child_main()as a shared child entrypoint. Sweep thesubint_forkserver→main_thread_forkservernaming across code.tractor._testing._reap+ anauto-reap fixture, a
tractor-reapCLI (scripts/), an opt-inreap_subactors_per_testfixture, UDS +--shmorphan sweeps, aper-test runaway-subactor CPU detector, and
tractor.spawn._reap.unlink_uds_bind_addrs().tractor._testing.tracecapture-snapshot impl, fork-aware capture fixtures in
_testing.pytest,pytest_load_initial_conftests()for honoring--capture=, a CI-aware--captureguard (warn locally), and amaybe_override_capturefix for invalidcapXfixture names.tractor_diag(nosis)xontrib: diagnostics aliases(
pytree,bindspace_scan,acli.reap), a--treeflag +cross-bucket parent annotations, cgroup/ppid-aware process-tree
liveness buckets, and an
_is_tractor_subactor()helper._runtime_varsinto a pure get/set API and lift config toenv-vars: honor
TRACTOR_LOGLEVEL/TRACTOR_SPAWN_METHOD, allowper-call
start_method/logleveloverrides, add anenable_transports/registry_addrsproto guard, and ause_stackscoperuntime var.ActorTooSlowErrorfor cancel-cascade timeouts, escalate cancel-ack timeouts to
proc.terminate(), makefail_after/expect-raises timeoutsbackend-aware, break the parent-chan shield + bound the peer-clear
wait in
async_main's teardown, and add a dup-name cancel-cascadeescalation test.
stackscopeSIGUSR1 tree-dumps: an--enable-stackscopeplugin flag, route SIGUSR1 onto the trio loop, fix dump ordering,
and a
use_stackscopesubactor-init var.greenback/.to_asynciosync-pause: import-or-skipgreenback, move it to async_pausedev group, reworkto_asyncio, and update thesync_bp/asyncio_bpexamples whilehardening
test_debugger/test_pause_from_sync/test_shield_pause.SharedMemoryunder the forkserver backend (+ document theincompat) and add a matching
--shmorphan sweep to reap.daemon/test_multi_programundertests/discovery/, replace thedaemon-fixture sleep with an active poll, fix_testing.addr.get_rando_addrcross-process collisions, widen thetest_register_duplicate_nameboot-racexfailton_dups=8, fixa UDS unlink-race shutdown deadlock, drop global
_PROC_SPAWN_WAITmutation, and harden
test_registrarwith reap fixtures +timeouts.
tractor.trionics.patchessubpkg (+ first upstream-triopatch) and per-actor
setproctitleviadevx._proctitle.tractor.logwith custom log-level API additions (+test_log_sys) and alogspecleaf-module Route-B granularityplan.
ai/conc-analwriteups for the trio-0.33 cancel-cascadedepth-3 slowdown, the
test_register_duplicate_nameboot-race, andthe
RuntimeVarsenv-var-lift design.TODOs before landing
main19 shareddev-tooling commits auto-shed (patch-identical), but ~13 replayed
commits touch
pyproject.toml/uv.lock/ci.ymland will conflictwith Tooling skills n config from mtf dev #461's final versions — prefer main's tooling state, then
uv lockto regenerate the lockfile.Future follow up
Land the actual
main_thread_forkserver/subintbackend ThisPR is everything except the backend impl; it's blocked on
msgspec/PEP-684 sub-interpreter support and lands last.Run the
test_debuggersuite on the forkserver spawner A# TODOwas added noting the suite isn't yet exercised under theforkserver backend.
Implement the
RuntimeVarsenv-var lift.claude/notes/rt_vars_lift_plan.mdis design-only so far.Links
fork()can be hacked now?' #379 — py-version range + harness gate for py3.14 backends.claude-code helpers tracking inclaude-code helpers follow up #417.(this pr content was generated in some part by
claude-code)