Skip to content

Subint era tooling#462

Open
goodboy wants to merge 107 commits into
mainfrom
subint_era_tooling
Open

Subint era tooling#462
goodboy wants to merge 107 commits into
mainfrom
subint_era_tooling

Conversation

@goodboy

@goodboy goodboy commented Jun 17, 2026

Copy link
Copy Markdown
Owner

Harden tractor runtime + test-infra (subint era)

Motivation

This is the tractor-library counterpart to #461. While building the
main_thread_forkserver (MTF) / sub-interpreter (subint) spawn
backend, a large body of runtime + test-infra hardening accreted that
— unlike #461's pure dev-tooling slice — does touch tractor/
library code. It's all the load-bearing groundwork the new backend
needs (py3.14 gating, orphan-reaping, capture/diag, runtime-var
plumbing, cancel-cascade timeouts) plus a pile of flake fixes that
fell out along the way.

This PR lands all of that except the actual backend impl, which
comes last (it's blocked on msgspec/PEP-684 sub-interpreter
support). It's stacked on #461 — the ~19 shared dev-tooling commits
in the current diff auto-shed once this rebases onto a merged #461
(see TODOs).

Summary of changes

  • Gate the py3.14 + main_thread_forkserver backend surface: open
    the py-version range + harness gate for py3.14 backends (Trying out sub-interpreters (subints), maybe fork() can be hacked now?' #379),
    treat py3.14+ incompats as test-skips, add a skipon_spawn_backend
    pytest marker + mark the known subint-hanging tests, enable
    debug_mode for the forkserver backend, and extract
    _actor_child_main() as a shared child entrypoint. Sweep the
    subint_forkservermain_thread_forkserver naming across code.
  • Add orphan-reaping test infra: tractor._testing._reap + an
    auto-reap fixture, a tractor-reap CLI (scripts/), an opt-in
    reap_subactors_per_test fixture, UDS + --shm orphan sweeps, a
    per-test runaway-subactor CPU detector, and
    tractor.spawn._reap.unlink_uds_bind_addrs().
  • Build fork-aware capture machinery: a big tractor._testing.trace
    capture-snapshot impl, fork-aware capture fixtures in
    _testing.pytest, pytest_load_initial_conftests() for honoring
    --capture=, a CI-aware --capture guard (warn locally), and a
    maybe_override_capture fix for invalid capX fixture names.
  • Add the tractor_diag(nosis) xontrib: diagnostics aliases
    (pytree, bindspace_scan, acli.reap), a --tree flag +
    cross-bucket parent annotations, cgroup/ppid-aware process-tree
    liveness buckets, and an _is_tractor_subactor() helper.
  • Refactor _runtime_vars into a pure get/set API and lift config to
    env-vars: honor TRACTOR_LOGLEVEL/TRACTOR_SPAWN_METHOD, allow
    per-call start_method/loglevel overrides, add an
    enable_transports/registry_addrs proto guard, and a
    use_stackscope runtime var.
  • Harden trio-0.33 cancel-cascade behavior: add ActorTooSlowError
    for cancel-cascade timeouts, escalate cancel-ack timeouts to
    proc.terminate(), make fail_after/expect-raises timeouts
    backend-aware, break the parent-chan shield + bound the peer-clear
    wait in async_main's teardown, and add a dup-name cancel-cascade
    escalation test.
  • Wire stackscope SIGUSR1 tree-dumps: an --enable-stackscope
    plugin flag, route SIGUSR1 onto the trio loop, fix dump ordering,
    and a use_stackscope subactor-init var.
  • Formalize greenback/.to_asyncio sync-pause: import-or-skip
    greenback, move it to a sync_pause dev group, rework
    to_asyncio, and update the sync_bp/asyncio_bp examples while
    hardening
    test_debugger/test_pause_from_sync/test_shield_pause.
  • Fix SharedMemory under the forkserver backend (+ document the
    incompat) and add a matching --shm orphan sweep to reap.
  • Reorg discovery tests + daemon readiness: move
    daemon/test_multi_program under tests/discovery/, replace the
    daemon-fixture sleep with an active poll, fix
    _testing.addr.get_rando_addr cross-process collisions, widen the
    test_register_duplicate_name boot-race xfail to n_dups=8, fix
    a UDS unlink-race shutdown deadlock, drop global _PROC_SPAWN_WAIT
    mutation, and harden test_registrar with reap fixtures +
    timeouts.
  • Add a tractor.trionics.patches subpkg (+ first upstream-trio
    patch) and per-actor setproctitle via devx._proctitle.
  • Extend tractor.log with custom log-level API additions (+
    test_log_sys) and a logspec leaf-module Route-B granularity
    plan.
  • Land ai/conc-anal writeups for the trio-0.33 cancel-cascade
    depth-3 slowdown, the test_register_duplicate_name boot-race, and
    the RuntimeVars env-var-lift design.

TODOs before landing

Future follow up

  • Land the actual main_thread_forkserver/subint backend This
    PR is everything except the backend impl; it's blocked on
    msgspec/PEP-684 sub-interpreter support and lands last.

  • Run the test_debugger suite on the forkserver spawner A # TODO was added noting the suite isn't yet exercised under the
    forkserver backend.

  • Implement the RuntimeVars env-var lift
    .claude/notes/rt_vars_lift_plan.md is design-only so far.

Links

(this pr content was generated in some part by claude-code)

Copilot AI review requested due to automatic review settings June 17, 2026 15:53
@goodboy goodboy added testing devx-tooling "developer experience" improvements as provided `tractor.devx` for runtime dependents. integration Optional/loose support for 3rd party libs/apps/projects labels Jun 17, 2026

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds “subint-era” hardening and diagnostics across tractor’s spawn/cancel/IPC teardown paths, plus a small framework for Trio monkey-patches and associated regression tests. It also bumps the supported Python range and expands dev/test tooling to better debug hangs and cleanup leaked resources (UDS sock-files, SHM segments, zombie subactors).

Changes:

  • Introduces tractor.trionics.patches (catalog + contract) and a first defensive Trio patch for WakeupSocketpair.drain() EOF busy-loop, with regression tests.
  • Improves teardown robustness: cancel escalation (ActorTooSlowError), bounded peer waits, parent-channel shield break, per-endpoint close isolation, and post-SIGKILL UDS sock cleanup.
  • Adds developer tooling: per-actor proc titles, hang dumps/resource deltas helpers, a tractor-reap cleanup script, and logging-spec parsing/apply support.

Reviewed changes

Copilot reviewed 78 out of 81 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
tractor/trionics/patches/README.md Documents patch scope + per-module API/contract.
tractor/trionics/patches/init.py Registers patch modules and provides apply_all().
tractor/trionics/patches/_wakeup_socketpair.py Monkey-patch for Trio wakeup socketpair EOF busy-loop + repro.
tractor/_child.py Applies patches early in child boot + sets per-actor proctitle.
tractor/devx/_proctitle.py Optional setproctitle integration for actor processes.
tests/trionics/test_patches.py Regression tests for Trio patches + apply_all() idempotency.
tractor/spawn/_trio.py Passes bind addr/subactor context into hard_kill() for cleanup.
tractor/spawn/_spawn.py Adds post-mortem UDS sock unlinking hook to hard_kill().
tractor/spawn/_reap.py New parent-side “reap” helpers (UDS sock cleanup).
tractor/runtime/_portal.py Adds raise_on_timeout option + ActorTooSlowError on cancel timeout.
tractor/runtime/_supervise.py Cancel-then-escalate helper used by ActorNursery.cancel().
tractor/_exceptions.py Adds ActorTooSlowError for cancel-ack timeout escalation.
tractor/runtime/_runtime.py Teardown deadlock guards (shield-break, bounded peer waits, stackscope enable).
tractor/runtime/_state.py Adds runtime-vars defaults + set_runtime_vars() API.
tractor/_root.py Env-var overrides (loglevel/spawn method) + transport/registry mismatch fail-fast.
tractor/ipc/_server.py Ensures endpoint-close failures don’t deadlock shutdown signaling.
tractor/ipc/_uds.py Makes UDS unlink tolerant to concurrent unlink races.
tractor/ipc/_shm.py Makes SHM unlink best-effort and tolerant to prior deallocation.
tractor/ipc/_mp_bs.py Simplifies resource_tracker disabling + always uses track=False.
tractor/ipc/_linux.py Adds import-time ImportError note for cffi support expectations.
tractor/log.py Fixes logger-name collapsing edge cases + adds logspec parse/apply.
tractor/devx/_debug_hangs.py Adds hang dump + resource delta tracking utilities.
tractor/devx/init.py Exposes new devx hang helpers.
tractor/devx/debug/_tty_lock.py Adjusts formatting of lock repr/log output.
tractor/_testing/addr.py Improves per-process random TCP port selection to reduce collisions.
tests/test_spawning.py Extends forkserver capture skip rationale.
tests/test_shm.py Skips SHM tests on subint backend with rationale.
tests/test_ringbuf.py Adds cffi import-or-skip note for py3.14 constraints; keeps test skipped.
tests/test_pubsub.py Skips pubsub on subint pending hang issues.
tests/test_multi_program.py Removes old multi-program tests (moved under discovery).
tests/discovery/test_multi_program.py New home for multi-program discovery tests + cancellation regression.
tests/discovery/conftest.py Adds daemon fixture + active readiness polling (replaces blind sleeps).
tests/discovery/test_registrar.py Adjusts registrar tests; adds hang dump guard; fixture usage refinements.
tests/test_log_sys.py Updates expectations for explicit logger naming behavior.
tests/test_local.py Updates no-runtime test to use trio.run() and NoRuntime.
tests/test_legacy_one_way_streaming.py Backend-aware timeouts + cancellation semantics tweaks.
tests/test_inter_peer_cancellation.py Skips known subint hang classes + adds per-test UDS orphan tracking.
tests/test_context_stream_semantics.py Makes tests pass registry_addrs; adjusts timeouts for fork-based spawners.
tests/test_clustering.py Adjusts fail_after budget for fork-based spawners.
tests/test_advanced_streaming.py Adds internal trio-based timeouts + richer hang diagnostics and cancellation matching.
tests/msg/test_pldrx_limiting.py Backend-aware timeout default selection for msg validation tests.
tests/msg/test_ext_types_msgspec.py Adds fork-aware capture fixture usage + wraps in fail_after.
tests/ipc/test_multi_tpt.py Asserts new enable_transports vs registry_addrs mismatch fail-fast behavior.
tests/devx/conftest.py Extends pexpect harness to drive spawn backend/loglevel via env vars.
tests/devx/test_tooling.py Updates expectations and harness typing; adjusts behavior under fork spawners.
tests/devx/test_proctitle.py Adds tests for proctitle + intrinsic subactor detection (Linux).
tests/devx/test_pause_from_non_trio.py Adds greenback gating for sync pause tests; adjusts expected output.
tests/devx/test_debugger.py Makes debugger tests more fork-aware; adjusts loglevels/pattern matching.
tests/conftest.py Moves common logging/daemon fixtures into plugin/sub-conftest; updates notes.
scripts/tractor-reap Adds CLI utility to reap zombie subactors + sweep shm/UDS leaks.
pyproject.toml Bumps Python range; adds deps/groups (setproctitle, pytest-timeout, psutil, etc.).
.github/workflows/ci.yml Adjusts pytest invocation formatting + explicit capture mode.
.gitignore Updates ignore patterns (AI/worktree/tooling related).
examples/debugging/sync_bp.py Adds env-driven prompt-color disabling + loglevel adjustments for harness.
examples/debugging/subactor_error.py Minor debugger example cleanup.
examples/debugging/subactor_bp_in_ctx.py Notes future env-var parametrization for transport selection.
examples/debugging/shield_hang_in_sub.py Updates actor nursery child lookup to use aid.uid.
examples/debugging/root_timeout_while_child_crashed.py Cleans docstring/commentary and prints child aid.
examples/debugging/root_cancelled_but_child_is_in_tty_lock.py Reorders/annotates debug loglevel/transport args.
examples/debugging/multi_nested_subactors_error_up_through_nurseries.py Sets loglevel for test pattern matching.
examples/debugging/multi_daemon_subactors.py Minor nursery variable rename/cleanup.
ai/tooling-todos/logspec_leaf_module_granularity_route_b.md Adds design notes for future logger granularity work.
ai/conc-anal/trio_wakeup_socketpair_busy_loop_under_fork_issue.md Adds concurrency analysis + repro for Trio busy-loop issue.
ai/conc-anal/trio_033_cancel_cascade_slowdown_depth3_issue.md Adds analysis note for Trio version cascade slowdown.
ai/conc-anal/test_register_duplicate_name_daemon_connect_race_issue.md Adds analysis note for daemon readiness race and mitigation.
ai/conc-anal/spawn_time_boot_death_dup_name_issue.md Adds analysis note for duplicate-name boot death race.
ai/conc-anal/fork_thread_semantics_execution_vs_memory.md Adds reference doc on fork semantics in multi-threaded procs.
.claude/skills/conc-anal/SKILL.md Updates conc-anal guidance (hang patterns, teardown waits).
.claude/settings.local.json Updates local agent permissions/config.
.claude/notes/rt_vars_lift_plan.md Adds draft plan for runtime-vars env var lift.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +317 to +321
cancel_timeout: float = (
timeout
or
self.cancel_timeout
)
Comment on lines +339 to 343
raise ActorTooSlowError(
f'Peer {peer_id} did not ack `Actor.cancel()`'
f'-RPC within bounded wait of '
f'{cancel_timeout!r}s'
)
Comment thread tractor/ipc/_linux.py
Comment on lines +24 to +32
try:
import cffi
except ImportError as ie:
if sys.version_info < (3, 14):
ie.add_note(
f'The `cffi` pkg has no 3.14 support yet.\n'
)

raise ie
Comment on lines +12 to +17
# disable `pbdp` prompt colors
# for prompt matching in test.
def disable_pdbp_color():
if os.environ['PYTHON_COLORS'] == '0':
from tractor.devx.debug import _repl
_repl.TractorConfig.use_pygments = False
# itself relies on... but `repro()` runs OUTSIDE
# a trio.run, so it's plain stdlib semantics here
# — alarm WILL fire during `recv` syscall).
signal.alarm(2)
Comment on lines +57 to +62
# Apply the patch.
applied: bool = wsp.apply()
# First call MUST return True; idempotent guard
# prevents False on subsequent calls within the
# same process.
assert applied is True or applied is False # idempotent
@goodboy goodboy force-pushed the tooling_skills_n_config_from_mtf_dev branch 2 times, most recently from f937cc9 to 6b0cb17 Compare June 17, 2026 19:53
Base automatically changed from tooling_skills_n_config_from_mtf_dev to main June 17, 2026 21:33
goodboy added 21 commits June 17, 2026 17:39
Prep for a future sub-interpreter (PEP 734
`concurrent.interpreters`) spawn backend per issue
test-harness error-gating; the backend itself comes
later.

Deats,
- bump `pyproject.toml` `requires-python` to
  `>=3.12, <3.15` and list the `3.14` classifier —
  the new stdlib `concurrent.interpreters` module
  only ships on 3.14
- `_testing.pytest.pytest_configure` wraps
  `try_set_start_method()` in a `pytest.UsageError`
  handler so an unsupported `--spawn-backend` on the
  running py-version prints a clean banner instead
  of a traceback

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit d318f1f)
(factored: kept only the pyproject + `_testing/pytest.py` parts of
 "Add `'subint'` spawn backend scaffold (#379)"; dropped
 tractor/spawn/_spawn.py + tractor/spawn/_subint.py)
Since we're devving subints we require the 3.14+ stdlib API
and a couple compiled libs don't support it yet, namely:
- `cffi`, which we're only using for the `.ipc._linux` eventfd
  stuff (now factored into `hotbaud` anyway).
- `greenback`, which requires `greenlet` which doesn't seem to be
  wheeled yet
  * on nixos the sdist build was failing due to lack of `g++` which
    i don't care to figure out rn since we don't need `.devx` stuff
    immediately for this subints prototype.
  * [ ] we still need to adjust any dependent suites to skip.

Adjust `test_ringbuf` to skip on import failure.

Also project wide,
- pin us to py 3.13+ in prep for last-2-minor-version policy.
- drop `msgspec>=0.20.0`, the first release with py3.14 support.

(cherry picked from commit d2ea8aa)
Pull the `_child.py` `__main__` block body out into
a callable `_actor_child_main()` so alternate spawn
backends can bootstrap a subactor without going
through the CLI entrypoint.

Deats,
- new `_actor_child_main(uid, loglevel, parent_addr,
  infect_asyncio, spawn_method='trio')` holds the
  full child-side runtime startup previously inlined
  under `if __name__ == '__main__':`
- `__main__` block reduces to arg-parsing + a call
  into the new func
- add `"subint"` to the `_runtime.py` spawn-method
  check so a child accepts `SpawnSpec` from that
  (future) backend; inert str-compare w/o it

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit b8f243e)
(factored: kept only the `_child.py`/`_runtime.py` entry-extraction parts of
 "Impl min-viable `subint` spawn backend (B.2)"; dropped
 tractor/spawn/_subint.py + subint prompt-io logs)
Bottle up the diagnostic primitives that actually cracked the
silent mid-suite hangs in the `subint` spawn-backend bringup (issue
there" session has them on the shelf instead of reinventing from
scratch.

Deats,
- `dump_on_hang(seconds, *, path)` — context manager wrapping
  `faulthandler.dump_traceback_later()`. Critical gotcha baked in:
  dumps go to a *file*, not `sys.stderr`, bc pytest's stderr
  capture silently eats the output and you can spend an hour
  convinced you're looking at the wrong thing
- `track_resource_deltas(label, *, writer)` — context manager
  logging per-block `(threading.active_count(),
  len(_interpreters.list_all()))` deltas; quickly rules out
  leak-accumulation theories when a suite progressively worsens (if
  counts don't grow, it's not a leak, look for a race on shared
  cleanup instead)
- `resource_delta_fixture(*, autouse, writer)` — factory returning
  a `pytest` fixture wrapping `track_resource_deltas` per-test; opt
  in by importing into a `conftest.py`. Kept as a factory (not a
  bare fixture) so callers own `autouse` / `writer` wiring

Also,
- export the three names from `tractor.devx`
- dep-free on py<3.13 (swallows `ImportError` for `_interpreters`)
- link back to the provenance in the module docstring (issue #379 /
  commit `26fb820`)

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 09466a1)
Wrap the test's `trio.run(main)` in
`dump_on_hang(seconds=20)` so any future hang
regression captures a stack dump for triage instead
of wedging CI silently; under the default backends
it's a no-op safety net.

Includes a "KNOWN ISSUE" comment block documenting
the (future) `subint` backend hang classes observed
against this test during Phase B bringup (#379).

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 4a32545)
(factored: kept only the tests/discovery/test_registrar.py part of
 "Doc `subint` backend hang classes + arm `dump_on_hang`"; dropped
 subint conc-anal docs + tests/test_subint_cancellation.py)
Add a hard process-level wall-clock bound on a test
known to wedge un-Ctrl-C-ably under an in-dev spawn
backend, so an unattended suite run can't hang
indefinitely.

Deats,
- New `testing` dep: `pytest-timeout>=2.3`.
- `test_stale_entry_is_deleted`:
  `@pytest.mark.timeout(3, method='thread')`. The
  `method='thread'` choice is deliberate —
  `method='signal'` routes via `SIGALRM` which can be
  starved by the same GIL-hostage path that drops
  `SIGINT`, so it'd never actually fire in the
  starvation case.

At timeout, `pytest-timeout` hard-kills the pytest
process itself — that's the intended behavior here;
the alternative is the suite never returning.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 189f4e3f72e9f1eda5d24bcbab5743f7e35bd913)
(factored: kept pyproject + tests/discovery/test_registrar.py parts of
 "Wall-cap `subint` audit tests via `pytest-timeout`"; dropped
 tests/test_subint_cancellation.py)
A reusable `@pytest.mark.skipon_spawn_backend( '<backend>' [, ...],
reason='...')` marker for backend-specific known-hang / -borked cases
— avoids scattering `@pytest.mark.skipif(lambda ...)` branches across
tests that misbehave under a particular `--spawn-backend`.

Deats,
- `pytest_configure()` registers the marker via
  `addinivalue_line('markers', ...)`.
- New `pytest_collection_modifyitems()` hook walks
  each collected item with `item.iter_markers(
  name='skipon_spawn_backend')`, checks whether the
  active `--spawn-backend` appears in `mark.args`, and
  if so injects a concrete `pytest.mark.skip(
  reason=...)`. `iter_markers()` makes the decorator
  work at function, class, or module (`pytestmark =
  [...]`) scope transparently.
- First matching mark wins; default reason is
  `f'Borked on --spawn-backend={backend!r}'` if the
  caller doesn't supply one.

Also, tighten type annotations on nearby `pytest`
integration points — `pytest_configure`, `debug_mode`,
`spawn_backend`, `tpt_protos`, `tpt_proto` — now taking
typed `pytest.Config` / `pytest.FixtureRequest` params.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 3b26b59)
Adopt the `@pytest.mark.skipon_spawn_backend('subint',
reason=...)` marker (a617b52) across the suites
reproducing the `subint` GIL-contention / starvation
hang classes doc'd in `ai/conc-anal/subint_*_issue.md`.

Deats,
- Module-level `pytestmark` on full-file-hanging suites:
  - `tests/test_cancellation.py`
  - `tests/test_inter_peer_cancellation.py`
  - `tests/test_pubsub.py`
  - `tests/test_shm.py`
- Per-test decorator where only one test in the file
  hangs:
  - `tests/discovery/test_registrar.py
    ::test_stale_entry_is_deleted` — replaces the
    inline `if start_method == 'subint': pytest.skip`
    branch with a declarative skip.
  - `tests/test_subint_cancellation.py
    ::test_subint_non_checkpointing_child`.
- A few per-test decorators are left commented-in-
  place as breadcrumbs for later finer-grained unskips.

Also, some nearby tidying in the affected files:
- Annotate loose fixture / test params
  (`pytest.FixtureRequest`, `str`, `tuple`, `bool`) in
  `tests/conftest.py`, `tests/devx/conftest.py`, and
  `tests/test_cancellation.py`.
- Normalize `"""..."""` → `'''...'''` docstrings per
  repo convention on a few touched tests.
- Add `timeout=6` / `timeout=10` to
  `@tractor_test(...)` on `test_cancel_infinite_streamer`
  and `test_some_cancels_all`.
- Drop redundant `spawn_backend` param from
  `test_cancel_via_SIGINT`; use `start_method` in the
  `'mp' in ...` check instead.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 4b2a088)
(factored: dropped spawn-backend-only path: tests/test_subint_cancellation.py)
Resetting `_runtime_vars` post-(forking-)spawn was
previously only possible via direct mutation of
`_state._runtime_vars` from an external module + an
inline default dict duplicating the
`_state.py`-internal defaults. Split the access
surface into a pure getter + explicit setter so such
a reset call site becomes a one-liner composition:
`set_runtime_vars(get_runtime_vars(clear_values=True))`.

Deats `tractor/runtime/_state.py`,
- extract initial values into a module-level
  `_RUNTIME_VARS_DEFAULTS: dict[str, Any]` constant; the
  live `_runtime_vars` is now initialised from
  `dict(_RUNTIME_VARS_DEFAULTS)`
- `get_runtime_vars()` grows a `clear_values: bool = False`
  kwarg. When True, returns a fresh copy of
  `_RUNTIME_VARS_DEFAULTS` instead of the live dict —
  still a **pure read**, never mutates anything
- new `set_runtime_vars(rtvars: dict | RuntimeVars)` —
  atomic replacement of the live dict's contents via
  `.clear()` + `.update()`, so existing references to the
  same dict object remain valid. Accepts either the
  historical dict form or the `RuntimeVars` struct

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 7804a9fe57693dd5e15bee6a08e7d2fa14b6a98a)
(factored: kept only the tractor/runtime/_state.py part; dropped
 tractor/spawn/_subint_forkserver.py call-site rewire)
The `subint_forkserver` backend's child runtime is trio-native (uses
`_trio_main` + receives `SpawnSpec` over IPC just like `trio`/`subint`),
so `tractor.devx.debug._tty_lock` works in those subactors. Wire the
runtime gates that historically hard-coded `_spawn_method == 'trio'` to
recognize this third backend.

Deats,
- new `_DEBUG_COMPATIBLE_BACKENDS` module-const in `tractor._root`
  listing the spawn backends whose subactor runtime is trio-native
  (`'trio'`, `'subint_forkserver'`). Both the enable-site
  (`_runtime_vars['_debug_mode'] = True`) and the cleanup-site reset
  key.
  off the same tuple — keep them in lockstep when adding backends
- `open_root_actor`'s `RuntimeError` for unsupported backends now
  reports the full compatible-set + the rejected method instead of the
  stale "only `trio`" msg.
- `runtime._runtime.Actor._from_parent`'s SpawnSpec-recv gate adds
  `'subint_forkserver'` to the existing `('trio', 'subint')` tuple
  — fork child-side runtime receives the same SpawnSpec IPC handshake as
  the others.
- `subint_forkserver_proc` child-target now passes
  `spawn_method='subint_forkserver'` (was hard-coded `'trio'`) so
  `Actor.pformat()` / log lines reflect the actual parent-side spawn
  mechanism rather than masquerading as plain `trio`.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 8bcbe73)
Stopgap companion to d012196 (`subint_forkserver`
test-cancellation leak doc): five tests in
`tests/test_cancellation.py` were running against the
default `:1616` registry, so any leaked
`subint-forkserv` descendant from a prior test holds
the port and blows up every subsequent run with
`TooSlowError` / "address in use". Thread the
session-unique `reg_addr` fixture through so each run
picks its own port — zombies can no longer poison
other tests (they'll only cross-contaminate whatever
happens to share their port, which is now nothing).

Deats,
- add `reg_addr: tuple` fixture param to:
  - `test_cancel_infinite_streamer`
  - `test_some_cancels_all`
  - `test_nested_multierrors`
  - `test_cancel_via_SIGINT`
  - `test_cancel_via_SIGINT_other_task`
- explicitly pass `registry_addrs=[reg_addr]` to the
  two `open_nursery()` calls that previously had no
  kwargs at all (in `test_cancel_via_SIGINT` and
  `test_cancel_via_SIGINT_other_task`)
- add bounded `@pytest.mark.timeout(7, method='thread')`
  to `test_nested_multierrors` so a hung run doesn't
  wedge the whole session

Still doesn't close the real leak — the
`subint_forkserver` backend's `_ForkedProc.kill()` is
PID-scoped not tree-scoped, so grandchildren survive
teardown regardless of registry port. This commit is
just blast-radius containment until that fix lands.
See `ai/conc-anal/
subint_forkserver_test_cancellation_leak_issue.md`.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 1af2121)
Completes the nested-cancel deadlock fix started in
0cd0b63 (fork-child FD scrub) and fe540d0 (pidfd-
cancellable wait). The remaining piece: the parent-
channel `process_messages` loop runs under
`shield=True` (so normal cancel cascades don't kill
it prematurely), and relies on EOF arriving when the
parent closes the socket to exit naturally.

Under exec-spawn backends (`trio_proc`, mp) that EOF
arrival is reliable — parent's teardown closes the
handler-task socket deterministically. But fork-
based backends like `subint_forkserver` share enough
process-image state that EOF delivery becomes racy:
the loop parks waiting for an EOF that only arrives
after the parent finishes its own teardown, but the
parent is itself blocked on `os.waitpid()` for THIS
actor's exit. Mutual wait → deadlock.

Deats,
- `async_main` stashes the cancel-scope returned by
  `root_tn.start(...)` for the parent-chan
  `process_messages` task onto the actor as
  `_parent_chan_cs`
- `Actor.cancel()`'s teardown path (after
  `ipc_server.cancel()` + `wait_for_shutdown()`)
  calls `self._parent_chan_cs.cancel()` to
  explicitly break the shield — no more waiting for
  EOF delivery, unwinding proceeds deterministically
  regardless of backend
- inline comments on both sites explain the mutual-
  wait deadlock + why the explicit cancel is
  backend-agnostic rather than a forkserver-specific
  workaround

With this + the prior two fixes, the
`subint_forkserver` nested-cancel cascade unwinds
cleanly end-to-end.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 8ac3dfe)
Skip-mark the still-hanging
`test_nested_multierrors[subint_forkserver]` via
`@pytest.mark.skipon_spawn_backend('subint_forkserver',
reason=...)` so it stops blocking the test matrix
while the remaining bug is being chased. The mark is
an inert no-op until that (in-dev) backend lands.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 506617c)
(factored: kept only the tests/test_cancellation.py skip-mark; dropped
 the subint_forkserver conc-anal doc update)
Fifth diagnostic pass pinpointed the hang to
`async_main`'s finally block — every stuck actor
reaches `FINALLY ENTER` but never `RETURNING`.
Specifically `await ipc_server.wait_for_no_more_
peers()` never returns when a peer-channel handler
is stuck: the `_no_more_peers` Event is set only
when `server._peers` empties, and stuck handlers
keep their channels registered.

Wrap the call in `trio.move_on_after(3.0)` + a
warning-log on timeout that records the still-
connected peer count. 3s is enough for any
graceful cancel-ack round-trip; beyond that we're
in bug territory and need to proceed with local
teardown so the parent's `_ForkedProc.wait()` can
unblock. Defensive-in-depth regardless of the
underlying bug — a local finally shouldn't block
on remote cooperation forever.

Verified: with this fix, ALL 15 actors reach
`async_main: RETURNING` (up from 10/15 before).

Test still hangs past 45s though — there's at
least one MORE unbounded wait downstream of
`async_main`. Candidates enumerated in the doc
update (`open_root_actor` finally /
`actor.cancel()` internals / trio.run bg tasks /
`_serve_ipc_eps` finally). Skip-mark stays on
`test_nested_multierrors[subint_forkserver]`.

Also updates
`subint_forkserver_test_cancellation_leak_issue.md`
with the new pinpoint + summary of the 6-item
investigation win list:
1. FD hygiene fix (`_close_inherited_fds`) —
   orphan-SIGINT closed
2. pidfd-based `_ForkedProc.wait` — cancellable
3. `_parent_chan_cs` wiring — shielded parent-chan
   loop now breakable
4. `wait_for_no_more_peers` bound — THIS commit
5. Ruled-out hypotheses: tree-kill missing, stuck
   socket recv, capture-pipe fill (all wrong)
6. Remaining unknown: at least one more unbounded
   wait in the teardown cascade above `async_main`

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit e312a68)
(factored: dropped subint_forkserver conc-anal doc update)
Refresh the `test_nested_multierrors` skip-mark
reason to the final diagnosis: the hang is pytest's
default `--capture=fd` pipe filling from high-volume
subactor traceback output inherited via fds 1,2 in
fork children — `pytest -s` passes cleanly. Records
the fix direction (redirect child stdio to
`/dev/null` in the fork-child prelude) for whoever
lands the backend.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit eceed29)
(factored: kept only the tests/test_cancellation.py skip-reason update of
 "Pin forkserver hang to pytest `--capture=fd`"; dropped the subint
 conc-anal doc + tests/spawn/test_subint_forkserver.py)
Lands the capture-pipe workaround from the prior cluster of diagnosis
commits: switch pytest's `--capture` mode from the default `fd`
(redirects fd 1,2 to temp files, which fork children inherit and can
deadlock writing into) to `sys` (only `sys.stdout` / `sys.stderr` — fd
1,2 left alone).

Trade-off documented inline in `pyproject.toml`:
- LOST: per-test attribution of raw-fd output (C-ext writes,
  `os.write(2, ...)`, subproc stdout). Still goes to terminal / CI
  capture, just not per-test-scoped in the failure report.
- KEPT: `print()` + `logging` capture per-test (tractor's logger uses
  `sys.stderr`).
- KEPT: `pytest -s` debugging behavior.

This allows us to re-enable `test_nested_multierrors` without
skip-marking + clears the class of pytest-capture-induced hangs for any
future fork-based backend tests.

Deats,
- `pyproject.toml`: `'--capture=sys'` added to `addopts` w/ ~20 lines of
  rationale comment cross-ref'ing the post-mortem doc

- `test_cancellation`: drop `skipon_spawn_backend('subint_forkserver')`
  from `test_nested_ multierrors` — no longer needed.
  * file-level `pytestmark` covers any residual.

- `tests/spawn/test_subint_forkserver.py`: orphan-SIGINT test's xfail
  mark loosened from `strict=True` to `strict=False` + reason rewritten.
  * it passes in isolation but is session-env-pollution sensitive
    (leftover subactor PIDs competing for ports / inheriting harness
    FDs).
  * tolerate both outcomes until suite isolation improves.

- `test_shm`: extend the existing
  `skipon_spawn_backend('subint', ...)` to also skip
  `'subint_forkserver'`.
  * Different root cause from the cancel-cascade class:
    `multiprocessing.SharedMemory`'s `resource_tracker` + internals
    assume fresh- process state, don't survive fork-without-exec cleanly

- `tests/discovery/test_registrar.py`: bump timeout 3→7s on one test
  (unrelated to forkserver; just a flaky-under-load bump).

- `tractor.spawn._subint_forkserver`: inline comment-only future-work
  marker right before `_actor_child_main()` describing the planned
  conditional stdout/stderr-to-`/dev/null` redirect for cases where
  `--capture=sys` isn't enough (no code change — the redirect logic
  itself is deferred).

EXTRA NOTEs
-----------
The `--capture=sys` approach is the minimum- invasive fix: just a pytest
ini change, no runtime code change, works for all fork-based backends,
trade-offs well-understood (terminal-level capture still happens, just
not pytest's per-test attribution of raw-fd output).

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 4c133ab)
(factored: dropped spawn-backend-only paths: tests/spawn/test_subint_forkserver.py + tractor/spawn/_subint_forkserver.py; the xfail-loosening bullet above no longer applies)
Continues the hygiene pattern from de60167 (cancel tests) into
`tests/test_infected_asyncio.py`: many tests here were calling
`tractor.open_nursery()` w/o `registry_addrs=[reg_addr]` and thus racing
on the default `:1616` registry across sessions. Thread the
session-unique `reg_addr` through so leaked or slow-to-teardown
subactors from a prior test can't cross-pollute.

Deats,
- add `registry_addrs=[reg_addr]` to `open_nursery()`
  calls in suite where missing.
- `test_sigint_closes_lifetime_stack`:
  - add `reg_addr`, `debug_mode`, `start_method`
    fixture params
  - `delay` now reads the `debug_mode` param directly
    instead of calling `tractor.debug_mode()` (fires
    slightly earlier in the test lifecycle)
  - sanity assert `if debug_mode: assert
    tractor.debug_mode()` after nursery open
  - new print showing SIGINT target
    (`send_sigint_to` + resolved pid)
  - catch `trio.TooSlowError` around
    `ctx.wait_for_result()` and conditionally
    `pytest.xfail` when `send_sigint_to == 'child'
    and start_method == 'subint_forkserver'` — the
    known orphan-SIGINT limitation tracked in
    `ai/conc-anal/subint_forkserver_orphan_sigint_hang_issue.md`
- parametrize id typo fix: `'just_trio_slee'` → `'just_trio_sleep'`

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit b350aa0)
Zombie-subactor cleanup for the test suite, SC-polite discipline
(`SIGINT` first, bounded grace, `SIGKILL` only on survivors). Two parts:
a shared reaper module + an autouse session-end fixture that runs it.

Deats,
- new `tractor/_testing/_reap.py` (+230 LOC) — Linux- only reaper using
  `/proc/<pid>/{status,cwd,cmdline}` inspection. Two detection modes:
  - `find_descendants(parent_pid)` for the in-session case
    (PPid-direct-match while pytest is still alive).
  - `find_orphans(repo_root)` for the CLI / post- mortem case (`PPid==1`
    reparented to init + `cwd` filter to repo root + `python` cmdline
    filter).
- `reap(pids, *, grace=3.0, poll=0.25)` does the signal ladder: SIGINT
  all, poll up to `grace` for exit, SIGKILL any survivors. Returns
  `(signalled, killed)` for caller-side reporting.
- new `_reap_orphaned_subactors` session-scoped autouse fixture in
  `tractor/_testing/pytest.py` — after `yield`, runs
  `find_descendants(os.getpid())` + `reap(...)` so each pytest session
  leaves no surviving forks.
- companion CLI scaffolding lives at `scripts/tractor-reap` (separate
  commit) for the pytest-died-mid-session case where the in-session
  fixture didn't get to run.

Also,
- promote `from tractor.spawn._spawn import SpawnMethodKey` to
  module-top in `pytest.py` (was inline-imported inside
  `pytest_generate_tests`), and reuse it in
  `pytest_collection_modifyitems` to assert each `skipon_spawn_backend`
  mark arg is a valid spawn-method literal — catches typos at collection
  time.
- inline `# ?TODO` flags running these through the `try_set_backend`
  checker for stronger validation.

Cross-refs `feedback_sc_graceful_cancel_first.md` for the
SIGINT-before-SIGKILL discipline rationale.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit eae478f)
New `scripts/tractor-reap` CLI wraps the
`_testing._reap` mod for manual zombie-subactor
cleanup after crashed pytest sessions. Two modes:

- orphan-mode (default): finds PPid==1 procs
  with cwd matching repo root + `python` in
  cmdline.
- descendant-mode (`--parent <pid>`): scoped
  sweep under a still-live supervisor.

SC-polite: SIGINT with bounded grace window
(default 3s) before escalating to SIGKILL.
Exit code signals whether escalation was needed
(useful for CI health-checks).

Also, document both the auto-reap fixture and
the CLI in `/run-tests` SKILL.md (section 10).

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 6d76b60)
goodboy added 25 commits June 17, 2026 17:39
Outer `signal.alarm` cap that fires even when trio's
`fail_after` is blocked by a shielded-await deadlock
(the bug-class-3 hang under MTF backends). Only armed
for fork-based spawners where the bug lives.

Deats,
- `_DIAG_CAP_S = fail_after_s + 5` — slightly larger than the
  trio-native guard so it always loses when the in-band path works.
- `test_log.cancel()` breadcrumbs at each cancel-scope boundary so the
  last-fired breadcrumb names the swallow point on hang.
- try/finally wrapping around each scope level for deterministic
  breadcrumb emission.
- add `is_forking_spawner`, `set_fork_aware_capture` fixture params.
- rework `fail_after_s`: 4s for fork, 12s for trio (was 30/12).

Also,
- `test_sigint_both_stream_types`: `assert 0` -> `pytest.fail()`, add
  TODO re `pytest.raises()`.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 10db117)
- `test_ext_types_over_ipc`: wrap `main()` in `fa_main()` with
  `trio.fail_after(2)` + commented `capfd.disabled()` investigation
  (pytest#14444).
- `test_basic_payload_spec`: add fixture param with note on fork-spawner
  hang prevention.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 8aa07a7)
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 75d5b4c)
- `skipon_spawn_backend('subint')`: expand reason with specific
  analysis doc refs + GH issue #379 umbrella link.
- add `track_orphaned_uds_per_test` fixture via `usefixtures` to
  blame-attribute UDS sock-file orphans left by SIGKILL cancel
  cascades.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 7d0a53d)
For forking spawner backends that is.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit b10011a)
Deats,
- `test_echoserver_detailed_mechanics`: add `is_forking_spawner`
  param, wrap `main()` in `fa_main()` with per-backend
  `trio.fail_after` (4s fork / 1s trio) to cap cancel-cascade
  teardown that compounds under forkserver.
- `test_sigint_closes_lifetime_stack`: swap `start_method` param
  for `is_forking_spawner`, pre-init `tmp_file`/`ctx` to `None` so
  KBI firing before `open_context` body doesn't `UnboundLocalError`,
  add `pytest.fail` guard for the spawn-time IPC race case, arm
  `signal.alarm` AFK-safety cap (10s) under fork backends

Also,
- `pytestmark`: add `track_orphaned_uds_per_test` +
  `detect_runaway_subactors_per_test` fixtures.
- `delay()`: hardcode `return 1e3` at top (debug override still in
  place).

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 7ee0dc2)
Extract all pure-Python diagnostic helpers (`dump_proc_tree`,
`dump_hung_state`, `scan_bindspace`, `dump_all`, `resolve_pids`,
`ensure_sudo_cached`, etc.) from the xonsh xontrib into a new
`tractor/_testing/trace.py` module so the same logic is callable
from both the `acli.*` terminal aliases AND in-test capture-on-hang
fixtures.

Deats,
- `_testing/trace.py`: new module (1171 lines) — proc-tree walker,
  hung-state dumper, bindspace scanner, `dump_all()` snapshot
  archiver, `AFKAlarmTimeout` exc, `fail_after_w_trace()` async CM
  (trio `fail_after` + auto-snapshot on `TooSlowError`),
  `afk_alarm_w_trace()` sync CM (`signal.alarm` + snapshot on
  `SIGALRM`), plus pytest fixture wrappers for both.
- `_testing/pytest.py`: re-export the two fixtures via `from .trace
  import` so pytest plugin-discovery picks them up.
- `tractor_diag.xsh`: thin terminal wrappers that import from
  `_testing.trace` — drops ~627 lines of inline impl. Add
  `acli.dump_all` alias for full snapshot-bundle CLI access.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 7509e31)
Deats,
- `_find_tractor_strays()`: scan `/proc/*/cmdline` for
  `tractor._child` procs NOT in the walk's `seen` set — surfaces
  ghost subactor trees from prior test runs (cross-test launchpad
  contamination).
- `dump_proc_tree(include_strays=True)`: refactor classification
  into `_classify_walk()` closure, walk stray roots as additional
  trees, emit stray-root summary in header. Also: `tractor._child`
  procs reparented to init are now always classified as orphans
  regardless of cgroup-slice (leaked subactor ≠ desktop-launched
  app).
- `_do_capture_snapshot()`: use `sys.__stderr__` to bypass pytest
  `--capture=sys` redirection so snapshot paths always land on the
  real terminal
- `fail_after_w_trace()`: capture diag snapshot on
  non-`TooSlowError` exceptions when the `fail_after` scope's
  cancel had already fired (e.g. nursery wraps `Cancelled` into a
  `BaseExceptionGroup` that escapes before `TooSlowError` can be
  raised).

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 3a243a1)
- `_testing/trace.py`: add `_SNAPSHOT_INDEX` session- scoped list
  populated by `_do_capture_snapshot()` on each successful dump;
  add TODO for future `TRACTOR_TRACE_HOLD=1` pause-on-hang mode
- `_testing/pytest.py`: add `pytest_terminal_summary` hook that
  prints all captured snapshot dirs at end-of-session so paths
  don't get buried in scrollback

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit fb87c36)
`reap(include_descendants=True)` now expands each orphan-root pid
into its full psutil subtree before delivering SIGINT, so a
multi-level leaked actor-tree gets torn down in a single pass
instead of requiring repeated calls (each pass kills the current
`ppid==1` level, the level below becomes init-adopted, etc.).

Falls back to the original flat `pids` list when `psutil` is
unavailable. Emits a log line when expansion adds descendant pids.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 8de684f)
Post-yield now also reaps init-adopted (`ppid==1`) tractor procs
that appeared during the test — leaked subactors whose mid-tier
parent died during cascade teardown, reparenting them to init.
Pre-yield snapshot of existing orphans scopes reap to THIS test's
leaks only, avoiding reap of unrelated tractor uses (piker, etc.)
on the box.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 01ce285)
Deats,
- `pytestmark`: enrich `skipon_spawn_backend('subint')` reason with
  conc-anal doc refs + GH#379 link, add `reap_subactors_per_test`,
  `track_orphaned_uds_per_test`,
  `detect_runaway_subactors_per_test` fixtures
- `test_nested_multierrors`: parametrize over `depth` `{1, 3}`, add
  MTF `xfail(strict=False)` with detailed race-window comment
  explaining the BEG shape mismatch, wrap body in
  `fail_after_w_trace` with per-backend timeout budget, bump
  `@tractor_test(timeout=10)`, drop old multiprocessing depth
  special-casing
- `test_multierror_fast_nursery`: wrap in
  `fail_after_w_trace(30.0)`, accept `TooSlowError` in
  `pytest.raises`, surface explicit `pytest.fail` on hang
- `test_cancel_while_childs_child_in_sync_sleep`: swap
  `spawn_backend` param for `is_forking_spawner`, widen
  `fail_after` delay for fork-based spawners
- `test_remote_error`, `test_multierror`,
  `test_cancel_infinite_streamer`, `test_some_cancels_all`: add
  `set_fork_aware_capture` fixture param
- Drop commented-out per-test `skipon_spawn_backend` blocks (now
  covered by module-level `pytestmark`)

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 32955db)
Replace inline `trio.fail_after` + manual `signal.alarm` guard with the
`_testing.trace` CM helpers that auto-capture a full ptree/wchan/py-spy
diag snapshot to disk on timeout.

Deats,
- inner guard: `trio.fail_after` → `fail_after_w_trace` (async CM,
  captures on `TooSlowError`).
- outer AFK guard: raw `signal.alarm` → `afk_alarm_w_trace` (sync
  CM, captures on `SIGALRM`), only armed under fork backends.
  Extracts `_run_and_match()` helper to keep branching clean.
- bump `fail_after_s` from 4/12 → 8/20 to stop borderline flakes
  while diag harness accumulates evidence.
- drop `_DIAG_CAP_S` var + manual signal import (now internal to
  `afk_alarm_w_trace`).

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit bd07a95)
Only flag `tractor._child` procs as cross-test ghosts of
THIS run if `ppid==1` (init-adopted real leak) or `ppid`
is in the walk's `seen` set (descendant we missed via
race).

Previously, procs whose `ppid` points to some OTHER live non-`pytest`
(in the use of `acli.ptree pytest`) process belong to a different
tractor app (`piker`, another `pytest` shell, a long-running tractor
daemon) and were being falsely flagged as cross-test ghosts.

Deats,
- post-cmdline-match check via `_ppid_from_proc(pid)`,
  short-circuit on `None` (proc died in-flight).
- expand module docstring to spell out the ownership
  filter rule + its rationale.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit a6d4ac3)
Per-terminal optimized `watch`-like xonsh alias that
runs an arbitrary callable alias in a loop inside the
alt-screen buffer with flicker-free repaint. Supersedes
the inline `acli.ptree` polling .xsh snippet (removed
from `_ptree` docstr in favor of
`acli.watch acli.ptree pytest`).

Deats,
- alt-screen entry/exit (`\033[?1049h/l`) + cursor-hide
  (`\033[?25l/h`) wrapped in try/finally so Ctrl-C always
  returns to a pristine shell.
- per-frame draw uses cursor-home (`\033[H`) + per-line
  EL (`\033[K` before each `\n`) + post-draw erase-down
  (`\033[J`) → stale tail chars from a longer prior
  frame are obvi cleared; no full-screen flash.
- SIGWINCH-aware: terminal resize sets a flag, next
  frame does a full clear (`\033[H\033[2J`) instead of
  the cheap cursor-home path.
- Ctrl-C handling: install `signal.default_int_handler`
  so `KeyboardInterrupt` lands cleanly; prior handler
  restored on exit.
- Output capture: redirect the alias's stdout to
  `StringIO` per frame so we can post-process the EL
  fix. Aliases writing directly to `sys.stdout.buffer`
  / `os.write(1)` bypass capture — EL-fix won't apply
  but loop still works.
- Alias unwrap: xonsh stores callables as either a bare
  callable OR `[fn, *preset_args]`. Both forms handled;
  subprocess-style aliases rejected w/ a friendly err
  msg.
- `argparse` w/ `-n`/`--interval` (default 0.3s); rest
  of argv forwarded as alias args.
- Reg `'acli.watch': watch` in `_TCLI_ALIASES`.

Other,
- Tn `_ptree` `args: list[str]` param.
- Mod-header `Provides:` block updated w/ `acli.watch`
  entry.
- Top-level imports: `os`, `sys`, `signal`, `time`,
  `typing.Callable`.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit bb239e8)
Adopt the `_testing.trace` CM helpers in two MTF-hang-prone
tests so on-timeout we get a fresh
`ptree`/`wchan`/`py-spy` diag snapshot on disk instead of
opaque pytest timeout-kills. Same shape as bd07a95 for
`test_dynamic_pub_sub`.

Deats,
- `test_echoserver_detailed_mechanics`:
  * inner `trio.fail_after` → `fail_after_w_trace`. Adds
    `fail_after_w_trace: FailAfterWTraceFactory` fixture
    param.
  * mv per-backend `timeout` calc to top of test body (was
    interleaved w/ helper defs).
  * factor deep
    `open_nursery`/`open_context`/`open_stream` body into
    `_body()` so the wrapping `main()` stays a 2-liner —
    keeps the nested-CM block at its natural indent level
    instead of pushing it under yet another `async with`.
  * drop `with_timeout: bool` knob + `fa_main()` helper
    (knob was hard-coded `True`).
- `test_sigint_closes_lifetime_stack`:
  * outer `signal.alarm`/`try`/`finally` → single
    `afk_alarm_w_trace(10)` CM. Adds
    `afk_alarm_w_trace: AfkAlarmWTraceFactory` fixture
    param.
  * drop `_AFK_CAP_S` + `armed_alarm` vars (CM owns both).
  * explanatory comment refreshed to mention
    `AFKAlarmTimeout` + the disk-snapshot side effect.

Other,
- Drop debug `return 1e3` short-circuit from `delay()`
  fixture — snuck in as a scratch line, was clobbering the
  proper `debug_mode`-branched return.
- Top-level import: `FailAfterWTraceFactory`,
  `AfkAlarmWTraceFactory` from `tractor._testing.trace`.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 1cafaec)
SIGUSR1 task-tree dumps via `stackscope` should work in
plain (non-pdb) runs too — esp. in infected-`asyncio`
processes where the kernel-default SIGUSR1 disposition is
`Term` (proc dies on `kill -USR1` w/o an installed
handler). Ungate the install path from `_debug_mode` in
both root and sub-actor init; the `use_stackscope` rt-var
+ `TRACTOR_ENABLE_STACKSCOPE` env-var checks remain as
the actual opt-in (e.g. via `--enable-stackscope`).

Deats,
- `_root.open_root_actor`: drop the `debug_mode and ...`
  conjunction around the `enable_stack_on_sig()` call;
  now gated only on the `enable_stack_on_sig` arg itself.
- `_runtime.Actor` sub-actor init: lift the
  `use_stackscope`/`TRACTOR_ENABLE_STACKSCOPE` branch out
  of the `if rvs['_debug_mode']:` block to peer-level.
  The `use_greenback` branch stays inside `_debug_mode`
  (pdb-specific).
- Refresh inline comments on both sites to call out the
  infected-`asyncio` "default SIGUSR1 = terminate proc"
  rationale.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 3d9c75b)
Factor the "deliver an exc to a running aio task" pattern out of
`translate_aio_errors()` + `open_channel_from()` into a shared
`maybe_signal_aio_task()` helper. Add a cause-chain matrix comment
+ relay-echo guard so the final-raise block can't cycle
  `trio_err.__cause__` back onto its own derivative relay.

`maybe_signal_aio_task()`,
- Delivers `exc` via `aio_task._fut_waiter.set_exception()` — NOT
  `aio_task.set_exception()` which on py3.13+ ALWAYS raises
  `RuntimeError("Task does not support set_exception")` (dead code as
  a relay mechanism).
- Returns `(delivered: bool, report: str)`. Caller uses `delivered` to
  flip `wait_on_aio_task` when delivery failed (avoids hanging on
  `_aio_task_complete.wait()`).
- `pre_captured_fut=`: required when the caller crosses a trio
  checkpoint between capturing `_fut_waiter` and invoking the helper.
  `Task._wakeup` clears `_fut_waiter = None` so re-reading
  post-checkpoint loses the ref even though the exc is still in-flight
  on the (now-`done()`) original fut.
- `cause=`: sets `exc.__cause__ = cause` so the relay carries
  a "trio_err -> caused -> relay" chain through `set_exception()`
  → `Task._wakeup` → coro raise → `wait_on_coro_final_result`
  → `signal_trio_when_done` → `task.result()`-raise.
- `allow_cancel_fallback=True`: opt-in `aio_task.cancel()` for the
  narrow case where `_fut_waiter is None` AND task is runnable (sitting
  in asyncio's ready queue, not parked on a poke-able future). NEVER
  cancels when `_fut_waiter` carries an in-flight exc — that would race
  + mask the real terminating exc.

`translate_aio_errors()`,
- Replace the two ad-hoc `_fut_waiter.set_exception()`
  / `aio_task.set_exception()` call sites w/ the helper.
- Capture `pre_cp_fut = aio_task._fut_waiter` BEFORE the post-shutdown
  `trio.lowlevel.checkpoint()` (critical: `_wakeup` clears the ref).
- New "cross-loop cause-chain matrix" comment block on the final-raise
  — tabulates every `(trio_err, aio_err, trio_to_raise)` combo into
  exactly one terminal `raise X [from Y]` or early `return`. Covers the
  sibling `signal_trio_when_done()` resolution + the relay-echo
  INVARIANT.
- New relay-echo guard: if `aio_err` is one of OUR OWN signals
  (`TrioTaskExited`/`TrioCancelled`) AND `aio_err.__cause__ is
  trio_err`, raise the bare `trio_err` instead of `trio_err from
  aio_err` (which would CYCLE the cause chain since the relay was itself
  caused-by `trio_err`).
- Drop the stale "the `task.set_exception(aio_taskc)` call MUST NOT
  EXCEPT or this WILL HANG" warning — the helper handles the failure
  path explicitly via `delivered=False` → `wait_on_aio_task = False`.
- Carry `cause=trio_err` on both the cancel-relay (`TrioCancelled`) and
  the graceful-exit relay (`TrioTaskExited`) so the aio-side traceback
  shows the real root.

`open_channel_from()`,
- Adopt the same helper; drop the dead "SHOULD NEVER GET HERE !?!?"
  + `tractor.pause(shield=True)` panic branch.
- Capture in-flight trio-side exc via `sys.exc_info()[1]` and pass as
  `cause=` — non-`None` only when the `try` body raised (graceful exit
  → None).

Other,
- Top-level import: `sys` (for `sys.exc_info()`).
- `run_as_asyncio_guest()`: add commented-out alt `out: Outcome = await
  trio_done_fute` next to the shielded version — exploratory note for
  the longstanding "why is `.shield()` needed?" TODO.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit acd1cbe)
Two coupled changes that let downstream projects (eg. `modden`) inherit
the test-harness loglevel plumbing for free via
`tractor._testing.pytest`:

Plugin lift (`tests/conftest.py` → `_testing/pytest.py`),
- mv `pytest_addoption(--ll)`, the `loglevel` autouse
  fixture, and `test_log` fixture out of the test-suite-
  local conftest into the reusable plugin.
- add `--tl`/`--tractor-loglevel` as a DISTINCT flag from
  `--ll`: `--ll` is the consuming-project's OWN app
  loglevel (scoped to its pkg-hierarchy), `--tl` is the
  `tractor.*` runtime loglevel. `--tl` falls back to
  `--ll` when unset (preserves current `tractor`-suite
  behavior).
- add `testing_pkg_name` session fixture (default
  `'tractor'`) — downstream projects override to e.g.
  `'modden'` so `--ll` scopes to their own hierarchy
  instead of `tractor.*`.
- `loglevel` fixture now yields the resolved
  tractor-runtime level (passed to
  `open_root_actor(loglevel=<.>)` by `@tractor_test`)
  AND separately applies `--ll` to the
  `testing_pkg_name` hierarchy when that isn't
  `tractor`. `test_log` scopes the per-test logger to
  `testing_pkg_name`.

`tractor.log` "logging-spec" mini-DSL,
- `LogSpec = str|bool`. Accepted forms:
  - `True` → enable `pkg_name` root at `default_level`
    (fallback `'cancel'`).
  - `False` → no-op.
  - bare level eg. `'info'` → root-logger at that level.
  - `'sub:info,x:cancel'` → per-sub-logger filter-spec;
    each `<name>` is RELATIVE to `pkg_name` (must NOT
    include the pkg-token).
- `parse_logspec()` → `{sublog|None: level}` mapping.
  `None` key = root-logger. Mixed bare-level + filters
  in one spec is rejected w/ a helpful err msg; so is
  embedding the `pkg_name` token in a sub-name.
- `apply_logspec()` → `(primary_level, {name: log})`:
  parses then enables a `colorlog` stderr handler per
  named (sub)logger. Authoritative sub-logger filters
  get `propagate=False` so they don't double-emit
  through a parallel root-level handler.
- !GRANULARITY CAVEAT! sub-logger names match at
  sub-pkg granularity, not leaf-module — so `devx.debug`
  collapses to the same `tractor.devx` logger as a bare
  `devx`, and top-level lib modules (eg.
  `tractor.to_asyncio`) emit under the *root* logger
  rather than a phantom `to_asyncio` child. Documented
  inline on `LogSpec`.

Other,
- `tests/conftest.py` keeps a NOTE pointing to the
  plugin for future-debugging clarity (don't remove
  silently — the lift is the relevant signal).

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 19a7770)
Strip the trailing `pkg_path` token ONLY when it duplicates the
caller's leaf-*module* name (which the console header already
shows via `{filename}`), instead of blindly dropping the last
token. This keeps genuine, possibly-*nested* sub-PACKAGE parts
addressable as their own sub-loggers.

- detect a true leaf-mod by comparing the caller's `__name__`
  vs `__package__` (a pkg `__init__` has them equal -> its
  trailing token is a real sub-pkg, NOT a leaf to strip).
- `name='devx.debug'` now -> `tractor.devx.debug`, DISTINCT
  from a bare `devx` -> `tractor.devx`; the old unconditional
  `pkg_path = subpkg_path` collapsed both to `tractor.devx` and
  silently broke per-sub-pkg level control via the logging-spec.
- `get_logger(__name__)` leaf-strip still works (cosmetic, bc
  the leaf-mod is in the `{filename}` header field).

Also,
- update the `LogSpec` caveat: sub-PACKAGE granularity now
  addressable at ANY depth; leaf *modules* intentionally aren't
  (they're the `{filename}`); top-level mods (eg. `to_asyncio`)
  still emit on the root logger.
- adjust `test_root_pkg_not_duplicated_in_logger_name` to the
  new literal explicit-`name` contract (no leaf-collapse).

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit 9c36363)
trio 0.29 → 0.33 lock bump (c7741bb) slowed the
depth=3 cancel-cascade in `test_nested_multierrors`
from <6s to ~7-8s; the 6s deadline was firing and its
`Cancelled(source='deadline')` (trio 0.33's new
cancel-reason metadata) collapsed a BEG branch,
breaking the `RemoteActorError` assertion downstream.

- Split the `('trio', _)` case-match into per-depth
  arms: `('trio', 1)` keeps 6s (still finishes in
  ~3s); `('trio', 3)` → 12s.
- Updated inline NOTE explains the version pivot +
  links the tracking issue
  `ai/conc-anal/trio_033_cancel_cascade_slowdown_depth3_issue.md`.
- Existing MTF/`subint_forkserver` budgets unchanged.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit ea67f1b)
(cherry picked from commit 57b3ea5)
Same trio 0.29 → 0.33 cancel-cascade slowdown that hit
`test_nested_multierrors` (ea67f1b) — bumps the
`trio`-backend (non-debug, non-forking) budget in
`test_echoserver_detailed_mechanics` from 1s → 4s.

- The 1s budget raced the ~1s teardown deadline. On a
  deadline-fire trio 0.33 injects
  `Cancelled(source='deadline')` (cancel-reason
  metadata) that wraps the mid-stream KBI in a
  `BaseExceptionGroup`, breaking the bare
  `pytest.raises(KeyboardInterrupt)` below.
- Bump matches the forking-spawner branch (4s).
- Inline NOTE references the tracking issue
  `ai/conc-anal/trio_033_cancel_cascade_slowdown_depth3_issue.md`.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code

(cherry picked from commit d7da502)
(cherry picked from commit d0144e5)
`open_root_actor()` writes `_enable_tpts` (and friends) into
the process-global `_state._runtime_vars` dict but nothing
resets it on actor teardown. Under the in-proc `pytest`
launchpad a uds-using test leaks `_enable_tpts=['uds']` into
a sibling tcp test, tripping the
`registry_addrs`×`enable_transports` proto-guard in
`open_root_actor()` with a `ValueError`.

New `_reset_runtime_vars` fixture snapshots + restores the
dict around every test so no runtime-var state crosses a
test boundary.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Regenerate the lockfile so it's consistent with the
post-rebase `pyproject.toml` — which now carries both #461's
landed tooling (`pytest>=9.0.3`, …) and this branch's
tractor deps (`setproctitle`, `pytest-timeout`, `psutil`),

- `uv lock` resolves the merged dep set against the landed
  `main` baseline.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
@goodboy goodboy force-pushed the subint_era_tooling branch from 5f52284 to 41b5371 Compare June 17, 2026 21:40
goodboy added 4 commits June 17, 2026 19:44
Left-over debug trap from the `_runtime_vars` pure get/set
refactor — it fired on *every* struct-form rt-var write (e.g.
via `.update()`), hanging any non-tty / CI / forked actor on
`pdb` stdin.

Surfaced by a `/code-review high` pass on #462.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Two fixes to the hang-debug SIGUSR1 task-tree dump path,
surfaced by `/code-review high` on #462,

- re-add `_debug_mode` to the sub-actor handler-install gate
  in `_runtime.py`. Dropping it (rel. `3a386ba5`/`3d9c75b6`
  "Drop debug_mode gate", from the `custom_log_levels_api`
  follow-up) was meant to *also* enable non-pdb runs, but
  nothing sets `use_stackscope` from `debug_mode`, so
  debug-mode subs were left with NO handler — and the default
  SIGUSR1 disposition then *kills* them. Now additive:
  `_debug_mode OR use_stackscope OR env`.
- pass `write_file=True` at both `dump_task_tree()` SIGUSR1
  call sites so the advertised `/tmp/tractor-stackscope-<pid>`
  `.log` tee is actually written (was dead under
  `--capture=fd`). Matches `1b1ef10a` "Re-enable writing
  `stackscope` to file by default"; param from `0df90500`.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Two defensive fixes around the `Portal.cancel_actor()` +
`_try_cancel_then_kill()` escalation from `34f333a0`
"Escalate cancel-ack timeouts to `proc.terminate()`" (the
`trionics.start_or_cancel` follow-up); surfaced by
`/code-review high` on #462,

- guard `proc.terminate()` for backends whose `proc` slot
  isn't a `Process` — the future `subint` backend stores an
  `int` interp-id, so escalation would `AttributeError`
  instead of hard-killing; now it logs + no-ops.
- swap `assert cs.cancelled_caught` for an
  `if cs.cancelled_caught and raise_on_timeout:` guard so an
  unexpected shielded-scope exit returns a soft `False`
  rather than crashing `cancel_actor()` mid-teardown.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
The `TRACTOR_LOGLEVEL`/`TRACTOR_SPAWN_METHOD` override-notice
branches were unreachable: `loglevel`/`start_method` were
reassigned to the env value BEFORE the `!=` compare, so the
"OVERRIDES caller-passed" message never fired. Capture the
caller value first, then compare. Rel. `208e7c09`/`d4eac06d`
"Honor env-vars" (`trionics.start_or_cancel`); surfaced by
`/code-review high` on #462.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

devx-tooling "developer experience" improvements as provided `tractor.devx` for runtime dependents. integration Optional/loose support for 3rd party libs/apps/projects testing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants