Skip to content

debug: hang flake instrumentation (do not merge)#5

Closed
fdatoo wants to merge 3 commits intomainfrom
debug/hang-flake
Closed

debug: hang flake instrumentation (do not merge)#5
fdatoo wants to merge 3 commits intomainfrom
debug/hang-flake

Conversation

@fdatoo
Copy link
Copy Markdown
Owner

@fdatoo fdatoo commented Apr 30, 2026

Temporary diagnostic branch — do not merge.

fdatoo added 3 commits April 30, 2026 05:44
DO NOT MERGE — temporary diagnostics on a debug branch.

Logs CANCEL, CANCEL_KILLED, TRANSITION_TO_WAITING, WAIT_BEGIN,
WAIT_RESULT events with run id, cid, kill response and exit code to
/tmp/fbi-debug.log; CI uploads it as the fbi-debug-log artifact on
every run. Drops retries to 0 so the flake surfaces.

Once we have the data and a fix, revert this branch.
Adds /fbi-state/quantico-debug.log marker for which branch quantico
took (Exited/SleepingForever/Err) and uploads per-run state dirs so
we can correlate quantico's outcome with the actor's
TRANSITION_TO_WAITING decision.
Root cause of the hang-test 'succeeded' flake: SQLite reuses run rowids
when a prior run is deleted (no AUTOINCREMENT). When test N+1 lands on
an id whose container from test N is still alive (Waiting phase, listener
still connected), the container's bind mount on runs_dir/<id>/state is
still active. setup_run_dir's del_dir_r then fails with EBUSY, the
directory survives, and result.json from the prior run leaks into the
new run. read_outcome reads the stale exit_code; if the prior scenario
exited 0, the new run is reported 'succeeded' regardless of what
actually happened in its container.

Two-part fix:
  1. Reorder do_mock_launch / do_real_launch to force-remove the prior
     container BEFORE setup_run_dir, so the bind mount is released and
     del_dir_r can clean the directory.
  2. As a defensive layer, explicitly delete the individual state signal
     files (result.json, agent-status, session-id, ready) by path. unlink
     works on individual files inside a bind-mounted directory even when
     the directory itself can't be removed — so even if a future bug
     re-introduces the ordering issue, read_outcome won't see stale data.

Verified via /tmp/fbi-runs-state artifacts on debug/hang-flake CI run
25166517481: run-3's quantico-debug.log said OUTCOME=SleepingForever
(correctly blocked) yet result.json had exit_code:1 — leftover from
the crash-fast run that previously held that id.
@fdatoo fdatoo closed this Apr 30, 2026
@fdatoo fdatoo deleted the debug/hang-flake branch April 30, 2026 13:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant