fix: hang test 'succeeded' flake — stale state dir from reused run id#6
Open
fix: hang test 'succeeded' flake — stale state dir from reused run id#6
Conversation
Root cause of the hang-test 'succeeded' flake: SQLite reuses run rowids
when a prior run is deleted (no AUTOINCREMENT). When test N+1 lands on
an id whose container from test N is still alive (Waiting phase, listener
still connected), the container's bind mount on runs_dir/<id>/state is
still active. setup_run_dir's del_dir_r then fails with EBUSY, the
directory survives, and result.json from the prior run leaks into the
new run. read_outcome reads the stale exit_code; if the prior scenario
exited 0, the new run is reported 'succeeded' regardless of what
actually happened in its container.
Two-part fix:
1. Reorder do_mock_launch / do_real_launch to force-remove the prior
container BEFORE setup_run_dir, so the bind mount is released and
del_dir_r can clean the directory.
2. As a defensive layer, explicitly delete the individual state signal
files (result.json, agent-status, session-id, ready) by path. unlink
works on individual files inside a bind-mounted directory even when
the directory itself can't be removed — so even if a future bug
re-introduces the ordering issue, read_outcome won't see stale data.
Verified via /tmp/fbi-runs-state artifacts on debug/hang-flake CI run
25166517481: run-3's quantico-debug.log said OUTCOME=SleepingForever
(correctly blocked) yet result.json had exit_code:1 — leftover from
the crash-fast run that previously held that id.
Three handler files left unformatted by recent merges; bring them in line so this PR's CI passes the format check. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two-actor-on-one-rowid bug — even with the worker.gleam state-dir fix, a deleted run's actor sits in Waiting phase (5-min listener-keepalive timeout) and still holds state.run_id = N. When SQLite reuses N for the next test's run, the old container_monitor eventually reports ContainerExited; the old actor processes it and calls mark_finished(db, N, prior_outcome). DB row N is now the *new* run, so the new run's state gets overwritten with the prior run's exit code — same 'succeeded'-flake symptom we hit in the e2e hang test. AUTOINCREMENT makes ids strictly grow. The old actor's UPDATE WHERE id = N matches no row (the new run has id N+1+), so the leak becomes a harmless no-op without changing anything else about the actor lifecycle. SQLite doesn't support ALTER TABLE … ADD AUTOINCREMENT, so this migration rebuilds the table. Manual sqlite_sequence DELETE+INSERT seeds it to MAX(id) (sqlite_sequence has no UNIQUE on name, so INSERT OR REPLACE doesn't dedupe). Verified locally: insert id=1, id=2, delete 2, insert → id=3 (was 2 with the old schema). Belt-and-suspenders alongside worker.gleam's state-dir cleanup; both fixes target the same root cause (id reuse + lingering old-run state) from different angles. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
do_mock_launch/do_real_launchto force-remove the prior container beforesetup_run_dirresult.json,agent-status,session-id,ready) by pathRoot cause
SQLite without `AUTOINCREMENT` reuses run rowids when prior runs are deleted. Test cleanup (`run.destroy()`) does exactly this, so the next test can land on an id whose `runs_dir//state` is still bind-mounted by the prior container (which sits in `Waiting` phase for 5 minutes after exit, since the test page is still subscribed via WS).
`setup_run_dir` runs before `remove_container`. While that bind mount is still alive, `del_dir_r` of the state dir fails with EBUSY — silently, because the result is `let _ = …`. The stale `result.json` survives. The new container starts, runs hang scenario, blocks on `SleepingForever`, gets SIGKILL'd → exits 137. `read_outcome` then reads the prior run's `result.json` and reports its `exit_code`. If the prior was `crash-fast` you get `exit 1 → state="failed"` (test passes by luck). If the prior exited 0 (default, env-echo, etc.) you get `exit 0 → state="succeeded"` and the hang test fails on `/cancelled|stopped|failed|errored/`.
Evidence
Captured on `debug/hang-flake` CI run `25166061230`. From run-3's state dir artifact:
```
quantico-debug.log: OUTCOME=SleepingForever ← current container (hang) blocked correctly
result.json: {"exit_code":1, ...} ← stale, from prior crash-fast container
```
The actor's `TRANSITION_TO_WAITING` saw `exit_code=1, err="agent exit 1"` despite `wait_container` returning `137` — confirming the result.json path was overriding the docker exit code.
Test plan
🤖 Generated with Claude Code