fix: hang test 'succeeded' flake — stale state dir from reused run id by fdatoo · Pull Request #6 · fdatoo/FBI

fdatoo · 2026-04-30T13:11:12Z

Summary

Reorder do_mock_launch / do_real_launch to force-remove the prior container before setup_run_dir
Belt-and-suspenders: explicitly delete individual state signal files (result.json, agent-status, session-id, ready) by path

Root cause

SQLite without `AUTOINCREMENT` reuses run rowids when prior runs are deleted. Test cleanup (`run.destroy()`) does exactly this, so the next test can land on an id whose `runs_dir//state` is still bind-mounted by the prior container (which sits in `Waiting` phase for 5 minutes after exit, since the test page is still subscribed via WS).

`setup_run_dir` runs before `remove_container`. While that bind mount is still alive, `del_dir_r` of the state dir fails with EBUSY — silently, because the result is `let _ = …`. The stale `result.json` survives. The new container starts, runs hang scenario, blocks on `SleepingForever`, gets SIGKILL'd → exits 137. `read_outcome` then reads the prior run's `result.json` and reports its `exit_code`. If the prior was `crash-fast` you get `exit 1 → state="failed"` (test passes by luck). If the prior exited 0 (default, env-echo, etc.) you get `exit 0 → state="succeeded"` and the hang test fails on `/cancelled|stopped|failed|errored/`.

Evidence

Captured on `debug/hang-flake` CI run `25166061230`. From run-3's state dir artifact:

```
quantico-debug.log: OUTCOME=SleepingForever ← current container (hang) blocked correctly
result.json: {"exit_code":1, ...} ← stale, from prior crash-fast container
```

The actor's `TRANSITION_TO_WAITING` saw `exit_code=1, err="agent exit 1"` despite `wait_container` returning `137` — confirming the result.json path was overriding the docker exit code.

Test plan

CI green on this branch (E2E with retries=2; should not flake)
Once merged, can drop `retries: 2` in playwright.config.ts back to 1 or 0 (separate PR)
Note: this also explains the much rarer `auto-scroll` flake — stale terminal state from a prior run reused id can leave the controller's pause-detection out of sync. Worth re-running auto-scroll a few times after merging to confirm it stabilizes.

🤖 Generated with Claude Code

Root cause of the hang-test 'succeeded' flake: SQLite reuses run rowids when a prior run is deleted (no AUTOINCREMENT). When test N+1 lands on an id whose container from test N is still alive (Waiting phase, listener still connected), the container's bind mount on runs_dir/<id>/state is still active. setup_run_dir's del_dir_r then fails with EBUSY, the directory survives, and result.json from the prior run leaks into the new run. read_outcome reads the stale exit_code; if the prior scenario exited 0, the new run is reported 'succeeded' regardless of what actually happened in its container. Two-part fix: 1. Reorder do_mock_launch / do_real_launch to force-remove the prior container BEFORE setup_run_dir, so the bind mount is released and del_dir_r can clean the directory. 2. As a defensive layer, explicitly delete the individual state signal files (result.json, agent-status, session-id, ready) by path. unlink works on individual files inside a bind-mounted directory even when the directory itself can't be removed — so even if a future bug re-introduces the ordering issue, read_outcome won't see stale data. Verified via /tmp/fbi-runs-state artifacts on debug/hang-flake CI run 25166517481: run-3's quantico-debug.log said OUTCOME=SleepingForever (correctly blocked) yet result.json had exit_code:1 — leftover from the crash-fast run that previously held that id.

Three handler files left unformatted by recent merges; bring them in line so this PR's CI passes the format check. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Two-actor-on-one-rowid bug — even with the worker.gleam state-dir fix, a deleted run's actor sits in Waiting phase (5-min listener-keepalive timeout) and still holds state.run_id = N. When SQLite reuses N for the next test's run, the old container_monitor eventually reports ContainerExited; the old actor processes it and calls mark_finished(db, N, prior_outcome). DB row N is now the *new* run, so the new run's state gets overwritten with the prior run's exit code — same 'succeeded'-flake symptom we hit in the e2e hang test. AUTOINCREMENT makes ids strictly grow. The old actor's UPDATE WHERE id = N matches no row (the new run has id N+1+), so the leak becomes a harmless no-op without changing anything else about the actor lifecycle. SQLite doesn't support ALTER TABLE … ADD AUTOINCREMENT, so this migration rebuilds the table. Manual sqlite_sequence DELETE+INSERT seeds it to MAX(id) (sqlite_sequence has no UNIQUE on name, so INSERT OR REPLACE doesn't dedupe). Verified locally: insert id=1, id=2, delete 2, insert → id=3 (was 2 with the old schema). Belt-and-suspenders alongside worker.gleam's state-dir cleanup; both fixes target the same root cause (id reuse + lingering old-run state) from different angles. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

fdatoo and others added 3 commits April 30, 2026 06:10

chore: gleam format

a3c9ed9

Three handler files left unformatted by recent merges; bring them in line so this PR's CI passes the format check. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: hang test 'succeeded' flake — stale state dir from reused run id#6

fix: hang test 'succeeded' flake — stale state dir from reused run id#6
fdatoo wants to merge 3 commits intomainfrom
fix/hang-state-dir-bind-race

fdatoo commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

fdatoo commented Apr 30, 2026

Summary

Root cause

Evidence

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant