Skip to content

fix: hang test 'succeeded' flake — stale state dir from reused run id#6

Open
fdatoo wants to merge 3 commits intomainfrom
fix/hang-state-dir-bind-race
Open

fix: hang test 'succeeded' flake — stale state dir from reused run id#6
fdatoo wants to merge 3 commits intomainfrom
fix/hang-state-dir-bind-race

Conversation

@fdatoo
Copy link
Copy Markdown
Owner

@fdatoo fdatoo commented Apr 30, 2026

Summary

  • Reorder do_mock_launch / do_real_launch to force-remove the prior container before setup_run_dir
  • Belt-and-suspenders: explicitly delete individual state signal files (result.json, agent-status, session-id, ready) by path

Root cause

SQLite without `AUTOINCREMENT` reuses run rowids when prior runs are deleted. Test cleanup (`run.destroy()`) does exactly this, so the next test can land on an id whose `runs_dir//state` is still bind-mounted by the prior container (which sits in `Waiting` phase for 5 minutes after exit, since the test page is still subscribed via WS).

`setup_run_dir` runs before `remove_container`. While that bind mount is still alive, `del_dir_r` of the state dir fails with EBUSY — silently, because the result is `let _ = …`. The stale `result.json` survives. The new container starts, runs hang scenario, blocks on `SleepingForever`, gets SIGKILL'd → exits 137. `read_outcome` then reads the prior run's `result.json` and reports its `exit_code`. If the prior was `crash-fast` you get `exit 1 → state="failed"` (test passes by luck). If the prior exited 0 (default, env-echo, etc.) you get `exit 0 → state="succeeded"` and the hang test fails on `/cancelled|stopped|failed|errored/`.

Evidence

Captured on `debug/hang-flake` CI run `25166061230`. From run-3's state dir artifact:

```
quantico-debug.log: OUTCOME=SleepingForever ← current container (hang) blocked correctly
result.json: {"exit_code":1, ...} ← stale, from prior crash-fast container
```

The actor's `TRANSITION_TO_WAITING` saw `exit_code=1, err="agent exit 1"` despite `wait_container` returning `137` — confirming the result.json path was overriding the docker exit code.

Test plan

  • CI green on this branch (E2E with retries=2; should not flake)
  • Once merged, can drop `retries: 2` in playwright.config.ts back to 1 or 0 (separate PR)
  • Note: this also explains the much rarer `auto-scroll` flake — stale terminal state from a prior run reused id can leave the controller's pause-detection out of sync. Worth re-running auto-scroll a few times after merging to confirm it stabilizes.

🤖 Generated with Claude Code

fdatoo and others added 3 commits April 30, 2026 06:10
Root cause of the hang-test 'succeeded' flake: SQLite reuses run rowids
when a prior run is deleted (no AUTOINCREMENT). When test N+1 lands on
an id whose container from test N is still alive (Waiting phase, listener
still connected), the container's bind mount on runs_dir/<id>/state is
still active. setup_run_dir's del_dir_r then fails with EBUSY, the
directory survives, and result.json from the prior run leaks into the
new run. read_outcome reads the stale exit_code; if the prior scenario
exited 0, the new run is reported 'succeeded' regardless of what
actually happened in its container.

Two-part fix:
  1. Reorder do_mock_launch / do_real_launch to force-remove the prior
     container BEFORE setup_run_dir, so the bind mount is released and
     del_dir_r can clean the directory.
  2. As a defensive layer, explicitly delete the individual state signal
     files (result.json, agent-status, session-id, ready) by path. unlink
     works on individual files inside a bind-mounted directory even when
     the directory itself can't be removed — so even if a future bug
     re-introduces the ordering issue, read_outcome won't see stale data.

Verified via /tmp/fbi-runs-state artifacts on debug/hang-flake CI run
25166517481: run-3's quantico-debug.log said OUTCOME=SleepingForever
(correctly blocked) yet result.json had exit_code:1 — leftover from
the crash-fast run that previously held that id.
Three handler files left unformatted by recent merges; bring them in
line so this PR's CI passes the format check.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two-actor-on-one-rowid bug — even with the worker.gleam state-dir fix,
a deleted run's actor sits in Waiting phase (5-min listener-keepalive
timeout) and still holds state.run_id = N. When SQLite reuses N for
the next test's run, the old container_monitor eventually reports
ContainerExited; the old actor processes it and calls
mark_finished(db, N, prior_outcome). DB row N is now the *new* run,
so the new run's state gets overwritten with the prior run's exit code
— same 'succeeded'-flake symptom we hit in the e2e hang test.

AUTOINCREMENT makes ids strictly grow. The old actor's UPDATE WHERE
id = N matches no row (the new run has id N+1+), so the leak becomes
a harmless no-op without changing anything else about the actor
lifecycle.

SQLite doesn't support ALTER TABLE … ADD AUTOINCREMENT, so this
migration rebuilds the table. Manual sqlite_sequence DELETE+INSERT
seeds it to MAX(id) (sqlite_sequence has no UNIQUE on name, so
INSERT OR REPLACE doesn't dedupe).

Verified locally: insert id=1, id=2, delete 2, insert → id=3 (was 2
with the old schema).

Belt-and-suspenders alongside worker.gleam's state-dir cleanup; both
fixes target the same root cause (id reuse + lingering old-run state)
from different angles.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant