Skip to content

test(gateway): de-flake integration tests via shared helper#407

Merged
POPPIN-FUMI merged 2 commits into
mainfrom
fix/gateway-test-flakiness
Jun 6, 2026
Merged

test(gateway): de-flake integration tests via shared helper#407
POPPIN-FUMI merged 2 commits into
mainfrom
fix/gateway-test-flakiness

Conversation

@POPPIN-FUMI

Copy link
Copy Markdown
Contributor

Problem

The cli/test/integration/gateway_*.test.ts suite intermittently fails CI with:

gateway did not become healthy on :<port> within 10000ms (last: ... Connection refused)

This is the flake that has been forcing reruns on essentially every PR. Two root causes:

  1. Random port collisions — each test picked 30000 + random*10000 with no free-port check; an in-use port → the subprocess fails to bind → the test polls a dead process for the full 10s.
  2. Cold subprocess start — each test spawns deno run -A --no-check src/index.ts gateway run, which recompiles the whole CLI module graph; under the parallel load of all 5 files a single cold start can exceed the 10s deadline on loaded CI runners.

Fix

Extracted the duplicated spawn/health/port/cleanup plumbing (copy-pasted across 5 files) into cli/test/integration/_gateway_helpers.ts, and fixed all three failure modes there:

  • pickPort() binds 127.0.0.1:0 and returns the OS-assigned free port — no more collisions.
  • waitForHealthz watches child.status and fails fast (surfacing stderr) if the subprocess exits before /healthz, instead of polling a dead process until the deadline.
  • Timeout 10s → 30s to absorb cold-start variance.
  • Warm the Deno module cache once per run (deno cache <entry>) so every spawn boots from a warm graph — the decisive fix under parallel load.

The 5 test files now import the helper and drop their local duplicates. No test assertions or scenarios were changed — only the plumbing.

Testing

cd cli && deno test -A --config=deno.json test/integration/gateway_*.test.ts18 passed / 0 failed, run multiple times consecutively, all green. deno fmt --check and deno check clean on all 6 files.

🤖 Generated with Claude Code

POPPIN-FUMI and others added 2 commits June 6, 2026 21:50
The gateway integration tests spawned `slv gateway run` as a subprocess on a
RANDOM port and polled /healthz for 10s. Two flake sources in CI:
- a random port could already be in use → bind fail → the test polled a dead
  process for the full 10s;
- a cold `deno run --no-check src/index.ts` recompiles the whole CLI graph, and
  under the parallel load of all 5 files a single start could exceed 10s.

Extract the duplicated plumbing into `cli/test/integration/_gateway_helpers.ts`
and fix all three:
- `pickPort()` returns an OS-assigned free loopback port (bind :0) — no collisions;
- `waitForHealthz` watches `child.status` and fails fast with stderr if the
  subprocess exits before healthz, instead of polling a dead process;
- default healthz timeout 10s → 30s;
- warm the Deno module cache once per run (`deno cache <entry>`) so every spawn
  boots from a warm graph.

No test assertions/scenarios changed. Local run: 18 passed / 0 failed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Review follow-up: homePrefix was exposed on spawnGateway/startGateway but never
passed by any caller. Drop it; temp dirs use a single 'slv-gw-it-' prefix (each
makeTempDir is already unique).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@POPPIN-FUMI POPPIN-FUMI merged commit 6fcee0f into main Jun 6, 2026
3 checks passed
@POPPIN-FUMI POPPIN-FUMI deleted the fix/gateway-test-flakiness branch June 6, 2026 13:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant