Skip to content

port allocation race in network::pick_unused_port caused intermittent test-spawn flakes #253

@zancas

Description

@zancas

Symptom

Under concurrent test execution (cargo nextest's per-test-process model, plus the cluster of 10+ zebrad/zcashd spawn tests), individual launches occasionally failed with:

```
zebrad launch_default: RpcReadinessTimeout {
process_name: "zebrad",
address: 127.0.0.1:17656,
timeout: 30s,
last_error: "reqwest::Error { ... ConnectionRefused ... }"
}
```

The launcher's stdout-marker scan returned successfully (the child printed an `Opened RPC endpoint at` line), but the subsequent RPC probe got `ConnectionRefused` for the full 30s readiness budget. Re-running the failing test alone always passed.

Root cause

`zcash_local_net::network::pick_unused_port` delegated to `portpicker::pick_unused_port`, which:

  1. Randomly samples a port in `15000..25000` (a 10 000-slot range).
  2. `TcpListener::bind` against `UNSPECIFIED:port`, immediately drops the listener.
  3. Returns the integer.

There are two TOCTOU windows:

  • Within a process (cargo test threaded model): bind+drop returns, then another concurrent caller can receive the same just-freed port.

  • Across processes (nextest's per-test process model): up to ~40 independent random picks happen in parallel (10 zebrad tests × 4 ports each). Birthday-paradox collision rate in a 10 000-slot range:

    P(collision) ≈ 1 − exp(−40² / (2·10 000)) ≈ 7%

The racing loser's child process fails to bind, exits or stays silent, and the launcher's RPC probe times out — surfacing as `RpcReadinessTimeout` rather than the underlying bind failure.

Fix

`network::pick_unused_port` rewritten in commit `c2fb7a4` (branch `speed_zcash_local_net_up`) to use:

  1. Kernel-assigned ephemeral allocation: `TcpListener::bind("127.0.0.1:0")`. Linux's port allocator issues sequential ports with TIME_WAIT cooldown; even cross-process collisions become vanishingly rare.
  2. Process-local registry: `LazyLock<Mutex<HashSet>>` of every port handed out, guaranteeing no two concurrent in-process callers ever receive the same port. Lock poisoning is tolerated (the inner state is consistent at every panic point).

Public signature is unchanged: `pick_unused_port(Option) -> u16`.

The `portpicker` workspace dependency was dropped.

Regression tests (in `zcash_local_net::network::tests`)

Test Coverage
`pick_unused_port_returns_unique_ports_under_concurrency` 32 threads × 8 picks: every returned port is unique
`pick_unused_port_no_collisions_repeated` 32 trials × 8 threads × 4 picks: zero duplicates anywhere
`returned_ports_are_bindable` Each returned port can be successfully `TcpListener::bind`'d
`simulated_concurrent_spawns_have_unique_bindable_ports` Mirrors zebrad shape: 12 simulated spawns × 4 ports each, bound concurrently
`fixed_port_double_reservation_panics` Registry rejects double-reservation of a fixed port

Residual cross-process race

A small race window still exists: between `bind(0); drop(listener)` returning and the child process binding, another test process could in theory grab the just-freed port. Linux's TIME_WAIT cooldown makes this vanishingly rare in practice but not impossible. Closing it fully would require either:

  • Holding a `SO_REUSEADDR`/`SO_REUSEPORT`-bound listener across the spawn (children would need matching support — zebrad/zcashd do not set these by default).
  • A file-locked cross-process registry under e.g. `$TMPDIR/zcash_local_net.ports/`.

Neither is necessary at the current test scale; recording for future reference.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions