Symptom
Under concurrent test execution (cargo nextest's per-test-process model, plus the cluster of 10+ zebrad/zcashd spawn tests), individual launches occasionally failed with:
```
zebrad launch_default: RpcReadinessTimeout {
process_name: "zebrad",
address: 127.0.0.1:17656,
timeout: 30s,
last_error: "reqwest::Error { ... ConnectionRefused ... }"
}
```
The launcher's stdout-marker scan returned successfully (the child printed an `Opened RPC endpoint at` line), but the subsequent RPC probe got `ConnectionRefused` for the full 30s readiness budget. Re-running the failing test alone always passed.
Root cause
`zcash_local_net::network::pick_unused_port` delegated to `portpicker::pick_unused_port`, which:
- Randomly samples a port in `15000..25000` (a 10 000-slot range).
- `TcpListener::bind` against `UNSPECIFIED:port`, immediately drops the listener.
- Returns the integer.
There are two TOCTOU windows:
-
Within a process (cargo test threaded model): bind+drop returns, then another concurrent caller can receive the same just-freed port.
-
Across processes (nextest's per-test process model): up to ~40 independent random picks happen in parallel (10 zebrad tests × 4 ports each). Birthday-paradox collision rate in a 10 000-slot range:
P(collision) ≈ 1 − exp(−40² / (2·10 000)) ≈ 7%
The racing loser's child process fails to bind, exits or stays silent, and the launcher's RPC probe times out — surfacing as `RpcReadinessTimeout` rather than the underlying bind failure.
Fix
`network::pick_unused_port` rewritten in commit `c2fb7a4` (branch `speed_zcash_local_net_up`) to use:
- Kernel-assigned ephemeral allocation: `TcpListener::bind("127.0.0.1:0")`. Linux's port allocator issues sequential ports with TIME_WAIT cooldown; even cross-process collisions become vanishingly rare.
- Process-local registry: `LazyLock<Mutex<HashSet>>` of every port handed out, guaranteeing no two concurrent in-process callers ever receive the same port. Lock poisoning is tolerated (the inner state is consistent at every panic point).
Public signature is unchanged: `pick_unused_port(Option) -> u16`.
The `portpicker` workspace dependency was dropped.
Regression tests (in `zcash_local_net::network::tests`)
| Test |
Coverage |
| `pick_unused_port_returns_unique_ports_under_concurrency` |
32 threads × 8 picks: every returned port is unique |
| `pick_unused_port_no_collisions_repeated` |
32 trials × 8 threads × 4 picks: zero duplicates anywhere |
| `returned_ports_are_bindable` |
Each returned port can be successfully `TcpListener::bind`'d |
| `simulated_concurrent_spawns_have_unique_bindable_ports` |
Mirrors zebrad shape: 12 simulated spawns × 4 ports each, bound concurrently |
| `fixed_port_double_reservation_panics` |
Registry rejects double-reservation of a fixed port |
Residual cross-process race
A small race window still exists: between `bind(0); drop(listener)` returning and the child process binding, another test process could in theory grab the just-freed port. Linux's TIME_WAIT cooldown makes this vanishingly rare in practice but not impossible. Closing it fully would require either:
- Holding a `SO_REUSEADDR`/`SO_REUSEPORT`-bound listener across the spawn (children would need matching support — zebrad/zcashd do not set these by default).
- A file-locked cross-process registry under e.g. `$TMPDIR/zcash_local_net.ports/`.
Neither is necessary at the current test scale; recording for future reference.
Symptom
Under concurrent test execution (cargo nextest's per-test-process model, plus the cluster of 10+ zebrad/zcashd spawn tests), individual launches occasionally failed with:
```
zebrad launch_default: RpcReadinessTimeout {
process_name: "zebrad",
address: 127.0.0.1:17656,
timeout: 30s,
last_error: "reqwest::Error { ... ConnectionRefused ... }"
}
```
The launcher's stdout-marker scan returned successfully (the child printed an `Opened RPC endpoint at` line), but the subsequent RPC probe got `ConnectionRefused` for the full 30s readiness budget. Re-running the failing test alone always passed.
Root cause
`zcash_local_net::network::pick_unused_port` delegated to `portpicker::pick_unused_port`, which:
There are two TOCTOU windows:
Within a process (cargo test threaded model): bind+drop returns, then another concurrent caller can receive the same just-freed port.
Across processes (nextest's per-test process model): up to ~40 independent random picks happen in parallel (10 zebrad tests × 4 ports each). Birthday-paradox collision rate in a 10 000-slot range:
P(collision) ≈ 1 − exp(−40² / (2·10 000)) ≈ 7%
The racing loser's child process fails to bind, exits or stays silent, and the launcher's RPC probe times out — surfacing as `RpcReadinessTimeout` rather than the underlying bind failure.
Fix
`network::pick_unused_port` rewritten in commit `c2fb7a4` (branch `speed_zcash_local_net_up`) to use:
Public signature is unchanged: `pick_unused_port(Option) -> u16`.
The `portpicker` workspace dependency was dropped.
Regression tests (in `zcash_local_net::network::tests`)
Residual cross-process race
A small race window still exists: between `bind(0); drop(listener)` returning and the child process binding, another test process could in theory grab the just-freed port. Linux's TIME_WAIT cooldown makes this vanishingly rare in practice but not impossible. Closing it fully would require either:
Neither is necessary at the current test scale; recording for future reference.