Skip to content

UDP send optimizations#20

Open
endel wants to merge 5 commits into
mainfrom
udp-send-optimizations
Open

UDP send optimizations#20
endel wants to merge 5 commits into
mainfrom
udp-send-optimizations

Conversation

@endel
Copy link
Copy Markdown
Owner

@endel endel commented Apr 15, 2026

endel added 5 commits April 15, 2026 15:55
Three coordinated send-path optimizations following Cloudflare's "Accelerating
UDP packet transmission for QUIC" post, each gated by an env var so we can
bisect regressions without rebuilding:

- `SendBatch.flush` on Linux now uses `sendmmsg` to batch packets grouped by
  ECN mark; ~60× fewer syscalls for bulk transfer. Partial-send policy drops
  silently with a rate-limited warn. macOS/Windows stay on the per-packet
  `sendmsg` loop. Kill switch: `QUIC_ZIG_NO_SENDMMSG=1`.
- Layered on top, UDP Generic Segmentation Offload (`UDP_SEGMENT`) coalesces
  same-peer same-size contiguous packets into one GSO super-buffer per
  mmsghdr entry. Default **off** because zig-client → neqo-server transfer
  regresses over ns-3 veth (26/27 combos green; see SPEC/interop-results.md).
  Opt-in: `QUIC_ZIG_ENABLE_GSO=1`.
- Pacer already gated inside `conn.send` and folded into `nextTimeoutNs`; this
  adds a symmetric kill switch for bisection parity: `QUIC_ZIG_NO_PACING=1`.

Validated via quic-interop-runner against quic-go and neqo in both directions
(handshake, transfer, chacha20, multiplexing, longrtt, http3, keyupdate):
14/14 pass with defaults, matching the 2026-03-24 baseline.
Prerequisite for Plan 4b (SO_TXTIME kernel pacing): the SCM_TXTIME cmsg
requires a monotonic timestamp, so Pacer.last_sent_time now lives on
CLOCK_MONOTONIC instead of CLOCK_REALTIME. Other subsystems (loss detection,
PTO, idle timeout) stay on the existing clock since they only consume
durations and are insensitive to the clock choice.

- New `clock.monoNanos()` wraps `clock_gettime(CLOCK_MONOTONIC)` with a
  Windows fallback to `nanoTimestamp()`.
- `conn.send()` reads both clocks; the three pacer call sites
  (timeUntilSend, onPacketSent, and the nextTimeoutNs pacer branch) now
  consume the monotonic value.
- `nextTimeoutNs` computes the pacer delay on the monotonic clock but
  returns a REALTIME-based deadline so it remains comparable to the loss
  and idle deadlines the event loop also collects.
- Pacer doc comment spells out the clock contract.

Validated via quic-interop-runner: 27/28 combos pass (quic-go + neqo both
directions); the single remaining failure is zig-client → neqo-server
chacha20, which is pre-existing in the 2026-03-24 baseline.
Plan 4b: when QUIC_ZIG_ENABLE_TXTIME=1, conn.send() keeps producing packets
while the user-space pacer would have blocked, stamping each one with a
CLOCK_MONOTONIC target transmission time. SendBatch attaches the timestamp
as an SCM_TXTIME cmsg so the kernel's fq qdisc releases the packet at the
right time — moving pacing from user-space sleeps to a single syscall.

Wiring:
- `ecn_socket.zig`: SO_TXTIME / SCM_TXTIME constants, `probeTxtimeSupport`
  at SendBatch.init, new `addTxtime(...)` API, `txtimes[]` per-packet array,
  combined cmsg layout (UDP_SEGMENT + SCM_TXTIME stacked, 48 B per entry),
  GSO grouping respects timestamp identity within a super-buffer.
- `connection.zig`: `last_target_txtime` field set when the pacer would
  block and kernel pacing is enabled; `isKernelPacingEnabled()` env cache
  mirrors the pattern used by `isPacingDisabled`.
- `event_loop.zig`: send-loop sites pass `conn.last_target_txtime` to
  `batch.addTxtime` (default 0 = "send now").

On non-fq egress paths (including ns-3 in the interop runner) the kernel
accepts the cmsg but ignores the timestamp — same observable behavior as
without TXTIME. Validated with both modes on quic-go + neqo, both directions:
no regressions in either default-off or opt-in. Real-hardware throughput
benefit only kicks in with `tc qdisc add dev <iface> root fq`.
Plans 2 and 4b were prototyped end-to-end (commits 4a0ee92 + 3ae98a0) and
proved correct in the interop matrix, but they target a workload we don't
have: bulk CDN-style throughput. Our direction is real-time WebTransport
(low-latency datagrams, browser interop), where:

- GSO almost never groups (small variable-size datagrams), and the one
  failing combo (zig-client → neqo-server transfer over ns-3 veth) was a
  permanent maintenance tax behind an opt-in nobody would enable.
- SO_TXTIME moves the pacer wait from user space to the kernel — useful
  when the kernel actually paces (fq qdisc on production hosts), but the
  user-space pacer already produces correct behavior, and "release at time
  T" is the opposite of what real-time datagram workloads want.

Keeping:
- `sendmmsg` batching (default-on, Linux): one syscall per ECN-mark run.
  Latency-critical paths still use `sendDirect` (single-packet, bypasses
  the batch entirely), so this only helps bulk WT streams without
  penalising real-time traffic.
- `QUIC_ZIG_NO_PACING=1` kill switch + send-loop doc comments from Plan 3.
- `clock.monoNanos()` and the Pacer migration to CLOCK_MONOTONIC from
  Plan 4a — standalone NTP-skew-resilience win, kept on its own merit.

353-line net deletion. ecn_socket.zig: 863 → 502 lines. Validated 28/28
across quic-go + neqo both directions.
Pacer state lives on CLOCK_MONOTONIC (NTP-skew resilience); everything else
uses std.time.nanoTimestamp() (REALTIME). The boundary is crossed in exactly
one place — Connection.nextTimeoutNs — and that's the only function readers
need to understand to avoid future cross-clock comparison bugs.

Spells out who uses which clock and why, the three rules for adding new
clock-touching code, and why we chose not to migrate everything to MONOTONIC.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant