feat: opt-in network monitor with auto-rebind on link change (slice a: core + supervisor)#261
Conversation
Slice a of tsr-doz6: the OS-free core that recovers a direct path on a link change instead of silently degrading to DERP until the periodic timers eventually re-probe. The embedder-driven Device::rebind path is left unchanged; this adds an internal supervisor as an opt-in parallel path. New ts_netmon crate (geiserx_ts_netmon, tokio + tracing only, no netlink/route deps): a LinkMonitor trait, a detail-free LinkChangeEvent, a LinkMonitorHandle that aborts its watcher task on drop, an OS-agnostic 250ms trailing-settle debouncer, plus ManualLinkMonitor (synthetic source for tests/this slice) and NoopLinkMonitor. ts_runtime gains a NetmonSupervisor actor behind a new network-monitor feature: on each coalesced link change it asks DirectManager the new RebindAndReprobe message and republishes MeasureNow. RebindAndReprobe folds all three kicks atomically inside the manager that owns the socket, peer db, and multiderp ref: rebind the underlay socket, re-ping all candidates on it, and fire an immediate gated STUN sweep. DerpLatencyMeasurer now caches the last derp map and re-measures on a MeasureNow bus message, so the supervisor needs no extra actor ref. The monitor is gated by Config::network_monitor (default off, pure engine posture) and the network-monitor feature. With the flag off it is a complete no-op. Setting the flag without the feature is a hard error at spawn rather than a silent no-op. The synthetic and no-op backends prove the full pipeline; the Linux/macOS backends are later slices.
…nterval rebinds Two storm-handling correctness fixes in the tsr-doz6 slice-a network monitor, found by the review panel. Both only bite once real OS event sources feed bursty or continuous events, but the flawed code already ships in the OS-agnostic core. ts_netmon debouncer: a continuous raw stream arriving faster than the 250ms trailing settle re-armed the settle timer forever, so the emit arm never fired and zero LinkChangeEvents were produced for the whole storm. Add a DEBOUNCE_MAX_WAIT ceiling of 1s: track when the current pending burst first became pending and arm the timer at min of now plus settle and first-pending plus max-wait, so a never-quiet stream still emits periodically. The marker is cleared on emit so the next raw event starts a fresh burst. The common quiet-terminated burst is unaffected; the cap only kicks in for a pathological never-quiet stream. Flush-on-raw-close is preserved. ts_runtime supervisor: an event arriving within MIN_REBIND_INTERVAL of the last completed rebind was continued, dropping it permanently. A genuinely distinct second link change within 1s was lost, leaving the node rebound for the stale network until the periodic pinger or STUN caught up. Coalesce and defer instead: set a pending flag, arm a timer to the interval boundary, and run exactly one rebind when it elapses, folding any number of within-interval events into that single deferred rebind. This bounds worst-case recovery to MIN_REBIND_INTERVAL. Serial processing, the RebindAndReprobe plus MeasureNow reaction, and clean shutdown precedence over a pending deferred rebind are all preserved. Tests: a paused-clock debouncer test feeds events every half-settle past the cap and asserts at least one emission during the never-quiet stream; the supervisor min-interval test is updated from drop to defer semantics and a second test proves multiple within-interval events coalesce to one deferred rebind. Also reword two stale comments the reviewer flagged.
📝 WalkthroughWalkthroughAdds a new ChangesNetwork-link monitor feature
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@ts_netmon/src/lib.rs`:
- Around line 135-145: The debouncer loop's shutdown handling has two issues: it
ignores the Result returned by shutdown.changed(), and lacks an upfront check
for an already-true shutdown signal. First, add a check before the loop that
breaks immediately if shutdown is already true. Second, in the
shutdown.changed() branch of the tokio::select!, properly handle the error case
returned by shutdown.changed() — if it returns an error (indicating the watch
sender is dropped), break from the loop instead of ignoring it. If it succeeds
and the borrowed shutdown value is true, also break. This prevents hot-spinning
when the channel is closed and ensures prompt exit when shutdown is already
signaled.
In `@ts_runtime/src/lib.rs`:
- Around line 410-418: The network-monitor feature/flag mismatch check in the
`Runtime::spawn` function currently executes after substantial startup work,
which can leave partial startup side effects if it fails. Move the entire guard
block (the `#[cfg(not(feature = "network-monitor"))]` conditional with the check
for `env.network_monitor` that returns the `NetworkMonitorUnavailable` error) to
the very beginning of the `Runtime::spawn` function, before any actor spawning
occurs. This ensures the check fails fast without leaving any partial state.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 0191186b-6504-4a8a-9b99-59a5a7c2e1ea
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock,!**/Cargo.lock
📒 Files selected for processing (17)
Cargo.tomlsrc/config.rssrc/error.rsts_control/src/config.rsts_netmon/Cargo.tomlts_netmon/README.mdts_netmon/src/lib.rsts_runtime/Cargo.tomlts_runtime/src/derp_latency.rsts_runtime/src/direct.rsts_runtime/src/env.rsts_runtime/src/error.rsts_runtime/src/forwarder_actor.rsts_runtime/src/lib.rsts_runtime/src/netmon.rsts_runtime/src/peer_tracker/mod.rsts_runtime/src/tun_actor.rs
| loop { | ||
| tokio::select! { | ||
| // Bias toward shutdown so a flip during a pending settle stops promptly. | ||
| biased; | ||
|
|
||
| _ = shutdown.changed() => { | ||
| if *shutdown.borrow() { | ||
| break; | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
Handle closed/already-true shutdown correctly in the debouncer loop.
At Line 140, shutdown.changed()’s result is ignored. If the watch sender is dropped while the value is still false, this branch becomes immediately-ready forever; with biased, the loop can spin hot and starve other branches. Also add an upfront check so an already-true shutdown exits immediately.
Proposed fix
loop {
+ if *shutdown.borrow() {
+ break;
+ }
tokio::select! {
// Bias toward shutdown so a flip during a pending settle stops promptly.
biased;
- _ = shutdown.changed() => {
- if *shutdown.borrow() {
+ changed = shutdown.changed() => {
+ if changed.is_err() || *shutdown.borrow() {
break;
}
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| loop { | |
| tokio::select! { | |
| // Bias toward shutdown so a flip during a pending settle stops promptly. | |
| biased; | |
| _ = shutdown.changed() => { | |
| if *shutdown.borrow() { | |
| break; | |
| } | |
| } | |
| loop { | |
| if *shutdown.borrow() { | |
| break; | |
| } | |
| tokio::select! { | |
| // Bias toward shutdown so a flip during a pending settle stops promptly. | |
| biased; | |
| changed = shutdown.changed() => { | |
| if changed.is_err() || *shutdown.borrow() { | |
| break; | |
| } | |
| } | |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@ts_netmon/src/lib.rs` around lines 135 - 145, The debouncer loop's shutdown
handling has two issues: it ignores the Result returned by shutdown.changed(),
and lacks an upfront check for an already-true shutdown signal. First, add a
check before the loop that breaks immediately if shutdown is already true.
Second, in the shutdown.changed() branch of the tokio::select!, properly handle
the error case returned by shutdown.changed() — if it returns an error
(indicating the watch sender is dropped), break from the loop instead of
ignoring it. If it succeeds and the borrowed shutdown value is true, also break.
This prevents hot-spinning when the channel is closed and ensures prompt exit
when shutdown is already signaled.
| #[cfg(not(feature = "network-monitor"))] | ||
| if env.network_monitor { | ||
| // The flag is set but this build cannot honor it. Fail loudly (never a silent no-op). | ||
| return Err(Error { | ||
| kind: ErrorKind::NetworkMonitorUnavailable, | ||
| target_actor: None, | ||
| message_ty: None, | ||
| }); | ||
| } |
There was a problem hiding this comment.
Fail fast on unsupported network_monitor before spawning runtime actors.
Line 410 performs the network-monitor feature/flag mismatch check after substantial startup work. Returning NetworkMonitorUnavailable this late can leave partial startup side effects before constructor failure. Move this guard to the top of Runtime::spawn (before any actor spawns).
Suggested fix
pub async fn spawn(
config: ts_control::Config,
auth_key: Option<String>,
keys: ts_keys::NodeState,
) -> Result<Self, Error> {
+ #[cfg(not(feature = "network-monitor"))]
+ if config.network_monitor {
+ return Err(Error {
+ kind: ErrorKind::NetworkMonitorUnavailable,
+ target_actor: None,
+ message_ty: None,
+ });
+ }
+
let (shutdown_tx, shutdown_rx) = watch::channel(false);
@@
- #[cfg(not(feature = "network-monitor"))]
- if env.network_monitor {
- // The flag is set but this build cannot honor it. Fail loudly (never a silent no-op).
- return Err(Error {
- kind: ErrorKind::NetworkMonitorUnavailable,
- target_actor: None,
- message_ty: None,
- });
- }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@ts_runtime/src/lib.rs` around lines 410 - 418, The network-monitor
feature/flag mismatch check in the `Runtime::spawn` function currently executes
after substantial startup work, which can leave partial startup side effects if
it fails. Move the entire guard block (the `#[cfg(not(feature =
"network-monitor"))]` conditional with the check for `env.network_monitor` that
returns the `NetworkMonitorUnavailable` error) to the very beginning of the
`Runtime::spawn` function, before any actor spawning occurs. This ensures the
check fails fast without leaving any partial state.
Problem
Device::rebind()is 100% embedder-driven (there is no built-in network monitor). On a Wi-Fi switch / sleep-wake / default-route change, an embedder that never callsrebind()silently degrades to DERP until something forces a re-probe — a realtailscaledrecovers automatically. Worse, even a manualrebind()today only swaps the socket and then waits out the periodic timers (≤2s ping, ≤26s STUN) and never re-runs the DERP-latency netcheck.This is the highest-leverage remaining tsnet-parity gap. It splits cleanly: the rebind mechanism (
MagicSock::rebind) already exists; the missing halves are an OS detector and a supervisor that fires the complete reaction.What this lands — slice (a): the OS-agnostic core
This is the first of three slices. It ships the entire reaction path + opt-in wiring, OS-free (the Linux netlink and macOS PF_ROUTE detectors are slices b/c). It is fully testable today via a synthetic event source, and is also usable by an embedder that already has its own network monitor.
ts_netmoncrate (deps:tokio+tracingonly — deliberately no netlink/route crate, to keep the default ring-only supply-chain graph unchanged): aLinkMonitortrait, a coalescing debouncer (trailing 250ms settle, with a 1s max-wait cap so a never-quiet event storm still emits), aManualLinkMonitor(synthetic/embedder-driven source), and aNoopLinkMonitor.NetmonSupervisoractor (ts_runtime): on each coalesced link-change event,direct.ask(RebindAndReprobe)thenenv.publish(MeasureNow), with a 1s minimum-interval backstop that coalesces-and-defers (never drops) a within-interval event, serial processing, and clean shutdown.RebindAndReprobe(folded intoDirectManager): rebind + immediate re-ping-all + a gated immediate STUN sweep, atomic in the socket-owning actor. The bareRebind/Device::rebind()manual path is unchanged.MeasureNow(onDerpLatencyMeasurer, via the existing pub/sub bus): on-demand DERP-latency re-measure against a newly-cached derp map → control re-selects the home region (idempotent). Closes the no-re-netcheck gap.Opt-in / zero-cost when off
Config::network_monitor: bool(default false) + anetwork-monitorcargo feature (default off). When off: the module, field, andts_netmondep compile out — byte-for-byte today's behavior, no new thread/socket. Feature-off + flag-on is a loud hard error at construction (mirrorsTransportMode::Tunwithout thetunfeature), never a silent no-op.Safety
rebind_socket,stun_servers_v4,send_stun_request); the slice-(a) core opens no OS socket.Tests
ts_netmon: debouncer burst-collapse, never-quiet stream emits via the max-wait cap, two-bursts, flush-on-close, handle-drop aborts.ts_runtime: an end-to-end test driving the realDirectManager+ aManualLinkMonitor(asserts the underlay socket port changes = rebind ran, and aMeasureNowreached the bus), plus within-interval event is deferred (not dropped) and multiple deferred events coalesce into one rebind.Reviewed
Independent correctness and safety+simplicity review passes; both found two storm-handling gaps (debouncer starvation on a never-quiet stream; supervisor dropping a within-interval event) — both fixed (the max-wait cap + coalesce-and-defer) with tests, before this PR.
Gates (local, all green)
cargo test -p geiserx_ts_netmon -p geiserx_ts_runtime --features network-monitor(352) · clippy-D warnings(default +network-monitor+ the existingtunlane) · default build (feature off) ·cargo fmt --check·cargo doc(broken_intra_doc_links=deny, incl.--features network-monitor) ·cargo run -p checks·cargo deny --no-default-features(ring-only graph unchanged) ·cargo tree(ts_netmon = tokio + tracing only).Follow-ups (tracked): slice (b) Linux netlink backend, slice (c) macOS PF_ROUTE backend, and the magicsock
our_node_keyrefresh after rotation.Signed-off-by: Sergio sergio@geiser.cloud
Created using Claude Code (Opus 4.8)
Summary by CodeRabbit
New Features
Documentation