feat: opt-in network monitor with auto-rebind on link change (slice a: core + supervisor) by GeiserX · Pull Request #261 · GeiserX/tailscale-rs

GeiserX · 2026-06-15T16:43:00Z

Problem

Device::rebind() is 100% embedder-driven (there is no built-in network monitor). On a Wi-Fi switch / sleep-wake / default-route change, an embedder that never calls rebind() silently degrades to DERP until something forces a re-probe — a real tailscaled recovers automatically. Worse, even a manual rebind() today only swaps the socket and then waits out the periodic timers (≤2s ping, ≤26s STUN) and never re-runs the DERP-latency netcheck.

This is the highest-leverage remaining tsnet-parity gap. It splits cleanly: the rebind mechanism (MagicSock::rebind) already exists; the missing halves are an OS detector and a supervisor that fires the complete reaction.

What this lands — slice (a): the OS-agnostic core

This is the first of three slices. It ships the entire reaction path + opt-in wiring, OS-free (the Linux netlink and macOS PF_ROUTE detectors are slices b/c). It is fully testable today via a synthetic event source, and is also usable by an embedder that already has its own network monitor.

New ts_netmon crate (deps: tokio + tracing only — deliberately no netlink/route crate, to keep the default ring-only supply-chain graph unchanged): a LinkMonitor trait, a coalescing debouncer (trailing 250ms settle, with a 1s max-wait cap so a never-quiet event storm still emits), a ManualLinkMonitor (synthetic/embedder-driven source), and a NoopLinkMonitor.
NetmonSupervisor actor (ts_runtime): on each coalesced link-change event, direct.ask(RebindAndReprobe) then env.publish(MeasureNow), with a 1s minimum-interval backstop that coalesces-and-defers (never drops) a within-interval event, serial processing, and clean shutdown.
RebindAndReprobe (folded into DirectManager): rebind + immediate re-ping-all + a gated immediate STUN sweep, atomic in the socket-owning actor. The bare Rebind / Device::rebind() manual path is unchanged.
MeasureNow (on DerpLatencyMeasurer, via the existing pub/sub bus): on-demand DERP-latency re-measure against a newly-cached derp map → control re-selects the home region (idempotent). Closes the no-re-netcheck gap.

Opt-in / zero-cost when off

Config::network_monitor: bool (default false) + a network-monitor cargo feature (default off). When off: the module, field, and ts_netmon dep compile out — byte-for-byte today's behavior, no new thread/socket. Feature-off + flag-on is a loud hard error at construction (mirrors TransportMode::Tun without the tun feature), never a silent no-op.

Safety

Never knocks a session offline (SACRED): a rebind invalidates each peer's best path (→ DERP, fail-closed) but keeps candidates and leaves WireGuard sessions + DERP home untouched; the immediate re-ping re-confirms a direct path in ~1 RTT, and the 1s min-rebind interval far exceeds recovery time, so even a sustained flap costs ~1 RTT of DERP per second, never a permanent DERP pin.
Anti-leak: every new egress funnels through the existing IPv4-only / fail-closed chokepoints (rebind_socket, stun_servers_v4, send_stun_request); the slice-(a) core opens no OS socket.

Tests

ts_netmon: debouncer burst-collapse, never-quiet stream emits via the max-wait cap, two-bursts, flush-on-close, handle-drop aborts. ts_runtime: an end-to-end test driving the real DirectManager + a ManualLinkMonitor (asserts the underlay socket port changes = rebind ran, and a MeasureNow reached the bus), plus within-interval event is deferred (not dropped) and multiple deferred events coalesce into one rebind.

Reviewed

Independent correctness and safety+simplicity review passes; both found two storm-handling gaps (debouncer starvation on a never-quiet stream; supervisor dropping a within-interval event) — both fixed (the max-wait cap + coalesce-and-defer) with tests, before this PR.

Gates (local, all green)

cargo test -p geiserx_ts_netmon -p geiserx_ts_runtime --features network-monitor (352) · clippy -D warnings (default + network-monitor + the existing tun lane) · default build (feature off) · cargo fmt --check · cargo doc (broken_intra_doc_links=deny, incl. --features network-monitor) · cargo run -p checks · cargo deny --no-default-features (ring-only graph unchanged) · cargo tree (ts_netmon = tokio + tracing only).

Follow-ups (tracked): slice (b) Linux netlink backend, slice (c) macOS PF_ROUTE backend, and the magicsock our_node_key refresh after rotation.

Signed-off-by: Sergio sergio@geiser.cloud

Created using Claude Code (Opus 4.8)

Summary by CodeRabbit

New Features
- Added network link change detection and monitoring capability that automatically responds to network connectivity events.
- Introduced configuration option to enable network link monitoring.
- Added on-demand DERP latency re-measurement triggering.
Documentation
- Added network monitor documentation describing link change event detection.

Slice a of tsr-doz6: the OS-free core that recovers a direct path on a link change instead of silently degrading to DERP until the periodic timers eventually re-probe. The embedder-driven Device::rebind path is left unchanged; this adds an internal supervisor as an opt-in parallel path. New ts_netmon crate (geiserx_ts_netmon, tokio + tracing only, no netlink/route deps): a LinkMonitor trait, a detail-free LinkChangeEvent, a LinkMonitorHandle that aborts its watcher task on drop, an OS-agnostic 250ms trailing-settle debouncer, plus ManualLinkMonitor (synthetic source for tests/this slice) and NoopLinkMonitor. ts_runtime gains a NetmonSupervisor actor behind a new network-monitor feature: on each coalesced link change it asks DirectManager the new RebindAndReprobe message and republishes MeasureNow. RebindAndReprobe folds all three kicks atomically inside the manager that owns the socket, peer db, and multiderp ref: rebind the underlay socket, re-ping all candidates on it, and fire an immediate gated STUN sweep. DerpLatencyMeasurer now caches the last derp map and re-measures on a MeasureNow bus message, so the supervisor needs no extra actor ref. The monitor is gated by Config::network_monitor (default off, pure engine posture) and the network-monitor feature. With the flag off it is a complete no-op. Setting the flag without the feature is a hard error at spawn rather than a silent no-op. The synthetic and no-op backends prove the full pipeline; the Linux/macOS backends are later slices.

…nterval rebinds Two storm-handling correctness fixes in the tsr-doz6 slice-a network monitor, found by the review panel. Both only bite once real OS event sources feed bursty or continuous events, but the flawed code already ships in the OS-agnostic core. ts_netmon debouncer: a continuous raw stream arriving faster than the 250ms trailing settle re-armed the settle timer forever, so the emit arm never fired and zero LinkChangeEvents were produced for the whole storm. Add a DEBOUNCE_MAX_WAIT ceiling of 1s: track when the current pending burst first became pending and arm the timer at min of now plus settle and first-pending plus max-wait, so a never-quiet stream still emits periodically. The marker is cleared on emit so the next raw event starts a fresh burst. The common quiet-terminated burst is unaffected; the cap only kicks in for a pathological never-quiet stream. Flush-on-raw-close is preserved. ts_runtime supervisor: an event arriving within MIN_REBIND_INTERVAL of the last completed rebind was continued, dropping it permanently. A genuinely distinct second link change within 1s was lost, leaving the node rebound for the stale network until the periodic pinger or STUN caught up. Coalesce and defer instead: set a pending flag, arm a timer to the interval boundary, and run exactly one rebind when it elapses, folding any number of within-interval events into that single deferred rebind. This bounds worst-case recovery to MIN_REBIND_INTERVAL. Serial processing, the RebindAndReprobe plus MeasureNow reaction, and clean shutdown precedence over a pending deferred rebind are all preserved. Tests: a paused-clock debouncer test feeds events every half-settle past the cap and asserts at least one emission during the never-quiet stream; the supervisor min-interval test is updated from drop to defer semantics and a second test proves multiple within-interval events coalesce to one deferred rebind. Also reword two stale comments the reviewer flagged.

coderabbitai · 2026-06-15T16:43:16Z

📝 Walkthrough

Walkthrough

Adds a new ts_netmon crate implementing a debounced OS link-change coalescing pipeline. Introduces a NetmonSupervisor actor in ts_runtime that reacts to coalesced events by triggering DirectManager.rebind_and_reprobe and publishing MeasureNow to DerpLatencyMeasurer. The feature is gated by a network-monitor Cargo feature and a Config::network_monitor boolean propagated through all config layers.

Changes

Network-link monitor feature

Layer / File(s)	Summary
Cargo workspace and feature wiring `Cargo.toml`, `ts_netmon/Cargo.toml`, `ts_runtime/Cargo.toml`	Registers `ts_netmon` as a workspace member and dependency, adds it as an optional dep in `ts_runtime`, and introduces the `network-monitor` Cargo feature at both workspace and runtime crate level.
Config fields and error variants `ts_control/src/config.rs`, `src/config.rs`, `src/error.rs`, `ts_runtime/src/error.rs`, `ts_runtime/src/env.rs`	Propagates `network_monitor: bool` through `ts_control::Config` → `ForwarderConfig` → `Env`; adds `NetworkMonitorUnavailable` error variants in `ts_runtime::ErrorKind`, `InternalErrorKind`, and the cross-layer `From` conversion.
ts_netmon debounce library `ts_netmon/README.md`, `ts_netmon/src/lib.rs`	Defines `LinkMonitor` trait, `LinkMonitorHandle` (abort-on-drop), `debounce` async coalescer with settle/max-wait/shutdown/flush logic, `ManualLinkMonitor`/`NoopLinkMonitor` backends, and a full test suite using paused tokio time.
DirectManager rebind_and_reprobe handler `ts_runtime/src/direct.rs`	Stores `ActorRef<Multiderp>` on `DirectManager`; new `rebind_and_reprobe` handler rebinds `MagicSock`, re-pings endpoints, and runs a best-effort v4 STUN sweep; wired in both startup branches.
DerpLatencyMeasurer MeasureNow message `ts_runtime/src/derp_latency.rs`	Adds `MeasureNow` bus message; extends `DerpLatencyMeasurer` with a `last_derp_map` cache; updates `StateUpdate` handler to store the map; adds a `MeasureNow` handler that re-measures from cache or no-ops; extends tests.
NetmonSupervisor actor and Runtime wiring `ts_runtime/src/netmon.rs`, `ts_runtime/src/lib.rs`	Introduces `NetmonSupervisor` with `on_start` (noop fallback on monitor failure), `run` loop with `MIN_REBIND_INTERVAL` throttling and shutdown; wires into `Runtime::spawn` behind `#[cfg(feature = "network-monitor")]`; full unit and end-to-end tests.
Test fixture updates `ts_runtime/src/forwarder_actor.rs`, `ts_runtime/src/peer_tracker/mod.rs`, `ts_runtime/src/tun_actor.rs`	Adds `network_monitor: false` to existing `ForwarderConfig` struct literals in test helpers.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main change: introducing an opt-in network monitor with automatic rebind on link changes, including the note that this is slice (a) covering core and supervisor components.
Description check	✅ Passed	The description comprehensively covers the problem, solution, safety considerations, testing, and review findings. It addresses all critical information but is missing explicit links to related issues or changelog entry.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/netmon-supervisor-slice-a

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@ts_netmon/src/lib.rs`:
- Around line 135-145: The debouncer loop's shutdown handling has two issues: it
ignores the Result returned by shutdown.changed(), and lacks an upfront check
for an already-true shutdown signal. First, add a check before the loop that
breaks immediately if shutdown is already true. Second, in the
shutdown.changed() branch of the tokio::select!, properly handle the error case
returned by shutdown.changed() — if it returns an error (indicating the watch
sender is dropped), break from the loop instead of ignoring it. If it succeeds
and the borrowed shutdown value is true, also break. This prevents hot-spinning
when the channel is closed and ensures prompt exit when shutdown is already
signaled.

In `@ts_runtime/src/lib.rs`:
- Around line 410-418: The network-monitor feature/flag mismatch check in the
`Runtime::spawn` function currently executes after substantial startup work,
which can leave partial startup side effects if it fails. Move the entire guard
block (the `#[cfg(not(feature = "network-monitor"))]` conditional with the check
for `env.network_monitor` that returns the `NetworkMonitorUnavailable` error) to
the very beginning of the `Runtime::spawn` function, before any actor spawning
occurs. This ensures the check fails fast without leaving any partial state.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 0191186b-6504-4a8a-9b99-59a5a7c2e1ea

📥 Commits

Reviewing files that changed from the base of the PR and between fea03ab and 4346753.

⛔ Files ignored due to path filters (1)

Cargo.lock is excluded by !**/*.lock, !**/Cargo.lock

📒 Files selected for processing (17)

Cargo.toml
src/config.rs
src/error.rs
ts_control/src/config.rs
ts_netmon/Cargo.toml
ts_netmon/README.md
ts_netmon/src/lib.rs
ts_runtime/Cargo.toml
ts_runtime/src/derp_latency.rs
ts_runtime/src/direct.rs
ts_runtime/src/env.rs
ts_runtime/src/error.rs
ts_runtime/src/forwarder_actor.rs
ts_runtime/src/lib.rs
ts_runtime/src/netmon.rs
ts_runtime/src/peer_tracker/mod.rs
ts_runtime/src/tun_actor.rs

coderabbitai · 2026-06-15T16:51:26Z

+    loop {
+        tokio::select! {
+            // Bias toward shutdown so a flip during a pending settle stops promptly.
+            biased;
+
+            _ = shutdown.changed() => {
+                if *shutdown.borrow() {
+                    break;
+                }
+            }
+


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Handle closed/already-true shutdown correctly in the debouncer loop.

At Line 140, shutdown.changed()’s result is ignored. If the watch sender is dropped while the value is still false, this branch becomes immediately-ready forever; with biased, the loop can spin hot and starve other branches. Also add an upfront check so an already-true shutdown exits immediately.

Proposed fix

loop { + if *shutdown.borrow() { + break; + } tokio::select! { // Bias toward shutdown so a flip during a pending settle stops promptly. biased; - _ = shutdown.changed() => { - if *shutdown.borrow() { + changed = shutdown.changed() => { + if changed.is_err() || *shutdown.borrow() { break; } }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

loop {

tokio::select! {

// Bias toward shutdown so a flip during a pending settle stops promptly.

biased;

_ = shutdown.changed() => {

if *shutdown.borrow() {

break;

}

}

loop {

if *shutdown.borrow() {

break;

}

tokio::select! {

// Bias toward shutdown so a flip during a pending settle stops promptly.

biased;

changed = shutdown.changed() => {

if changed.is_err() || *shutdown.borrow() {

break;

}

}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ts_netmon/src/lib.rs` around lines 135 - 145, The debouncer loop's shutdown handling has two issues: it ignores the Result returned by shutdown.changed(), and lacks an upfront check for an already-true shutdown signal. First, add a check before the loop that breaks immediately if shutdown is already true. Second, in the shutdown.changed() branch of the tokio::select!, properly handle the error case returned by shutdown.changed() — if it returns an error (indicating the watch sender is dropped), break from the loop instead of ignoring it. If it succeeds and the borrowed shutdown value is true, also break. This prevents hot-spinning when the channel is closed and ensures prompt exit when shutdown is already signaled.

coderabbitai · 2026-06-15T16:51:26Z

+        #[cfg(not(feature = "network-monitor"))]
+        if env.network_monitor {
+            // The flag is set but this build cannot honor it. Fail loudly (never a silent no-op).
+            return Err(Error {
+                kind: ErrorKind::NetworkMonitorUnavailable,
+                target_actor: None,
+                message_ty: None,
+            });
+        }


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail fast on unsupported network_monitor before spawning runtime actors.

Line 410 performs the network-monitor feature/flag mismatch check after substantial startup work. Returning NetworkMonitorUnavailable this late can leave partial startup side effects before constructor failure. Move this guard to the top of Runtime::spawn (before any actor spawns).

Suggested fix

pub async fn spawn( config: ts_control::Config, auth_key: Option<String>, keys: ts_keys::NodeState, ) -> Result<Self, Error> { + #[cfg(not(feature = "network-monitor"))] + if config.network_monitor { + return Err(Error { + kind: ErrorKind::NetworkMonitorUnavailable, + target_actor: None, + message_ty: None, + }); + } + let (shutdown_tx, shutdown_rx) = watch::channel(false); @@ - #[cfg(not(feature = "network-monitor"))] - if env.network_monitor { - // The flag is set but this build cannot honor it. Fail loudly (never a silent no-op). - return Err(Error { - kind: ErrorKind::NetworkMonitorUnavailable, - target_actor: None, - message_ty: None, - }); - }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ts_runtime/src/lib.rs` around lines 410 - 418, The network-monitor feature/flag mismatch check in the `Runtime::spawn` function currently executes after substantial startup work, which can leave partial startup side effects if it fails. Move the entire guard block (the `#[cfg(not(feature = "network-monitor"))]` conditional with the check for `env.network_monitor` that returns the `NetworkMonitorUnavailable` error) to the very beginning of the `Runtime::spawn` function, before any actor spawning occurs. This ensures the check fails fast without leaving any partial state.

GeiserX added 2 commits June 15, 2026 18:08

coderabbitai Bot reviewed Jun 15, 2026

View reviewed changes

GeiserX merged commit 14d885a into main Jun 15, 2026
16 checks passed

github-actions Bot mentioned this pull request Jun 15, 2026

chore(main): release 0.40.0 #256

Merged

GeiserX mentioned this pull request Jun 15, 2026

fix: publish ts_netmon crate so the workspace release completes #264

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: opt-in network monitor with auto-rebind on link change (slice a: core + supervisor)#261

feat: opt-in network monitor with auto-rebind on link change (slice a: core + supervisor)#261
GeiserX merged 2 commits into
mainfrom
feat/netmon-supervisor-slice-a

GeiserX commented Jun 15, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 15, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 15, 2026

Uh oh!

coderabbitai Bot Jun 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

GeiserX commented Jun 15, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

What this lands — slice (a): the OS-agnostic core

Opt-in / zero-cost when off

Safety

Tests

Reviewed

Gates (local, all green)

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

GeiserX commented Jun 15, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 15, 2026 •

edited

Loading