Skip to content

feat: opt-in network monitor with auto-rebind on link change (slice a: core + supervisor)#261

Merged
GeiserX merged 2 commits into
mainfrom
feat/netmon-supervisor-slice-a
Jun 15, 2026
Merged

feat: opt-in network monitor with auto-rebind on link change (slice a: core + supervisor)#261
GeiserX merged 2 commits into
mainfrom
feat/netmon-supervisor-slice-a

Conversation

@GeiserX

@GeiserX GeiserX commented Jun 15, 2026

Copy link
Copy Markdown
Owner

Problem

Device::rebind() is 100% embedder-driven (there is no built-in network monitor). On a Wi-Fi switch / sleep-wake / default-route change, an embedder that never calls rebind() silently degrades to DERP until something forces a re-probe — a real tailscaled recovers automatically. Worse, even a manual rebind() today only swaps the socket and then waits out the periodic timers (≤2s ping, ≤26s STUN) and never re-runs the DERP-latency netcheck.

This is the highest-leverage remaining tsnet-parity gap. It splits cleanly: the rebind mechanism (MagicSock::rebind) already exists; the missing halves are an OS detector and a supervisor that fires the complete reaction.

What this lands — slice (a): the OS-agnostic core

This is the first of three slices. It ships the entire reaction path + opt-in wiring, OS-free (the Linux netlink and macOS PF_ROUTE detectors are slices b/c). It is fully testable today via a synthetic event source, and is also usable by an embedder that already has its own network monitor.

  • New ts_netmon crate (deps: tokio + tracing only — deliberately no netlink/route crate, to keep the default ring-only supply-chain graph unchanged): a LinkMonitor trait, a coalescing debouncer (trailing 250ms settle, with a 1s max-wait cap so a never-quiet event storm still emits), a ManualLinkMonitor (synthetic/embedder-driven source), and a NoopLinkMonitor.
  • NetmonSupervisor actor (ts_runtime): on each coalesced link-change event, direct.ask(RebindAndReprobe) then env.publish(MeasureNow), with a 1s minimum-interval backstop that coalesces-and-defers (never drops) a within-interval event, serial processing, and clean shutdown.
  • RebindAndReprobe (folded into DirectManager): rebind + immediate re-ping-all + a gated immediate STUN sweep, atomic in the socket-owning actor. The bare Rebind / Device::rebind() manual path is unchanged.
  • MeasureNow (on DerpLatencyMeasurer, via the existing pub/sub bus): on-demand DERP-latency re-measure against a newly-cached derp map → control re-selects the home region (idempotent). Closes the no-re-netcheck gap.

Opt-in / zero-cost when off

  • Config::network_monitor: bool (default false) + a network-monitor cargo feature (default off). When off: the module, field, and ts_netmon dep compile out — byte-for-byte today's behavior, no new thread/socket. Feature-off + flag-on is a loud hard error at construction (mirrors TransportMode::Tun without the tun feature), never a silent no-op.

Safety

  • Never knocks a session offline (SACRED): a rebind invalidates each peer's best path (→ DERP, fail-closed) but keeps candidates and leaves WireGuard sessions + DERP home untouched; the immediate re-ping re-confirms a direct path in ~1 RTT, and the 1s min-rebind interval far exceeds recovery time, so even a sustained flap costs ~1 RTT of DERP per second, never a permanent DERP pin.
  • Anti-leak: every new egress funnels through the existing IPv4-only / fail-closed chokepoints (rebind_socket, stun_servers_v4, send_stun_request); the slice-(a) core opens no OS socket.

Tests

ts_netmon: debouncer burst-collapse, never-quiet stream emits via the max-wait cap, two-bursts, flush-on-close, handle-drop aborts. ts_runtime: an end-to-end test driving the real DirectManager + a ManualLinkMonitor (asserts the underlay socket port changes = rebind ran, and a MeasureNow reached the bus), plus within-interval event is deferred (not dropped) and multiple deferred events coalesce into one rebind.

Reviewed

Independent correctness and safety+simplicity review passes; both found two storm-handling gaps (debouncer starvation on a never-quiet stream; supervisor dropping a within-interval event) — both fixed (the max-wait cap + coalesce-and-defer) with tests, before this PR.

Gates (local, all green)

cargo test -p geiserx_ts_netmon -p geiserx_ts_runtime --features network-monitor (352) · clippy -D warnings (default + network-monitor + the existing tun lane) · default build (feature off) · cargo fmt --check · cargo doc (broken_intra_doc_links=deny, incl. --features network-monitor) · cargo run -p checks · cargo deny --no-default-features (ring-only graph unchanged) · cargo tree (ts_netmon = tokio + tracing only).

Follow-ups (tracked): slice (b) Linux netlink backend, slice (c) macOS PF_ROUTE backend, and the magicsock our_node_key refresh after rotation.

Signed-off-by: Sergio sergio@geiser.cloud

Created using Claude Code (Opus 4.8)

Summary by CodeRabbit

  • New Features

    • Added network link change detection and monitoring capability that automatically responds to network connectivity events.
    • Introduced configuration option to enable network link monitoring.
    • Added on-demand DERP latency re-measurement triggering.
  • Documentation

    • Added network monitor documentation describing link change event detection.

GeiserX added 2 commits June 15, 2026 18:08
Slice a of tsr-doz6: the OS-free core that recovers a direct path on a
link change instead of silently degrading to DERP until the periodic
timers eventually re-probe. The embedder-driven Device::rebind path is
left unchanged; this adds an internal supervisor as an opt-in parallel
path.

New ts_netmon crate (geiserx_ts_netmon, tokio + tracing only, no
netlink/route deps): a LinkMonitor trait, a detail-free LinkChangeEvent,
a LinkMonitorHandle that aborts its watcher task on drop, an
OS-agnostic 250ms trailing-settle debouncer, plus ManualLinkMonitor
(synthetic source for tests/this slice) and NoopLinkMonitor.

ts_runtime gains a NetmonSupervisor actor behind a new network-monitor
feature: on each coalesced link change it asks DirectManager the new
RebindAndReprobe message and republishes MeasureNow. RebindAndReprobe
folds all three kicks atomically inside the manager that owns the
socket, peer db, and multiderp ref: rebind the underlay socket, re-ping
all candidates on it, and fire an immediate gated STUN sweep.
DerpLatencyMeasurer now caches the last derp map and re-measures on a
MeasureNow bus message, so the supervisor needs no extra actor ref.

The monitor is gated by Config::network_monitor (default off, pure
engine posture) and the network-monitor feature. With the flag off it
is a complete no-op. Setting the flag without the feature is a hard
error at spawn rather than a silent no-op. The synthetic and no-op
backends prove the full pipeline; the Linux/macOS backends are later
slices.
…nterval rebinds

Two storm-handling correctness fixes in the tsr-doz6 slice-a network
monitor, found by the review panel. Both only bite once real OS event
sources feed bursty or continuous events, but the flawed code already
ships in the OS-agnostic core.

ts_netmon debouncer: a continuous raw stream arriving faster than the
250ms trailing settle re-armed the settle timer forever, so the emit arm
never fired and zero LinkChangeEvents were produced for the whole storm.
Add a DEBOUNCE_MAX_WAIT ceiling of 1s: track when the current pending
burst first became pending and arm the timer at min of now plus settle
and first-pending plus max-wait, so a never-quiet stream still emits
periodically. The marker is cleared on emit so the next raw event starts
a fresh burst. The common quiet-terminated burst is unaffected; the cap
only kicks in for a pathological never-quiet stream. Flush-on-raw-close
is preserved.

ts_runtime supervisor: an event arriving within MIN_REBIND_INTERVAL of
the last completed rebind was continued, dropping it permanently. A
genuinely distinct second link change within 1s was lost, leaving the
node rebound for the stale network until the periodic pinger or STUN
caught up. Coalesce and defer instead: set a pending flag, arm a timer
to the interval boundary, and run exactly one rebind when it elapses,
folding any number of within-interval events into that single deferred
rebind. This bounds worst-case recovery to MIN_REBIND_INTERVAL. Serial
processing, the RebindAndReprobe plus MeasureNow reaction, and clean
shutdown precedence over a pending deferred rebind are all preserved.

Tests: a paused-clock debouncer test feeds events every half-settle past
the cap and asserts at least one emission during the never-quiet stream;
the supervisor min-interval test is updated from drop to defer semantics
and a second test proves multiple within-interval events coalesce to one
deferred rebind. Also reword two stale comments the reviewer flagged.
@coderabbitai

coderabbitai Bot commented Jun 15, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

Adds a new ts_netmon crate implementing a debounced OS link-change coalescing pipeline. Introduces a NetmonSupervisor actor in ts_runtime that reacts to coalesced events by triggering DirectManager.rebind_and_reprobe and publishing MeasureNow to DerpLatencyMeasurer. The feature is gated by a network-monitor Cargo feature and a Config::network_monitor boolean propagated through all config layers.

Changes

Network-link monitor feature

Layer / File(s) Summary
Cargo workspace and feature wiring
Cargo.toml, ts_netmon/Cargo.toml, ts_runtime/Cargo.toml
Registers ts_netmon as a workspace member and dependency, adds it as an optional dep in ts_runtime, and introduces the network-monitor Cargo feature at both workspace and runtime crate level.
Config fields and error variants
ts_control/src/config.rs, src/config.rs, src/error.rs, ts_runtime/src/error.rs, ts_runtime/src/env.rs
Propagates network_monitor: bool through ts_control::ConfigForwarderConfigEnv; adds NetworkMonitorUnavailable error variants in ts_runtime::ErrorKind, InternalErrorKind, and the cross-layer From conversion.
ts_netmon debounce library
ts_netmon/README.md, ts_netmon/src/lib.rs
Defines LinkMonitor trait, LinkMonitorHandle (abort-on-drop), debounce async coalescer with settle/max-wait/shutdown/flush logic, ManualLinkMonitor/NoopLinkMonitor backends, and a full test suite using paused tokio time.
DirectManager rebind_and_reprobe handler
ts_runtime/src/direct.rs
Stores ActorRef<Multiderp> on DirectManager; new rebind_and_reprobe handler rebinds MagicSock, re-pings endpoints, and runs a best-effort v4 STUN sweep; wired in both startup branches.
DerpLatencyMeasurer MeasureNow message
ts_runtime/src/derp_latency.rs
Adds MeasureNow bus message; extends DerpLatencyMeasurer with a last_derp_map cache; updates StateUpdate handler to store the map; adds a MeasureNow handler that re-measures from cache or no-ops; extends tests.
NetmonSupervisor actor and Runtime wiring
ts_runtime/src/netmon.rs, ts_runtime/src/lib.rs
Introduces NetmonSupervisor with on_start (noop fallback on monitor failure), run loop with MIN_REBIND_INTERVAL throttling and shutdown; wires into Runtime::spawn behind #[cfg(feature = "network-monitor")]; full unit and end-to-end tests.
Test fixture updates
ts_runtime/src/forwarder_actor.rs, ts_runtime/src/peer_tracker/mod.rs, ts_runtime/src/tun_actor.rs
Adds network_monitor: false to existing ForwarderConfig struct literals in test helpers.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: introducing an opt-in network monitor with automatic rebind on link changes, including the note that this is slice (a) covering core and supervisor components.
Description check ✅ Passed The description comprehensively covers the problem, solution, safety considerations, testing, and review findings. It addresses all critical information but is missing explicit links to related issues or changelog entry.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/netmon-supervisor-slice-a

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@ts_netmon/src/lib.rs`:
- Around line 135-145: The debouncer loop's shutdown handling has two issues: it
ignores the Result returned by shutdown.changed(), and lacks an upfront check
for an already-true shutdown signal. First, add a check before the loop that
breaks immediately if shutdown is already true. Second, in the
shutdown.changed() branch of the tokio::select!, properly handle the error case
returned by shutdown.changed() — if it returns an error (indicating the watch
sender is dropped), break from the loop instead of ignoring it. If it succeeds
and the borrowed shutdown value is true, also break. This prevents hot-spinning
when the channel is closed and ensures prompt exit when shutdown is already
signaled.

In `@ts_runtime/src/lib.rs`:
- Around line 410-418: The network-monitor feature/flag mismatch check in the
`Runtime::spawn` function currently executes after substantial startup work,
which can leave partial startup side effects if it fails. Move the entire guard
block (the `#[cfg(not(feature = "network-monitor"))]` conditional with the check
for `env.network_monitor` that returns the `NetworkMonitorUnavailable` error) to
the very beginning of the `Runtime::spawn` function, before any actor spawning
occurs. This ensures the check fails fast without leaving any partial state.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 0191186b-6504-4a8a-9b99-59a5a7c2e1ea

📥 Commits

Reviewing files that changed from the base of the PR and between fea03ab and 4346753.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock, !**/Cargo.lock
📒 Files selected for processing (17)
  • Cargo.toml
  • src/config.rs
  • src/error.rs
  • ts_control/src/config.rs
  • ts_netmon/Cargo.toml
  • ts_netmon/README.md
  • ts_netmon/src/lib.rs
  • ts_runtime/Cargo.toml
  • ts_runtime/src/derp_latency.rs
  • ts_runtime/src/direct.rs
  • ts_runtime/src/env.rs
  • ts_runtime/src/error.rs
  • ts_runtime/src/forwarder_actor.rs
  • ts_runtime/src/lib.rs
  • ts_runtime/src/netmon.rs
  • ts_runtime/src/peer_tracker/mod.rs
  • ts_runtime/src/tun_actor.rs

Comment thread ts_netmon/src/lib.rs
Comment on lines +135 to +145
loop {
tokio::select! {
// Bias toward shutdown so a flip during a pending settle stops promptly.
biased;

_ = shutdown.changed() => {
if *shutdown.borrow() {
break;
}
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Handle closed/already-true shutdown correctly in the debouncer loop.

At Line 140, shutdown.changed()’s result is ignored. If the watch sender is dropped while the value is still false, this branch becomes immediately-ready forever; with biased, the loop can spin hot and starve other branches. Also add an upfront check so an already-true shutdown exits immediately.

Proposed fix
 loop {
+    if *shutdown.borrow() {
+        break;
+    }
     tokio::select! {
         // Bias toward shutdown so a flip during a pending settle stops promptly.
         biased;
 
-        _ = shutdown.changed() => {
-            if *shutdown.borrow() {
+        changed = shutdown.changed() => {
+            if changed.is_err() || *shutdown.borrow() {
                 break;
             }
         }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
loop {
tokio::select! {
// Bias toward shutdown so a flip during a pending settle stops promptly.
biased;
_ = shutdown.changed() => {
if *shutdown.borrow() {
break;
}
}
loop {
if *shutdown.borrow() {
break;
}
tokio::select! {
// Bias toward shutdown so a flip during a pending settle stops promptly.
biased;
changed = shutdown.changed() => {
if changed.is_err() || *shutdown.borrow() {
break;
}
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ts_netmon/src/lib.rs` around lines 135 - 145, The debouncer loop's shutdown
handling has two issues: it ignores the Result returned by shutdown.changed(),
and lacks an upfront check for an already-true shutdown signal. First, add a
check before the loop that breaks immediately if shutdown is already true.
Second, in the shutdown.changed() branch of the tokio::select!, properly handle
the error case returned by shutdown.changed() — if it returns an error
(indicating the watch sender is dropped), break from the loop instead of
ignoring it. If it succeeds and the borrowed shutdown value is true, also break.
This prevents hot-spinning when the channel is closed and ensures prompt exit
when shutdown is already signaled.

Comment thread ts_runtime/src/lib.rs
Comment on lines +410 to +418
#[cfg(not(feature = "network-monitor"))]
if env.network_monitor {
// The flag is set but this build cannot honor it. Fail loudly (never a silent no-op).
return Err(Error {
kind: ErrorKind::NetworkMonitorUnavailable,
target_actor: None,
message_ty: None,
});
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail fast on unsupported network_monitor before spawning runtime actors.

Line 410 performs the network-monitor feature/flag mismatch check after substantial startup work. Returning NetworkMonitorUnavailable this late can leave partial startup side effects before constructor failure. Move this guard to the top of Runtime::spawn (before any actor spawns).

Suggested fix
 pub async fn spawn(
     config: ts_control::Config,
     auth_key: Option<String>,
     keys: ts_keys::NodeState,
 ) -> Result<Self, Error> {
+    #[cfg(not(feature = "network-monitor"))]
+    if config.network_monitor {
+        return Err(Error {
+            kind: ErrorKind::NetworkMonitorUnavailable,
+            target_actor: None,
+            message_ty: None,
+        });
+    }
+
     let (shutdown_tx, shutdown_rx) = watch::channel(false);
@@
-        #[cfg(not(feature = "network-monitor"))]
-        if env.network_monitor {
-            // The flag is set but this build cannot honor it. Fail loudly (never a silent no-op).
-            return Err(Error {
-                kind: ErrorKind::NetworkMonitorUnavailable,
-                target_actor: None,
-                message_ty: None,
-            });
-        }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ts_runtime/src/lib.rs` around lines 410 - 418, The network-monitor
feature/flag mismatch check in the `Runtime::spawn` function currently executes
after substantial startup work, which can leave partial startup side effects if
it fails. Move the entire guard block (the `#[cfg(not(feature =
"network-monitor"))]` conditional with the check for `env.network_monitor` that
returns the `NetworkMonitorUnavailable` error) to the very beginning of the
`Runtime::spawn` function, before any actor spawning occurs. This ensures the
check fails fast without leaving any partial state.

@GeiserX GeiserX merged commit 14d885a into main Jun 15, 2026
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant