feat: auto re-auth on node-key expiry instead of going permanently offline#260
Conversation
An auth-key-registered node went permanently offline at node-key expiry: the control runner flipped DeviceState::Expired terminally on a self-node whose key expiry had passed and never re-registered. Go's doLogin instead rotates the node key and re-registers with the stored auth key, recovering automatically. The node-key rotation primitive existed but nothing in the runtime drove it. Wire the recovery end to end: ts_control gains Command::Reauth on the live map-poll command channel and an AsyncControlClient::reauth method. A node-key rotation cannot run inside run_once which only borrows the key state, so the Reauth arm breaks the poll loop and signals run via a reauth_requested out-param threaded like received_frame. run owns the NodeState, calls rotate_node_key, and reconnects immediately skipping the backoff so the next run_once re-registers with the rotated key plus OldNodeKey. Only the node key rotates; disco and machine keys are untouched so established tunnels do not flap. ts_runtime adds the transient DeviceState::Reauthenticating, treated like Connecting by the waiters and carrying no browse_to_url since recovery is non-interactive. The unconditional expired-to-Expired branch is replaced by a pure expiry_action helper deciding Running, Reauthenticate, or Expired from key_expired, has_auth_key, reauth_enabled, and tka_active. On Reauthenticate the handler sets the state and sends Command::Reauth once on the transition in; recovery to Running is automatic on the next good self-node at the same handler. The Tailnet Lock gate is a hard safety constraint: rotating on a locked tailnet would install an unsigned key and lock the node out of locked peers, so an active lock vetoes auto-reauth and falls through to Expired. Auto-reauth also requires a retained auth key and the reauth_on_expiry config opt-in default true, threaded through the facade like ephemeral. A one-shot auth key degrades to the terminal state, no regression. The magicsock our_node_key refresh after rotation is deferred with a TODO at the rotation site: the stale key only causes a brief disco-ping rediscovery hiccup that self-heals on the next netmap, not a tunnel teardown, and wiring it is a cross-actor change not worth a fragile shim for a self-healing transient. Signed-off-by: GeiserX <9169332+GeiserX@users.noreply.github.com>
…eKey anchor Address the review panel's interlocking defects on the tsr-ajvm auto-reauth path, all faces of one missing piece — a bounded reauth sub-state-machine. - Circuit breaker: ControlRunner now counts consecutive reauths that have not recovered the node (reauth_attempts, cap MAX_REAUTH_ATTEMPTS=3). The decision is the pure reauth_circuit_step: fire the one-shot Command::Reauth only on the transition INTO Reauthenticating, count each later expired self-node while still reauthenticating, and trip to terminal Expired at the cap. Resets to 0 on a good (non-expired) self-node. A one-shot/consumed auth key now settles instead of sitting in Reauthenticating forever. - Preserve the original anchor: NodeState::rotate_node_key only captures the current key as old_node_key when none is held, so a second rotation before a successful register keeps the ORIGINAL pre-expiry key control needs to link the rotation. New NodeState::clear_old_node_key releases the anchor; the client run loop calls it on a successful re-register so the next genuine rotation re-anchors on the now-current key. - Bridge no-clobber: bridge_reauth_url_to_state no longer downgrades an in-flight Reauthenticating to NeedsLogin, which both avoided a misleading browse_to_url on a headless node and re-armed a second rotation. - DeviceState is now #[non_exhaustive] (amortizes the Reauthenticating addition). - Polish: reword the Reauthenticating 'transient' prose off the nonexistent DeviceState::is_permanent, and downgrade reauth()'s closed-channel log to debug (a benign teardown race). Signed-off-by: GeiserX <9169332+GeiserX@users.noreply.github.com>
|
Warning Review limit reached
More reviews will be available in 47 minutes and 32 seconds. Learn how PR review limits work. Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file). ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (6)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Problem
An auth-key-registered node went permanently offline at node-key expiry. When control delivered a self-node whose node key had expired, the runtime flipped to a terminal
DeviceState::Expiredand never re-registered. Go'sipnlocalinstead runsdoLogin: it generates a fresh node key, records the prior key asOldNodeKey, and re-registers with the stored auth key, so the node renews itself. This was the highest-value parity gap — a silent permanent failure for a long-lived node (an exit node renews on a schedule; this fork just died).What this does
On an expired self node-key, if an auth key is retained, auto-reauth is enabled, and Tailnet Lock is not enforcing, the runtime now:
DeviceState::Reauthenticating(non-interactive — nobrowse_to_url),NodeState::rotate_node_key— fresh node key, prior key recorded asOldNodeKey; disco / machine / network-lock keys are left untouched),OldNodeKeyset, so control links the new key to the existing node identity (same node ID / tailnet IP).Recovery to
Runningis automatic — the next good self-node flips it back at the same handler.No tunnel flap (the SACRED invariant)
Only the node key rotates. The data plane keys WireGuard/disco sessions on the disco, machine, and per-peer keys — never the self node key — so a rotation does not re-key or tear down an established tunnel. The control connection re-registers in place (the netmap stream task is reused, not rebuilt). Verified in review:
our_node_keyis used only as the claimed-sender identity inside disco pings (sealed with the disco key), so a stale value after rotation is at worst a brief disco-ping re-discovery hiccup that self-heals on the next netmap — never a session teardown. (Refreshing magicsock's copy is a tracked follow-upTODO.)Safety
Expired(the counter resets on a successful re-register / good self-node). The reauth fires once per transition intoReauthenticating, never per netmap frame.tka_authoritycell the peer-trust chokepoint enforces against (single source of truth). TKA re-sign is a separate follow-up.MachineNotAuthorized -> NeedsLoginbridge no longer overwrites an in-flight auto-reauth.reauth_on_expiryconfig (defaulttrue) lets an embedder opt out.DeviceStateis now#[non_exhaustive]to future-proof embedder matches.Tests
Pure
expiry_actiondecision matrix (incl. the TKA veto), thereauth_circuit_stepbreaker (fire-once, count, trip-to-Expired-at-cap, reset-on-recovery, full one-shot-key episode), the OldNodeKey anchor lifecycle (two rotations preserve the original; rotation after a clear re-anchors), the bridge no-clobber guard, andReauthenticatingrecovery inwait_for_running.Gates (local, all green)
cargo test -p geiserx_ts_keys -p geiserx_ts_control -p geiserx_ts_runtime(345) · clippy-D warnings(default +--features tun) ·cargo fmt --check·cargo doc(broken_intra_doc_links=deny) ·cargo run -p checks. Reviewed by an independent correctness+parity and an architecture+safety pass; all findings fixed.Signed-off-by: Sergio sergio@geiser.cloud
Created using Claude Code (Opus 4.8)