Skip to content

feat: auto re-auth on node-key expiry instead of going permanently offline#260

Merged
GeiserX merged 2 commits into
mainfrom
feat/tka-reauth-on-key-expiry
Jun 15, 2026
Merged

feat: auto re-auth on node-key expiry instead of going permanently offline#260
GeiserX merged 2 commits into
mainfrom
feat/tka-reauth-on-key-expiry

Conversation

@GeiserX

@GeiserX GeiserX commented Jun 15, 2026

Copy link
Copy Markdown
Owner

Problem

An auth-key-registered node went permanently offline at node-key expiry. When control delivered a self-node whose node key had expired, the runtime flipped to a terminal DeviceState::Expired and never re-registered. Go's ipnlocal instead runs doLogin: it generates a fresh node key, records the prior key as OldNodeKey, and re-registers with the stored auth key, so the node renews itself. This was the highest-value parity gap — a silent permanent failure for a long-lived node (an exit node renews on a schedule; this fork just died).

What this does

On an expired self node-key, if an auth key is retained, auto-reauth is enabled, and Tailnet Lock is not enforcing, the runtime now:

  1. sets the new transient DeviceState::Reauthenticating (non-interactive — no browse_to_url),
  2. rotates the node key (NodeState::rotate_node_key — fresh node key, prior key recorded as OldNodeKey; disco / machine / network-lock keys are left untouched),
  3. re-registers immediately over the live connection with OldNodeKey set, so control links the new key to the existing node identity (same node ID / tailnet IP).

Recovery to Running is automatic — the next good self-node flips it back at the same handler.

No tunnel flap (the SACRED invariant)

Only the node key rotates. The data plane keys WireGuard/disco sessions on the disco, machine, and per-peer keys — never the self node key — so a rotation does not re-key or tear down an established tunnel. The control connection re-registers in place (the netmap stream task is reused, not rebuilt). Verified in review: our_node_key is used only as the claimed-sender identity inside disco pings (sealed with the disco key), so a stale value after rotation is at worst a brief disco-ping re-discovery hiccup that self-heals on the next netmap — never a session teardown. (Refreshing magicsock's copy is a tracked follow-up TODO.)

Safety

  • Circuit breaker: a one-shot or rejected auth key can't re-register. Rather than loop forever, after 3 consecutive failed reauth attempts the node settles on terminal Expired (the counter resets on a successful re-register / good self-node). The reauth fires once per transition into Reauthenticating, never per netmap frame.
  • OldNodeKey anchor preserved: a second rotation before a register succeeds keeps the original pre-expiry key as the linkage anchor (capture-once); the anchor is cleared on a successful register so the next genuine cycle re-anchors fresh.
  • Tailnet Lock gate (mandatory): rotation is gated OFF while lock enforcement is active — an unsigned new key would lock the node out of locked peers. Gated on the same tka_authority cell the peer-trust chokepoint enforces against (single source of truth). TKA re-sign is a separate follow-up.
  • No interactive-prompt clobber: the existing MachineNotAuthorized -> NeedsLogin bridge no longer overwrites an in-flight auto-reauth.
  • reauth_on_expiry config (default true) lets an embedder opt out. DeviceState is now #[non_exhaustive] to future-proof embedder matches.

Tests

Pure expiry_action decision matrix (incl. the TKA veto), the reauth_circuit_step breaker (fire-once, count, trip-to-Expired-at-cap, reset-on-recovery, full one-shot-key episode), the OldNodeKey anchor lifecycle (two rotations preserve the original; rotation after a clear re-anchors), the bridge no-clobber guard, and Reauthenticating recovery in wait_for_running.

Gates (local, all green)

cargo test -p geiserx_ts_keys -p geiserx_ts_control -p geiserx_ts_runtime (345) · clippy -D warnings (default + --features tun) · cargo fmt --check · cargo doc (broken_intra_doc_links=deny) · cargo run -p checks. Reviewed by an independent correctness+parity and an architecture+safety pass; all findings fixed.

Signed-off-by: Sergio sergio@geiser.cloud

Created using Claude Code (Opus 4.8)

GeiserX added 2 commits June 15, 2026 16:05
An auth-key-registered node went permanently offline at node-key expiry: the
control runner flipped DeviceState::Expired terminally on a self-node whose key
expiry had passed and never re-registered. Go's doLogin instead rotates the node
key and re-registers with the stored auth key, recovering automatically. The
node-key rotation primitive existed but nothing in the runtime drove it.

Wire the recovery end to end:

ts_control gains Command::Reauth on the live map-poll command channel and an
AsyncControlClient::reauth method. A node-key rotation cannot run inside run_once
which only borrows the key state, so the Reauth arm breaks the poll loop and
signals run via a reauth_requested out-param threaded like received_frame. run
owns the NodeState, calls rotate_node_key, and reconnects immediately skipping
the backoff so the next run_once re-registers with the rotated key plus
OldNodeKey. Only the node key rotates; disco and machine keys are untouched so
established tunnels do not flap.

ts_runtime adds the transient DeviceState::Reauthenticating, treated like
Connecting by the waiters and carrying no browse_to_url since recovery is
non-interactive. The unconditional expired-to-Expired branch is replaced by a
pure expiry_action helper deciding Running, Reauthenticate, or Expired from
key_expired, has_auth_key, reauth_enabled, and tka_active. On Reauthenticate the
handler sets the state and sends Command::Reauth once on the transition in;
recovery to Running is automatic on the next good self-node at the same handler.

The Tailnet Lock gate is a hard safety constraint: rotating on a locked tailnet
would install an unsigned key and lock the node out of locked peers, so an active
lock vetoes auto-reauth and falls through to Expired. Auto-reauth also requires a
retained auth key and the reauth_on_expiry config opt-in default true, threaded
through the facade like ephemeral. A one-shot auth key degrades to the terminal
state, no regression.

The magicsock our_node_key refresh after rotation is deferred with a TODO at the
rotation site: the stale key only causes a brief disco-ping rediscovery hiccup
that self-heals on the next netmap, not a tunnel teardown, and wiring it is a
cross-actor change not worth a fragile shim for a self-healing transient.

Signed-off-by: GeiserX <9169332+GeiserX@users.noreply.github.com>
…eKey anchor

Address the review panel's interlocking defects on the tsr-ajvm auto-reauth
path, all faces of one missing piece — a bounded reauth sub-state-machine.

- Circuit breaker: ControlRunner now counts consecutive reauths that have not
  recovered the node (reauth_attempts, cap MAX_REAUTH_ATTEMPTS=3). The decision
  is the pure reauth_circuit_step: fire the one-shot Command::Reauth only on the
  transition INTO Reauthenticating, count each later expired self-node while
  still reauthenticating, and trip to terminal Expired at the cap. Resets to 0
  on a good (non-expired) self-node. A one-shot/consumed auth key now settles
  instead of sitting in Reauthenticating forever.
- Preserve the original anchor: NodeState::rotate_node_key only captures the
  current key as old_node_key when none is held, so a second rotation before a
  successful register keeps the ORIGINAL pre-expiry key control needs to link
  the rotation. New NodeState::clear_old_node_key releases the anchor; the
  client run loop calls it on a successful re-register so the next genuine
  rotation re-anchors on the now-current key.
- Bridge no-clobber: bridge_reauth_url_to_state no longer downgrades an
  in-flight Reauthenticating to NeedsLogin, which both avoided a misleading
  browse_to_url on a headless node and re-armed a second rotation.
- DeviceState is now #[non_exhaustive] (amortizes the Reauthenticating addition).
- Polish: reword the Reauthenticating 'transient' prose off the nonexistent
  DeviceState::is_permanent, and downgrade reauth()'s closed-channel log to debug
  (a benign teardown race).

Signed-off-by: GeiserX <9169332+GeiserX@users.noreply.github.com>
@coderabbitai

coderabbitai Bot commented Jun 15, 2026

Copy link
Copy Markdown

Warning

Review limit reached

@GeiserX, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 47 minutes and 32 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: d9595e48-26a3-4927-bb1a-53a2f45112f8

📥 Commits

Reviewing files that changed from the base of the PR and between ffa14a5 and 65b5a82.

📒 Files selected for processing (6)
  • src/config.rs
  • ts_control/src/config.rs
  • ts_control/src/tokio/client.rs
  • ts_keys/src/keystate.rs
  • ts_runtime/src/control_runner.rs
  • ts_runtime/src/device_state.rs
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/tka-reauth-on-key-expiry

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@GeiserX GeiserX merged commit fea03ab into main Jun 15, 2026
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant