Skip to content

Distinguish corrupt vs transient errors in LoadAuth.get_tok (#69073)#69074

Merged
dwoz merged 3 commits into
saltstack:3006.xfrom
co-cy:security-get-tok-distinguish-exceptions
May 21, 2026
Merged

Distinguish corrupt vs transient errors in LoadAuth.get_tok (#69073)#69074
dwoz merged 3 commits into
saltstack:3006.xfrom
co-cy:security-get-tok-distinguish-exceptions

Conversation

@co-cy
Copy link
Copy Markdown

@co-cy co-cy commented May 6, 2026

What does this PR do?

Splits the single except in salt.auth.LoadAuth.get_tok into two cases so that a transient backend error no longer looks the same as a corrupt token blob.

Before, when the eauth token backend raised, the exception either propagated up the stack or was treated as "this token is bad". For sites with a network-backed token store (Redis, NFS-shared localfs), a single backend hiccup — Redis container restart, NFS export blip, full disk — would therefore log every authenticated user out, because callers that recover by calling rm_token (or wrap the call in except Exception and delete on any failure) treated a single failed read as "drop this token from the store".

After this PR:

  • SaltDeserializationError — the stored blob is corrupt and cannot be parsed. The token is permanently unusable, so removing it is correct: leaving it around would make every subsequent get_tok for the same id keep failing. (Existing behaviour, kept.)
  • OSError (covers IOError) — transient backend error. The token itself is fine; do not delete it. Return {} so the caller treats this request as not-authenticated, and the next request retries against the backend.
  • Expired (deserialized successfully but past expire) — remove. (Existing behaviour, kept.)

What issues does this PR fix or reference?

Fixes #69073

Previous Behavior

get_tok either propagated OSError to the caller or, where callers recovered by calling rm_token, deleted a perfectly valid token because the backend briefly couldn't be read. A 5-second Redis reconnect logged the entire fleet out.

New Behavior

A backend read failure is classified by cause:

  • corrupt blob → remove the token (it can never authenticate again)
  • transient I/O error → keep the token, return empty for this request, let the next request retry

Merge requirements satisfied?

Tests written?

Three behavioural tests in tests/pytests/unit/auth/test_auth.py:

  • test_get_tok_returns_empty_and_keeps_token_on_io_error — backend raises OSError. get_tok returns {} and rm_token is not called. This test fails on the previous implementation and is the regression this PR exists to prevent.
  • test_get_tok_removes_token_on_deserialization_error — corrupt blob → rm_token is called.
  • test_get_tok_removes_expired_token — expired token → rm_token is called.

Anything else reviewers should know?

  • Behaviour for already-correct paths (corrupt blob, expired token) is preserved unchanged.
  • No public API change. Same signature, same return type, same caller contract for "this token is not currently usable" ({}).

@co-cy co-cy requested a review from a team as a code owner May 6, 2026 19:02
When the eauth token backend raised any exception, the previous code
either propagated it or treated it as "remove the token". For Redis
or other network-backed token stores, a transient backend error
(connection drop, NFS hang, full disk) would therefore log every
authenticated user out -- ``rm_token`` deletes valid tokens that
just briefly could not be read.

Split the ``except`` into two cases:

* ``SaltDeserializationError`` -- the on-disk blob is corrupt and
  cannot be parsed. The token is permanently unusable, so removing
  it is correct: leaving it around would make every subsequent
  ``get_tok`` for the same id keep failing.
* ``OSError`` (covers ``IOError``) -- transient backend error. The
  token itself is fine; do NOT delete it. Return ``{}`` so the
  caller treats this request as not-authenticated, and the next
  request retries against the backend.

Three behavioural tests guard each branch and the existing expired
path. The IOError test fails on the previous implementation, which
is the regression this change exists to prevent.

Refs: saltstack#69073
Copy link
Copy Markdown
Contributor

@twangboy twangboy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please address the Lint failures

Comment thread salt/auth/__init__.py Outdated
CI saltpylint flagged `salt/auth/__init__.py:253` with
`W1404(implicit-str-concat)`. The two adjacent string literals on
that line are the result of black collapsing what was originally a
two-line string onto a single line, which saltpylint reads as a
missing comma. Merge them into a single string literal; the message
remains under 88 columns so black leaves it alone.

Refs: saltstack#69073
Address review feedback from @bdrx312 on PR saltstack#69074: the original
`log.warning` calls in the SaltDeserializationError and OSError
handlers in `LoadAuth.get_tok` did not capture the exception at all,
making troubleshooting harder. Two combined changes:

1. Add the exception class+message inline via `%r` so each
   occurrence is one greppable line that names which subclass
   (e.g. `ConnectionResetError`) and its message produced the
   warning -- enough to triage without a full traceback.
2. Add a companion `log.debug(..., exc_info=True)` per @bdrx312's
   suggestion. Operators who need the full traceback for a specific
   intermittent failure can raise the level to DEBUG and see it.
   Costs nothing at the default WARNING level because logging skips
   exc_info formatting when the level is disabled.

This avoids GBs/hour of stack frames during a flapping Redis or NFS
outage (the original `exc_info=True` approach considered earlier)
while keeping full diagnostic depth one log-level change away.

Refs: saltstack#69073
@dwoz dwoz merged commit f780a7b into saltstack:3006.x May 21, 2026
812 of 817 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

test:full Run the full test suite

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: LoadAuth.get_tok deletes valid tokens on transient backend errors, logging users out on every Redis hiccup

4 participants