Skip to content

fix(scout): skip TPM2_Clear on hosts with no TPM device#2703

Open
s3rj1k wants to merge 1 commit into
NVIDIA:mainfrom
s3rj1k:tpm
Open

fix(scout): skip TPM2_Clear on hosts with no TPM device#2703
s3rj1k wants to merge 1 commit into
NVIDIA:mainfrom
s3rj1k:tpm

Conversation

@s3rj1k

@s3rj1k s3rj1k commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

clear_tpm ran TPM2_Clear unconditionally during discovery deprovision (run_no_api -> clear_tpm), which hard-fails on a host with no TPM (/dev/tpmrm0 absent): the scout logs it non-fatally and re-polls, so discovery never reports DiscoverMachine and the machine loops in WaitingForDiscovery. register::run already tolerates no-TPM hosts via is_dpu (absent EK cert gates all attestation), so clear_tpm was the only unconditional TPM call blocking discovery. Map the TCTI to its device file and skip the clear when that device is absent; a present TPM that fails to clear stays a hard error.

Type of Change

  • Add - New feature or capability
  • Change - Changes in existing functionality
  • Fix - Bug fixes
  • Remove - Removed features or deprecated functionality
  • Internal - Internal changes (refactoring, tests, docs, etc.)

Breaking Changes

  • This PR contains breaking changes

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • No testing required (docs, internal refactor, etc.)

Additional Notes

clear_tpm ran TPM2_Clear unconditionally during discovery deprovision (run_no_api -> clear_tpm), which hard-fails on a host with no TPM (/dev/tpmrm0 absent): the scout logs it non-fatally and re-polls, so discovery never reports DiscoverMachine and the machine loops in WaitingForDiscovery. register::run already tolerates no-TPM hosts via is_dpu (absent EK cert gates all attestation), so clear_tpm was the only unconditional TPM call blocking discovery. Map the TCTI to its device file and skip the clear when that device is absent; a present TPM that fails to clear stays a hard error.

Signed-off-by: s3rj1k <evasive.gyron@gmail.com>
@s3rj1k s3rj1k requested a review from a team as a code owner June 19, 2026 15:10
@copy-pr-bot

copy-pr-bot Bot commented Jun 19, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Summary by CodeRabbit

  • Bug Fixes
    • Improved TPM clearing logic to gracefully handle systems without TPM device files. The operation now skips clearing on systems lacking a TPM device and logs a warning, preventing unnecessary errors.

Walkthrough

crates/scout/src/tpm.rs gains a new internal helper tpm_device_path that parses a device:/... TCTI string into a filesystem path, returning None for socket-based TCTIs. clear_tpm is updated to use this helper and skip the TPM2_Clear operation with an Ok(()) early-return when the device file is not present on the host.

Changes

TPM Device Existence Guard

Layer / File(s) Summary
tpm_device_path helper and clear_tpm early-return guard
crates/scout/src/tpm.rs
Adds tpm_device_path to extract a filesystem path from device: TCTI strings. Updates clear_tpm to check for the device file's existence; logs a warning and returns Ok(()) immediately when the file is absent, bypassing set_tpm_max_auth_fail and TPM2_Clear. Existing execution path is preserved when the device is present or when the TCTI is socket-based.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and concisely summarizes the primary change—skipping TPM2_Clear on hosts without a TPM device—directly aligned with the changeset.
Description check ✅ Passed The description is well-structured, clearly relating the problem (unconditional TPM2_Clear failure), the solution (conditional device check), and context (alignment with existing register::run behavior).
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
crates/scout/src/tpm.rs (2)

74-74: ⚡ Quick win

Use structured tracing fields in the warning log.

Prefer key/value fields (e.g., device=%dev) instead of interpolation to keep logs logfmt-friendly and queryable.

As per coding guidelines, "All services should emit logs in 'logfmt' syntax" and "prefer placing common fields as attributes passed to tracing functions instead of using string interpolation."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/scout/src/tpm.rs` at line 74, The tracing::warn! call in the clear_tpm
function is using string interpolation to embed the dev variable into the log
message, which is not logfmt-friendly. Replace the string interpolation with
structured tracing fields by moving the dev variable outside the message string
and passing it as a key/value field argument to tracing::warn! using the format
device=%dev, keeping only descriptive text in the message string itself.

Source: Coding guidelines


60-65: ⚡ Quick win

Add table-driven tests for TCTI parsing and skip/clear branching.

tpm_device_path and the new early-return guard are input-mapping logic and should be covered with table-driven cases (device:/dev/tpmrm0, plain /dev/tpmrm0, socket TCTIs, malformed strings, absent/present device outcomes).

As per coding guidelines, "Prefer table-driven tests for any function that maps inputs to outputs, errors, or other observable results" and "Reach for a table whenever two or more tests call the same operation with different inputs."

Also applies to: 67-76

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/scout/src/tpm.rs` around lines 60 - 65, The tpm_device_path function
lacks comprehensive test coverage for its input-to-output mapping logic. Add
table-driven tests that cover multiple input scenarios including device TCTI
paths like device:/dev/tpmrm0, plain /dev/tpmrm0 paths, socket TCTIs that should
return None, malformed strings, and cases with absent/present device outcomes.
Apply the same table-driven testing approach to the early-return guard logic
(lines 67-76) as these are also input-mapping functions that benefit from
systematic case coverage per coding guidelines.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/scout/src/tpm.rs`:
- Around line 71-75: In the TPM device check within the function around line
71-75, replace the `Path::exists()` call with `try_exists()` to properly
distinguish between a missing device (Ok(false)) and actual IO/permission errors
(Err), then propagate any errors instead of silently returning Ok. Additionally,
refactor the tracing::warn! call to use structured logging conventions by moving
the device path from string interpolation {dev} into a tracing field using the
format device=%dev. Finally, add comprehensive table-driven test coverage for
the tpm_device_path() function to validate its input-output mapping behavior
according to the project's STYLE_GUIDE.md testing conventions.

---

Nitpick comments:
In `@crates/scout/src/tpm.rs`:
- Line 74: The tracing::warn! call in the clear_tpm function is using string
interpolation to embed the dev variable into the log message, which is not
logfmt-friendly. Replace the string interpolation with structured tracing fields
by moving the dev variable outside the message string and passing it as a
key/value field argument to tracing::warn! using the format device=%dev, keeping
only descriptive text in the message string itself.
- Around line 60-65: The tpm_device_path function lacks comprehensive test
coverage for its input-to-output mapping logic. Add table-driven tests that
cover multiple input scenarios including device TCTI paths like
device:/dev/tpmrm0, plain /dev/tpmrm0 paths, socket TCTIs that should return
None, malformed strings, and cases with absent/present device outcomes. Apply
the same table-driven testing approach to the early-return guard logic (lines
67-76) as these are also input-mapping functions that benefit from systematic
case coverage per coding guidelines.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 15fc4475-46a0-41d5-b440-70605589fa0c

📥 Commits

Reviewing files that changed from the base of the PR and between d5e5b61 and a13ff6c.

📒 Files selected for processing (1)
  • crates/scout/src/tpm.rs

Comment thread crates/scout/src/tpm.rs
Comment on lines +71 to +75
if let Some(dev) = tpm_device_path(tpm_path)
&& !std::path::Path::new(dev).exists()
{
tracing::warn!("clear_tpm: TPM device {dev} absent; skipping TPM2_Clear (host has no TPM)");
return Ok(());

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cd /tmp && find . -name "tpm.rs" -type f 2>/dev/null | head -20

Repository: NVIDIA/infra-controller

Length of output: 49


🏁 Script executed:

# Locate the file in the repository
find . -path "*/crates/scout/src/tpm.rs" -type f 2>/dev/null

Repository: NVIDIA/infra-controller

Length of output: 93


🏁 Script executed:

# Check repository structure
ls -la crates/ 2>/dev/null | head -20

Repository: NVIDIA/infra-controller

Length of output: 1245


🏁 Script executed:

# Find the scout crate
fd -i scout --type d

Repository: NVIDIA/infra-controller

Length of output: 323


🏁 Script executed:

#!/bin/bash
# Search for the scout crate and tpm.rs file
fd -i "tpm\.rs" -type f

Repository: NVIDIA/infra-controller

Length of output: 238


🏁 Script executed:

#!/bin/bash
# List crates directory
ls -1 crates/ 2>/dev/null || echo "crates dir not found"

Repository: NVIDIA/infra-controller

Length of output: 1316


🏁 Script executed:

#!/bin/bash
# Search for all Rust files containing tpm_device_path or clear_tpm
rg "tpm_device_path|fn clear_tpm" -l --type rust

Repository: NVIDIA/infra-controller

Length of output: 137


🏁 Script executed:

cat -n crates/scout/src/tpm.rs | head -100

Repository: NVIDIA/infra-controller

Length of output: 4662


🏁 Script executed:

# Get more context around the function
cat -n crates/scout/src/tpm.rs | sed -n '50,90p'

Repository: NVIDIA/infra-controller

Length of output: 2199


🏁 Script executed:

# Check function signature and error types
rg "fn clear_tpm|type.*Error|enum.*Error" crates/scout/src/tpm.rs -A 3

Repository: NVIDIA/infra-controller

Length of output: 245


🏁 Script executed:

# Check what Rust version is pinned
cat rust-toolchain.toml

Repository: NVIDIA/infra-controller

Length of output: 860


🏁 Script executed:

# Check if try_exists is used elsewhere in the codebase
rg "try_exists" -l

Repository: NVIDIA/infra-controller

Length of output: 160


🏁 Script executed:

# Check if Path::exists() vs try_exists() is documented or discussed
rg "Path::exists|try_exists" crates/scout/src/ -B 2 -A 2

Repository: NVIDIA/infra-controller

Length of output: 49


🏁 Script executed:

# Look for tests in the tpm.rs file
rg "#\[test\]|#\[cfg\(test\)\]" crates/scout/src/tpm.rs

Repository: NVIDIA/infra-controller

Length of output: 49


🏁 Script executed:

# Check if there are tests for tpm_device_path elsewhere
rg "tpm_device_path" --type rust

Repository: NVIDIA/infra-controller

Length of output: 217


🏁 Script executed:

# Check the logging guidelines again from STYLE_GUIDE.md
cat STYLE_GUIDE.md | grep -A 20 "## Logging"

Repository: NVIDIA/infra-controller

Length of output: 749


🏁 Script executed:

# Check if there's a specific guideline on structured fields
rg "structured.*logfmt|logfmt.*structured" STYLE_GUIDE.md -B 2 -A 2

Repository: NVIDIA/infra-controller

Length of output: 251


🏁 Script executed:

# Look at how other warn logs are structured in this file
rg "tracing::warn" crates/scout/src/tpm.rs -B 1 -A 1

Repository: NVIDIA/infra-controller

Length of output: 197


🏁 Script executed:

# Examine other scout files for logging patterns
rg "tracing::(warn|error|info)" crates/scout/src/ -A 1 | head -40

Repository: NVIDIA/infra-controller

Length of output: 2059


Address filesystem metadata error handling and structured logging in TPM device check.

Path::exists() conflates "device not present" with IO/permission errors; on transient access failures, it returns false, causing TPM2_Clear to skip silently in deprovision. Use try_exists() to distinguish Ok(false) (absent) from Err(_) (inaccessible), and propagate errors.

Additionally, line 74 violates structured logging conventions; move the device path to a tracing field (device=%dev) rather than string interpolation for proper logfmt formatting.

The tpm_device_path() function is a pure input–output mapper (parser) and should have comprehensive table-driven test coverage per STYLE_GUIDE.md guidance.

Suggested fix
-    if let Some(dev) = tpm_device_path(tpm_path)
-        && !std::path::Path::new(dev).exists()
-    {
-        tracing::warn!("clear_tpm: TPM device {dev} absent; skipping TPM2_Clear (host has no TPM)");
-        return Ok(());
-    }
+    if let Some(dev) = tpm_device_path(tpm_path) {
+        match std::path::Path::new(dev).try_exists() {
+            Ok(false) => {
+                tracing::warn!(device=%dev, "clear_tpm: TPM device absent; skipping TPM2_Clear");
+                return Ok(());
+            }
+            Ok(true) => {}
+            Err(e) => {
+                return Err(CarbideClientError::TpmError(format!(
+                    "Failed to check TPM device path {dev}: {e}"
+                )));
+            }
+        }
+    }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/scout/src/tpm.rs` around lines 71 - 75, In the TPM device check within
the function around line 71-75, replace the `Path::exists()` call with
`try_exists()` to properly distinguish between a missing device (Ok(false)) and
actual IO/permission errors (Err), then propagate any errors instead of silently
returning Ok. Additionally, refactor the tracing::warn! call to use structured
logging conventions by moving the device path from string interpolation {dev}
into a tracing field using the format device=%dev. Finally, add comprehensive
table-driven test coverage for the tpm_device_path() function to validate its
input-output mapping behavior according to the project's STYLE_GUIDE.md testing
conventions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant