Skip to content

feat(site-explorer): fall back to ACPowercycle when a vendor refuses PowerCycle#2718

Open
chet wants to merge 1 commit into
NVIDIA:mainfrom
chet:gh-issue-2635
Open

feat(site-explorer): fall back to ACPowercycle when a vendor refuses PowerCycle#2718
chet wants to merge 1 commit into
NVIDIA:mainfrom
chet:gh-issue-2635

Conversation

@chet

@chet chet commented Jun 20, 2026

Copy link
Copy Markdown
Contributor

A queued BlueField NIC-mode change only applies after a real power cycle, and site-explorer drives one for every vendor once it sees a DPU in the wrong mode. But PowerCycle (Redfish ComputerSystem.Reset) is implemented only by Dell and the DPU BMCs -- other host vendors, and Vikings, refuse it -- so on those hosts the reset never landed and the flip parked on ManualPowerCycleRequired.

redfish_powercycle now falls back to a cold ACPowercycle (implemented by the HPE/Lenovo/Supermicro/GBx00 wrappers) when PowerCycle is refused, so the change applies without an operator. If both resets are refused the error still surfaces as ManualPowerCycleRequired, unchanged.

The mock endpoint explorer gains per-action failure injection, and a new test drives a non-Dell host whose PowerCycle is refused and asserts the ACPowercycle fallback fires (after PowerCycle is tried first).

Supports #2635

@chet

chet commented Jun 20, 2026

Copy link
Copy Markdown
Contributor Author

@coderabbitai review

@copy-pr-bot

copy-pr-bot Bot commented Jun 20, 2026

Copy link
Copy Markdown

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented Jun 20, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: dd451e19-eec7-4ea6-8967-38d1e3dfa3de

📥 Commits

Reviewing files that changed from the base of the PR and between acb8b70 and 83a2f4f.

📒 Files selected for processing (4)
  • crates/api-core/src/tests/common/api_fixtures/mod.rs
  • crates/site-explorer/src/lib.rs
  • crates/site-explorer/src/test_support/mock_endpoint_explorer.rs
  • crates/site-explorer/tests/site_explorer.rs
🚧 Files skipped from review as they are similar to previous changes (3)
  • crates/site-explorer/src/lib.rs
  • crates/site-explorer/tests/site_explorer.rs
  • crates/site-explorer/src/test_support/mock_endpoint_explorer.rs

Summary by CodeRabbit

Release Notes

  • Improvements

    • Enhanced NIC-mode power cycle handling with a resilient two-step flow: tries vendor-specific PowerCycle first, then falls back to ACPowercycle if the initial command is refused.
  • Tests

    • Added an integration test to confirm fallback behavior when power-control commands are rejected, including verification of call order (PowerCycle before ACPowercycle).

Walkthrough

SiteExplorer::redfish_powercycle is changed from a single PowerCycle call to a two-step strategy: attempt PowerCycle, and on error fall back to ACPowercycle. Surrounding comments are updated to reflect this escalation. MockEndpointExplorer gains a power_control_failures list and fail_power_control helper to simulate vendor refusal, and a new integration test validates the fallback ordering end-to-end.

Changes

PowerCycle → ACPowercycle Fallback

Layer / File(s) Summary
redfish_powercycle fallback logic and updated comments
crates/site-explorer/src/lib.rs
redfish_powercycle now attempts PowerCycle first; on error it logs a warning and retries with ACPowercycle, propagating failure only if the fallback also fails. Adjacent comments in identify_managed_hosts updated to reflect the escalation path and mid-flight/stuck detection semantics.
Mock power-control rejection infrastructure
crates/site-explorer/src/test_support/mock_endpoint_explorer.rs, crates/api-core/src/tests/common/api_fixtures/mod.rs
Adds power_control_failures: Arc<Mutex<Vec<SystemPowerControl>>> field to MockEndpointExplorer, a fail_power_control helper to register rejected actions, and rejection logic inside redfish_power_control returning EndpointExplorationError::Unreachable for configured actions. Test fixture wiring updated to initialize the new field.
Integration test: fallback ordering
crates/site-explorer/tests/site_explorer.rs
New #[sqlx_test] provisions a NicMode-declared host with a DPU, refuses PowerCycle via fail_power_control, runs two site-explorer iterations to reach the queued reset path, and asserts PowerCycle precedes ACPowercycle in recorded power-control calls.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related issues

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and concisely describes the primary change: implementing a fallback from PowerCycle to ACPowercycle when vendors refuse the initial reset action.
Description check ✅ Passed The description provides clear context on the business problem, technical solution, and testing approach, all directly aligned with the changeset modifications.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai

coderabbitai Bot commented Jun 20, 2026

Copy link
Copy Markdown
Contributor
✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
crates/site-explorer/tests/site_explorer.rs (1)

2614-2715: ⚡ Quick win

Prefer a table-driven test for the PowerCycle/ACPowercycle matrix

This test and test_site_explorer_power_cycles_non_dell_host_to_apply_nic_mode exercise the same operation with variant outcomes. Consolidating into one table-driven case set would reduce duplicated setup and make future vendor/reset variants easier to add consistently.

As per coding guidelines, “Reach for a table whenever two or more tests call the same operation with different inputs. Do not force genuinely-distinct tests into a table.”

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/site-explorer/tests/site_explorer.rs` around lines 2614 - 2715, The
test `test_site_explorer_falls_back_to_ac_powercycle_when_powercycle_refused`
and the related test
`test_site_explorer_power_cycles_non_dell_host_to_apply_nic_mode` share
significant duplicated setup and test logic for exercising the same operation
with different vendor/reset outcomes. Refactor these into a single table-driven
test by extracting the varying parameters (such as BMC vendor type, which power
control method is attempted, and which fallback is expected) into a test case
struct or tuple, then iterate through the test cases to reduce code duplication
and make it easier to add new vendor/reset variants in the future.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/site-explorer/src/lib.rs`:
- Around line 2560-2573: The current code executes the ACPowercycle fallback for
any error returned by the redfish_power_control call with
SystemPowerControl::PowerCycle, but the log message indicates this should only
happen when PowerCycle is explicitly refused or unsupported. Modify the error
handling to inspect the power_cycle_err and only proceed with the ACPowercycle
fallback if the error specifically indicates action refusal or unsupported
operations; otherwise, return the original error to avoid masking transient,
transport, or authentication failures with an unintended cold cycle.

In `@crates/site-explorer/tests/site_explorer.rs`:
- Around line 2689-2712: The assertions computing powercycle_pos and
acpowercycle_pos are iterating over all power_calls without filtering by a
specific host_bmc_ip, which can cause false positives if other endpoints issue
power actions during the test. Filter the power_calls iterator to only include
calls for the target host_bmc_ip before computing the positions using the
position() method. This filtering should be applied to both the powercycle_pos
and acpowercycle_pos iterator chains to ensure the assertions only validate the
fallback order for the specific endpoint being tested.

---

Nitpick comments:
In `@crates/site-explorer/tests/site_explorer.rs`:
- Around line 2614-2715: The test
`test_site_explorer_falls_back_to_ac_powercycle_when_powercycle_refused` and the
related test `test_site_explorer_power_cycles_non_dell_host_to_apply_nic_mode`
share significant duplicated setup and test logic for exercising the same
operation with different vendor/reset outcomes. Refactor these into a single
table-driven test by extracting the varying parameters (such as BMC vendor type,
which power control method is attempted, and which fallback is expected) into a
test case struct or tuple, then iterate through the test cases to reduce code
duplication and make it easier to add new vendor/reset variants in the future.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 1ab6f2c4-37d5-4e4f-b541-fedaf3f8d0b8

📥 Commits

Reviewing files that changed from the base of the PR and between 79e54c0 and acb8b70.

📒 Files selected for processing (3)
  • crates/site-explorer/src/lib.rs
  • crates/site-explorer/src/test_support/mock_endpoint_explorer.rs
  • crates/site-explorer/tests/site_explorer.rs

Comment on lines +2560 to 2573
if let Err(power_cycle_err) = self
.redfish_power_control(bmc_ip_address, libredfish::SystemPowerControl::PowerCycle)
.await
{
tracing::warn!(
%bmc_ip_address,
error = %power_cycle_err,
"PowerCycle refused; falling back to ACPowercycle to apply the queued NIC mode change",
);
self.redfish_power_control(
bmc_ip_address,
libredfish::SystemPowerControl::ACPowercycle,
)
.await?;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Limit fallback to explicit “PowerCycle refused/unsupported” failures

At Line 2560, fallback currently runs for any PowerCycle error. That includes transport/auth/transient failures, but the log at Line 2567 labels it as refusal and then executes a cold cycle anyway. Return the original error unless the failure is explicitly an action-refusal signal.

Proposed fix
 async fn redfish_powercycle(&self, bmc_ip_address: IpAddr) -> SiteExplorerResult<()> {
-    if let Err(power_cycle_err) = self
-        .redfish_power_control(bmc_ip_address, libredfish::SystemPowerControl::PowerCycle)
-        .await
-    {
-        tracing::warn!(
-            %bmc_ip_address,
-            error = %power_cycle_err,
-            "PowerCycle refused; falling back to ACPowercycle to apply the queued NIC mode change",
-        );
-        self.redfish_power_control(
-            bmc_ip_address,
-            libredfish::SystemPowerControl::ACPowercycle,
-        )
-        .await?;
-    }
+    match self
+        .redfish_power_control(bmc_ip_address, libredfish::SystemPowerControl::PowerCycle)
+        .await
+    {
+        Ok(()) => {}
+        Err(power_cycle_err) if is_powercycle_refused(&power_cycle_err) => {
+            tracing::warn!(
+                %bmc_ip_address,
+                error = %power_cycle_err,
+                "PowerCycle refused; falling back to ACPowercycle to apply the queued NIC mode change",
+            );
+            self.redfish_power_control(
+                bmc_ip_address,
+                libredfish::SystemPowerControl::ACPowercycle,
+            )
+            .await?;
+        }
+        Err(err) => return Err(err),
+    }

     let mut txn = self.txn_begin().await?;
     db::explored_endpoints::set_last_redfish_powercycle(bmc_ip_address, &mut txn).await?;
     Ok(txn.commit().await?)
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/site-explorer/src/lib.rs` around lines 2560 - 2573, The current code
executes the ACPowercycle fallback for any error returned by the
redfish_power_control call with SystemPowerControl::PowerCycle, but the log
message indicates this should only happen when PowerCycle is explicitly refused
or unsupported. Modify the error handling to inspect the power_cycle_err and
only proceed with the ACPowercycle fallback if the error specifically indicates
action refusal or unsupported operations; otherwise, return the original error
to avoid masking transient, transport, or authentication failures with an
unintended cold cycle.

Comment on lines +2689 to +2712
let power_calls = explorer
.endpoint_explorer()
.redfish_power_control_calls
.lock()
.unwrap();
let powercycle_pos = power_calls
.iter()
.position(|(_, action)| matches!(action, libredfish::SystemPowerControl::PowerCycle));
let acpowercycle_pos = power_calls
.iter()
.position(|(_, action)| matches!(action, libredfish::SystemPowerControl::ACPowercycle));
assert!(
powercycle_pos.is_some(),
"expected `PowerCycle` to be attempted; power calls so far: {power_calls:?}"
);
assert!(
acpowercycle_pos.is_some(),
"expected the `ACPowercycle` fallback after `PowerCycle` was refused; power calls so far: {power_calls:?}"
);
// A fallback is only correct if `PowerCycle` is the one tried first.
assert!(
powercycle_pos < acpowercycle_pos,
"expected `PowerCycle` (at {powercycle_pos:?}) before the `ACPowercycle` fallback (at {acpowercycle_pos:?}); power calls: {power_calls:?}"
);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Scope fallback-order assertions to the target host IP

At Line 2694 and Line 2697, positions are computed across all power-control calls. If other endpoints start issuing power actions in this flow, this can produce false positives/negatives. Filter by host_bmc_ip before computing positions.

Proposed fix
     let power_calls = explorer
         .endpoint_explorer()
         .redfish_power_control_calls
         .lock()
         .unwrap();
-    let powercycle_pos = power_calls
-        .iter()
+    let host_power_calls = power_calls
+        .iter()
+        .filter(|(addr, _)| addr.ip() == host_bmc_ip)
+        .collect::<Vec<_>>();
+    let powercycle_pos = host_power_calls
+        .iter()
         .position(|(_, action)| matches!(action, libredfish::SystemPowerControl::PowerCycle));
-    let acpowercycle_pos = power_calls
-        .iter()
+    let acpowercycle_pos = host_power_calls
+        .iter()
         .position(|(_, action)| matches!(action, libredfish::SystemPowerControl::ACPowercycle));
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/site-explorer/tests/site_explorer.rs` around lines 2689 - 2712, The
assertions computing powercycle_pos and acpowercycle_pos are iterating over all
power_calls without filtering by a specific host_bmc_ip, which can cause false
positives if other endpoints issue power actions during the test. Filter the
power_calls iterator to only include calls for the target host_bmc_ip before
computing the positions using the position() method. This filtering should be
applied to both the powercycle_pos and acpowercycle_pos iterator chains to
ensure the assertions only validate the fallback order for the specific endpoint
being tested.

@chet chet marked this pull request as ready for review June 21, 2026 04:23
@chet chet requested a review from a team as a code owner June 21, 2026 04:23
…PowerCycle

A queued BlueField NIC-mode change only applies after a real power cycle, and
site-explorer drives one for every vendor once it sees a DPU in the wrong mode.
But `PowerCycle` (Redfish `ComputerSystem.Reset`) is implemented only by Dell
and the DPU BMCs -- other host vendors, and Vikings, refuse it -- so on those
hosts the reset never landed and the flip parked on `ManualPowerCycleRequired`.

`redfish_powercycle` now falls back to a cold `ACPowercycle` (implemented by the
HPE/Lenovo/Supermicro/GBx00 wrappers) when `PowerCycle` is refused, so the
change applies without an operator. If both resets are refused the error still
surfaces as `ManualPowerCycleRequired`, unchanged.

The mock endpoint explorer gains per-action failure injection, and a new test
drives a non-Dell host whose `PowerCycle` is refused and asserts the
`ACPowercycle` fallback fires.

Signed-off-by: Chet Nichols III <chetn@nvidia.com>
@github-actions

github-actions Bot commented Jun 21, 2026

Copy link
Copy Markdown

🔍 Container Scan Summary

Service Total Critical High Medium Low Other
boot-artifacts-aarch64 3 0 0 3 0 0
boot-artifacts-x86_64 3 0 0 3 0 0
forge-admin-cli-x86_64 264 6 23 99 6 130
machine-validation-runner 704 34 183 258 35 194
machine_validation 704 34 183 258 35 194
nvmetal-carbide 704 34 183 258 35 194
TOTAL 2382 108 572 879 111 712

Per-CVE detail lives in the per-service grype-* artifacts (JSON + SARIF). Severity counts only — no CVE IDs published here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant