feat(site-explorer): fall back to ACPowercycle when a vendor refuses PowerCycle#2718
feat(site-explorer): fall back to ACPowercycle when a vendor refuses PowerCycle#2718chet wants to merge 1 commit into
Conversation
|
@coderabbitai review |
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (4)
🚧 Files skipped from review as they are similar to previous changes (3)
Summary by CodeRabbitRelease Notes
Walkthrough
ChangesPowerCycle → ACPowercycle Fallback
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Possibly related issues
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
✅ Action performedReview finished.
|
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
crates/site-explorer/tests/site_explorer.rs (1)
2614-2715: ⚡ Quick winPrefer a table-driven test for the PowerCycle/ACPowercycle matrix
This test and
test_site_explorer_power_cycles_non_dell_host_to_apply_nic_modeexercise the same operation with variant outcomes. Consolidating into one table-driven case set would reduce duplicated setup and make future vendor/reset variants easier to add consistently.As per coding guidelines, “Reach for a table whenever two or more tests call the same operation with different inputs. Do not force genuinely-distinct tests into a table.”
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/site-explorer/tests/site_explorer.rs` around lines 2614 - 2715, The test `test_site_explorer_falls_back_to_ac_powercycle_when_powercycle_refused` and the related test `test_site_explorer_power_cycles_non_dell_host_to_apply_nic_mode` share significant duplicated setup and test logic for exercising the same operation with different vendor/reset outcomes. Refactor these into a single table-driven test by extracting the varying parameters (such as BMC vendor type, which power control method is attempted, and which fallback is expected) into a test case struct or tuple, then iterate through the test cases to reduce code duplication and make it easier to add new vendor/reset variants in the future.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/site-explorer/src/lib.rs`:
- Around line 2560-2573: The current code executes the ACPowercycle fallback for
any error returned by the redfish_power_control call with
SystemPowerControl::PowerCycle, but the log message indicates this should only
happen when PowerCycle is explicitly refused or unsupported. Modify the error
handling to inspect the power_cycle_err and only proceed with the ACPowercycle
fallback if the error specifically indicates action refusal or unsupported
operations; otherwise, return the original error to avoid masking transient,
transport, or authentication failures with an unintended cold cycle.
In `@crates/site-explorer/tests/site_explorer.rs`:
- Around line 2689-2712: The assertions computing powercycle_pos and
acpowercycle_pos are iterating over all power_calls without filtering by a
specific host_bmc_ip, which can cause false positives if other endpoints issue
power actions during the test. Filter the power_calls iterator to only include
calls for the target host_bmc_ip before computing the positions using the
position() method. This filtering should be applied to both the powercycle_pos
and acpowercycle_pos iterator chains to ensure the assertions only validate the
fallback order for the specific endpoint being tested.
---
Nitpick comments:
In `@crates/site-explorer/tests/site_explorer.rs`:
- Around line 2614-2715: The test
`test_site_explorer_falls_back_to_ac_powercycle_when_powercycle_refused` and the
related test `test_site_explorer_power_cycles_non_dell_host_to_apply_nic_mode`
share significant duplicated setup and test logic for exercising the same
operation with different vendor/reset outcomes. Refactor these into a single
table-driven test by extracting the varying parameters (such as BMC vendor type,
which power control method is attempted, and which fallback is expected) into a
test case struct or tuple, then iterate through the test cases to reduce code
duplication and make it easier to add new vendor/reset variants in the future.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 1ab6f2c4-37d5-4e4f-b541-fedaf3f8d0b8
📒 Files selected for processing (3)
crates/site-explorer/src/lib.rscrates/site-explorer/src/test_support/mock_endpoint_explorer.rscrates/site-explorer/tests/site_explorer.rs
| if let Err(power_cycle_err) = self | ||
| .redfish_power_control(bmc_ip_address, libredfish::SystemPowerControl::PowerCycle) | ||
| .await | ||
| { | ||
| tracing::warn!( | ||
| %bmc_ip_address, | ||
| error = %power_cycle_err, | ||
| "PowerCycle refused; falling back to ACPowercycle to apply the queued NIC mode change", | ||
| ); | ||
| self.redfish_power_control( | ||
| bmc_ip_address, | ||
| libredfish::SystemPowerControl::ACPowercycle, | ||
| ) | ||
| .await?; |
There was a problem hiding this comment.
Limit fallback to explicit “PowerCycle refused/unsupported” failures
At Line 2560, fallback currently runs for any PowerCycle error. That includes transport/auth/transient failures, but the log at Line 2567 labels it as refusal and then executes a cold cycle anyway. Return the original error unless the failure is explicitly an action-refusal signal.
Proposed fix
async fn redfish_powercycle(&self, bmc_ip_address: IpAddr) -> SiteExplorerResult<()> {
- if let Err(power_cycle_err) = self
- .redfish_power_control(bmc_ip_address, libredfish::SystemPowerControl::PowerCycle)
- .await
- {
- tracing::warn!(
- %bmc_ip_address,
- error = %power_cycle_err,
- "PowerCycle refused; falling back to ACPowercycle to apply the queued NIC mode change",
- );
- self.redfish_power_control(
- bmc_ip_address,
- libredfish::SystemPowerControl::ACPowercycle,
- )
- .await?;
- }
+ match self
+ .redfish_power_control(bmc_ip_address, libredfish::SystemPowerControl::PowerCycle)
+ .await
+ {
+ Ok(()) => {}
+ Err(power_cycle_err) if is_powercycle_refused(&power_cycle_err) => {
+ tracing::warn!(
+ %bmc_ip_address,
+ error = %power_cycle_err,
+ "PowerCycle refused; falling back to ACPowercycle to apply the queued NIC mode change",
+ );
+ self.redfish_power_control(
+ bmc_ip_address,
+ libredfish::SystemPowerControl::ACPowercycle,
+ )
+ .await?;
+ }
+ Err(err) => return Err(err),
+ }
let mut txn = self.txn_begin().await?;
db::explored_endpoints::set_last_redfish_powercycle(bmc_ip_address, &mut txn).await?;
Ok(txn.commit().await?)
}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@crates/site-explorer/src/lib.rs` around lines 2560 - 2573, The current code
executes the ACPowercycle fallback for any error returned by the
redfish_power_control call with SystemPowerControl::PowerCycle, but the log
message indicates this should only happen when PowerCycle is explicitly refused
or unsupported. Modify the error handling to inspect the power_cycle_err and
only proceed with the ACPowercycle fallback if the error specifically indicates
action refusal or unsupported operations; otherwise, return the original error
to avoid masking transient, transport, or authentication failures with an
unintended cold cycle.
| let power_calls = explorer | ||
| .endpoint_explorer() | ||
| .redfish_power_control_calls | ||
| .lock() | ||
| .unwrap(); | ||
| let powercycle_pos = power_calls | ||
| .iter() | ||
| .position(|(_, action)| matches!(action, libredfish::SystemPowerControl::PowerCycle)); | ||
| let acpowercycle_pos = power_calls | ||
| .iter() | ||
| .position(|(_, action)| matches!(action, libredfish::SystemPowerControl::ACPowercycle)); | ||
| assert!( | ||
| powercycle_pos.is_some(), | ||
| "expected `PowerCycle` to be attempted; power calls so far: {power_calls:?}" | ||
| ); | ||
| assert!( | ||
| acpowercycle_pos.is_some(), | ||
| "expected the `ACPowercycle` fallback after `PowerCycle` was refused; power calls so far: {power_calls:?}" | ||
| ); | ||
| // A fallback is only correct if `PowerCycle` is the one tried first. | ||
| assert!( | ||
| powercycle_pos < acpowercycle_pos, | ||
| "expected `PowerCycle` (at {powercycle_pos:?}) before the `ACPowercycle` fallback (at {acpowercycle_pos:?}); power calls: {power_calls:?}" | ||
| ); |
There was a problem hiding this comment.
Scope fallback-order assertions to the target host IP
At Line 2694 and Line 2697, positions are computed across all power-control calls. If other endpoints start issuing power actions in this flow, this can produce false positives/negatives. Filter by host_bmc_ip before computing positions.
Proposed fix
let power_calls = explorer
.endpoint_explorer()
.redfish_power_control_calls
.lock()
.unwrap();
- let powercycle_pos = power_calls
- .iter()
+ let host_power_calls = power_calls
+ .iter()
+ .filter(|(addr, _)| addr.ip() == host_bmc_ip)
+ .collect::<Vec<_>>();
+ let powercycle_pos = host_power_calls
+ .iter()
.position(|(_, action)| matches!(action, libredfish::SystemPowerControl::PowerCycle));
- let acpowercycle_pos = power_calls
- .iter()
+ let acpowercycle_pos = host_power_calls
+ .iter()
.position(|(_, action)| matches!(action, libredfish::SystemPowerControl::ACPowercycle));🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@crates/site-explorer/tests/site_explorer.rs` around lines 2689 - 2712, The
assertions computing powercycle_pos and acpowercycle_pos are iterating over all
power_calls without filtering by a specific host_bmc_ip, which can cause false
positives if other endpoints issue power actions during the test. Filter the
power_calls iterator to only include calls for the target host_bmc_ip before
computing the positions using the position() method. This filtering should be
applied to both the powercycle_pos and acpowercycle_pos iterator chains to
ensure the assertions only validate the fallback order for the specific endpoint
being tested.
…PowerCycle A queued BlueField NIC-mode change only applies after a real power cycle, and site-explorer drives one for every vendor once it sees a DPU in the wrong mode. But `PowerCycle` (Redfish `ComputerSystem.Reset`) is implemented only by Dell and the DPU BMCs -- other host vendors, and Vikings, refuse it -- so on those hosts the reset never landed and the flip parked on `ManualPowerCycleRequired`. `redfish_powercycle` now falls back to a cold `ACPowercycle` (implemented by the HPE/Lenovo/Supermicro/GBx00 wrappers) when `PowerCycle` is refused, so the change applies without an operator. If both resets are refused the error still surfaces as `ManualPowerCycleRequired`, unchanged. The mock endpoint explorer gains per-action failure injection, and a new test drives a non-Dell host whose `PowerCycle` is refused and asserts the `ACPowercycle` fallback fires. Signed-off-by: Chet Nichols III <chetn@nvidia.com>
🔍 Container Scan Summary
Per-CVE detail lives in the per-service |
A queued BlueField NIC-mode change only applies after a real power cycle, and site-explorer drives one for every vendor once it sees a DPU in the wrong mode. But
PowerCycle(RedfishComputerSystem.Reset) is implemented only by Dell and the DPU BMCs -- other host vendors, and Vikings, refuse it -- so on those hosts the reset never landed and the flip parked onManualPowerCycleRequired.redfish_powercyclenow falls back to a coldACPowercycle(implemented by the HPE/Lenovo/Supermicro/GBx00 wrappers) whenPowerCycleis refused, so the change applies without an operator. If both resets are refused the error still surfaces asManualPowerCycleRequired, unchanged.The mock endpoint explorer gains per-action failure injection, and a new test drives a non-Dell host whose
PowerCycleis refused and asserts theACPowercyclefallback fires (afterPowerCycleis tried first).Supports #2635