feat: resolve a host's boot interface from predictions before its first lease#2712
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (5)
🚧 Files skipped from review as they are similar to previous changes (5)
Summary by CodeRabbitNew Features
WalkthroughIntroduces ChangesPrediction-based boot interface resolution
Sequence DiagramsequenceDiagram
participant Handler as handler.rs
participant DB as db
participant BootInterface as boot_interface.rs
participant PickPred as pick_boot_prediction
Handler->>DB: load_boot_predictions(machine_id)
DB-->>Handler: Vec<PredictedMachineInterface>
Handler->>BootInterface: boot_interface_target(snapshot, predictions)
BootInterface->>BootInterface: boot_interface_target_from(row_pair, row_mac, predictions)
BootInterface->>PickPred: pick_boot_prediction(predictions)
PickPred-->>BootInterface: Option<&PredictedMachineInterface>
BootInterface-->>Handler: Option<BootInterfaceTarget>
Handler->>BootInterface: resolve_boot_interface(snapshot, predictions)
BootInterface-->>Handler: BootInterfaceResolution
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related issues
Possibly related PRs
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
@coderabbitai PTAL, thank you! |
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed: #!/bin/bash
set -euo pipefail
REPO="NVIDIA/infra-controller"
PR=2712
echo "=== PR Info ==="
gh pr view $PR --repo $REPO --json title,body,files,additions,deletions,commits | jq '{title, additions, deletions, files: [.files[].path]}'
echo ""
echo "=== PR Diff ==="
gh pr diff $PR --repo $REPOLength of output: 28218
Medium — Stale comment in
|
|
@coderabbitai review |
✅ Action performedReview finished.
|
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
crates/machine-controller/src/handler.rs (1)
10972-10975:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winUpdate the SetBootOrder comment to reflect prediction fallback.
Lines 10972-10975 still describe “no boot interface to resolve” before first lease, but the current logic now resolves from predictions first and waits only when both row and usable prediction are absent.
📝 Suggested comment update
- // Resolve the boot NIC MAC the same way `CheckHostConfig` does, - // supporting hosts with DPU(s) and zero DPUs alike. A zero-DPU host - // whose boot NIC has not taken its first HostInband lease yet has no - // boot interface to resolve -- wait for it rather than failing. + // Resolve the boot NIC the same way `CheckHostConfig` does, + // supporting hosts with DPU(s) and zero DPUs alike. Before the first + // HostInband lease creates a real row, zero-DPU/NIC-mode hosts can + // resolve via predictions; wait only when neither a real row nor a + // usable prediction exists.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/machine-controller/src/handler.rs` around lines 10972 - 10975, The comment block describing the boot NIC MAC resolution logic in the SetBootOrder handler (around lines 10972-10975) is outdated and does not reflect the current implementation. Update this comment to accurately describe the resolution strategy: first attempt to resolve from the database row, then fall back to using usable predictions if the row is unavailable, and only wait if both the row and usable predictions are absent. This ensures the comment accurately reflects the prediction fallback behavior in the current code.
🧹 Nitpick comments (4)
crates/machine-controller/src/handler.rs (1)
3516-3524: ⚡ Quick winUse a pooled read connection for
load_boot_predictionsinstead of a transaction.Line 3520 starts a transaction for a read-only
find_by_machine_idcall, which adds avoidable transaction/rollback overhead on a frequently-invoked path.💡 Suggested change
async fn load_boot_predictions( ctx: &mut StateHandlerContext<'_, MachineStateHandlerContextObjects>, machine_id: &MachineId, ) -> Result<Vec<PredictedMachineInterface>, StateHandlerError> { - let mut txn = ctx.services.db_pool.begin().await?; - let predictions = - db::predicted_machine_interface::find_by_machine_id(&mut txn, machine_id).await?; + let mut conn = ctx.services.db_pool.acquire().await?; + let predictions = + db::predicted_machine_interface::find_by_machine_id(&mut conn, machine_id).await?; Ok(predictions) }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/machine-controller/src/handler.rs` around lines 3516 - 3524, The load_boot_predictions function currently begins a transaction via ctx.services.db_pool.begin() for a read-only database operation (find_by_machine_id), which introduces unnecessary transaction overhead. Replace the transaction initialization with a pooled read connection acquired from ctx.services.db_pool, and pass that connection to the find_by_machine_id call instead. This will eliminate the unnecessary transaction/rollback overhead on this frequently-invoked code path while maintaining the same functionality.crates/api-model/src/machine/mod.rs (1)
3507-3556: ⚡ Quick winUse table-driven cases for
pick_boot_predictiontests.From Line 3507 onward, this is a single mapping operation exercised with multiple input variants; please convert these into
value_scenarios!(orcheck_values) to keep extension/maintenance one-row-per-case.As per coding guidelines, "Reach for a table whenever two or more tests call the same operation with different inputs" and "Use
carbide_test_support::value_scenarios!macro for total operations."🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/api-model/src/machine/mod.rs` around lines 3507 - 3556, The four test functions pick_boot_prediction_returns_the_declared_primary, pick_boot_prediction_returns_the_sole_non_underlay_prediction, pick_boot_prediction_refuses_multiple_non_primary_predictions, and pick_boot_prediction_ignores_underlay_only_predictions all test the same pick_boot_prediction function with different input variants. Convert these individual test functions into a single table-driven test using the value_scenarios! macro from carbide_test_support, where each row represents one test case with its input predictions and expected output, eliminating the repetitive boilerplate and making the test easier to maintain and extend with new cases.Source: Coding guidelines
crates/machine-controller/src/boot_interface.rs (1)
143-193: ⚡ Quick winPrefer table-driven tests for
boot_interface_target_frominput variants.Lines 143-193 test one operation across multiple cases; convert to
value_scenarios!/check_valuesfor consistency and easier case expansion.As per coding guidelines, "Reach for a table whenever two or more tests call the same operation with different inputs" and "Use
carbide_test_support::value_scenarios!macro for total operations."🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/machine-controller/src/boot_interface.rs` around lines 143 - 193, Consolidate the four separate test functions (boot_target_prefers_the_owned_row_over_predictions, boot_target_falls_back_to_the_prediction_before_the_first_lease, boot_target_prediction_without_id_is_mac_only, and boot_target_is_none_without_row_or_prediction) into a single table-driven test. Replace these individual tests with a value_scenarios! macro that defines a table of test cases, each with descriptive names and input/expected output pairs, then use check_values to verify each scenario against the boot_interface_target_from function call. This consolidates the multiple invocations of the same operation into a single parameterized test structure for better maintainability and easier expansion.Source: Coding guidelines
crates/api-core/src/handlers/bmc_endpoint_explorer.rs (1)
1357-1379: ⚡ Quick winAdopt table-driven scenarios for prediction-disambiguation tests
This new case is solid, but this test area now has multiple input/output variants for the same operation (
resolve_admin_boot_interface_target). Converting these prediction-selection cases to a table-driven style will make additions safer and less repetitive.Refactor sketch
+use carbide_test_support::check_values; + +#[test] +fn no_mac_prediction_disambiguation_cases() { + struct Case { + name: &'static str, + candidates: BootInterfaceCandidates, + expected: Option<BootInterfaceTarget>, + } + + let cases = vec![ + Case { + name: "multiple non-primary predictions refuse", + candidates: BootInterfaceCandidates { + interfaces: vec![], + predicted: vec![ + predicted("00:00:5e:00:53:02", Some("NIC.Slot.7-1-1")), + predicted("00:00:5e:00:53:01", Some("NIC.Embedded.1-1-1")), + ], + }, + expected: None, + }, + Case { + name: "declared primary prediction wins", + candidates: BootInterfaceCandidates { + interfaces: vec![], + predicted: vec![ + predicted("00:00:5e:00:53:02", Some("NIC.Slot.7-1-1")), + PredictedMachineInterface { + primary_interface: true, + ..predicted("00:00:5e:00:53:01", Some("NIC.Embedded.1-1-1")) + }, + ], + }, + expected: Some(BootInterfaceTarget::Pair(pair( + "00:00:5e:00:53:01", + "NIC.Embedded.1-1-1", + ))), + }, + ]; + + check_values(cases, |c| { + resolve_admin_boot_interface_target(None, Some(&c.candidates), None) + }, |c| c.expected.clone()); +}As per coding guidelines, “Prefer table-driven tests … Reach for a table whenever two or more tests call the same operation with different inputs.”
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/api-core/src/handlers/bmc_endpoint_explorer.rs` around lines 1357 - 1379, The test `no_mac_declared_primary_prediction_wins_over_other_predictions` is one of multiple tests that call the same function `resolve_admin_boot_interface_target` with different inputs and expected outputs. Refactor these related prediction-disambiguation tests into a table-driven test format by creating a struct or tuple that defines test cases with their respective inputs (such as the different predicted interface configurations and expected boot interface targets), then replace the individual test functions with a single test function that iterates through this table of cases. This approach will make it easier to add new prediction-selection scenarios and reduce repetition across similar tests.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@crates/machine-controller/src/handler.rs`:
- Around line 10972-10975: The comment block describing the boot NIC MAC
resolution logic in the SetBootOrder handler (around lines 10972-10975) is
outdated and does not reflect the current implementation. Update this comment to
accurately describe the resolution strategy: first attempt to resolve from the
database row, then fall back to using usable predictions if the row is
unavailable, and only wait if both the row and usable predictions are absent.
This ensures the comment accurately reflects the prediction fallback behavior in
the current code.
---
Nitpick comments:
In `@crates/api-core/src/handlers/bmc_endpoint_explorer.rs`:
- Around line 1357-1379: The test
`no_mac_declared_primary_prediction_wins_over_other_predictions` is one of
multiple tests that call the same function `resolve_admin_boot_interface_target`
with different inputs and expected outputs. Refactor these related
prediction-disambiguation tests into a table-driven test format by creating a
struct or tuple that defines test cases with their respective inputs (such as
the different predicted interface configurations and expected boot interface
targets), then replace the individual test functions with a single test function
that iterates through this table of cases. This approach will make it easier to
add new prediction-selection scenarios and reduce repetition across similar
tests.
In `@crates/api-model/src/machine/mod.rs`:
- Around line 3507-3556: The four test functions
pick_boot_prediction_returns_the_declared_primary,
pick_boot_prediction_returns_the_sole_non_underlay_prediction,
pick_boot_prediction_refuses_multiple_non_primary_predictions, and
pick_boot_prediction_ignores_underlay_only_predictions all test the same
pick_boot_prediction function with different input variants. Convert these
individual test functions into a single table-driven test using the
value_scenarios! macro from carbide_test_support, where each row represents one
test case with its input predictions and expected output, eliminating the
repetitive boilerplate and making the test easier to maintain and extend with
new cases.
In `@crates/machine-controller/src/boot_interface.rs`:
- Around line 143-193: Consolidate the four separate test functions
(boot_target_prefers_the_owned_row_over_predictions,
boot_target_falls_back_to_the_prediction_before_the_first_lease,
boot_target_prediction_without_id_is_mac_only, and
boot_target_is_none_without_row_or_prediction) into a single table-driven test.
Replace these individual tests with a value_scenarios! macro that defines a
table of test cases, each with descriptive names and input/expected output
pairs, then use check_values to verify each scenario against the
boot_interface_target_from function call. This consolidates the multiple
invocations of the same operation into a single parameterized test structure for
better maintainability and easier expansion.
In `@crates/machine-controller/src/handler.rs`:
- Around line 3516-3524: The load_boot_predictions function currently begins a
transaction via ctx.services.db_pool.begin() for a read-only database operation
(find_by_machine_id), which introduces unnecessary transaction overhead. Replace
the transaction initialization with a pooled read connection acquired from
ctx.services.db_pool, and pass that connection to the find_by_machine_id call
instead. This will eliminate the unnecessary transaction/rollback overhead on
this frequently-invoked code path while maintaining the same functionality.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 47327c9c-4b05-4d0d-bd90-a3000bd4e2db
📒 Files selected for processing (5)
crates/api-core/src/handlers/bmc_endpoint_explorer.rscrates/api-model/src/machine/mod.rscrates/machine-controller/src/boot_interface.rscrates/machine-controller/src/handler.rscrates/machine-controller/src/handler/bios_config.rs
…st lease The admin boot actions learned to resolve a host still awaiting its first DHCP lease in NVIDIA#2528 -- from its predicted boot interface -- but the machine-controller never did: it read only `machine_interfaces` rows, so a zero-DPU or NIC-mode host (which gets no primary row at ingestion) had nothing to boot-target and its boot states waited. This gives the controller the same fallback: before the first lease creates a real row, the host's predicted boot interface answers, and the real row wins the moment it exists. The selection is now one shared `pick_boot_prediction` -- the declared `ExpectedHostNic.primary` (recorded on the prediction in NVIDIA#2657), else the sole non-underlay prediction, else refuse to guess. Both the machine-controller and the admin resolver route through it, so a multi-NIC host with a declared primary resolves instead of refusing, and the two paths can't drift. - Add `pick_boot_prediction` (api-model) and give the controller's `boot_interface_target`/`resolve_boot_interface` a prediction fallback, loaded per resolve via `load_boot_predictions`. `AwaitingNic` now fires only when there is no row and no usable prediction. - Route admin `resolve_admin_boot_interface_target` through the same selector, so it honors the declared-primary flag for multi-NIC hosts (it used to refuse). - The real `machine_interfaces` row always wins the moment it exists; a host with one expected boot NIC resolves from its sole prediction. Tests cover `pick_boot_prediction`'s precedence, the controller's row-wins / prediction-fallback / MAC-only / wait composition (split out so it unit-tests without a full snapshot), and the admin declared-primary case. Part of NVIDIA#2658 (epic NVIDIA#2660). Signed-off-by: Chet Nichols III <chetn@nvidia.com>
401b808 to
649c6ff
Compare
|
@coderabbitai Thanks for the eyes. Updated -- refreshed the stale |
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed: #!/bin/bash
set -euo pipefail
REPO="NVIDIA/infra-controller"
PR=2712
# List review comments by coderabbitai[bot]
gh api "/repos/$REPO/pulls/$PR/comments" --paginate \
| jq '[.[] | select(.user.login == "coderabbitai[bot]") | {id, body: .body[:80]}]'Length of output: 301 🏁 Script executed: #!/bin/bash
set -euo pipefail
REPO="NVIDIA/infra-controller"
PR=2712
# List issue/review-level comments by coderabbitai[bot]
gh api "/repos/$REPO/issues/$PR/comments" --paginate \
| jq '[.[] | select(.user.login == "coderabbitai[bot]") | {id, body: .body[:120]}]'Length of output: 819
(≧◡≦) 🐇 [resolve] |
🔍 Container Scan Summary
Per-CVE detail lives in the per-service |
What
Gives the machine-controller the same predicted-boot-interface fallback the admin path got in #2528, so a zero-DPU / NIC-mode host can be boot-targeted before its first DHCP lease instead of waiting.
A zero-DPU or NIC-mode host gets no primary
machine_interfacesrow at ingestion, so until its NIC leases the controller's boot states had nothing to target.boot_interface_target/resolve_boot_interfacenow fall back to the host's predicted boot interface; the real row still wins the moment it exists, andAwaitingNic(the #2688 wait) now fires only when there's no row and no usable prediction.The selection rule lives in one shared
pick_boot_prediction(api-model): the declaredExpectedHostNic.primary(recorded on the prediction in #2657) wins, else the sole non-underlay prediction, else it refuses to guess. The adminresolve_admin_boot_interface_targetis routed through the same selector, so it now honors the declared-primary flag for multi-NIC hosts (previously it refused) and the two paths can't drift.Notes
boot_interface_candidates), via aload_boot_predictionshelper -- not added toManagedHostStateSnapshot(a fixed SQL-template view loaded broadly). Themachine_idindex added in Honor a host's declared primary interface when picking its boot device #2657 keeps that query cheap.boot_interface_target_from, the same "unit-test without a full snapshot" split api-model uses forpick_boot_interface_mac/_pair.Test plan
cargo clippy --all-features --testsclean over the touched crates;cargo +nightly fmt.pick_boot_predictionprecedence (primary / sole non-underlay / refuse-multiple / underlay-filtered); controllerboot_interface_target_from(row wins over predictions, prediction fallback as pair, MAC-only without id, none → wait); admin declared-primary disambiguation. Existingboot_interface_resolution+classify_*tests still pass.Part of #2658 (epic #2660).