feat: resolve a host's boot interface from predictions before its first lease by chet · Pull Request #2712 · NVIDIA/infra-controller

chet · 2026-06-20T15:45:39Z

What

Gives the machine-controller the same predicted-boot-interface fallback the admin path got in #2528, so a zero-DPU / NIC-mode host can be boot-targeted before its first DHCP lease instead of waiting.

A zero-DPU or NIC-mode host gets no primary machine_interfaces row at ingestion, so until its NIC leases the controller's boot states had nothing to target. boot_interface_target / resolve_boot_interface now fall back to the host's predicted boot interface; the real row still wins the moment it exists, and AwaitingNic (the #2688 wait) now fires only when there's no row and no usable prediction.

The selection rule lives in one shared pick_boot_prediction (api-model): the declared ExpectedHostNic.primary (recorded on the prediction in #2657) wins, else the sole non-underlay prediction, else it refuses to guess. The admin resolve_admin_boot_interface_target is routed through the same selector, so it now honors the declared-primary flag for multi-NIC hosts (previously it refused) and the two paths can't drift.

Notes

Predictions are loaded on demand at each resolve site (mirroring the admin path's boot_interface_candidates), via a load_boot_predictions helper -- not added to ManagedHostStateSnapshot (a fixed SQL-template view loaded broadly). The machine_id index added in Honor a host's declared primary interface when picking its boot device #2657 keeps that query cheap.
The boot-target composition is split into a pure boot_interface_target_from, the same "unit-test without a full snapshot" split api-model uses for pick_boot_interface_mac/_pair.

Test plan

cargo clippy --all-features --tests clean over the touched crates; cargo +nightly fmt.
New tests: pick_boot_prediction precedence (primary / sole non-underlay / refuse-multiple / underlay-filtered); controller boot_interface_target_from (row wins over predictions, prediction fallback as pair, MAC-only without id, none → wait); admin declared-primary disambiguation. Existing boot_interface_resolution + classify_* tests still pass.

Part of #2658 (epic #2660).

copy-pr-bot · 2026-06-20T15:45:43Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

coderabbitai · 2026-06-20T15:45:47Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 33408363-bedd-4b6a-b411-623e313864fa

📥 Commits

Reviewing files that changed from the base of the PR and between 401b808 and 649c6ff.

📒 Files selected for processing (5)

crates/api-core/src/handlers/bmc_endpoint_explorer.rs
crates/api-model/src/machine/mod.rs
crates/machine-controller/src/boot_interface.rs
crates/machine-controller/src/handler.rs
crates/machine-controller/src/handler/bios_config.rs

🚧 Files skipped from review as they are similar to previous changes (5)

crates/machine-controller/src/handler/bios_config.rs
crates/machine-controller/src/boot_interface.rs
crates/api-model/src/machine/mod.rs
crates/api-core/src/handlers/bmc_endpoint_explorer.rs
crates/machine-controller/src/handler.rs

Summary by CodeRabbit

New Features

Predicted boot interface support: Enhanced boot interface selection during early host provisioning phases to better handle scenarios where physical network interface data isn't yet available, improving reliability for zero-DPU deployments during initial DHCP provisioning.
Improved boot interface resolution: Updated boot repair and configuration workflows to leverage predicted interface information, ensuring consistent boot interface selection across BIOS setup, boot order remediation, and host provisioning operations.

Walkthrough

Introduces pick_boot_prediction in api-model to select a PredictedMachineInterface using declared-primary precedence with sole-non-underlay fallback. This function is integrated into boot_interface_target and resolve_boot_interface (both now accept a predictions slice), propagated through BIOS-config and handler call sites via a new load_boot_predictions helper, and adopted in the admin BMC endpoint explorer to replace prior inline filtering.

Changes

Prediction-based boot interface resolution

Layer / File(s)	Summary
`pick_boot_prediction` model function and unit tests `crates/api-model/src/machine/mod.rs`	Introduces `pick_boot_prediction` with declared-primary → sole-non-underlay → `None` precedence rules and comprehensive unit-test coverage validating all four resolution branches.
`boot_interface_target` and `resolve_boot_interface` updates `crates/machine-controller/src/boot_interface.rs`	Both public functions now accept a `predictions` slice. Adds `boot_interface_target_from` helper with owned-row-first precedence then `pick_boot_prediction` fallback, producing `Pair` or `MacOnly`. Tests validate owned-row preference, prediction fallback before first lease, MacOnly derivation, and None return paths.
BIOS-config call sites wired with predictions `crates/machine-controller/src/handler/bios_config.rs`	`configure_host_bios`, `advance_polling_bios_setup` (signature extended with predictions parameter), and `handle_bios_setup_failed_recovery` each load predictions and forward them into `boot_interface_target`.
Handler prediction loading and call-site integration `crates/machine-controller/src/handler.rs`	Adds `load_boot_predictions` async helper querying `db::predicted_machine_interface::find_by_machine_id`. Wires predictions into `check_host_boot_config`, both `set_host_boot_order` states (`SetBootOrder` and `CheckBootOrder`), and all three BIOS polling sites (`ReprovisionState::PollingHostBiosSetup`, `HostMachineStateHandler::MachineState::PollingBiosSetup`, `InstanceStateHandler::HostPlatformConfigurationState::PollingBiosSetup`).
Admin BMC explorer refactored to shared logic `crates/api-core/src/handlers/bmc_endpoint_explorer.rs`	Replaces inline prediction-filter logic with a delegated call to `model::machine::pick_boot_prediction`, updates resolver documentation to describe declared-primary vs. sole-non-underlay vs. ambiguous behavior, relocates `NetworkSegmentType` import to test scope, and adds a test asserting that declared-primary predictions disambiguate multi-prediction host resolution.

Sequence Diagram

sequenceDiagram
  participant Handler as handler.rs
  participant DB as db
  participant BootInterface as boot_interface.rs
  participant PickPred as pick_boot_prediction
  Handler->>DB: load_boot_predictions(machine_id)
  DB-->>Handler: Vec<PredictedMachineInterface>
  Handler->>BootInterface: boot_interface_target(snapshot, predictions)
  BootInterface->>BootInterface: boot_interface_target_from(row_pair, row_mac, predictions)
  BootInterface->>PickPred: pick_boot_prediction(predictions)
  PickPred-->>BootInterface: Option<&PredictedMachineInterface>
  BootInterface-->>Handler: Option<BootInterfaceTarget>
  Handler->>BootInterface: resolve_boot_interface(snapshot, predictions)
  BootInterface-->>Handler: BootInterfaceResolution

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related issues

Resolve the boot interface from predictions in the machine-controller #2658: This PR implements the core functionality requested — prediction-fallback boot-interface resolution via pick_boot_prediction integrated into boot_interface_target and resolve_boot_interface.
Prove the controller has a boot target during NIC-mode zero-DPU provisioning #2633: The PR directly resolves the stated problem by enabling safe boot-interface determination during the pre-first-DHCP-lease window for zero-DPU NIC-mode provisioning, before HostInband leases materialize.
Honor a host's declared primary interface when picking its boot device #2657: The declared primary_interface flag is now honored during prediction-based selection via pick_boot_prediction's declared-primary precedence rule, directly addressing the primary-boot-intent requirements raised in that issue.
Standardize boot interface selection and management across NIC types #2660: This PR constitutes a core implementation artifact of the work tracked in that epic, adding pick_boot_prediction and systematically threading predictions through the machine-controller's boot repair and BIOS setup paths.
Boot from a declared integrated NIC while keeping its DPUs managed #2668: This PR establishes the prediction-consumption infrastructure (via pick_boot_prediction and prediction-aware boot resolution) that complementary prediction-creation logic for integrated NICs on managed-DPU hosts would depend upon.

Possibly related PRs

NVIDIA/infra-controller#2528: Directly overlaps—both modify the admin BMC boot-interface resolution in bmc_endpoint_explorer.rs to use predicted_machine_interfaces with identical "unambiguous single vs. require explicit MAC for multiple" disambiguation semantics and declared-primary precedence.
NVIDIA/infra-controller#2688: Modifies the same boot_interface_target/resolve_boot_interface surface in boot_interface.rs and the downstream handler call sites that this PR extends with prediction-aware forwarding.
NVIDIA/infra-controller#2690: Both PRs honor primary_interface in PredictedMachineInterface for boot-device selection with overlapping changes in bmc_endpoint_explorer.rs prediction logic, establishing the declared-primary semantics used by pick_boot_prediction.

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title directly reflects the main change: extending boot interface resolution to use predictions before the first DHCP lease.
Description check	✅ Passed	The description is comprehensive and clearly related to the changeset, explaining the motivation, implementation strategy, and testing approach.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

chet · 2026-06-20T15:46:16Z

@coderabbitai PTAL, thank you!

coderabbitai · 2026-06-20T15:48:05Z

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

REPO="NVIDIA/infra-controller"
PR=2712

echo "=== PR Info ==="
gh pr view $PR --repo $REPO --json title,body,files,additions,deletions,commits | jq '{title, additions, deletions, files: [.files[].path]}'

echo ""
echo "=== PR Diff ==="
gh pr diff $PR --repo $REPO

Length of output: 28218

@chet: This is a clean, well-structured piece of work. The shared pick_boot_prediction selector is the right abstraction — it eliminates the admin/controller drift risk and the doc comment clearly articulates the two-tier precedence rule. The boot_interface_target_from split mirrors the existing api-model pattern and the test coverage is thorough. A few observations worth addressing:

Medium — Stale comment in `handler.rs`

In set_host_boot_order, the comment around the SetBootOrderState::SetBootOrder arm was not updated alongside the behavior change:

// supporting hosts with DPU(s) and zero DPUs alike. A zero-DPU host
// whose boot NIC has not taken its first HostInband lease yet has no
// boot interface to resolve -- wait for it rather than failing.
let boot_interface = match resolve_boot_interface(mh_snapshot, &predictions) {

AwaitingNic now fires only when both the real row and any usable prediction are absent. The comment should reflect that:

// supporting hosts with DPU(s) and zero DPUs alike. A zero-DPU host
// falls back to its predicted boot NIC before its first HostInband lease;
// only when neither a real row nor a usable prediction exists does the
// resolver return AwaitingNic and defer rather than fault.
let boot_interface = match resolve_boot_interface(mh_snapshot, &predictions) {

Minor — `load_boot_predictions` opens a write transaction for a read

async fn load_boot_predictions(...) -> Result<Vec<PredictedMachineInterface>, ...> {
    let mut txn = ctx.services.db_pool.begin().await?;
    let predictions = db::predicted_machine_interface::find_by_machine_id(&mut txn, machine_id).await?;
    Ok(predictions)
    // txn silently rolls back on drop
}

begin() acquires a transaction (implicitly BEGIN) and the drop rolls it back — correct for isolation, but semantically a plain connection (acquire()) is more precise for a read-only query and avoids the no-op rollback round-trip. If this matches the codebase convention for find_by_* calls, it is fine to leave as-is.

Nit — No explicit empty-slice test for `pick_boot_prediction`

The four existing tests are well chosen, but an empty-slice case makes the "returns None" contract explicit without any deduction:

#[test]
fn pick_boot_prediction_returns_none_for_empty_predictions() {
    assert!(pick_boot_prediction(&[]).is_none());
}

Positive observations

Precedence parity: pick_boot_prediction mirrors pick_boot_interface exactly one level down (row → prediction), making the two-level precedence straightforward to reason about.
Single load, multiple uses: set_host_boot_order loads predictions once and reuses them across both resolve_boot_interface call sites — correct.
No snapshot bloat: keeping predictions out of ManagedHostStateSnapshot (loaded on demand) is the right call given the broad load surface of that view.
Test isolation: boot_interface_target_from being a pure function makes the unit tests in boot_interface.rs robust without any snapshot construction overhead.

chet · 2026-06-20T16:06:28Z

@coderabbitai review

coderabbitai · 2026-06-20T16:06:34Z

✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

crates/machine-controller/src/handler.rs (1)

10972-10975: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Update the SetBootOrder comment to reflect prediction fallback.

Lines 10972-10975 still describe “no boot interface to resolve” before first lease, but the current logic now resolves from predictions first and waits only when both row and usable prediction are absent.

📝 Suggested comment update

-            // Resolve the boot NIC MAC the same way `CheckHostConfig` does,
-            // supporting hosts with DPU(s) and zero DPUs alike. A zero-DPU host
-            // whose boot NIC has not taken its first HostInband lease yet has no
-            // boot interface to resolve -- wait for it rather than failing.
+            // Resolve the boot NIC the same way `CheckHostConfig` does,
+            // supporting hosts with DPU(s) and zero DPUs alike. Before the first
+            // HostInband lease creates a real row, zero-DPU/NIC-mode hosts can
+            // resolve via predictions; wait only when neither a real row nor a
+            // usable prediction exists.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/machine-controller/src/handler.rs` around lines 10972 - 10975, The
comment block describing the boot NIC MAC resolution logic in the SetBootOrder
handler (around lines 10972-10975) is outdated and does not reflect the current
implementation. Update this comment to accurately describe the resolution
strategy: first attempt to resolve from the database row, then fall back to
using usable predictions if the row is unavailable, and only wait if both the
row and usable predictions are absent. This ensures the comment accurately
reflects the prediction fallback behavior in the current code.

🧹 Nitpick comments (4)

crates/machine-controller/src/handler.rs (1)

3516-3524: ⚡ Quick win

Use a pooled read connection for load_boot_predictions instead of a transaction.

Line 3520 starts a transaction for a read-only find_by_machine_id call, which adds avoidable transaction/rollback overhead on a frequently-invoked path.

💡 Suggested change

 async fn load_boot_predictions(
     ctx: &mut StateHandlerContext<'_, MachineStateHandlerContextObjects>,
     machine_id: &MachineId,
 ) -> Result<Vec<PredictedMachineInterface>, StateHandlerError> {
-    let mut txn = ctx.services.db_pool.begin().await?;
-    let predictions =
-        db::predicted_machine_interface::find_by_machine_id(&mut txn, machine_id).await?;
+    let mut conn = ctx.services.db_pool.acquire().await?;
+    let predictions =
+        db::predicted_machine_interface::find_by_machine_id(&mut conn, machine_id).await?;
     Ok(predictions)
 }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/machine-controller/src/handler.rs` around lines 3516 - 3524, The
load_boot_predictions function currently begins a transaction via
ctx.services.db_pool.begin() for a read-only database operation
(find_by_machine_id), which introduces unnecessary transaction overhead. Replace
the transaction initialization with a pooled read connection acquired from
ctx.services.db_pool, and pass that connection to the find_by_machine_id call
instead. This will eliminate the unnecessary transaction/rollback overhead on
this frequently-invoked code path while maintaining the same functionality.

crates/api-model/src/machine/mod.rs (1)

3507-3556: ⚡ Quick win

Use table-driven cases for pick_boot_prediction tests.

From Line 3507 onward, this is a single mapping operation exercised with multiple input variants; please convert these into value_scenarios! (or check_values) to keep extension/maintenance one-row-per-case.

As per coding guidelines, "Reach for a table whenever two or more tests call the same operation with different inputs" and "Use carbide_test_support::value_scenarios! macro for total operations."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/api-model/src/machine/mod.rs` around lines 3507 - 3556, The four test
functions pick_boot_prediction_returns_the_declared_primary,
pick_boot_prediction_returns_the_sole_non_underlay_prediction,
pick_boot_prediction_refuses_multiple_non_primary_predictions, and
pick_boot_prediction_ignores_underlay_only_predictions all test the same
pick_boot_prediction function with different input variants. Convert these
individual test functions into a single table-driven test using the
value_scenarios! macro from carbide_test_support, where each row represents one
test case with its input predictions and expected output, eliminating the
repetitive boilerplate and making the test easier to maintain and extend with
new cases.
Source: Coding guidelines

crates/machine-controller/src/boot_interface.rs (1)

143-193: ⚡ Quick win

Prefer table-driven tests for boot_interface_target_from input variants.

Lines 143-193 test one operation across multiple cases; convert to value_scenarios!/check_values for consistency and easier case expansion.

As per coding guidelines, "Reach for a table whenever two or more tests call the same operation with different inputs" and "Use carbide_test_support::value_scenarios! macro for total operations."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/machine-controller/src/boot_interface.rs` around lines 143 - 193,
Consolidate the four separate test functions
(boot_target_prefers_the_owned_row_over_predictions,
boot_target_falls_back_to_the_prediction_before_the_first_lease,
boot_target_prediction_without_id_is_mac_only, and
boot_target_is_none_without_row_or_prediction) into a single table-driven test.
Replace these individual tests with a value_scenarios! macro that defines a
table of test cases, each with descriptive names and input/expected output
pairs, then use check_values to verify each scenario against the
boot_interface_target_from function call. This consolidates the multiple
invocations of the same operation into a single parameterized test structure for
better maintainability and easier expansion.
Source: Coding guidelines

crates/api-core/src/handlers/bmc_endpoint_explorer.rs (1)

1357-1379: ⚡ Quick win

Adopt table-driven scenarios for prediction-disambiguation tests

This new case is solid, but this test area now has multiple input/output variants for the same operation (resolve_admin_boot_interface_target). Converting these prediction-selection cases to a table-driven style will make additions safer and less repetitive.

Refactor sketch

+use carbide_test_support::check_values;
+
+#[test]
+fn no_mac_prediction_disambiguation_cases() {
+    struct Case {
+        name: &'static str,
+        candidates: BootInterfaceCandidates,
+        expected: Option<BootInterfaceTarget>,
+    }
+
+    let cases = vec![
+        Case {
+            name: "multiple non-primary predictions refuse",
+            candidates: BootInterfaceCandidates {
+                interfaces: vec![],
+                predicted: vec![
+                    predicted("00:00:5e:00:53:02", Some("NIC.Slot.7-1-1")),
+                    predicted("00:00:5e:00:53:01", Some("NIC.Embedded.1-1-1")),
+                ],
+            },
+            expected: None,
+        },
+        Case {
+            name: "declared primary prediction wins",
+            candidates: BootInterfaceCandidates {
+                interfaces: vec![],
+                predicted: vec![
+                    predicted("00:00:5e:00:53:02", Some("NIC.Slot.7-1-1")),
+                    PredictedMachineInterface {
+                        primary_interface: true,
+                        ..predicted("00:00:5e:00:53:01", Some("NIC.Embedded.1-1-1"))
+                    },
+                ],
+            },
+            expected: Some(BootInterfaceTarget::Pair(pair(
+                "00:00:5e:00:53:01",
+                "NIC.Embedded.1-1-1",
+            ))),
+        },
+    ];
+
+    check_values(cases, |c| {
+        resolve_admin_boot_interface_target(None, Some(&c.candidates), None)
+    }, |c| c.expected.clone());
+}

As per coding guidelines, “Prefer table-driven tests … Reach for a table whenever two or more tests call the same operation with different inputs.”

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/api-core/src/handlers/bmc_endpoint_explorer.rs` around lines 1357 -
1379, The test `no_mac_declared_primary_prediction_wins_over_other_predictions`
is one of multiple tests that call the same function
`resolve_admin_boot_interface_target` with different inputs and expected
outputs. Refactor these related prediction-disambiguation tests into a
table-driven test format by creating a struct or tuple that defines test cases
with their respective inputs (such as the different predicted interface
configurations and expected boot interface targets), then replace the individual
test functions with a single test function that iterates through this table of
cases. This approach will make it easier to add new prediction-selection
scenarios and reduce repetition across similar tests.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@crates/machine-controller/src/handler.rs`:
- Around line 10972-10975: The comment block describing the boot NIC MAC
resolution logic in the SetBootOrder handler (around lines 10972-10975) is
outdated and does not reflect the current implementation. Update this comment to
accurately describe the resolution strategy: first attempt to resolve from the
database row, then fall back to using usable predictions if the row is
unavailable, and only wait if both the row and usable predictions are absent.
This ensures the comment accurately reflects the prediction fallback behavior in
the current code.

---

Nitpick comments:
In `@crates/api-core/src/handlers/bmc_endpoint_explorer.rs`:
- Around line 1357-1379: The test
`no_mac_declared_primary_prediction_wins_over_other_predictions` is one of
multiple tests that call the same function `resolve_admin_boot_interface_target`
with different inputs and expected outputs. Refactor these related
prediction-disambiguation tests into a table-driven test format by creating a
struct or tuple that defines test cases with their respective inputs (such as
the different predicted interface configurations and expected boot interface
targets), then replace the individual test functions with a single test function
that iterates through this table of cases. This approach will make it easier to
add new prediction-selection scenarios and reduce repetition across similar
tests.

In `@crates/api-model/src/machine/mod.rs`:
- Around line 3507-3556: The four test functions
pick_boot_prediction_returns_the_declared_primary,
pick_boot_prediction_returns_the_sole_non_underlay_prediction,
pick_boot_prediction_refuses_multiple_non_primary_predictions, and
pick_boot_prediction_ignores_underlay_only_predictions all test the same
pick_boot_prediction function with different input variants. Convert these
individual test functions into a single table-driven test using the
value_scenarios! macro from carbide_test_support, where each row represents one
test case with its input predictions and expected output, eliminating the
repetitive boilerplate and making the test easier to maintain and extend with
new cases.

In `@crates/machine-controller/src/boot_interface.rs`:
- Around line 143-193: Consolidate the four separate test functions
(boot_target_prefers_the_owned_row_over_predictions,
boot_target_falls_back_to_the_prediction_before_the_first_lease,
boot_target_prediction_without_id_is_mac_only, and
boot_target_is_none_without_row_or_prediction) into a single table-driven test.
Replace these individual tests with a value_scenarios! macro that defines a
table of test cases, each with descriptive names and input/expected output
pairs, then use check_values to verify each scenario against the
boot_interface_target_from function call. This consolidates the multiple
invocations of the same operation into a single parameterized test structure for
better maintainability and easier expansion.

In `@crates/machine-controller/src/handler.rs`:
- Around line 3516-3524: The load_boot_predictions function currently begins a
transaction via ctx.services.db_pool.begin() for a read-only database operation
(find_by_machine_id), which introduces unnecessary transaction overhead. Replace
the transaction initialization with a pooled read connection acquired from
ctx.services.db_pool, and pass that connection to the find_by_machine_id call
instead. This will eliminate the unnecessary transaction/rollback overhead on
this frequently-invoked code path while maintaining the same functionality.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 47327c9c-4b05-4d0d-bd90-a3000bd4e2db

📥 Commits

Reviewing files that changed from the base of the PR and between 79e54c0 and 401b808.

📒 Files selected for processing (5)

crates/api-core/src/handlers/bmc_endpoint_explorer.rs
crates/api-model/src/machine/mod.rs
crates/machine-controller/src/boot_interface.rs
crates/machine-controller/src/handler.rs
crates/machine-controller/src/handler/bios_config.rs

…st lease The admin boot actions learned to resolve a host still awaiting its first DHCP lease in NVIDIA#2528 -- from its predicted boot interface -- but the machine-controller never did: it read only `machine_interfaces` rows, so a zero-DPU or NIC-mode host (which gets no primary row at ingestion) had nothing to boot-target and its boot states waited. This gives the controller the same fallback: before the first lease creates a real row, the host's predicted boot interface answers, and the real row wins the moment it exists. The selection is now one shared `pick_boot_prediction` -- the declared `ExpectedHostNic.primary` (recorded on the prediction in NVIDIA#2657), else the sole non-underlay prediction, else refuse to guess. Both the machine-controller and the admin resolver route through it, so a multi-NIC host with a declared primary resolves instead of refusing, and the two paths can't drift. - Add `pick_boot_prediction` (api-model) and give the controller's `boot_interface_target`/`resolve_boot_interface` a prediction fallback, loaded per resolve via `load_boot_predictions`. `AwaitingNic` now fires only when there is no row and no usable prediction. - Route admin `resolve_admin_boot_interface_target` through the same selector, so it honors the declared-primary flag for multi-NIC hosts (it used to refuse). - The real `machine_interfaces` row always wins the moment it exists; a host with one expected boot NIC resolves from its sole prediction. Tests cover `pick_boot_prediction`'s precedence, the controller's row-wins / prediction-fallback / MAC-only / wait composition (split out so it unit-tests without a full snapshot), and the admin declared-primary case. Part of NVIDIA#2658 (epic NVIDIA#2660). Signed-off-by: Chet Nichols III <chetn@nvidia.com>

chet · 2026-06-20T16:22:36Z

@coderabbitai Thanks for the eyes. Updated -- refreshed the stale set_host_boot_order comment to describe the prediction fallback, and switched load_boot_predictions to a pooled acquire() read connection instead of a transaction. Leaving the table-driven test suggestions as-is: the new tests match the per-case #[test] style of the existing tests in each module (pick_boot_interface_*, classify_*, no_mac_*), so converting only the new ones would split each module's style; a module-wide value_scenarios! conversion (which also adds a carbide-test-support dev-dep to machine-controller) belongs to the ongoing test-restructuring work, not this feature PR. PTAL. Resolve if good, thanks!

coderabbitai · 2026-06-20T16:23:09Z

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

REPO="NVIDIA/infra-controller"
PR=2712

# List review comments by coderabbitai[bot]
gh api "/repos/$REPO/pulls/$PR/comments" --paginate \
  | jq '[.[] | select(.user.login == "coderabbitai[bot]") | {id, body: .body[:80]}]'

Length of output: 301

🏁 Script executed:

#!/bin/bash
set -euo pipefail

REPO="NVIDIA/infra-controller"
PR=2712

# List issue/review-level comments by coderabbitai[bot]
gh api "/repos/$REPO/issues/$PR/comments" --paginate \
  | jq '[.[] | select(.user.login == "coderabbitai[bot]") | {id, body: .body[:120]}]'

Length of output: 819

@chet: Both actionable items are confirmed resolved. The updated comment in set_host_boot_order now accurately describes the prediction-fallback path, and switching load_boot_predictions to acquire() is the correct choice. The rationale for leaving the per-case #[test] style intact is sound — a module-wide value_scenarios! conversion is cleanly separable and belongs in the test-restructuring track rather than here. No further issues.

(≧◡≦) 🐇

[resolve]

github-actions · 2026-06-21T05:37:58Z

🔍 Container Scan Summary

Service	Total	Critical	High	Medium	Low	Other
boot-artifacts-aarch64	3	0	0	3	0	0
boot-artifacts-x86_64	3	0	0	3	0	0
forge-admin-cli-x86_64	264	6	23	99	6	130
machine-validation-runner	704	34	183	258	35	194
machine_validation	704	34	183	258	35	194
nvmetal-carbide	704	34	183	258	35	194
TOTAL	2382	108	572	879	111	712

Per-CVE detail lives in the per-service grype-* artifacts (JSON + SARIF). Severity counts only — no CVE IDs published here.

rwthompsonii approved these changes Jun 20, 2026

View reviewed changes

coderabbitai Bot reviewed Jun 20, 2026

View reviewed changes

chet force-pushed the controller-boot-from-predictions branch from 401b808 to 649c6ff Compare June 20, 2026 16:21

chet marked this pull request as ready for review June 21, 2026 04:16

chet requested a review from a team as a code owner June 21, 2026 04:16

rwthompsonii approved these changes Jun 21, 2026

View reviewed changes

chet merged commit 611e96b into NVIDIA:main Jun 21, 2026
55 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: resolve a host's boot interface from predictions before its first lease#2712

feat: resolve a host's boot interface from predictions before its first lease#2712
chet merged 1 commit into
NVIDIA:mainfrom
chet:controller-boot-from-predictions

chet commented Jun 20, 2026

Uh oh!

copy-pr-bot Bot commented Jun 20, 2026

Uh oh!

coderabbitai Bot commented Jun 20, 2026 •

edited

Loading

Uh oh!

chet commented Jun 20, 2026

Uh oh!

coderabbitai Bot commented Jun 20, 2026

Uh oh!

chet commented Jun 20, 2026

Uh oh!

coderabbitai Bot commented Jun 20, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

chet commented Jun 20, 2026

Uh oh!

coderabbitai Bot commented Jun 20, 2026

Uh oh!

github-actions Bot commented Jun 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

chet commented Jun 20, 2026

What

Notes

Test plan

Uh oh!

copy-pr-bot Bot commented Jun 20, 2026

Uh oh!

coderabbitai Bot commented Jun 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

New Features

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related issues

Possibly related PRs

Uh oh!

chet commented Jun 20, 2026

Uh oh!

coderabbitai Bot commented Jun 20, 2026

Medium — Stale comment in handler.rs

Minor — load_boot_predictions opens a write transaction for a read

Nit — No explicit empty-slice test for pick_boot_prediction

Positive observations

Uh oh!

chet commented Jun 20, 2026

Uh oh!

coderabbitai Bot commented Jun 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

chet commented Jun 20, 2026

Uh oh!

coderabbitai Bot commented Jun 20, 2026

Uh oh!

github-actions Bot commented Jun 21, 2026

🔍 Container Scan Summary

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai Bot commented Jun 20, 2026 •

edited

Loading

Medium — Stale comment in `handler.rs`

Minor — `load_boot_predictions` opens a write transaction for a read

Nit — No explicit empty-slice test for `pick_boot_prediction`

coderabbitai Bot commented Jun 20, 2026 •

edited

Loading