Skip to content

feat: resolve a host's boot interface from predictions before its first lease#2712

Merged
chet merged 1 commit into
NVIDIA:mainfrom
chet:controller-boot-from-predictions
Jun 21, 2026
Merged

feat: resolve a host's boot interface from predictions before its first lease#2712
chet merged 1 commit into
NVIDIA:mainfrom
chet:controller-boot-from-predictions

Conversation

@chet

@chet chet commented Jun 20, 2026

Copy link
Copy Markdown
Contributor

What

Gives the machine-controller the same predicted-boot-interface fallback the admin path got in #2528, so a zero-DPU / NIC-mode host can be boot-targeted before its first DHCP lease instead of waiting.

A zero-DPU or NIC-mode host gets no primary machine_interfaces row at ingestion, so until its NIC leases the controller's boot states had nothing to target. boot_interface_target / resolve_boot_interface now fall back to the host's predicted boot interface; the real row still wins the moment it exists, and AwaitingNic (the #2688 wait) now fires only when there's no row and no usable prediction.

The selection rule lives in one shared pick_boot_prediction (api-model): the declared ExpectedHostNic.primary (recorded on the prediction in #2657) wins, else the sole non-underlay prediction, else it refuses to guess. The admin resolve_admin_boot_interface_target is routed through the same selector, so it now honors the declared-primary flag for multi-NIC hosts (previously it refused) and the two paths can't drift.

Notes

  • Predictions are loaded on demand at each resolve site (mirroring the admin path's boot_interface_candidates), via a load_boot_predictions helper -- not added to ManagedHostStateSnapshot (a fixed SQL-template view loaded broadly). The machine_id index added in Honor a host's declared primary interface when picking its boot device #2657 keeps that query cheap.
  • The boot-target composition is split into a pure boot_interface_target_from, the same "unit-test without a full snapshot" split api-model uses for pick_boot_interface_mac/_pair.

Test plan

  • cargo clippy --all-features --tests clean over the touched crates; cargo +nightly fmt.
  • New tests: pick_boot_prediction precedence (primary / sole non-underlay / refuse-multiple / underlay-filtered); controller boot_interface_target_from (row wins over predictions, prediction fallback as pair, MAC-only without id, none → wait); admin declared-primary disambiguation. Existing boot_interface_resolution + classify_* tests still pass.

Part of #2658 (epic #2660).

@copy-pr-bot

copy-pr-bot Bot commented Jun 20, 2026

Copy link
Copy Markdown

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented Jun 20, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 33408363-bedd-4b6a-b411-623e313864fa

📥 Commits

Reviewing files that changed from the base of the PR and between 401b808 and 649c6ff.

📒 Files selected for processing (5)
  • crates/api-core/src/handlers/bmc_endpoint_explorer.rs
  • crates/api-model/src/machine/mod.rs
  • crates/machine-controller/src/boot_interface.rs
  • crates/machine-controller/src/handler.rs
  • crates/machine-controller/src/handler/bios_config.rs
🚧 Files skipped from review as they are similar to previous changes (5)
  • crates/machine-controller/src/handler/bios_config.rs
  • crates/machine-controller/src/boot_interface.rs
  • crates/api-model/src/machine/mod.rs
  • crates/api-core/src/handlers/bmc_endpoint_explorer.rs
  • crates/machine-controller/src/handler.rs

Summary by CodeRabbit

New Features

  • Predicted boot interface support: Enhanced boot interface selection during early host provisioning phases to better handle scenarios where physical network interface data isn't yet available, improving reliability for zero-DPU deployments during initial DHCP provisioning.

  • Improved boot interface resolution: Updated boot repair and configuration workflows to leverage predicted interface information, ensuring consistent boot interface selection across BIOS setup, boot order remediation, and host provisioning operations.

Walkthrough

Introduces pick_boot_prediction in api-model to select a PredictedMachineInterface using declared-primary precedence with sole-non-underlay fallback. This function is integrated into boot_interface_target and resolve_boot_interface (both now accept a predictions slice), propagated through BIOS-config and handler call sites via a new load_boot_predictions helper, and adopted in the admin BMC endpoint explorer to replace prior inline filtering.

Changes

Prediction-based boot interface resolution

Layer / File(s) Summary
pick_boot_prediction model function and unit tests
crates/api-model/src/machine/mod.rs
Introduces pick_boot_prediction with declared-primary → sole-non-underlay → None precedence rules and comprehensive unit-test coverage validating all four resolution branches.
boot_interface_target and resolve_boot_interface updates
crates/machine-controller/src/boot_interface.rs
Both public functions now accept a predictions slice. Adds boot_interface_target_from helper with owned-row-first precedence then pick_boot_prediction fallback, producing Pair or MacOnly. Tests validate owned-row preference, prediction fallback before first lease, MacOnly derivation, and None return paths.
BIOS-config call sites wired with predictions
crates/machine-controller/src/handler/bios_config.rs
configure_host_bios, advance_polling_bios_setup (signature extended with predictions parameter), and handle_bios_setup_failed_recovery each load predictions and forward them into boot_interface_target.
Handler prediction loading and call-site integration
crates/machine-controller/src/handler.rs
Adds load_boot_predictions async helper querying db::predicted_machine_interface::find_by_machine_id. Wires predictions into check_host_boot_config, both set_host_boot_order states (SetBootOrder and CheckBootOrder), and all three BIOS polling sites (ReprovisionState::PollingHostBiosSetup, HostMachineStateHandler::MachineState::PollingBiosSetup, InstanceStateHandler::HostPlatformConfigurationState::PollingBiosSetup).
Admin BMC explorer refactored to shared logic
crates/api-core/src/handlers/bmc_endpoint_explorer.rs
Replaces inline prediction-filter logic with a delegated call to model::machine::pick_boot_prediction, updates resolver documentation to describe declared-primary vs. sole-non-underlay vs. ambiguous behavior, relocates NetworkSegmentType import to test scope, and adds a test asserting that declared-primary predictions disambiguate multi-prediction host resolution.

Sequence Diagram

sequenceDiagram
  participant Handler as handler.rs
  participant DB as db
  participant BootInterface as boot_interface.rs
  participant PickPred as pick_boot_prediction
  Handler->>DB: load_boot_predictions(machine_id)
  DB-->>Handler: Vec<PredictedMachineInterface>
  Handler->>BootInterface: boot_interface_target(snapshot, predictions)
  BootInterface->>BootInterface: boot_interface_target_from(row_pair, row_mac, predictions)
  BootInterface->>PickPred: pick_boot_prediction(predictions)
  PickPred-->>BootInterface: Option<&PredictedMachineInterface>
  BootInterface-->>Handler: Option<BootInterfaceTarget>
  Handler->>BootInterface: resolve_boot_interface(snapshot, predictions)
  BootInterface-->>Handler: BootInterfaceResolution
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related issues

Possibly related PRs

  • NVIDIA/infra-controller#2528: Directly overlaps—both modify the admin BMC boot-interface resolution in bmc_endpoint_explorer.rs to use predicted_machine_interfaces with identical "unambiguous single vs. require explicit MAC for multiple" disambiguation semantics and declared-primary precedence.
  • NVIDIA/infra-controller#2688: Modifies the same boot_interface_target/resolve_boot_interface surface in boot_interface.rs and the downstream handler call sites that this PR extends with prediction-aware forwarding.
  • NVIDIA/infra-controller#2690: Both PRs honor primary_interface in PredictedMachineInterface for boot-device selection with overlapping changes in bmc_endpoint_explorer.rs prediction logic, establishing the declared-primary semantics used by pick_boot_prediction.
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title directly reflects the main change: extending boot interface resolution to use predictions before the first DHCP lease.
Description check ✅ Passed The description is comprehensive and clearly related to the changeset, explaining the motivation, implementation strategy, and testing approach.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@chet

chet commented Jun 20, 2026

Copy link
Copy Markdown
Contributor Author

@coderabbitai PTAL, thank you!

@coderabbitai

coderabbitai Bot commented Jun 20, 2026

Copy link
Copy Markdown
Contributor

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

REPO="NVIDIA/infra-controller"
PR=2712

echo "=== PR Info ==="
gh pr view $PR --repo $REPO --json title,body,files,additions,deletions,commits | jq '{title, additions, deletions, files: [.files[].path]}'

echo ""
echo "=== PR Diff ==="
gh pr diff $PR --repo $REPO

Length of output: 28218



@chet: This is a clean, well-structured piece of work. The shared pick_boot_prediction selector is the right abstraction — it eliminates the admin/controller drift risk and the doc comment clearly articulates the two-tier precedence rule. The boot_interface_target_from split mirrors the existing api-model pattern and the test coverage is thorough. A few observations worth addressing:


Medium — Stale comment in handler.rs

In set_host_boot_order, the comment around the SetBootOrderState::SetBootOrder arm was not updated alongside the behavior change:

// supporting hosts with DPU(s) and zero DPUs alike. A zero-DPU host
// whose boot NIC has not taken its first HostInband lease yet has no
// boot interface to resolve -- wait for it rather than failing.
let boot_interface = match resolve_boot_interface(mh_snapshot, &predictions) {

AwaitingNic now fires only when both the real row and any usable prediction are absent. The comment should reflect that:

// supporting hosts with DPU(s) and zero DPUs alike. A zero-DPU host
// falls back to its predicted boot NIC before its first HostInband lease;
// only when neither a real row nor a usable prediction exists does the
// resolver return AwaitingNic and defer rather than fault.
let boot_interface = match resolve_boot_interface(mh_snapshot, &predictions) {

Minor — load_boot_predictions opens a write transaction for a read

async fn load_boot_predictions(...) -> Result<Vec<PredictedMachineInterface>, ...> {
    let mut txn = ctx.services.db_pool.begin().await?;
    let predictions = db::predicted_machine_interface::find_by_machine_id(&mut txn, machine_id).await?;
    Ok(predictions)
    // txn silently rolls back on drop
}

begin() acquires a transaction (implicitly BEGIN) and the drop rolls it back — correct for isolation, but semantically a plain connection (acquire()) is more precise for a read-only query and avoids the no-op rollback round-trip. If this matches the codebase convention for find_by_* calls, it is fine to leave as-is.


Nit — No explicit empty-slice test for pick_boot_prediction

The four existing tests are well chosen, but an empty-slice case makes the "returns None" contract explicit without any deduction:

#[test]
fn pick_boot_prediction_returns_none_for_empty_predictions() {
    assert!(pick_boot_prediction(&[]).is_none());
}

Positive observations

  • Precedence parity: pick_boot_prediction mirrors pick_boot_interface exactly one level down (row → prediction), making the two-level precedence straightforward to reason about.
  • Single load, multiple uses: set_host_boot_order loads predictions once and reuses them across both resolve_boot_interface call sites — correct.
  • No snapshot bloat: keeping predictions out of ManagedHostStateSnapshot (loaded on demand) is the right call given the broad load surface of that view.
  • Test isolation: boot_interface_target_from being a pure function makes the unit tests in boot_interface.rs robust without any snapshot construction overhead.

@chet

chet commented Jun 20, 2026

Copy link
Copy Markdown
Contributor Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented Jun 20, 2026

Copy link
Copy Markdown
Contributor
✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
crates/machine-controller/src/handler.rs (1)

10972-10975: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Update the SetBootOrder comment to reflect prediction fallback.

Lines 10972-10975 still describe “no boot interface to resolve” before first lease, but the current logic now resolves from predictions first and waits only when both row and usable prediction are absent.

📝 Suggested comment update
-            // Resolve the boot NIC MAC the same way `CheckHostConfig` does,
-            // supporting hosts with DPU(s) and zero DPUs alike. A zero-DPU host
-            // whose boot NIC has not taken its first HostInband lease yet has no
-            // boot interface to resolve -- wait for it rather than failing.
+            // Resolve the boot NIC the same way `CheckHostConfig` does,
+            // supporting hosts with DPU(s) and zero DPUs alike. Before the first
+            // HostInband lease creates a real row, zero-DPU/NIC-mode hosts can
+            // resolve via predictions; wait only when neither a real row nor a
+            // usable prediction exists.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/machine-controller/src/handler.rs` around lines 10972 - 10975, The
comment block describing the boot NIC MAC resolution logic in the SetBootOrder
handler (around lines 10972-10975) is outdated and does not reflect the current
implementation. Update this comment to accurately describe the resolution
strategy: first attempt to resolve from the database row, then fall back to
using usable predictions if the row is unavailable, and only wait if both the
row and usable predictions are absent. This ensures the comment accurately
reflects the prediction fallback behavior in the current code.
🧹 Nitpick comments (4)
crates/machine-controller/src/handler.rs (1)

3516-3524: ⚡ Quick win

Use a pooled read connection for load_boot_predictions instead of a transaction.

Line 3520 starts a transaction for a read-only find_by_machine_id call, which adds avoidable transaction/rollback overhead on a frequently-invoked path.

💡 Suggested change
 async fn load_boot_predictions(
     ctx: &mut StateHandlerContext<'_, MachineStateHandlerContextObjects>,
     machine_id: &MachineId,
 ) -> Result<Vec<PredictedMachineInterface>, StateHandlerError> {
-    let mut txn = ctx.services.db_pool.begin().await?;
-    let predictions =
-        db::predicted_machine_interface::find_by_machine_id(&mut txn, machine_id).await?;
+    let mut conn = ctx.services.db_pool.acquire().await?;
+    let predictions =
+        db::predicted_machine_interface::find_by_machine_id(&mut conn, machine_id).await?;
     Ok(predictions)
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/machine-controller/src/handler.rs` around lines 3516 - 3524, The
load_boot_predictions function currently begins a transaction via
ctx.services.db_pool.begin() for a read-only database operation
(find_by_machine_id), which introduces unnecessary transaction overhead. Replace
the transaction initialization with a pooled read connection acquired from
ctx.services.db_pool, and pass that connection to the find_by_machine_id call
instead. This will eliminate the unnecessary transaction/rollback overhead on
this frequently-invoked code path while maintaining the same functionality.
crates/api-model/src/machine/mod.rs (1)

3507-3556: ⚡ Quick win

Use table-driven cases for pick_boot_prediction tests.

From Line 3507 onward, this is a single mapping operation exercised with multiple input variants; please convert these into value_scenarios! (or check_values) to keep extension/maintenance one-row-per-case.

As per coding guidelines, "Reach for a table whenever two or more tests call the same operation with different inputs" and "Use carbide_test_support::value_scenarios! macro for total operations."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/api-model/src/machine/mod.rs` around lines 3507 - 3556, The four test
functions pick_boot_prediction_returns_the_declared_primary,
pick_boot_prediction_returns_the_sole_non_underlay_prediction,
pick_boot_prediction_refuses_multiple_non_primary_predictions, and
pick_boot_prediction_ignores_underlay_only_predictions all test the same
pick_boot_prediction function with different input variants. Convert these
individual test functions into a single table-driven test using the
value_scenarios! macro from carbide_test_support, where each row represents one
test case with its input predictions and expected output, eliminating the
repetitive boilerplate and making the test easier to maintain and extend with
new cases.

Source: Coding guidelines

crates/machine-controller/src/boot_interface.rs (1)

143-193: ⚡ Quick win

Prefer table-driven tests for boot_interface_target_from input variants.

Lines 143-193 test one operation across multiple cases; convert to value_scenarios!/check_values for consistency and easier case expansion.

As per coding guidelines, "Reach for a table whenever two or more tests call the same operation with different inputs" and "Use carbide_test_support::value_scenarios! macro for total operations."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/machine-controller/src/boot_interface.rs` around lines 143 - 193,
Consolidate the four separate test functions
(boot_target_prefers_the_owned_row_over_predictions,
boot_target_falls_back_to_the_prediction_before_the_first_lease,
boot_target_prediction_without_id_is_mac_only, and
boot_target_is_none_without_row_or_prediction) into a single table-driven test.
Replace these individual tests with a value_scenarios! macro that defines a
table of test cases, each with descriptive names and input/expected output
pairs, then use check_values to verify each scenario against the
boot_interface_target_from function call. This consolidates the multiple
invocations of the same operation into a single parameterized test structure for
better maintainability and easier expansion.

Source: Coding guidelines

crates/api-core/src/handlers/bmc_endpoint_explorer.rs (1)

1357-1379: ⚡ Quick win

Adopt table-driven scenarios for prediction-disambiguation tests

This new case is solid, but this test area now has multiple input/output variants for the same operation (resolve_admin_boot_interface_target). Converting these prediction-selection cases to a table-driven style will make additions safer and less repetitive.

Refactor sketch
+use carbide_test_support::check_values;
+
+#[test]
+fn no_mac_prediction_disambiguation_cases() {
+    struct Case {
+        name: &'static str,
+        candidates: BootInterfaceCandidates,
+        expected: Option<BootInterfaceTarget>,
+    }
+
+    let cases = vec![
+        Case {
+            name: "multiple non-primary predictions refuse",
+            candidates: BootInterfaceCandidates {
+                interfaces: vec![],
+                predicted: vec![
+                    predicted("00:00:5e:00:53:02", Some("NIC.Slot.7-1-1")),
+                    predicted("00:00:5e:00:53:01", Some("NIC.Embedded.1-1-1")),
+                ],
+            },
+            expected: None,
+        },
+        Case {
+            name: "declared primary prediction wins",
+            candidates: BootInterfaceCandidates {
+                interfaces: vec![],
+                predicted: vec![
+                    predicted("00:00:5e:00:53:02", Some("NIC.Slot.7-1-1")),
+                    PredictedMachineInterface {
+                        primary_interface: true,
+                        ..predicted("00:00:5e:00:53:01", Some("NIC.Embedded.1-1-1"))
+                    },
+                ],
+            },
+            expected: Some(BootInterfaceTarget::Pair(pair(
+                "00:00:5e:00:53:01",
+                "NIC.Embedded.1-1-1",
+            ))),
+        },
+    ];
+
+    check_values(cases, |c| {
+        resolve_admin_boot_interface_target(None, Some(&c.candidates), None)
+    }, |c| c.expected.clone());
+}

As per coding guidelines, “Prefer table-driven tests … Reach for a table whenever two or more tests call the same operation with different inputs.”

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/api-core/src/handlers/bmc_endpoint_explorer.rs` around lines 1357 -
1379, The test `no_mac_declared_primary_prediction_wins_over_other_predictions`
is one of multiple tests that call the same function
`resolve_admin_boot_interface_target` with different inputs and expected
outputs. Refactor these related prediction-disambiguation tests into a
table-driven test format by creating a struct or tuple that defines test cases
with their respective inputs (such as the different predicted interface
configurations and expected boot interface targets), then replace the individual
test functions with a single test function that iterates through this table of
cases. This approach will make it easier to add new prediction-selection
scenarios and reduce repetition across similar tests.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@crates/machine-controller/src/handler.rs`:
- Around line 10972-10975: The comment block describing the boot NIC MAC
resolution logic in the SetBootOrder handler (around lines 10972-10975) is
outdated and does not reflect the current implementation. Update this comment to
accurately describe the resolution strategy: first attempt to resolve from the
database row, then fall back to using usable predictions if the row is
unavailable, and only wait if both the row and usable predictions are absent.
This ensures the comment accurately reflects the prediction fallback behavior in
the current code.

---

Nitpick comments:
In `@crates/api-core/src/handlers/bmc_endpoint_explorer.rs`:
- Around line 1357-1379: The test
`no_mac_declared_primary_prediction_wins_over_other_predictions` is one of
multiple tests that call the same function `resolve_admin_boot_interface_target`
with different inputs and expected outputs. Refactor these related
prediction-disambiguation tests into a table-driven test format by creating a
struct or tuple that defines test cases with their respective inputs (such as
the different predicted interface configurations and expected boot interface
targets), then replace the individual test functions with a single test function
that iterates through this table of cases. This approach will make it easier to
add new prediction-selection scenarios and reduce repetition across similar
tests.

In `@crates/api-model/src/machine/mod.rs`:
- Around line 3507-3556: The four test functions
pick_boot_prediction_returns_the_declared_primary,
pick_boot_prediction_returns_the_sole_non_underlay_prediction,
pick_boot_prediction_refuses_multiple_non_primary_predictions, and
pick_boot_prediction_ignores_underlay_only_predictions all test the same
pick_boot_prediction function with different input variants. Convert these
individual test functions into a single table-driven test using the
value_scenarios! macro from carbide_test_support, where each row represents one
test case with its input predictions and expected output, eliminating the
repetitive boilerplate and making the test easier to maintain and extend with
new cases.

In `@crates/machine-controller/src/boot_interface.rs`:
- Around line 143-193: Consolidate the four separate test functions
(boot_target_prefers_the_owned_row_over_predictions,
boot_target_falls_back_to_the_prediction_before_the_first_lease,
boot_target_prediction_without_id_is_mac_only, and
boot_target_is_none_without_row_or_prediction) into a single table-driven test.
Replace these individual tests with a value_scenarios! macro that defines a
table of test cases, each with descriptive names and input/expected output
pairs, then use check_values to verify each scenario against the
boot_interface_target_from function call. This consolidates the multiple
invocations of the same operation into a single parameterized test structure for
better maintainability and easier expansion.

In `@crates/machine-controller/src/handler.rs`:
- Around line 3516-3524: The load_boot_predictions function currently begins a
transaction via ctx.services.db_pool.begin() for a read-only database operation
(find_by_machine_id), which introduces unnecessary transaction overhead. Replace
the transaction initialization with a pooled read connection acquired from
ctx.services.db_pool, and pass that connection to the find_by_machine_id call
instead. This will eliminate the unnecessary transaction/rollback overhead on
this frequently-invoked code path while maintaining the same functionality.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 47327c9c-4b05-4d0d-bd90-a3000bd4e2db

📥 Commits

Reviewing files that changed from the base of the PR and between 79e54c0 and 401b808.

📒 Files selected for processing (5)
  • crates/api-core/src/handlers/bmc_endpoint_explorer.rs
  • crates/api-model/src/machine/mod.rs
  • crates/machine-controller/src/boot_interface.rs
  • crates/machine-controller/src/handler.rs
  • crates/machine-controller/src/handler/bios_config.rs

…st lease

The admin boot actions learned to resolve a host still awaiting its first
DHCP lease in NVIDIA#2528 -- from its predicted boot interface -- but the
machine-controller never did: it read only `machine_interfaces` rows, so a
zero-DPU or NIC-mode host (which gets no primary row at ingestion) had
nothing to boot-target and its boot states waited. This gives the controller
the same fallback: before the first lease creates a real row, the host's
predicted boot interface answers, and the real row wins the moment it exists.

The selection is now one shared `pick_boot_prediction` -- the declared
`ExpectedHostNic.primary` (recorded on the prediction in NVIDIA#2657), else the
sole non-underlay prediction, else refuse to guess. Both the machine-controller
and the admin resolver route through it, so a multi-NIC host with a declared
primary resolves instead of refusing, and the two paths can't drift.

- Add `pick_boot_prediction` (api-model) and give the controller's
  `boot_interface_target`/`resolve_boot_interface` a prediction fallback, loaded
  per resolve via `load_boot_predictions`. `AwaitingNic` now fires only when
  there is no row and no usable prediction.
- Route admin `resolve_admin_boot_interface_target` through the same selector,
  so it honors the declared-primary flag for multi-NIC hosts (it used to refuse).
- The real `machine_interfaces` row always wins the moment it exists; a host
  with one expected boot NIC resolves from its sole prediction.

Tests cover `pick_boot_prediction`'s precedence, the controller's
row-wins / prediction-fallback / MAC-only / wait composition (split out so it
unit-tests without a full snapshot), and the admin declared-primary case.

Part of NVIDIA#2658 (epic NVIDIA#2660).

Signed-off-by: Chet Nichols III <chetn@nvidia.com>
@chet chet force-pushed the controller-boot-from-predictions branch from 401b808 to 649c6ff Compare June 20, 2026 16:21
@chet

chet commented Jun 20, 2026

Copy link
Copy Markdown
Contributor Author

@coderabbitai Thanks for the eyes. Updated -- refreshed the stale set_host_boot_order comment to describe the prediction fallback, and switched load_boot_predictions to a pooled acquire() read connection instead of a transaction. Leaving the table-driven test suggestions as-is: the new tests match the per-case #[test] style of the existing tests in each module (pick_boot_interface_*, classify_*, no_mac_*), so converting only the new ones would split each module's style; a module-wide value_scenarios! conversion (which also adds a carbide-test-support dev-dep to machine-controller) belongs to the ongoing test-restructuring work, not this feature PR. PTAL. Resolve if good, thanks!

@coderabbitai

coderabbitai Bot commented Jun 20, 2026

Copy link
Copy Markdown
Contributor

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

REPO="NVIDIA/infra-controller"
PR=2712

# List review comments by coderabbitai[bot]
gh api "/repos/$REPO/pulls/$PR/comments" --paginate \
  | jq '[.[] | select(.user.login == "coderabbitai[bot]") | {id, body: .body[:80]}]'

Length of output: 301


🏁 Script executed:

#!/bin/bash
set -euo pipefail

REPO="NVIDIA/infra-controller"
PR=2712

# List issue/review-level comments by coderabbitai[bot]
gh api "/repos/$REPO/issues/$PR/comments" --paginate \
  | jq '[.[] | select(.user.login == "coderabbitai[bot]") | {id, body: .body[:120]}]'

Length of output: 819



@chet: Both actionable items are confirmed resolved. The updated comment in set_host_boot_order now accurately describes the prediction-fallback path, and switching load_boot_predictions to acquire() is the correct choice. The rationale for leaving the per-case #[test] style intact is sound — a module-wide value_scenarios! conversion is cleanly separable and belongs in the test-restructuring track rather than here. No further issues.

(≧◡≦) 🐇

[resolve]

@chet chet marked this pull request as ready for review June 21, 2026 04:16
@chet chet requested a review from a team as a code owner June 21, 2026 04:16
@github-actions

Copy link
Copy Markdown

🔍 Container Scan Summary

Service Total Critical High Medium Low Other
boot-artifacts-aarch64 3 0 0 3 0 0
boot-artifacts-x86_64 3 0 0 3 0 0
forge-admin-cli-x86_64 264 6 23 99 6 130
machine-validation-runner 704 34 183 258 35 194
machine_validation 704 34 183 258 35 194
nvmetal-carbide 704 34 183 258 35 194
TOTAL 2382 108 572 879 111 712

Per-CVE detail lives in the per-service grype-* artifacts (JSON + SARIF). Severity counts only — no CVE IDs published here.

@chet chet merged commit 611e96b into NVIDIA:main Jun 21, 2026
55 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants