Skip to content

[llmd] Work on Azure#930

Open
kpouget wants to merge 14 commits into
openshift-psap:mainfrom
kpouget:aks
Open

[llmd] Work on Azure#930
kpouget wants to merge 14 commits into
openshift-psap:mainfrom
kpouget:aks

Conversation

@kpouget
Copy link
Copy Markdown
Contributor

@kpouget kpouget commented May 11, 2026

Summary by CodeRabbit

  • Chores

    • Updated CI/testing presets and flavor mappings; added an Azure H100 preset and adjusted GPU/H200-focused presets and flags
    • Added local model entries and updated top-level test/inference-service settings
    • Enhanced debug artifact capture to include HTTPRoute YAML and status during deployment
  • New Features

    • Added new inference report generation entries (and reorganized some existing visualizations)
  • Bug Fixes

    • Improved flavor filtering in baseline throughput comparisons

Review Change Stack

@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented May 11, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign sjmonson for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 11, 2026

📝 Walkthrough

Walkthrough

Updates LLM‑D testing CI presets and local models; extends ISVC reshape to support hostPath models, tensor-parallel args, InfiniBand/AKS hotfix wiring, and router tolerations; adjusts scheduler nodeSelectors; adds Ansible HTTPRoute debug captures; and tweaks visualization/report generation and plotting.

Changes

ISVC reshape, hostPath, tensor-parallel, and InfiniBand

Layer / File(s) Summary
Tensor-parallel helper and prefill integration
projects/llm-d/testing/test_llmd.py
Adds add_tensor_parallelism_to_container, and updates flavor/prefill tensor-parallel application to use the helper.
HostPath model wiring and vLLM args
projects/llm-d/testing/test_llmd.py
apply_model_configuration accepts flavor; hostpath sources become hf://..., hostPath hf-cache volume is wired, containers’ commands rewritten to vllm serve with parsed VLLM_ADDITIONAL_ARGS, and decode/prefill ports chosen by base flavor.
InfiniBand and AKS hotfix handling
projects/llm-d/testing/test_llmd.py
Adds rdma resource requests/limits for pd flavors, AKS UCX/NIC env vars, IPC_LOCK capability, optional vllm-ucx-multiproc-hotfix ConfigMap creation and mounting, and pod ulimit annotations.
Prepare: skip downloads for hostpath models
projects/llm-d/testing/prepare_llmd.py
Early-return from download_single_model when model source starts with hostpath:, skipping PVC/download steps.

CI presets, model registry, and EPP defaults

Layer / File(s) Summary
CI presets and local models
projects/llm-d/testing/config.yaml
Adds ci_presets.to_apply, introduces azure_h100, updates GPU/H200 preset flags, updates prepare.rhoai.tag, adds local llama3-* and gpt-oss-120-local hostpath models, updates test flavors and inference_service router/infiniband/hybrid_kv_cache flags.

Scheduler manifests

Layer / File(s) Summary
Scheduler container and nodeSelector tweaks
projects/llm-d/testing/llmisvcs/llmisvc-pd.yaml, projects/llm-d/testing/llmisvcs/llmisvc-simple.yaml
Removes inline router.scheduler.template.containers block from PD manifest; changes GPU nodeSelector key in simple manifest to nvidia.com/gpu.deploy.container-toolkit: "true".

Ansible debug artifact capture

Layer / File(s) Summary
Capture HTTPRoute YAML and status in tasks
projects/llm-d/toolbox/llmd_capture_isvc_state/tasks/main.yml, projects/llm-d/toolbox/llmd_deploy_llm_inference_service/tasks/main.yml
Adds tasks to save oc get httproute YAML and status outputs to artifacts, using selectors for router component and service name; tasks use failed_when: false.

Visualizations and plotting

Layer / File(s) Summary
Reports config and baseline skip
projects/llm-d/visualizations/llmd_inference/data/reports.yaml, projects/llm-d/visualizations/llmd_inference/data/plots.yaml, projects/llm-d/visualizations/llmd_inference/plotting/throughput_comparisons.py
Adds visualize.llmd_inference.generate entries; comments out several plot generation entries; plotting now silently skips simple-tp4-x4 before rendering its header.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested labels

lgtm

Poem

🐰 I hopped through configs, changed a line or two,
HostPath caches stitched and vLLM commands anew,
InfiniBand whispers, AKS hotfix in place,
HTTPRoutes captured so traces we can trace,
Reports sleep gently — testing leaps into view!

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 79.17% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title '[llmd] Work on Azure' is vague and overly broad; it does not clearly indicate the specific changes being made despite the changeset containing multiple significant modifications across testing configs, Ansible tasks, visualizations, and inference service specifications. Revise the title to be more descriptive of the main change, such as '[llmd] Add Azure H100 preset and update inference service configurations' or '[llmd] Configure Azure H100 support and InfiniBand handling'.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
projects/llm-d/visualizations/llmd_inference/plotting/throughput_comparisons.py (1)

324-335: ⚡ Quick win

Remove unused filter_flavors function or refactor to use it.

The filter_flavors function is defined but never called within BaselineComparisonsReport.do_plot. Meanwhile, the newly added flavor filtering at lines 348-349 uses a simple if/continue pattern instead.

The other report classes (IntelligentRoutingComparisonsReport and PDComparisonsReport) define identical filter_flavors functions and actually use them for filtering. For consistency and to avoid dead code, consider either:

  1. Remove this unused function (simpler if the inline skip is sufficient), or
  2. Refactor lines 348-349 to use filter_flavors for consistency with other reports.
Option 1: Remove the unused function
-        def filter_flavors(setting_lists, flavor_filter):
-            """Filter flavors from setting_lists based on provided filter function"""
-            updated_setting_lists = []
-            for setting_list in setting_lists:
-                if setting_list and setting_list[0][0] == 'flavor':
-                    # Apply the filter function to flavors
-                    filtered_flavors = [(k, v) for k, v in setting_list if flavor_filter(v)]
-                    if filtered_flavors:
-                        updated_setting_lists.append(filtered_flavors)
-                else:
-                    updated_setting_lists.append(setting_list)
-            setting_lists[:] = updated_setting_lists
-
         # Get simple flavors
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@projects/llm-d/visualizations/llmd_inference/plotting/throughput_comparisons.py`
around lines 324 - 335, The file defines a helper function filter_flavors that
is never used in BaselineComparisonsReport.do_plot while that method instead
uses an inline "if/continue" flavor check; either delete the unused
filter_flavors function to remove dead code, or replace the inline flavor skip
in BaselineComparisonsReport.do_plot with a call to
filter_flavors(setting_lists, flavor_filter) so behavior matches
IntelligentRoutingComparisonsReport and PDComparisonsReport—update any variable
names to match the existing signature and ensure setting_lists is mutated in
place as filter_flavors does.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@projects/llm-d/testing/config.yaml`:
- Line 7: The config currently forces the azure-specific preset by setting
to_apply: [azure_h100], which causes all runs to inherit Azure behavior
(including the kpouget-dev namespace); remove or unset the global to_apply entry
so the default config is neutral and require selecting the azure_h100 preset
explicitly in CI/job inputs (or move the preset into the specific job/pipeline
entries that need it) to avoid leaking Azure-specific settings into non-Azure
workflows.
- Line 138: The root-level "skip: true" toggle in
projects/llm-d/testing/config.yaml is globally scoped and will disable baseline
prepare steps for all presets; move these Azure-specific skip flags out of the
top-level prepare.*.skip and instead set them only inside the Azure CI preset
(e.g., ci_presets.azure_h100 -> prepare.*.skip) so defaults remain
unchanged—update the three occurrences noted (lines referenced in the review) by
removing or setting them false at root and adding the skip:true entries within
the ci_presets.azure_h100 overrides for the corresponding prepare sections.

---

Nitpick comments:
In
`@projects/llm-d/visualizations/llmd_inference/plotting/throughput_comparisons.py`:
- Around line 324-335: The file defines a helper function filter_flavors that is
never used in BaselineComparisonsReport.do_plot while that method instead uses
an inline "if/continue" flavor check; either delete the unused filter_flavors
function to remove dead code, or replace the inline flavor skip in
BaselineComparisonsReport.do_plot with a call to filter_flavors(setting_lists,
flavor_filter) so behavior matches IntelligentRoutingComparisonsReport and
PDComparisonsReport—update any variable names to match the existing signature
and ensure setting_lists is mutated in place as filter_flavors does.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e7ee7db4-319b-46ac-96f9-5edcb3d1ede1

📥 Commits

Reviewing files that changed from the base of the PR and between 0240e09 and 16f23b4.

📒 Files selected for processing (5)
  • projects/llm-d/testing/config.yaml
  • projects/llm-d/toolbox/llmd_capture_isvc_state/tasks/main.yml
  • projects/llm-d/toolbox/llmd_deploy_llm_inference_service/tasks/main.yml
  • projects/llm-d/visualizations/llmd_inference/data/plots.yaml
  • projects/llm-d/visualizations/llmd_inference/plotting/throughput_comparisons.py

Comment thread projects/llm-d/testing/config.yaml Outdated
Comment thread projects/llm-d/testing/config.yaml Outdated
@kpouget kpouget force-pushed the aks branch 3 times, most recently from ec5f6fe to ad0254c Compare May 20, 2026 11:13
@kpouget
Copy link
Copy Markdown
Contributor Author

kpouget commented May 20, 2026

/test jump-ci llm-d azure_h100 baseline-flavors
/cluster aks-h100
/only test_ci

@psap-forge-bot
Copy link
Copy Markdown

🟢 Test of 'llm-d test test_ci' succeeded after 00 hours 20 minutes 16 seconds. 🟢

• Link to the test results.

• Link to the reports index.

Test configuration:

PR_POSITIONAL_ARGS: jump-ci
PR_POSITIONAL_ARG_0: jump-ci
PR_POSITIONAL_ARG_1: azure_h100
PR_POSITIONAL_ARG_2: baseline-flavors

@kpouget
Copy link
Copy Markdown
Contributor Author

kpouget commented May 20, 2026

/test jump-ci llm-d azure_h100 baseline-flavors
/cluster aks-h100
/only test_ci

@psap-forge-bot
Copy link
Copy Markdown

🟢 Test of 'llm-d test test_ci' succeeded after 00 hours 21 minutes 40 seconds. 🟢

• Link to the test results.

• Link to the reports index.

Test configuration:

PR_POSITIONAL_ARGS: jump-ci
PR_POSITIONAL_ARG_0: jump-ci
PR_POSITIONAL_ARG_1: azure_h100
PR_POSITIONAL_ARG_2: baseline-flavors

@kpouget
Copy link
Copy Markdown
Contributor Author

kpouget commented May 20, 2026

/test jump-ci llm-d azure_h100 baseline-flavors guidellm_heterogeneous_eval
/cluster aks-h100
/only test_ci

@psap-forge-bot
Copy link
Copy Markdown

🟢 Test of 'llm-d test test_ci' succeeded after 03 hours 21 minutes 21 seconds. 🟢

• Link to the test results.

• Link to the reports index.

Test configuration:

PR_POSITIONAL_ARGS: jump-ci
PR_POSITIONAL_ARG_0: jump-ci
PR_POSITIONAL_ARG_1: azure_h100
PR_POSITIONAL_ARG_2: baseline-flavors
PR_POSITIONAL_ARG_3: guidellm_heterogeneous_eval

@kpouget
Copy link
Copy Markdown
Contributor Author

kpouget commented May 20, 2026

/test jump-ci llm-d azure_h100 intelligentrouting-flavors guidellm_heterogeneous_eval
/cluster aks-h100
/only test_ci

@kpouget
Copy link
Copy Markdown
Contributor Author

kpouget commented May 20, 2026

/test jump-ci llm-d azure_h100 baseline-flavors guidellm_heterogeneous_eval llama-70b
/cluster aks-h100
/only test_ci

@psap-forge-bot
Copy link
Copy Markdown

🔴 Test of 'llm-d test test_ci' failed after 00 hours 00 minutes 40 seconds. 🔴

• Link to the test results.

• No reports index generated...

Test configuration:

PR_POSITIONAL_ARGS: jump-ci
PR_POSITIONAL_ARG_0: jump-ci
PR_POSITIONAL_ARG_1: azure_h100
PR_POSITIONAL_ARG_2: intelligentrouting-flavors
PR_POSITIONAL_ARG_3: guidellm_heterogeneous_eval

Failure indicator: Empty. (See run.log)

@psap-forge-bot
Copy link
Copy Markdown

🟢 Test of 'llm-d test test_ci' succeeded after 02 hours 16 minutes 44 seconds. 🟢

• Link to the test results.

• Link to the reports index.

Test configuration:

PR_POSITIONAL_ARGS: jump-ci
PR_POSITIONAL_ARG_0: jump-ci
PR_POSITIONAL_ARG_1: azure_h100
PR_POSITIONAL_ARG_2: intelligentrouting-flavors
PR_POSITIONAL_ARG_3: guidellm_heterogeneous_eval

@kpouget kpouget force-pushed the aks branch 2 times, most recently from 1efcc76 to f301a7d Compare May 21, 2026 06:53
@kpouget
Copy link
Copy Markdown
Contributor Author

kpouget commented May 21, 2026

/test jump-ci llm-d azure_h100 baseline-flavors llama-70b
/cluster aks-h100
/only test_ci

@psap-forge-bot
Copy link
Copy Markdown

🔴 Test of 'llm-d test test_ci' failed after 00 hours 23 minutes 44 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

PR_POSITIONAL_ARGS: jump-ci
PR_POSITIONAL_ARG_0: jump-ci
PR_POSITIONAL_ARG_1: azure_h100
PR_POSITIONAL_ARG_2: baseline-flavors
PR_POSITIONAL_ARG_3: llama-70b

Failure indicator:

/tmp/topsail_202605211779347575/001__llm_d_testing/000__flavor_simple/000__llmd__deploy_llm_inference_service/FAILURE | [000__llmd__deploy_llm_inference_service] ./run_toolbox.py llmd deploy_llm_inference_service --name=llm-d-simple --namespace=kpouget-dev --yaml_file=/tmp/topsail_202605211779347575/001__llm_d_testing/000__flavor_simple/llmisvc-simple.yaml --> 2


@kpouget
Copy link
Copy Markdown
Contributor Author

kpouget commented May 21, 2026

/test jump-ci llm-d azure_h100 baseline-flavors llama-70b
/cluster aks-h100
/only test_ci
/var tests.llmd.flavors: [simple-tp2]

@kpouget
Copy link
Copy Markdown
Contributor Author

kpouget commented May 21, 2026

/test jump-ci llm-d azure_h100 baseline-flavors llama-70b
/cluster aks-h100
/only test_ci
/var tests.llmd.flavors: simple-tp2

@psap-forge-bot
Copy link
Copy Markdown

🟢 Test of 'llm-d test test_ci' succeeded after 00 hours 08 minutes 02 seconds. 🟢

• Link to the test results.

• Link to the reports index.

Test configuration:

PR_POSITIONAL_ARGS: jump-ci
PR_POSITIONAL_ARG_0: jump-ci
PR_POSITIONAL_ARG_1: azure_h100
PR_POSITIONAL_ARG_2: baseline-flavors
PR_POSITIONAL_ARG_3: llama-70b
tests.llmd.flavors: simple-tp2

@kpouget
Copy link
Copy Markdown
Contributor Author

kpouget commented May 21, 2026

/test jump-ci llm-d azure_h100 baseline-flavors llama-70b guidellm_heterogeneous_eval
/cluster aks-h100
/only test_ci

@psap-forge-bot
Copy link
Copy Markdown

🟢 Test of 'llm-d test test_ci' succeeded after 03 hours 20 minutes 38 seconds. 🟢

• Link to the test results.

• Link to the reports index.

Test configuration:

PR_POSITIONAL_ARGS: jump-ci
PR_POSITIONAL_ARG_0: jump-ci
PR_POSITIONAL_ARG_1: azure_h100
PR_POSITIONAL_ARG_2: baseline-flavors
PR_POSITIONAL_ARG_3: llama-70b
PR_POSITIONAL_ARG_4: guidellm_heterogeneous_eval

@kpouget
Copy link
Copy Markdown
Contributor Author

kpouget commented May 21, 2026

/test jump-ci llm-d azure_h100 guidellm_multiturn_eval gpt-oss
/cluster aks-h100
/only test_ci
/var tests.llmd.inference_service.aks_hotfix_enabled: false
/var tests.llmd.flavors: intelligent-tp4-x4

@kpouget
Copy link
Copy Markdown
Contributor Author

kpouget commented May 21, 2026

/test jump-ci llm-d azure_h100 guidellm_multiturn_eval gpt-oss
/cluster aks-h100
/only test_ci
/var tests.llmd.inference_service.aks_hotfix_enabled: false
/var tests.llmd.flavors: intelligent-tp4-x4

@psap-forge-bot
Copy link
Copy Markdown

🔴 Test of 'llm-d test test_ci' failed after 00 hours 00 minutes 39 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

PR_POSITIONAL_ARGS: jump-ci
PR_POSITIONAL_ARG_0: jump-ci
PR_POSITIONAL_ARG_1: azure_h100
PR_POSITIONAL_ARG_2: guidellm_multiturn_eval
PR_POSITIONAL_ARG_3: gpt-oss
tests.llmd.flavors: intelligent-tp4-x4
tests.llmd.inference_service.aks_hotfix_enabled: false

Failure indicator:

/tmp/topsail_202605211779397318/002__plots/FAILURE | An error happened during the visualization post-processing ... (0_matbench_parse.log in /tmp/topsail_202605211779397318/002__plots). Mind that the test that was processed FAILED.


@kpouget
Copy link
Copy Markdown
Contributor Author

kpouget commented May 22, 2026

/test jump-ci llm-d azure_h100 pd-flavors guidellm_multiturn_eval gpt-oss
/cluster aks-h100
/only test_ci
/var tests.llmd.inference_service.aks_hotfix_enabled: false

@kpouget
Copy link
Copy Markdown
Contributor Author

kpouget commented May 22, 2026

/test jump-ci llm-d azure_h100 pd-flavors guidellm_multiturn_eval gpt-oss
/cluster aks-h100
/only test_ci

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🧹 Nitpick comments (1)
projects/llm-d/testing/config.yaml (1)

319-319: ⚡ Quick win

Make the flavors type consistent with preset definitions.

Line 319 uses a bare string pd-x2-ptp4-px1-dtp4, while preset definitions at lines 32, 87, and 93 use lists for the same field. This type inconsistency can cause issues if the code expects a list and receives a string.

♻️ Suggested change for type consistency
-    flavors: pd-x2-ptp4-px1-dtp4
+    flavors: [pd-x2-ptp4-px1-dtp4]
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@projects/llm-d/testing/config.yaml` at line 319, The `flavors` field is
currently a bare string (pd-x2-ptp4-px1-dtp4) but other preset definitions use a
list; update the `flavors` value to be a YAML list to match the presets (i.e.,
convert the scalar to a sequence containing the same item) so code that expects
an array/sequence will receive a list; ensure the `flavors` key in this config
matches the same list formatting used by the preset entries referenced.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@projects/llm-d/testing/config.yaml`:
- Line 340: The global setting tests.llmd.inference_service.aks_hotfix_enabled
is currently true and will apply AKS-specific behavior everywhere; change the
default to false in that section and enable it only in the Azure preset by
adding tests.llmd.inference_service.aks_hotfix_enabled: true under the azure
preset (or, if you prefer to keep it true globally, update the comment next to
aks_hotfix_enabled to state explicitly that the hotfix is a no-op on non-AKS
environments and must be overridden where unsafe). Ensure you update the
tests.llmd.inference_service block and the azure preset block (named azure)
accordingly.

In `@projects/llm-d/testing/llmisvcs/llmisvc-simple.yaml`:
- Line 21: The pod nodeSelector in
projects/llm-d/testing/llmisvcs/llmisvc-simple.yaml currently uses the
nonstandard label key nvidia.com/gpu.deploy.container-toolkit: "true"; verify
that this exact label/value exists on your AKS H100 node pool (e.g., inspect
node labels) or replace the selector with the repo-standard label
nvidia.com/gpu.present: "true" so pods will schedule consistently; update the
manifest's nodeSelector key accordingly or add a note documenting the cluster
labeling contract if the custom label is intentional.

In `@projects/llm-d/testing/test_llmd.py`:
- Around line 1015-1028: The helper apply_to_containers assumes
container['resources'], ['limits'] and ['requests'] exist which can raise
KeyError; update apply_to_containers to ensure container['resources'] is a dict
and to initialize container['resources']['limits'] and
container['resources']['requests'] if missing (e.g., via setdefault or explicit
checks) before popping or assigning RDMA keys, then proceed to use
infiniband_config to add the appropriate RDMA entries.
- Around line 631-650: The code currently enforces that VLLM_ADDITIONAL_ARGS
must exist by raising a RuntimeError when vllm_args_found is False; remove that
requirement so hostPath models can run with a plain "vllm serve" command: delete
or replace the final if-not-vllm_args_found raise block (the check using
vllm_args_found and the RuntimeError) in the routine that processes
container['env'] so that when VLLM_ADDITIONAL_ARGS is absent the code simply
continues (optionally log a debug/info message) instead of throwing an
exception.
- Around line 723-743: The code currently replaces existing volumeMounts/volumes
(e.g., container['volumeMounts'] = [hf_cache_mount] and
isvc_data['spec']['template']['volumes'] = [hf_cache_volume]) which drops
pre-existing mounts; change these to append the hf_cache_mount/hf_cache_volume
instead: ensure in configure_container (and when handling prefill_container) you
create the keys if missing and extend the existing lists (e.g., use
container.setdefault('volumeMounts', []).append(hf_cache_mount) and
template.setdefault('volumes', []).append(hf_cache_volume') semantics) so
existing TLS/config mounts and other volumes are preserved.

In `@projects/llm-d/toolbox/llmd_capture_isvc_state/tasks/main.yml`:
- Around line 95-109: The two tasks "Capture HTTPRoute for routing" and "Capture
HTTPRoute status" only redirect stdout to the artifact files so any oc errors go
to stderr and are lost; update the shell commands that call oc get httproute
(the tasks referencing variables llmd_capture_isvc_state_llmisvc_name,
target_namespace and artifact_extra_logs_dir) to also capture stderr into the
same artifact files (e.g., redirect stderr into stdout) so that diagnostic
messages are preserved while keeping ignore_errors: true.

In `@projects/llm-d/toolbox/llmd_deploy_llm_inference_service/tasks/main.yml`:
- Around line 167-182: The two tasks "Capture HTTPRoute YAML for routing" and
"Capture HTTPRoute status for routing" drop stderr when the shell runs with
failed_when: false; update each shell invocation in tasks/main.yml so the oc
command's stderr is preserved in the artifact files (e.g., redirect stderr into
the same output file or a companion .stderr file) — keep failed_when: false but
ensure the shell redirect captures both stdout and stderr (or write stderr
separately) so any command failure details are recorded in the artifacts.

---

Nitpick comments:
In `@projects/llm-d/testing/config.yaml`:
- Line 319: The `flavors` field is currently a bare string (pd-x2-ptp4-px1-dtp4)
but other preset definitions use a list; update the `flavors` value to be a YAML
list to match the presets (i.e., convert the scalar to a sequence containing the
same item) so code that expects an array/sequence will receive a list; ensure
the `flavors` key in this config matches the same list formatting used by the
preset entries referenced.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6979cfda-842f-457a-a917-08bd6e3884c8

📥 Commits

Reviewing files that changed from the base of the PR and between 16f23b4 and 47821cb.

📒 Files selected for processing (10)
  • projects/llm-d/testing/config.yaml
  • projects/llm-d/testing/llmisvcs/llmisvc-pd.yaml
  • projects/llm-d/testing/llmisvcs/llmisvc-simple.yaml
  • projects/llm-d/testing/prepare_llmd.py
  • projects/llm-d/testing/test_llmd.py
  • projects/llm-d/toolbox/llmd_capture_isvc_state/tasks/main.yml
  • projects/llm-d/toolbox/llmd_deploy_llm_inference_service/tasks/main.yml
  • projects/llm-d/visualizations/llmd_inference/data/plots.yaml
  • projects/llm-d/visualizations/llmd_inference/data/reports.yaml
  • projects/llm-d/visualizations/llmd_inference/plotting/throughput_comparisons.py
💤 Files with no reviewable changes (2)
  • projects/llm-d/visualizations/llmd_inference/data/plots.yaml
  • projects/llm-d/testing/llmisvcs/llmisvc-pd.yaml
✅ Files skipped from review due to trivial changes (1)
  • projects/llm-d/visualizations/llmd_inference/data/reports.yaml

model: gpt-oss-20b
max_model_len: 40960
infiniband: null # null=unchanged, true=enable rdma/ib: "1", false=disable rdma/ib
aks_hotfix_enabled: true # Enable/disable AKS ConfigMap hotfix mounts
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Move AKS-specific setting to the Azure preset or default to false.

aks_hotfix_enabled: true is set globally in the default tests.llmd.inference_service section. According to the layer description, this enables "AKS-specific UCX/NIC env, IPC_LOCK capability, optional vllm-ucx-multiproc-hotfix ConfigMap creation". Setting this to true globally will apply AKS-specific behavior to all environments, including non-AKS deployments, which may fail or create unnecessary resources.

🔧 Recommended fix

Option 1 (preferred): Default to false globally and enable in Azure presets

-      aks_hotfix_enabled: true  # Enable/disable AKS ConfigMap hotfix mounts
+      aks_hotfix_enabled: false  # Enable/disable AKS ConfigMap hotfix mounts

Then add to the azure preset (around line 60):

  azure:
    security.run_as_root: true
    prepare.preload.skip: true
    prepare.operators.skip: true
    prepare.monitoring.skip: true
    prepare.cluster.skip: true
    prepare.rhoai.skip: true
    
    tests.capture_prom: false
    tests.capture_prom_uwm: false
    tests.llmd.skip_prepare: false
+   tests.llmd.inference_service.aks_hotfix_enabled: true

Option 2: Keep true but document that non-AKS environments must override

If the hotfix is safe to apply universally (no-op on non-AKS), update the comment to clarify this behavior.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
aks_hotfix_enabled: true # Enable/disable AKS ConfigMap hotfix mounts
aks_hotfix_enabled: false # Enable/disable AKS ConfigMap hotfix mounts
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@projects/llm-d/testing/config.yaml` at line 340, The global setting
tests.llmd.inference_service.aks_hotfix_enabled is currently true and will apply
AKS-specific behavior everywhere; change the default to false in that section
and enable it only in the Azure preset by adding
tests.llmd.inference_service.aks_hotfix_enabled: true under the azure preset
(or, if you prefer to keep it true globally, update the comment next to
aks_hotfix_enabled to state explicitly that the hotfix is a no-op on non-AKS
environments and must be overridden where unsafe). Ensure you update the
tests.llmd.inference_service block and the azure preset block (named azure)
accordingly.

- pluginRef: max-score-picker
nodeSelector:
nvidia.com/gpu.present: "true"
nvidia.com/gpu.deploy.container-toolkit: "true"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Search for references to both old and new nodeSelector labels

echo "=== Searching for old label: nvidia.com/gpu.present ==="
rg -n "nvidia\.com/gpu\.present" -C2

echo -e "\n=== Searching for new label: nvidia.com/gpu.deploy.container-toolkit ==="
rg -n "nvidia\.com/gpu\.deploy\.container-toolkit" -C2

echo -e "\n=== All nvidia.com/gpu nodeSelector patterns ==="
rg -n "nvidia\.com/gpu\.[a-z.-]+" -C2

Repository: openshift-psap/topsail

Length of output: 21530


🌐 Web query:

AKS NVIDIA GPU node labels container toolkit deployment

💡 Result:

In Azure Kubernetes Service (AKS), GPU node pools come with NVIDIA drivers and the NVIDIA Container Toolkit pre-installed by default [1]. Consequently, you typically do not need to manually deploy the container toolkit [2][1]. Node Labeling and Scheduling AKS automatically applies default labels to GPU nodes, such as accelerator=nvidia [3]. When deploying GPU-enabled workloads, it is recommended practice to: 1. Use Node Pools and Taints: Isolate GPU workloads by creating dedicated GPU node pools and applying taints (e.g., sku=gpu:NoSchedule) [4][5]. 2. Use Node Selectors or Affinity: Ensure your workload pods land on the correct nodes using node selectors or affinity that match your labels [4][6]. 3. Custom Labels: You can apply additional labels as needed for specific hardware or configuration requirements using kubectl label nodes = [6]. NVIDIA GPU Operator and Toolkit While AKS provides pre-installed components, you may choose to use the NVIDIA GPU Operator for advanced lifecycle management (e.g., automated driver updates, monitoring) [2][1]. If you use the Operator in an environment where drivers and the toolkit are already pre-installed, you must configure the installation to disable these components to avoid conflicts [1][7]: helm install gpu-operator nvidia/gpu-operator -n gpu-operator --set driver.enabled=false --set toolkit.enabled=false --set operator.runtimeClass=nvidia-container-runtime If you require custom driver management, you can skip the default installation during node pool creation by using the --gpu-driver none flag [2][8]. Once skipped, you assume responsibility for installing the drivers and the container toolkit (e.g., via the GPU Operator or manual DaemonSet deployment) [2][1][8]. Dynamic Resource Allocation (DRA) For newer scenarios like virtualized GPUs (vGPU), AKS utilizes NVIDIA DRA, which runs as a DaemonSet and may require you to manually apply specific labels, such as nvidia.com/gpu.present=true, to your nodes [3]. Always verify the current AKS documentation for your specific GPU architecture and workload requirements [2][9].

Citations:


Harden/align nodeSelector label for router scheduler pods (AKS H100): nvidia.com/gpu.deploy.container-toolkit is repo-unique and not documented as a standard AKS/NVIDIA label

  • In projects/llm-d/testing/llmisvcs/llmisvc-simple.yaml (line 21), pod placement is constrained to nvidia.com/gpu.deploy.container-toolkit: "true"; if that label isn’t present on the target GPU nodes, pods won’t schedule.
  • The repo references nvidia.com/gpu.present=true across configs/manifests/tests, but nvidia.com/gpu.deploy.container-toolkit is only used in this single manifest.
  • Public AKS NVIDIA GPU docs/guidance point to nvidia.com/gpu.present=true (not nvidia.com/gpu.deploy.container-toolkit) for scheduling, and note the toolkit is typically already installed on AKS GPU nodes.
  • Action: confirm the exact label/value exists on the AKS H100 GPU node pool you target (e.g., oc get nodes --show-labels and grep), and/or switch this nodeSelector to the repo’s existing nvidia.com/gpu.present: "true" (or document the cluster labeling contract if intentionally different).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@projects/llm-d/testing/llmisvcs/llmisvc-simple.yaml` at line 21, The pod
nodeSelector in projects/llm-d/testing/llmisvcs/llmisvc-simple.yaml currently
uses the nonstandard label key nvidia.com/gpu.deploy.container-toolkit: "true";
verify that this exact label/value exists on your AKS H100 node pool (e.g.,
inspect node labels) or replace the selector with the repo-standard label
nvidia.com/gpu.present: "true" so pods will schedule consistently; update the
manifest's nodeSelector key accordingly or add a note documenting the cluster
labeling contract if the custom label is intentional.

Comment thread projects/llm-d/testing/test_llmd.py
Comment thread projects/llm-d/testing/test_llmd.py
Comment on lines +1015 to +1028
def apply_to_containers(containers):
for container in containers:
# Always remove existing rdma/ib first
container['resources']['limits'].pop('rdma/ib', None)
container['resources']['requests'].pop('rdma/ib', None)

# Add InfiniBand resource based on config type
if infiniband_config is True:
container['resources']['limits']['rdma/ib'] = "1"
container['resources']['requests']['rdma/ib'] = "1"
elif isinstance(infiniband_config, str):
# Use custom resource string (e.g., "rdma/shared_ib")
container['resources']['limits'][infiniband_config] = "1"
container['resources']['requests'][infiniband_config] = "1"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Initialize resources.requests and resources.limits before writing RDMA keys.

This helper assumes every decode/prefill container already has resources.requests and resources.limits. That isn't guaranteed on the current reshape path, so a plain pd manifest without those blocks can fail here with KeyError.

Suggested fix
     def apply_to_containers(containers):
         for container in containers:
+            resources = container.setdefault('resources', {})
+            limits = resources.setdefault('limits', {})
+            requests = resources.setdefault('requests', {})
+
             # Always remove existing rdma/ib first
-            container['resources']['limits'].pop('rdma/ib', None)
-            container['resources']['requests'].pop('rdma/ib', None)
+            limits.pop('rdma/ib', None)
+            requests.pop('rdma/ib', None)
 
             # Add InfiniBand resource based on config type
             if infiniband_config is True:
-                container['resources']['limits']['rdma/ib'] = "1"
-                container['resources']['requests']['rdma/ib'] = "1"
+                limits['rdma/ib'] = "1"
+                requests['rdma/ib'] = "1"
             elif isinstance(infiniband_config, str):
                 # Use custom resource string (e.g., "rdma/shared_ib")
-                container['resources']['limits'][infiniband_config] = "1"
-                container['resources']['requests'][infiniband_config] = "1"
+                limits[infiniband_config] = "1"
+                requests[infiniband_config] = "1"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def apply_to_containers(containers):
for container in containers:
# Always remove existing rdma/ib first
container['resources']['limits'].pop('rdma/ib', None)
container['resources']['requests'].pop('rdma/ib', None)
# Add InfiniBand resource based on config type
if infiniband_config is True:
container['resources']['limits']['rdma/ib'] = "1"
container['resources']['requests']['rdma/ib'] = "1"
elif isinstance(infiniband_config, str):
# Use custom resource string (e.g., "rdma/shared_ib")
container['resources']['limits'][infiniband_config] = "1"
container['resources']['requests'][infiniband_config] = "1"
def apply_to_containers(containers):
for container in containers:
resources = container.setdefault('resources', {})
limits = resources.setdefault('limits', {})
requests = resources.setdefault('requests', {})
# Always remove existing rdma/ib first
limits.pop('rdma/ib', None)
requests.pop('rdma/ib', None)
# Add InfiniBand resource based on config type
if infiniband_config is True:
limits['rdma/ib'] = "1"
requests['rdma/ib'] = "1"
elif isinstance(infiniband_config, str):
# Use custom resource string (e.g., "rdma/shared_ib")
limits[infiniband_config] = "1"
requests[infiniband_config] = "1"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@projects/llm-d/testing/test_llmd.py` around lines 1015 - 1028, The helper
apply_to_containers assumes container['resources'], ['limits'] and ['requests']
exist which can raise KeyError; update apply_to_containers to ensure
container['resources'] is a dict and to initialize
container['resources']['limits'] and container['resources']['requests'] if
missing (e.g., via setdefault or explicit checks) before popping or assigning
RDMA keys, then proceed to use infiniband_config to add the appropriate RDMA
entries.

Comment on lines +95 to +109
- name: Capture HTTPRoute for routing
shell:
oc get httproute \
-l "app.kubernetes.io/component=llminferenceservice-router,app.kubernetes.io/name={{ llmd_capture_isvc_state_llmisvc_name }}" \
-n "{{ target_namespace }}" \
-oyaml > "{{ artifact_extra_logs_dir }}/artifacts/llminferenceservice.httproute.yaml"
ignore_errors: true

- name: Capture HTTPRoute status
shell:
oc get httproute \
-l "app.kubernetes.io/component=llminferenceservice-router,app.kubernetes.io/name={{ llmd_capture_isvc_state_llmisvc_name }}" \
-n "{{ target_namespace }}" \
> "{{ artifact_extra_logs_dir }}/artifacts/llminferenceservice.httproute.status"
ignore_errors: true
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Capture stderr into the HTTPRoute artifacts as well.

At Line 97 and Line 105, only stdout is redirected. When oc get httproute fails, the most useful diagnostics are on stderr, so the artifact file can be empty while the task still passes (ignore_errors: true).

Suggested patch
 - name: Capture HTTPRoute for routing
   shell:
     oc get httproute \
        -l "app.kubernetes.io/component=llminferenceservice-router,app.kubernetes.io/name={{ llmd_capture_isvc_state_llmisvc_name }}" \
        -n "{{ target_namespace }}" \
-       -oyaml > "{{ artifact_extra_logs_dir }}/artifacts/llminferenceservice.httproute.yaml"
+       --ignore-not-found \
+       -oyaml > "{{ artifact_extra_logs_dir }}/artifacts/llminferenceservice.httproute.yaml" 2>&1
   ignore_errors: true

 - name: Capture HTTPRoute status
   shell:
     oc get httproute \
        -l "app.kubernetes.io/component=llminferenceservice-router,app.kubernetes.io/name={{ llmd_capture_isvc_state_llmisvc_name }}" \
        -n "{{ target_namespace }}" \
-       > "{{ artifact_extra_logs_dir }}/artifacts/llminferenceservice.httproute.status"
+       --ignore-not-found \
+       > "{{ artifact_extra_logs_dir }}/artifacts/llminferenceservice.httproute.status" 2>&1
   ignore_errors: true
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- name: Capture HTTPRoute for routing
shell:
oc get httproute \
-l "app.kubernetes.io/component=llminferenceservice-router,app.kubernetes.io/name={{ llmd_capture_isvc_state_llmisvc_name }}" \
-n "{{ target_namespace }}" \
-oyaml > "{{ artifact_extra_logs_dir }}/artifacts/llminferenceservice.httproute.yaml"
ignore_errors: true
- name: Capture HTTPRoute status
shell:
oc get httproute \
-l "app.kubernetes.io/component=llminferenceservice-router,app.kubernetes.io/name={{ llmd_capture_isvc_state_llmisvc_name }}" \
-n "{{ target_namespace }}" \
> "{{ artifact_extra_logs_dir }}/artifacts/llminferenceservice.httproute.status"
ignore_errors: true
- name: Capture HTTPRoute for routing
shell:
oc get httproute \
-l "app.kubernetes.io/component=llminferenceservice-router,app.kubernetes.io/name={{ llmd_capture_isvc_state_llmisvc_name }}" \
-n "{{ target_namespace }}" \
--ignore-not-found \
-oyaml > "{{ artifact_extra_logs_dir }}/artifacts/llminferenceservice.httproute.yaml" 2>&1
ignore_errors: true
- name: Capture HTTPRoute status
shell:
oc get httproute \
-l "app.kubernetes.io/component=llminferenceservice-router,app.kubernetes.io/name={{ llmd_capture_isvc_state_llmisvc_name }}" \
-n "{{ target_namespace }}" \
--ignore-not-found \
> "{{ artifact_extra_logs_dir }}/artifacts/llminferenceservice.httproute.status" 2>&1
ignore_errors: true
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@projects/llm-d/toolbox/llmd_capture_isvc_state/tasks/main.yml` around lines
95 - 109, The two tasks "Capture HTTPRoute for routing" and "Capture HTTPRoute
status" only redirect stdout to the artifact files so any oc errors go to stderr
and are lost; update the shell commands that call oc get httproute (the tasks
referencing variables llmd_capture_isvc_state_llmisvc_name, target_namespace and
artifact_extra_logs_dir) to also capture stderr into the same artifact files
(e.g., redirect stderr into stdout) so that diagnostic messages are preserved
while keeping ignore_errors: true.

Comment on lines +167 to +182
- name: Capture HTTPRoute YAML for routing
shell:
oc get httproute \
-l "app.kubernetes.io/component=llminferenceservice-router,app.kubernetes.io/name={{ llmd_deploy_llm_inference_service_name }}" \
-n "{{ llmd_deploy_llm_inference_service_namespace }}" \
-oyaml \
> "{{ artifact_extra_logs_dir }}/artifacts/llm_inference_service.httproute.yaml"
failed_when: false

- name: Capture HTTPRoute status for routing
shell:
oc get httproute \
-l "app.kubernetes.io/component=llminferenceservice-router,app.kubernetes.io/name={{ llmd_deploy_llm_inference_service_name }}" \
-n "{{ llmd_deploy_llm_inference_service_namespace }}" \
> "{{ artifact_extra_logs_dir }}/artifacts/llm_inference_service.httproute.status"
failed_when: false
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Preserve HTTPRoute command errors in artifact files.

At Line 169 and Line 178, failed_when: false keeps the play running, but stderr is currently discarded. If the query fails, artifacts may miss the failure reason.

Suggested patch
   - name: Capture HTTPRoute YAML for routing
     shell:
       oc get httproute \
          -l "app.kubernetes.io/component=llminferenceservice-router,app.kubernetes.io/name={{ llmd_deploy_llm_inference_service_name }}" \
          -n "{{ llmd_deploy_llm_inference_service_namespace }}" \
+         --ignore-not-found \
          -oyaml \
-         > "{{ artifact_extra_logs_dir }}/artifacts/llm_inference_service.httproute.yaml"
+         > "{{ artifact_extra_logs_dir }}/artifacts/llm_inference_service.httproute.yaml" 2>&1
     failed_when: false

   - name: Capture HTTPRoute status for routing
     shell:
       oc get httproute \
          -l "app.kubernetes.io/component=llminferenceservice-router,app.kubernetes.io/name={{ llmd_deploy_llm_inference_service_name }}" \
          -n "{{ llmd_deploy_llm_inference_service_namespace }}" \
-         > "{{ artifact_extra_logs_dir }}/artifacts/llm_inference_service.httproute.status"
+         --ignore-not-found \
+         > "{{ artifact_extra_logs_dir }}/artifacts/llm_inference_service.httproute.status" 2>&1
     failed_when: false
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@projects/llm-d/toolbox/llmd_deploy_llm_inference_service/tasks/main.yml`
around lines 167 - 182, The two tasks "Capture HTTPRoute YAML for routing" and
"Capture HTTPRoute status for routing" drop stderr when the shell runs with
failed_when: false; update each shell invocation in tasks/main.yml so the oc
command's stderr is preserved in the artifact files (e.g., redirect stderr into
the same output file or a companion .stderr file) — keep failed_when: false but
ensure the shell redirect captures both stdout and stderr (or write stderr
separately) so any command failure details are recorded in the artifacts.

@psap-forge-bot
Copy link
Copy Markdown

🟢 Test of 'llm-d test test_ci' succeeded after 01 hours 16 minutes 13 seconds. 🟢

• Link to the test results.

• Link to the reports index.

Test configuration:

PR_POSITIONAL_ARGS: jump-ci
PR_POSITIONAL_ARG_0: jump-ci
PR_POSITIONAL_ARG_1: azure_h100
PR_POSITIONAL_ARG_2: pd-flavors
PR_POSITIONAL_ARG_3: guidellm_multiturn_eval
PR_POSITIONAL_ARG_4: gpt-oss

@kpouget
Copy link
Copy Markdown
Contributor Author

kpouget commented May 22, 2026

/test jump-ci llm-d azure_h100 pd-flavors guidellm_multiturn_eval gpt-oss
/cluster aks-h100
/only test_ci

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
projects/llm-d/testing/test_llmd.py (1)

1163-1171: 💤 Low value

Remove unused add_hotfix_volumes function.

add_hotfix_volumes (lines 1163-1171) is defined but never called. Only add_hotfix_to_containers (lines 1174-1182) is used, and they are identical.

Suggested fix
-        # Helper function to add hotfix volume mounts to containers
-        def add_hotfix_volumes(containers, template_volumes):
-            for container in containers:
-                # Add volume mounts
-                if 'volumeMounts' not in container:
-                    container['volumeMounts'] = []
-                container['volumeMounts'].extend(hotfix_mounts)
-
-            # Add volume to template
-            template_volumes.append(hotfix_volume)
-
         # Apply hotfix volumes and complete configuration to containers
         def add_hotfix_to_containers(containers, template_volumes):
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@projects/llm-d/testing/test_llmd.py` around lines 1163 - 1171, The function
add_hotfix_volumes is redundant and never invoked; remove its definition (the
entire add_hotfix_volumes function block) and keep the existing
add_hotfix_to_containers implementation instead so behavior remains unchanged;
verify that references to hotfix_mounts, hotfix_volume, containers, and
template_volumes remain used only by add_hotfix_to_containers and adjust any
imports or tests if they referenced add_hotfix_volumes.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@projects/llm-d/testing/test_llmd.py`:
- Around line 1163-1171: The function add_hotfix_volumes is redundant and never
invoked; remove its definition (the entire add_hotfix_volumes function block)
and keep the existing add_hotfix_to_containers implementation instead so
behavior remains unchanged; verify that references to hotfix_mounts,
hotfix_volume, containers, and template_volumes remain used only by
add_hotfix_to_containers and adjust any imports or tests if they referenced
add_hotfix_volumes.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 4d86284a-3bfe-4231-a5b1-bc880d6815e3

📥 Commits

Reviewing files that changed from the base of the PR and between 47821cb and 19bf445.

📒 Files selected for processing (2)
  • projects/llm-d/testing/config.yaml
  • projects/llm-d/testing/test_llmd.py

@psap-forge-bot
Copy link
Copy Markdown

🟢 Test of 'llm-d test test_ci' succeeded after 01 hours 18 minutes 36 seconds. 🟢

• Link to the test results.

• Link to the reports index.

Test configuration:

PR_POSITIONAL_ARGS: jump-ci
PR_POSITIONAL_ARG_0: jump-ci
PR_POSITIONAL_ARG_1: azure_h100
PR_POSITIONAL_ARG_2: pd-flavors
PR_POSITIONAL_ARG_3: guidellm_multiturn_eval
PR_POSITIONAL_ARG_4: gpt-oss

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant