[llmd] Work on Azure by kpouget · Pull Request #930 · openshift-psap/topsail

kpouget · 2026-05-11T08:19:51Z

Summary by CodeRabbit

Chores
- Updated CI/testing presets and flavor mappings; added an Azure H100 preset and adjusted GPU/H200-focused presets and flags
- Added local model entries and updated top-level test/inference-service settings
- Enhanced debug artifact capture to include HTTPRoute YAML and status during deployment
New Features
- Added new inference report generation entries (and reorganized some existing visualizations)
Bug Fixes
- Improved flavor filtering in baseline throughput comparisons

openshift-ci · 2026-05-11T08:19:58Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign sjmonson for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai · 2026-05-11T08:20:04Z

📝 Walkthrough

Walkthrough

Updates LLM‑D testing CI presets and local models; extends ISVC reshape to support hostPath models, tensor-parallel args, InfiniBand/AKS hotfix wiring, and router tolerations; adjusts scheduler nodeSelectors; adds Ansible HTTPRoute debug captures; and tweaks visualization/report generation and plotting.

Changes

ISVC reshape, hostPath, tensor-parallel, and InfiniBand

Layer / File(s)	Summary
Tensor-parallel helper and prefill integration `projects/llm-d/testing/test_llmd.py`	Adds `add_tensor_parallelism_to_container`, and updates flavor/prefill tensor-parallel application to use the helper.
HostPath model wiring and vLLM args `projects/llm-d/testing/test_llmd.py`	`apply_model_configuration` accepts `flavor`; hostpath sources become `hf://...`, hostPath `hf-cache` volume is wired, containers’ commands rewritten to `vllm serve` with parsed `VLLM_ADDITIONAL_ARGS`, and decode/prefill ports chosen by base flavor.
InfiniBand and AKS hotfix handling `projects/llm-d/testing/test_llmd.py`	Adds rdma resource requests/limits for pd flavors, AKS UCX/NIC env vars, IPC_LOCK capability, optional `vllm-ucx-multiproc-hotfix` ConfigMap creation and mounting, and pod ulimit annotations.
Prepare: skip downloads for hostpath models `projects/llm-d/testing/prepare_llmd.py`	Early-return from `download_single_model` when model `source` starts with `hostpath:`, skipping PVC/download steps.

CI presets, model registry, and EPP defaults

Layer / File(s)	Summary
CI presets and local models `projects/llm-d/testing/config.yaml`	Adds `ci_presets.to_apply`, introduces `azure_h100`, updates GPU/H200 preset flags, updates `prepare.rhoai.tag`, adds local `llama3-*` and `gpt-oss-120-local` hostpath models, updates test flavors and inference_service router/infiniband/hybrid_kv_cache flags.

Scheduler manifests

Layer / File(s)	Summary
Scheduler container and nodeSelector tweaks `projects/llm-d/testing/llmisvcs/llmisvc-pd.yaml`, `projects/llm-d/testing/llmisvcs/llmisvc-simple.yaml`	Removes inline router.scheduler.template.containers block from PD manifest; changes GPU nodeSelector key in simple manifest to `nvidia.com/gpu.deploy.container-toolkit: "true"`.

Ansible debug artifact capture

Layer / File(s)	Summary
Capture HTTPRoute YAML and status in tasks `projects/llm-d/toolbox/llmd_capture_isvc_state/tasks/main.yml`, `projects/llm-d/toolbox/llmd_deploy_llm_inference_service/tasks/main.yml`	Adds tasks to save `oc get httproute` YAML and status outputs to artifacts, using selectors for router component and service name; tasks use `failed_when: false`.

Visualizations and plotting

Layer / File(s)	Summary
Reports config and baseline skip `projects/llm-d/visualizations/llmd_inference/data/reports.yaml`, `projects/llm-d/visualizations/llmd_inference/data/plots.yaml`, `projects/llm-d/visualizations/llmd_inference/plotting/throughput_comparisons.py`	Adds `visualize.llmd_inference.generate` entries; comments out several plot generation entries; plotting now silently skips `simple-tp4-x4` before rendering its header.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

openshift-psap/topsail#926: Modifies throughput_comparisons.py and related baseline/intelligent-routing/P‑D plotting logic.
openshift-psap/topsail#919: Also changes llmisvc-pd.yaml router scheduler container configuration.
openshift-psap/topsail#916: Touches test_llmd.py and prepare_llmd.py related codepaths overlapping reshape and model prepare logic.

Suggested labels

lgtm

Poem

🐰 I hopped through configs, changed a line or two,
HostPath caches stitched and vLLM commands anew,
InfiniBand whispers, AKS hotfix in place,
HTTPRoutes captured so traces we can trace,
Reports sleep gently — testing leaps into view!

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 79.17% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check	❓ Inconclusive	The title '[llmd] Work on Azure' is vague and overly broad; it does not clearly indicate the specific changes being made despite the changeset containing multiple significant modifications across testing configs, Ansible tasks, visualizations, and inference service specifications.	Revise the title to be more descriptive of the main change, such as '[llmd] Add Azure H100 preset and update inference service configurations' or '[llmd] Configure Azure H100 support and InfiniBand handling'.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

projects/llm-d/visualizations/llmd_inference/plotting/throughput_comparisons.py (1)

324-335: ⚡ Quick win

Remove unused filter_flavors function or refactor to use it.

The filter_flavors function is defined but never called within BaselineComparisonsReport.do_plot. Meanwhile, the newly added flavor filtering at lines 348-349 uses a simple if/continue pattern instead.

The other report classes (IntelligentRoutingComparisonsReport and PDComparisonsReport) define identical filter_flavors functions and actually use them for filtering. For consistency and to avoid dead code, consider either:

Remove this unused function (simpler if the inline skip is sufficient), or
Refactor lines 348-349 to use filter_flavors for consistency with other reports.

Option 1: Remove the unused function

-        def filter_flavors(setting_lists, flavor_filter):
-            """Filter flavors from setting_lists based on provided filter function"""
-            updated_setting_lists = []
-            for setting_list in setting_lists:
-                if setting_list and setting_list[0][0] == 'flavor':
-                    # Apply the filter function to flavors
-                    filtered_flavors = [(k, v) for k, v in setting_list if flavor_filter(v)]
-                    if filtered_flavors:
-                        updated_setting_lists.append(filtered_flavors)
-                else:
-                    updated_setting_lists.append(setting_list)
-            setting_lists[:] = updated_setting_lists
-
         # Get simple flavors

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@projects/llm-d/visualizations/llmd_inference/plotting/throughput_comparisons.py`
around lines 324 - 335, The file defines a helper function filter_flavors that
is never used in BaselineComparisonsReport.do_plot while that method instead
uses an inline "if/continue" flavor check; either delete the unused
filter_flavors function to remove dead code, or replace the inline flavor skip
in BaselineComparisonsReport.do_plot with a call to
filter_flavors(setting_lists, flavor_filter) so behavior matches
IntelligentRoutingComparisonsReport and PDComparisonsReport—update any variable
names to match the existing signature and ensure setting_lists is mutated in
place as filter_flavors does.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@projects/llm-d/testing/config.yaml`:
- Line 7: The config currently forces the azure-specific preset by setting
to_apply: [azure_h100], which causes all runs to inherit Azure behavior
(including the kpouget-dev namespace); remove or unset the global to_apply entry
so the default config is neutral and require selecting the azure_h100 preset
explicitly in CI/job inputs (or move the preset into the specific job/pipeline
entries that need it) to avoid leaking Azure-specific settings into non-Azure
workflows.
- Line 138: The root-level "skip: true" toggle in
projects/llm-d/testing/config.yaml is globally scoped and will disable baseline
prepare steps for all presets; move these Azure-specific skip flags out of the
top-level prepare.*.skip and instead set them only inside the Azure CI preset
(e.g., ci_presets.azure_h100 -> prepare.*.skip) so defaults remain
unchanged—update the three occurrences noted (lines referenced in the review) by
removing or setting them false at root and adding the skip:true entries within
the ci_presets.azure_h100 overrides for the corresponding prepare sections.

---

Nitpick comments:
In
`@projects/llm-d/visualizations/llmd_inference/plotting/throughput_comparisons.py`:
- Around line 324-335: The file defines a helper function filter_flavors that is
never used in BaselineComparisonsReport.do_plot while that method instead uses
an inline "if/continue" flavor check; either delete the unused filter_flavors
function to remove dead code, or replace the inline flavor skip in
BaselineComparisonsReport.do_plot with a call to filter_flavors(setting_lists,
flavor_filter) so behavior matches IntelligentRoutingComparisonsReport and
PDComparisonsReport—update any variable names to match the existing signature
and ensure setting_lists is mutated in place as filter_flavors does.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e7ee7db4-319b-46ac-96f9-5edcb3d1ede1

📥 Commits

Reviewing files that changed from the base of the PR and between 0240e09 and 16f23b4.

📒 Files selected for processing (5)

projects/llm-d/testing/config.yaml
projects/llm-d/toolbox/llmd_capture_isvc_state/tasks/main.yml
projects/llm-d/toolbox/llmd_deploy_llm_inference_service/tasks/main.yml
projects/llm-d/visualizations/llmd_inference/data/plots.yaml
projects/llm-d/visualizations/llmd_inference/plotting/throughput_comparisons.py

…ns: improve

…oute

kpouget · 2026-05-20T11:19:10Z

/test jump-ci llm-d azure_h100 baseline-flavors
/cluster aks-h100
/only test_ci

psap-forge-bot · 2026-05-20T11:44:27Z

🟢 Test of 'llm-d test test_ci' succeeded after 00 hours 20 minutes 16 seconds. 🟢

• Link to the test results.

• Link to the reports index.

Test configuration:

PR_POSITIONAL_ARGS: jump-ci
PR_POSITIONAL_ARG_0: jump-ci
PR_POSITIONAL_ARG_1: azure_h100
PR_POSITIONAL_ARG_2: baseline-flavors

…config

kpouget · 2026-05-20T12:07:44Z

/test jump-ci llm-d azure_h100 baseline-flavors
/cluster aks-h100
/only test_ci

psap-forge-bot · 2026-05-20T12:37:17Z

🟢 Test of 'llm-d test test_ci' succeeded after 00 hours 21 minutes 40 seconds. 🟢

• Link to the test results.

• Link to the reports index.

Test configuration:

PR_POSITIONAL_ARGS: jump-ci
PR_POSITIONAL_ARG_0: jump-ci
PR_POSITIONAL_ARG_1: azure_h100
PR_POSITIONAL_ARG_2: baseline-flavors

kpouget · 2026-05-20T15:58:01Z

/test jump-ci llm-d azure_h100 baseline-flavors guidellm_heterogeneous_eval
/cluster aks-h100
/only test_ci

psap-forge-bot · 2026-05-20T19:25:11Z

🟢 Test of 'llm-d test test_ci' succeeded after 03 hours 21 minutes 21 seconds. 🟢

• Link to the test results.

• Link to the reports index.

Test configuration:

PR_POSITIONAL_ARGS: jump-ci
PR_POSITIONAL_ARG_0: jump-ci
PR_POSITIONAL_ARG_1: azure_h100
PR_POSITIONAL_ARG_2: baseline-flavors
PR_POSITIONAL_ARG_3: guidellm_heterogeneous_eval

kpouget · 2026-05-20T20:29:17Z

/test jump-ci llm-d azure_h100 intelligentrouting-flavors guidellm_heterogeneous_eval
/cluster aks-h100
/only test_ci

kpouget · 2026-05-20T20:33:17Z

/test jump-ci llm-d azure_h100 baseline-flavors guidellm_heterogeneous_eval llama-70b
/cluster aks-h100
/only test_ci

psap-forge-bot · 2026-05-20T20:33:33Z

🔴 Test of 'llm-d test test_ci' failed after 00 hours 00 minutes 40 seconds. 🔴

• Link to the test results.

• No reports index generated...

Test configuration:

PR_POSITIONAL_ARGS: jump-ci
PR_POSITIONAL_ARG_0: jump-ci
PR_POSITIONAL_ARG_1: azure_h100
PR_POSITIONAL_ARG_2: intelligentrouting-flavors
PR_POSITIONAL_ARG_3: guidellm_heterogeneous_eval

Failure indicator: Empty. (See run.log)

psap-forge-bot · 2026-05-20T22:55:03Z

🟢 Test of 'llm-d test test_ci' succeeded after 02 hours 16 minutes 44 seconds. 🟢

• Link to the test results.

• Link to the reports index.

Test configuration:

PR_POSITIONAL_ARGS: jump-ci
PR_POSITIONAL_ARG_0: jump-ci
PR_POSITIONAL_ARG_1: azure_h100
PR_POSITIONAL_ARG_2: intelligentrouting-flavors
PR_POSITIONAL_ARG_3: guidellm_heterogeneous_eval

kpouget · 2026-05-21T06:55:13Z

/test jump-ci llm-d azure_h100 baseline-flavors llama-70b
/cluster aks-h100
/only test_ci

psap-forge-bot · 2026-05-21T07:36:43Z

🔴 Test of 'llm-d test test_ci' failed after 00 hours 23 minutes 44 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

PR_POSITIONAL_ARGS: jump-ci
PR_POSITIONAL_ARG_0: jump-ci
PR_POSITIONAL_ARG_1: azure_h100
PR_POSITIONAL_ARG_2: baseline-flavors
PR_POSITIONAL_ARG_3: llama-70b

Failure indicator:

/tmp/topsail_202605211779347575/001__llm_d_testing/000__flavor_simple/000__llmd__deploy_llm_inference_service/FAILURE | [000__llmd__deploy_llm_inference_service] ./run_toolbox.py llmd deploy_llm_inference_service --name=llm-d-simple --namespace=kpouget-dev --yaml_file=/tmp/topsail_202605211779347575/001__llm_d_testing/000__flavor_simple/llmisvc-simple.yaml --> 2

kpouget · 2026-05-21T09:38:09Z

/test jump-ci llm-d azure_h100 baseline-flavors llama-70b
/cluster aks-h100
/only test_ci
/var tests.llmd.flavors: [simple-tp2]

kpouget · 2026-05-21T10:15:34Z

/test jump-ci llm-d azure_h100 baseline-flavors llama-70b
/cluster aks-h100
/only test_ci
/var tests.llmd.flavors: simple-tp2

psap-forge-bot · 2026-05-21T10:26:58Z

🟢 Test of 'llm-d test test_ci' succeeded after 00 hours 08 minutes 02 seconds. 🟢

• Link to the test results.

• Link to the reports index.

Test configuration:

PR_POSITIONAL_ARGS: jump-ci
PR_POSITIONAL_ARG_0: jump-ci
PR_POSITIONAL_ARG_1: azure_h100
PR_POSITIONAL_ARG_2: baseline-flavors
PR_POSITIONAL_ARG_3: llama-70b
tests.llmd.flavors: simple-tp2

kpouget · 2026-05-21T12:54:57Z

/test jump-ci llm-d azure_h100 baseline-flavors llama-70b guidellm_heterogeneous_eval
/cluster aks-h100
/only test_ci

psap-forge-bot · 2026-05-21T18:19:02Z

🟢 Test of 'llm-d test test_ci' succeeded after 03 hours 20 minutes 38 seconds. 🟢

• Link to the test results.

• Link to the reports index.

Test configuration:

PR_POSITIONAL_ARGS: jump-ci
PR_POSITIONAL_ARG_0: jump-ci
PR_POSITIONAL_ARG_1: azure_h100
PR_POSITIONAL_ARG_2: baseline-flavors
PR_POSITIONAL_ARG_3: llama-70b
PR_POSITIONAL_ARG_4: guidellm_heterogeneous_eval

kpouget · 2026-05-21T20:57:28Z

/test jump-ci llm-d azure_h100 guidellm_multiturn_eval gpt-oss
/cluster aks-h100
/only test_ci
/var tests.llmd.inference_service.aks_hotfix_enabled: false
/var tests.llmd.flavors: intelligent-tp4-x4

kpouget · 2026-05-21T20:58:27Z

/test jump-ci llm-d azure_h100 guidellm_multiturn_eval gpt-oss
/cluster aks-h100
/only test_ci
/var tests.llmd.inference_service.aks_hotfix_enabled: false
/var tests.llmd.flavors: intelligent-tp4-x4

psap-forge-bot · 2026-05-21T21:02:41Z

🔴 Test of 'llm-d test test_ci' failed after 00 hours 00 minutes 39 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

PR_POSITIONAL_ARGS: jump-ci
PR_POSITIONAL_ARG_0: jump-ci
PR_POSITIONAL_ARG_1: azure_h100
PR_POSITIONAL_ARG_2: guidellm_multiturn_eval
PR_POSITIONAL_ARG_3: gpt-oss
tests.llmd.flavors: intelligent-tp4-x4
tests.llmd.inference_service.aks_hotfix_enabled: false

Failure indicator:

/tmp/topsail_202605211779397318/002__plots/FAILURE | An error happened during the visualization post-processing ... (0_matbench_parse.log in /tmp/topsail_202605211779397318/002__plots). Mind that the test that was processed FAILED.

…ison

…iner

kpouget · 2026-05-22T08:08:10Z

/test jump-ci llm-d azure_h100 pd-flavors guidellm_multiturn_eval gpt-oss
/cluster aks-h100
/only test_ci
/var tests.llmd.inference_service.aks_hotfix_enabled: false

kpouget · 2026-05-22T09:02:34Z

/test jump-ci llm-d azure_h100 pd-flavors guidellm_multiturn_eval gpt-oss
/cluster aks-h100
/only test_ci

coderabbitai

Actionable comments posted: 7

🧹 Nitpick comments (1)

projects/llm-d/testing/config.yaml (1)
319-319: ⚡ Quick win

Make the flavors type consistent with preset definitions.

Line 319 uses a bare string pd-x2-ptp4-px1-dtp4, while preset definitions at lines 32, 87, and 93 use lists for the same field. This type inconsistency can cause issues if the code expects a list and receives a string.
♻️ Suggested change for type consistency
-    flavors: pd-x2-ptp4-px1-dtp4
+    flavors: [pd-x2-ptp4-px1-dtp4]
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@projects/llm-d/testing/config.yaml` at line 319, The `flavors` field is
currently a bare string (pd-x2-ptp4-px1-dtp4) but other preset definitions use a
list; update the `flavors` value to be a YAML list to match the presets (i.e.,
convert the scalar to a sequence containing the same item) so code that expects
an array/sequence will receive a list; ensure the `flavors` key in this config
matches the same list formatting used by the preset entries referenced.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@projects/llm-d/testing/config.yaml`:
- Line 340: The global setting tests.llmd.inference_service.aks_hotfix_enabled
is currently true and will apply AKS-specific behavior everywhere; change the
default to false in that section and enable it only in the Azure preset by
adding tests.llmd.inference_service.aks_hotfix_enabled: true under the azure
preset (or, if you prefer to keep it true globally, update the comment next to
aks_hotfix_enabled to state explicitly that the hotfix is a no-op on non-AKS
environments and must be overridden where unsafe). Ensure you update the
tests.llmd.inference_service block and the azure preset block (named azure)
accordingly.

In `@projects/llm-d/testing/llmisvcs/llmisvc-simple.yaml`:
- Line 21: The pod nodeSelector in
projects/llm-d/testing/llmisvcs/llmisvc-simple.yaml currently uses the
nonstandard label key nvidia.com/gpu.deploy.container-toolkit: "true"; verify
that this exact label/value exists on your AKS H100 node pool (e.g., inspect
node labels) or replace the selector with the repo-standard label
nvidia.com/gpu.present: "true" so pods will schedule consistently; update the
manifest's nodeSelector key accordingly or add a note documenting the cluster
labeling contract if the custom label is intentional.

In `@projects/llm-d/testing/test_llmd.py`:
- Around line 1015-1028: The helper apply_to_containers assumes
container['resources'], ['limits'] and ['requests'] exist which can raise
KeyError; update apply_to_containers to ensure container['resources'] is a dict
and to initialize container['resources']['limits'] and
container['resources']['requests'] if missing (e.g., via setdefault or explicit
checks) before popping or assigning RDMA keys, then proceed to use
infiniband_config to add the appropriate RDMA entries.
- Around line 631-650: The code currently enforces that VLLM_ADDITIONAL_ARGS
must exist by raising a RuntimeError when vllm_args_found is False; remove that
requirement so hostPath models can run with a plain "vllm serve" command: delete
or replace the final if-not-vllm_args_found raise block (the check using
vllm_args_found and the RuntimeError) in the routine that processes
container['env'] so that when VLLM_ADDITIONAL_ARGS is absent the code simply
continues (optionally log a debug/info message) instead of throwing an
exception.
- Around line 723-743: The code currently replaces existing volumeMounts/volumes
(e.g., container['volumeMounts'] = [hf_cache_mount] and
isvc_data['spec']['template']['volumes'] = [hf_cache_volume]) which drops
pre-existing mounts; change these to append the hf_cache_mount/hf_cache_volume
instead: ensure in configure_container (and when handling prefill_container) you
create the keys if missing and extend the existing lists (e.g., use
container.setdefault('volumeMounts', []).append(hf_cache_mount) and
template.setdefault('volumes', []).append(hf_cache_volume') semantics) so
existing TLS/config mounts and other volumes are preserved.

In `@projects/llm-d/toolbox/llmd_capture_isvc_state/tasks/main.yml`:
- Around line 95-109: The two tasks "Capture HTTPRoute for routing" and "Capture
HTTPRoute status" only redirect stdout to the artifact files so any oc errors go
to stderr and are lost; update the shell commands that call oc get httproute
(the tasks referencing variables llmd_capture_isvc_state_llmisvc_name,
target_namespace and artifact_extra_logs_dir) to also capture stderr into the
same artifact files (e.g., redirect stderr into stdout) so that diagnostic
messages are preserved while keeping ignore_errors: true.

In `@projects/llm-d/toolbox/llmd_deploy_llm_inference_service/tasks/main.yml`:
- Around line 167-182: The two tasks "Capture HTTPRoute YAML for routing" and
"Capture HTTPRoute status for routing" drop stderr when the shell runs with
failed_when: false; update each shell invocation in tasks/main.yml so the oc
command's stderr is preserved in the artifact files (e.g., redirect stderr into
the same output file or a companion .stderr file) — keep failed_when: false but
ensure the shell redirect captures both stdout and stderr (or write stderr
separately) so any command failure details are recorded in the artifacts.

---

Nitpick comments:
In `@projects/llm-d/testing/config.yaml`:
- Line 319: The `flavors` field is currently a bare string (pd-x2-ptp4-px1-dtp4)
but other preset definitions use a list; update the `flavors` value to be a YAML
list to match the presets (i.e., convert the scalar to a sequence containing the
same item) so code that expects an array/sequence will receive a list; ensure
the `flavors` key in this config matches the same list formatting used by the
preset entries referenced.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6979cfda-842f-457a-a917-08bd6e3884c8

📥 Commits

Reviewing files that changed from the base of the PR and between 16f23b4 and 47821cb.

📒 Files selected for processing (10)

projects/llm-d/testing/config.yaml
projects/llm-d/testing/llmisvcs/llmisvc-pd.yaml
projects/llm-d/testing/llmisvcs/llmisvc-simple.yaml
projects/llm-d/testing/prepare_llmd.py
projects/llm-d/testing/test_llmd.py
projects/llm-d/toolbox/llmd_capture_isvc_state/tasks/main.yml
projects/llm-d/toolbox/llmd_deploy_llm_inference_service/tasks/main.yml
projects/llm-d/visualizations/llmd_inference/data/plots.yaml
projects/llm-d/visualizations/llmd_inference/data/reports.yaml
projects/llm-d/visualizations/llmd_inference/plotting/throughput_comparisons.py

💤 Files with no reviewable changes (2)

projects/llm-d/visualizations/llmd_inference/data/plots.yaml
projects/llm-d/testing/llmisvcs/llmisvc-pd.yaml

✅ Files skipped from review due to trivial changes (1)

projects/llm-d/visualizations/llmd_inference/data/reports.yaml

coderabbitai · 2026-05-22T09:07:24Z

      model: gpt-oss-20b
      max_model_len: 40960
+      infiniband: null  # null=unchanged, true=enable rdma/ib: "1", false=disable rdma/ib
+      aks_hotfix_enabled: true  # Enable/disable AKS ConfigMap hotfix mounts


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Move AKS-specific setting to the Azure preset or default to false.

aks_hotfix_enabled: true is set globally in the default tests.llmd.inference_service section. According to the layer description, this enables "AKS-specific UCX/NIC env, IPC_LOCK capability, optional vllm-ucx-multiproc-hotfix ConfigMap creation". Setting this to true globally will apply AKS-specific behavior to all environments, including non-AKS deployments, which may fail or create unnecessary resources.

🔧 Recommended fix

Option 1 (preferred): Default to false globally and enable in Azure presets

- aks_hotfix_enabled: true # Enable/disable AKS ConfigMap hotfix mounts + aks_hotfix_enabled: false # Enable/disable AKS ConfigMap hotfix mounts

Then add to the azure preset (around line 60):

azure: security.run_as_root: true prepare.preload.skip: true prepare.operators.skip: true prepare.monitoring.skip: true prepare.cluster.skip: true prepare.rhoai.skip: true tests.capture_prom: false tests.capture_prom_uwm: false tests.llmd.skip_prepare: false + tests.llmd.inference_service.aks_hotfix_enabled: true

Option 2: Keep true but document that non-AKS environments must override

If the hotfix is safe to apply universally (no-op on non-AKS), update the comment to clarify this behavior.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

aks_hotfix_enabled: true # Enable/disable AKS ConfigMap hotfix mounts

aks_hotfix_enabled: false # Enable/disable AKS ConfigMap hotfix mounts

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@projects/llm-d/testing/config.yaml` at line 340, The global setting tests.llmd.inference_service.aks_hotfix_enabled is currently true and will apply AKS-specific behavior everywhere; change the default to false in that section and enable it only in the Azure preset by adding tests.llmd.inference_service.aks_hotfix_enabled: true under the azure preset (or, if you prefer to keep it true globally, update the comment next to aks_hotfix_enabled to state explicitly that the hotfix is a no-op on non-AKS environments and must be overridden where unsafe). Ensure you update the tests.llmd.inference_service block and the azure preset block (named azure) accordingly.

coderabbitai · 2026-05-22T09:07:24Z

-              - pluginRef: max-score-picker
        nodeSelector:
-          nvidia.com/gpu.present: "true"
+          nvidia.com/gpu.deploy.container-toolkit: "true"


⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Search for references to both old and new nodeSelector labels echo "=== Searching for old label: nvidia.com/gpu.present ===" rg -n "nvidia\.com/gpu\.present" -C2 echo -e "\n=== Searching for new label: nvidia.com/gpu.deploy.container-toolkit ===" rg -n "nvidia\.com/gpu\.deploy\.container-toolkit" -C2 echo -e "\n=== All nvidia.com/gpu nodeSelector patterns ===" rg -n "nvidia\.com/gpu\.[a-z.-]+" -C2

Repository: openshift-psap/topsail

Length of output: 21530

🌐 Web query:

AKS NVIDIA GPU node labels container toolkit deployment

💡 Result:

In Azure Kubernetes Service (AKS), GPU node pools come with NVIDIA drivers and the NVIDIA Container Toolkit pre-installed by default [1]. Consequently, you typically do not need to manually deploy the container toolkit [2][1]. Node Labeling and Scheduling AKS automatically applies default labels to GPU nodes, such as accelerator=nvidia [3]. When deploying GPU-enabled workloads, it is recommended practice to: 1. Use Node Pools and Taints: Isolate GPU workloads by creating dedicated GPU node pools and applying taints (e.g., sku=gpu:NoSchedule) [4][5]. 2. Use Node Selectors or Affinity: Ensure your workload pods land on the correct nodes using node selectors or affinity that match your labels [4][6]. 3. Custom Labels: You can apply additional labels as needed for specific hardware or configuration requirements using kubectl label nodes = [6]. NVIDIA GPU Operator and Toolkit While AKS provides pre-installed components, you may choose to use the NVIDIA GPU Operator for advanced lifecycle management (e.g., automated driver updates, monitoring) [2][1]. If you use the Operator in an environment where drivers and the toolkit are already pre-installed, you must configure the installation to disable these components to avoid conflicts [1][7]: helm install gpu-operator nvidia/gpu-operator -n gpu-operator --set driver.enabled=false --set toolkit.enabled=false --set operator.runtimeClass=nvidia-container-runtime If you require custom driver management, you can skip the default installation during node pool creation by using the --gpu-driver none flag [2][8]. Once skipped, you assume responsibility for installing the drivers and the container toolkit (e.g., via the GPU Operator or manual DaemonSet deployment) [2][1][8]. Dynamic Resource Allocation (DRA) For newer scenarios like virtualized GPUs (vGPU), AKS utilizes NVIDIA DRA, which runs as a DaemonSet and may require you to manually apply specific labels, such as nvidia.com/gpu.present=true, to your nodes [3]. Always verify the current AKS documentation for your specific GPU architecture and workload requirements [2][9].

Citations:

1: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/microsoft-aks.html

2: https://learn.microsoft.com/en-us/azure/aks/use-nvidia-gpu

3: https://blog.aks.azure.com/2026/03/06/dra-with-vGPUs-on-aks

4: https://securityboulevard.com/2026/05/how-do-i-run-ai-workloads-on-kubernetes-without-wasting-gpus/

5: https://blog.colbyford.com/building-gpu-enabled-clusters-in-azure-kubernetes-service-7a6d4b2c4f56

6: https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/

7: GPU Operator installation failure in AKS NVIDIA/gpu-operator#529

8: https://docs.azure.cn/en-us/aks/use-nvidia-gpu

9: https://blog.aks.azure.com/2025/11/17/dra-devices-and-drivers-on-kubernetes

Harden/align nodeSelector label for router scheduler pods (AKS H100): nvidia.com/gpu.deploy.container-toolkit is repo-unique and not documented as a standard AKS/NVIDIA label

In projects/llm-d/testing/llmisvcs/llmisvc-simple.yaml (line 21), pod placement is constrained to nvidia.com/gpu.deploy.container-toolkit: "true"; if that label isn’t present on the target GPU nodes, pods won’t schedule.

The repo references nvidia.com/gpu.present=true across configs/manifests/tests, but nvidia.com/gpu.deploy.container-toolkit is only used in this single manifest.

Public AKS NVIDIA GPU docs/guidance point to nvidia.com/gpu.present=true (not nvidia.com/gpu.deploy.container-toolkit) for scheduling, and note the toolkit is typically already installed on AKS GPU nodes.

Action: confirm the exact label/value exists on the AKS H100 GPU node pool you target (e.g., oc get nodes --show-labels and grep), and/or switch this nodeSelector to the repo’s existing nvidia.com/gpu.present: "true" (or document the cluster labeling contract if intentionally different).

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@projects/llm-d/testing/llmisvcs/llmisvc-simple.yaml` at line 21, The pod nodeSelector in projects/llm-d/testing/llmisvcs/llmisvc-simple.yaml currently uses the nonstandard label key nvidia.com/gpu.deploy.container-toolkit: "true"; verify that this exact label/value exists on your AKS H100 node pool (e.g., inspect node labels) or replace the selector with the repo-standard label nvidia.com/gpu.present: "true" so pods will schedule consistently; update the manifest's nodeSelector key accordingly or add a note documenting the cluster labeling contract if the custom label is intentional.

coderabbitai · 2026-05-22T09:07:24Z

+    def apply_to_containers(containers):
+        for container in containers:
+            # Always remove existing rdma/ib first
+            container['resources']['limits'].pop('rdma/ib', None)
+            container['resources']['requests'].pop('rdma/ib', None)
+
+            # Add InfiniBand resource based on config type
+            if infiniband_config is True:
+                container['resources']['limits']['rdma/ib'] = "1"
+                container['resources']['requests']['rdma/ib'] = "1"
+            elif isinstance(infiniband_config, str):
+                # Use custom resource string (e.g., "rdma/shared_ib")
+                container['resources']['limits'][infiniband_config] = "1"
+                container['resources']['requests'][infiniband_config] = "1"


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Initialize resources.requests and resources.limits before writing RDMA keys.

This helper assumes every decode/prefill container already has resources.requests and resources.limits. That isn't guaranteed on the current reshape path, so a plain pd manifest without those blocks can fail here with KeyError.

Suggested fix

def apply_to_containers(containers): for container in containers: + resources = container.setdefault('resources', {}) + limits = resources.setdefault('limits', {}) + requests = resources.setdefault('requests', {}) + # Always remove existing rdma/ib first - container['resources']['limits'].pop('rdma/ib', None) - container['resources']['requests'].pop('rdma/ib', None) + limits.pop('rdma/ib', None) + requests.pop('rdma/ib', None) # Add InfiniBand resource based on config type if infiniband_config is True: - container['resources']['limits']['rdma/ib'] = "1" - container['resources']['requests']['rdma/ib'] = "1" + limits['rdma/ib'] = "1" + requests['rdma/ib'] = "1" elif isinstance(infiniband_config, str): # Use custom resource string (e.g., "rdma/shared_ib") - container['resources']['limits'][infiniband_config] = "1" - container['resources']['requests'][infiniband_config] = "1" + limits[infiniband_config] = "1" + requests[infiniband_config] = "1"

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def apply_to_containers(containers):

for container in containers:

# Always remove existing rdma/ib first

container['resources']['limits'].pop('rdma/ib', None)

container['resources']['requests'].pop('rdma/ib', None)

# Add InfiniBand resource based on config type

if infiniband_config is True:

container['resources']['limits']['rdma/ib'] = "1"

container['resources']['requests']['rdma/ib'] = "1"

elif isinstance(infiniband_config, str):

# Use custom resource string (e.g., "rdma/shared_ib")

container['resources']['limits'][infiniband_config] = "1"

container['resources']['requests'][infiniband_config] = "1"

def apply_to_containers(containers):

for container in containers:

resources = container.setdefault('resources', {})

limits = resources.setdefault('limits', {})

requests = resources.setdefault('requests', {})

# Always remove existing rdma/ib first

limits.pop('rdma/ib', None)

requests.pop('rdma/ib', None)

# Add InfiniBand resource based on config type

if infiniband_config is True:

limits['rdma/ib'] = "1"

requests['rdma/ib'] = "1"

elif isinstance(infiniband_config, str):

# Use custom resource string (e.g., "rdma/shared_ib")

limits[infiniband_config] = "1"

requests[infiniband_config] = "1"

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@projects/llm-d/testing/test_llmd.py` around lines 1015 - 1028, The helper apply_to_containers assumes container['resources'], ['limits'] and ['requests'] exist which can raise KeyError; update apply_to_containers to ensure container['resources'] is a dict and to initialize container['resources']['limits'] and container['resources']['requests'] if missing (e.g., via setdefault or explicit checks) before popping or assigning RDMA keys, then proceed to use infiniband_config to add the appropriate RDMA entries.

coderabbitai · 2026-05-22T09:07:24Z

+- name: Capture HTTPRoute for routing
+  shell:
+    oc get httproute \
+       -l "app.kubernetes.io/component=llminferenceservice-router,app.kubernetes.io/name={{ llmd_capture_isvc_state_llmisvc_name }}" \
+       -n "{{ target_namespace }}" \
+       -oyaml > "{{ artifact_extra_logs_dir }}/artifacts/llminferenceservice.httproute.yaml"
+  ignore_errors: true
+
+- name: Capture HTTPRoute status
+  shell:
+    oc get httproute \
+       -l "app.kubernetes.io/component=llminferenceservice-router,app.kubernetes.io/name={{ llmd_capture_isvc_state_llmisvc_name }}" \
+       -n "{{ target_namespace }}" \
+       > "{{ artifact_extra_logs_dir }}/artifacts/llminferenceservice.httproute.status"
+  ignore_errors: true


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Capture stderr into the HTTPRoute artifacts as well.

At Line 97 and Line 105, only stdout is redirected. When oc get httproute fails, the most useful diagnostics are on stderr, so the artifact file can be empty while the task still passes (ignore_errors: true).

Suggested patch

- name: Capture HTTPRoute for routing shell: oc get httproute \ -l "app.kubernetes.io/component=llminferenceservice-router,app.kubernetes.io/name={{ llmd_capture_isvc_state_llmisvc_name }}" \ -n "{{ target_namespace }}" \ - -oyaml > "{{ artifact_extra_logs_dir }}/artifacts/llminferenceservice.httproute.yaml" + --ignore-not-found \ + -oyaml > "{{ artifact_extra_logs_dir }}/artifacts/llminferenceservice.httproute.yaml" 2>&1 ignore_errors: true - name: Capture HTTPRoute status shell: oc get httproute \ -l "app.kubernetes.io/component=llminferenceservice-router,app.kubernetes.io/name={{ llmd_capture_isvc_state_llmisvc_name }}" \ -n "{{ target_namespace }}" \ - > "{{ artifact_extra_logs_dir }}/artifacts/llminferenceservice.httproute.status" + --ignore-not-found \ + > "{{ artifact_extra_logs_dir }}/artifacts/llminferenceservice.httproute.status" 2>&1 ignore_errors: true

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

- name: Capture HTTPRoute for routing

shell:

oc get httproute \

-l "app.kubernetes.io/component=llminferenceservice-router,app.kubernetes.io/name={{ llmd_capture_isvc_state_llmisvc_name }}" \

-n "{{ target_namespace }}" \

-oyaml > "{{ artifact_extra_logs_dir }}/artifacts/llminferenceservice.httproute.yaml"

ignore_errors: true

- name: Capture HTTPRoute status

shell:

oc get httproute \

-l "app.kubernetes.io/component=llminferenceservice-router,app.kubernetes.io/name={{ llmd_capture_isvc_state_llmisvc_name }}" \

-n "{{ target_namespace }}" \

> "{{ artifact_extra_logs_dir }}/artifacts/llminferenceservice.httproute.status"

ignore_errors: true

- name: Capture HTTPRoute for routing

shell:

oc get httproute \

-l "app.kubernetes.io/component=llminferenceservice-router,app.kubernetes.io/name={{ llmd_capture_isvc_state_llmisvc_name }}" \

-n "{{ target_namespace }}" \

--ignore-not-found \

-oyaml > "{{ artifact_extra_logs_dir }}/artifacts/llminferenceservice.httproute.yaml" 2>&1

ignore_errors: true

- name: Capture HTTPRoute status

shell:

oc get httproute \

-l "app.kubernetes.io/component=llminferenceservice-router,app.kubernetes.io/name={{ llmd_capture_isvc_state_llmisvc_name }}" \

-n "{{ target_namespace }}" \

--ignore-not-found \

> "{{ artifact_extra_logs_dir }}/artifacts/llminferenceservice.httproute.status" 2>&1

ignore_errors: true

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@projects/llm-d/toolbox/llmd_capture_isvc_state/tasks/main.yml` around lines 95 - 109, The two tasks "Capture HTTPRoute for routing" and "Capture HTTPRoute status" only redirect stdout to the artifact files so any oc errors go to stderr and are lost; update the shell commands that call oc get httproute (the tasks referencing variables llmd_capture_isvc_state_llmisvc_name, target_namespace and artifact_extra_logs_dir) to also capture stderr into the same artifact files (e.g., redirect stderr into stdout) so that diagnostic messages are preserved while keeping ignore_errors: true.

coderabbitai · 2026-05-22T09:07:24Z

+  - name: Capture HTTPRoute YAML for routing
+    shell:
+      oc get httproute \
+         -l "app.kubernetes.io/component=llminferenceservice-router,app.kubernetes.io/name={{ llmd_deploy_llm_inference_service_name }}" \
+         -n "{{ llmd_deploy_llm_inference_service_namespace }}" \
+         -oyaml \
+         > "{{ artifact_extra_logs_dir }}/artifacts/llm_inference_service.httproute.yaml"
+    failed_when: false
+
+  - name: Capture HTTPRoute status for routing
+    shell:
+      oc get httproute \
+         -l "app.kubernetes.io/component=llminferenceservice-router,app.kubernetes.io/name={{ llmd_deploy_llm_inference_service_name }}" \
+         -n "{{ llmd_deploy_llm_inference_service_namespace }}" \
+         > "{{ artifact_extra_logs_dir }}/artifacts/llm_inference_service.httproute.status"
+    failed_when: false


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Preserve HTTPRoute command errors in artifact files.

At Line 169 and Line 178, failed_when: false keeps the play running, but stderr is currently discarded. If the query fails, artifacts may miss the failure reason.

Suggested patch

- name: Capture HTTPRoute YAML for routing shell: oc get httproute \ -l "app.kubernetes.io/component=llminferenceservice-router,app.kubernetes.io/name={{ llmd_deploy_llm_inference_service_name }}" \ -n "{{ llmd_deploy_llm_inference_service_namespace }}" \ + --ignore-not-found \ -oyaml \ - > "{{ artifact_extra_logs_dir }}/artifacts/llm_inference_service.httproute.yaml" + > "{{ artifact_extra_logs_dir }}/artifacts/llm_inference_service.httproute.yaml" 2>&1 failed_when: false - name: Capture HTTPRoute status for routing shell: oc get httproute \ -l "app.kubernetes.io/component=llminferenceservice-router,app.kubernetes.io/name={{ llmd_deploy_llm_inference_service_name }}" \ -n "{{ llmd_deploy_llm_inference_service_namespace }}" \ - > "{{ artifact_extra_logs_dir }}/artifacts/llm_inference_service.httproute.status" + --ignore-not-found \ + > "{{ artifact_extra_logs_dir }}/artifacts/llm_inference_service.httproute.status" 2>&1 failed_when: false

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@projects/llm-d/toolbox/llmd_deploy_llm_inference_service/tasks/main.yml` around lines 167 - 182, The two tasks "Capture HTTPRoute YAML for routing" and "Capture HTTPRoute status for routing" drop stderr when the shell runs with failed_when: false; update each shell invocation in tasks/main.yml so the oc command's stderr is preserved in the artifact files (e.g., redirect stderr into the same output file or a companion .stderr file) — keep failed_when: false but ensure the shell redirect captures both stdout and stderr (or write stderr separately) so any command failure details are recorded in the artifacts.

psap-forge-bot · 2026-05-22T10:22:10Z

🟢 Test of 'llm-d test test_ci' succeeded after 01 hours 16 minutes 13 seconds. 🟢

• Link to the test results.

• Link to the reports index.

Test configuration:

PR_POSITIONAL_ARGS: jump-ci
PR_POSITIONAL_ARG_0: jump-ci
PR_POSITIONAL_ARG_1: azure_h100
PR_POSITIONAL_ARG_2: pd-flavors
PR_POSITIONAL_ARG_3: guidellm_multiturn_eval
PR_POSITIONAL_ARG_4: gpt-oss

kpouget · 2026-05-22T12:24:13Z

/test jump-ci llm-d azure_h100 pd-flavors guidellm_multiturn_eval gpt-oss
/cluster aks-h100
/only test_ci

coderabbitai

🧹 Nitpick comments (1)

projects/llm-d/testing/test_llmd.py (1)

1163-1171: 💤 Low value

Remove unused add_hotfix_volumes function.

add_hotfix_volumes (lines 1163-1171) is defined but never called. Only add_hotfix_to_containers (lines 1174-1182) is used, and they are identical.

Suggested fix

-        # Helper function to add hotfix volume mounts to containers
-        def add_hotfix_volumes(containers, template_volumes):
-            for container in containers:
-                # Add volume mounts
-                if 'volumeMounts' not in container:
-                    container['volumeMounts'] = []
-                container['volumeMounts'].extend(hotfix_mounts)
-
-            # Add volume to template
-            template_volumes.append(hotfix_volume)
-
         # Apply hotfix volumes and complete configuration to containers
         def add_hotfix_to_containers(containers, template_volumes):

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@projects/llm-d/testing/test_llmd.py` around lines 1163 - 1171, The function
add_hotfix_volumes is redundant and never invoked; remove its definition (the
entire add_hotfix_volumes function block) and keep the existing
add_hotfix_to_containers implementation instead so behavior remains unchanged;
verify that references to hotfix_mounts, hotfix_volume, containers, and
template_volumes remain used only by add_hotfix_to_containers and adjust any
imports or tests if they referenced add_hotfix_volumes.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@projects/llm-d/testing/test_llmd.py`:
- Around line 1163-1171: The function add_hotfix_volumes is redundant and never
invoked; remove its definition (the entire add_hotfix_volumes function block)
and keep the existing add_hotfix_to_containers implementation instead so
behavior remains unchanged; verify that references to hotfix_mounts,
hotfix_volume, containers, and template_volumes remain used only by
add_hotfix_to_containers and adjust any imports or tests if they referenced
add_hotfix_volumes.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 4d86284a-3bfe-4231-a5b1-bc880d6815e3

📥 Commits

Reviewing files that changed from the base of the PR and between 47821cb and 19bf445.

📒 Files selected for processing (2)

projects/llm-d/testing/config.yaml
projects/llm-d/testing/test_llmd.py

psap-forge-bot · 2026-05-22T13:49:40Z

🟢 Test of 'llm-d test test_ci' succeeded after 01 hours 18 minutes 36 seconds. 🟢

• Link to the test results.

• Link to the reports index.

Test configuration:

PR_POSITIONAL_ARGS: jump-ci
PR_POSITIONAL_ARG_0: jump-ci
PR_POSITIONAL_ARG_1: azure_h100
PR_POSITIONAL_ARG_2: pd-flavors
PR_POSITIONAL_ARG_3: guidellm_multiturn_eval
PR_POSITIONAL_ARG_4: gpt-oss

coderabbitai Bot reviewed May 11, 2026

View reviewed changes

Comment thread projects/llm-d/testing/config.yaml Outdated

Comment thread projects/llm-d/testing/config.yaml Outdated

kpouget added 5 commits May 20, 2026 09:07

[llm-d] testing: config: add the 'to_apply' placeholder

5a054ce

[llm-d] testing: config: update to rhoai 3.4

2b70f5f

[llm-d] visualizations: llmd_inference: plotting/throughput_compariso…

61a03fd

…ns: improve

[llm-d] toolbox: llmd_deploy_llm_inference_service: capture the httpr…

12ebcf9

…oute

[llm-d] toolbox: llmd_capture_isvc_state: capture the httproute

2769ad7

kpouget force-pushed the aks branch 3 times, most recently from ec5f6fe to ad0254c Compare May 20, 2026 11:13

kpouget added 4 commits May 20, 2026 14:07

[llm-d] testing: config: add the azure_h100 preset

d91b9c4

[llm-d] testing: add AKS H100 support

73ef433

[llm-d] testing: llmisvcs: llmisvc-simple: use the default scheduler …

62c073e

…config

[llm-d] testing: test_llmd: allow setting a toleration to the router

fa2d523

kpouget force-pushed the aks branch from ad0254c to c83cba3 Compare May 20, 2026 12:07

kpouget force-pushed the aks branch 2 times, most recently from 1efcc76 to f301a7d Compare May 21, 2026 06:53

kpouget added 3 commits May 22, 2026 08:54

[llm-d] testing: add dynamic IB configuration

2857d6f

[llm-d] visualizations: llmd_inference: dedicated file for the compar…

b5cc3fc

…ison

[llm-d] testing: test_llmd: correctly set the TP in the prefill conta…

68d9fba

…iner

kpouget force-pushed the aks branch from cd85ee0 to 8ae2b7d Compare May 22, 2026 08:03

kpouget force-pushed the aks branch from 8ae2b7d to 4e05357 Compare May 22, 2026 08:31

[llm-d] testing: config: update the PD flavors for AKS

107f561

kpouget force-pushed the aks branch from 4e05357 to 47821cb Compare May 22, 2026 09:00

coderabbitai Bot reviewed May 22, 2026

View reviewed changes

[llm-d] testing: add knobs for AKS and hybrid_kv_cache_manager

19bf445

kpouget force-pushed the aks branch from 47821cb to 19bf445 Compare May 22, 2026 12:23

coderabbitai Bot reviewed May 22, 2026

View reviewed changes

	aks_hotfix_enabled: true # Enable/disable AKS ConfigMap hotfix mounts
	aks_hotfix_enabled: false # Enable/disable AKS ConfigMap hotfix mounts

Conversation

kpouget commented May 11, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

openshift-ci Bot commented May 11, 2026

Uh oh!

coderabbitai Bot commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Poem

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

kpouget commented May 20, 2026

Uh oh!

psap-forge-bot Bot commented May 20, 2026

Uh oh!

kpouget commented May 20, 2026

Uh oh!

psap-forge-bot Bot commented May 20, 2026

Uh oh!

kpouget commented May 20, 2026

Uh oh!

psap-forge-bot Bot commented May 20, 2026

Uh oh!

kpouget commented May 20, 2026

Uh oh!

kpouget commented May 20, 2026

Uh oh!

psap-forge-bot Bot commented May 20, 2026

Uh oh!

psap-forge-bot Bot commented May 20, 2026

Uh oh!

kpouget commented May 21, 2026

Uh oh!

psap-forge-bot Bot commented May 21, 2026

Uh oh!

kpouget commented May 21, 2026

Uh oh!

kpouget commented May 21, 2026

Uh oh!

psap-forge-bot Bot commented May 21, 2026

Uh oh!

kpouget commented May 21, 2026

Uh oh!

psap-forge-bot Bot commented May 21, 2026

Uh oh!

kpouget commented May 21, 2026

Uh oh!

kpouget commented May 21, 2026

Uh oh!

psap-forge-bot Bot commented May 21, 2026

Uh oh!

kpouget commented May 22, 2026

Uh oh!

kpouget commented May 22, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

kpouget commented May 11, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 11, 2026 •

edited

Loading