[llmd] Work on Azure#930
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
📝 WalkthroughWalkthroughUpdates LLM‑D testing CI presets and local models; extends ISVC reshape to support hostPath models, tensor-parallel args, InfiniBand/AKS hotfix wiring, and router tolerations; adjusts scheduler nodeSelectors; adds Ansible HTTPRoute debug captures; and tweaks visualization/report generation and plotting. ChangesISVC reshape, hostPath, tensor-parallel, and InfiniBand
CI presets, model registry, and EPP defaults
Scheduler manifests
Ansible debug artifact capture
Visualizations and plotting
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
projects/llm-d/visualizations/llmd_inference/plotting/throughput_comparisons.py (1)
324-335: ⚡ Quick winRemove unused
filter_flavorsfunction or refactor to use it.The
filter_flavorsfunction is defined but never called withinBaselineComparisonsReport.do_plot. Meanwhile, the newly added flavor filtering at lines 348-349 uses a simpleif/continuepattern instead.The other report classes (
IntelligentRoutingComparisonsReportandPDComparisonsReport) define identicalfilter_flavorsfunctions and actually use them for filtering. For consistency and to avoid dead code, consider either:
- Remove this unused function (simpler if the inline skip is sufficient), or
- Refactor lines 348-349 to use
filter_flavorsfor consistency with other reports.Option 1: Remove the unused function
- def filter_flavors(setting_lists, flavor_filter): - """Filter flavors from setting_lists based on provided filter function""" - updated_setting_lists = [] - for setting_list in setting_lists: - if setting_list and setting_list[0][0] == 'flavor': - # Apply the filter function to flavors - filtered_flavors = [(k, v) for k, v in setting_list if flavor_filter(v)] - if filtered_flavors: - updated_setting_lists.append(filtered_flavors) - else: - updated_setting_lists.append(setting_list) - setting_lists[:] = updated_setting_lists - # Get simple flavors🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@projects/llm-d/visualizations/llmd_inference/plotting/throughput_comparisons.py` around lines 324 - 335, The file defines a helper function filter_flavors that is never used in BaselineComparisonsReport.do_plot while that method instead uses an inline "if/continue" flavor check; either delete the unused filter_flavors function to remove dead code, or replace the inline flavor skip in BaselineComparisonsReport.do_plot with a call to filter_flavors(setting_lists, flavor_filter) so behavior matches IntelligentRoutingComparisonsReport and PDComparisonsReport—update any variable names to match the existing signature and ensure setting_lists is mutated in place as filter_flavors does.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@projects/llm-d/testing/config.yaml`:
- Line 7: The config currently forces the azure-specific preset by setting
to_apply: [azure_h100], which causes all runs to inherit Azure behavior
(including the kpouget-dev namespace); remove or unset the global to_apply entry
so the default config is neutral and require selecting the azure_h100 preset
explicitly in CI/job inputs (or move the preset into the specific job/pipeline
entries that need it) to avoid leaking Azure-specific settings into non-Azure
workflows.
- Line 138: The root-level "skip: true" toggle in
projects/llm-d/testing/config.yaml is globally scoped and will disable baseline
prepare steps for all presets; move these Azure-specific skip flags out of the
top-level prepare.*.skip and instead set them only inside the Azure CI preset
(e.g., ci_presets.azure_h100 -> prepare.*.skip) so defaults remain
unchanged—update the three occurrences noted (lines referenced in the review) by
removing or setting them false at root and adding the skip:true entries within
the ci_presets.azure_h100 overrides for the corresponding prepare sections.
---
Nitpick comments:
In
`@projects/llm-d/visualizations/llmd_inference/plotting/throughput_comparisons.py`:
- Around line 324-335: The file defines a helper function filter_flavors that is
never used in BaselineComparisonsReport.do_plot while that method instead uses
an inline "if/continue" flavor check; either delete the unused filter_flavors
function to remove dead code, or replace the inline flavor skip in
BaselineComparisonsReport.do_plot with a call to filter_flavors(setting_lists,
flavor_filter) so behavior matches IntelligentRoutingComparisonsReport and
PDComparisonsReport—update any variable names to match the existing signature
and ensure setting_lists is mutated in place as filter_flavors does.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: e7ee7db4-319b-46ac-96f9-5edcb3d1ede1
📒 Files selected for processing (5)
projects/llm-d/testing/config.yamlprojects/llm-d/toolbox/llmd_capture_isvc_state/tasks/main.ymlprojects/llm-d/toolbox/llmd_deploy_llm_inference_service/tasks/main.ymlprojects/llm-d/visualizations/llmd_inference/data/plots.yamlprojects/llm-d/visualizations/llmd_inference/plotting/throughput_comparisons.py
ec5f6fe to
ad0254c
Compare
|
/test jump-ci llm-d azure_h100 baseline-flavors |
|
🟢 Test of 'llm-d test test_ci' succeeded after 00 hours 20 minutes 16 seconds. 🟢 • Link to the test results. • Link to the reports index. Test configuration: |
|
/test jump-ci llm-d azure_h100 baseline-flavors |
|
🟢 Test of 'llm-d test test_ci' succeeded after 00 hours 21 minutes 40 seconds. 🟢 • Link to the test results. • Link to the reports index. Test configuration: |
|
/test jump-ci llm-d azure_h100 baseline-flavors guidellm_heterogeneous_eval |
|
🟢 Test of 'llm-d test test_ci' succeeded after 03 hours 21 minutes 21 seconds. 🟢 • Link to the test results. • Link to the reports index. Test configuration: |
|
/test jump-ci llm-d azure_h100 intelligentrouting-flavors guidellm_heterogeneous_eval |
|
/test jump-ci llm-d azure_h100 baseline-flavors guidellm_heterogeneous_eval llama-70b |
|
🔴 Test of 'llm-d test test_ci' failed after 00 hours 00 minutes 40 seconds. 🔴 • Link to the test results. • No reports index generated... Test configuration: Failure indicator: Empty. (See run.log) |
|
🟢 Test of 'llm-d test test_ci' succeeded after 02 hours 16 minutes 44 seconds. 🟢 • Link to the test results. • Link to the reports index. Test configuration: |
1efcc76 to
f301a7d
Compare
|
/test jump-ci llm-d azure_h100 baseline-flavors llama-70b |
|
🔴 Test of 'llm-d test test_ci' failed after 00 hours 23 minutes 44 seconds. 🔴 • Link to the test results. • Link to the reports index. Test configuration: |
|
/test jump-ci llm-d azure_h100 baseline-flavors llama-70b |
|
/test jump-ci llm-d azure_h100 baseline-flavors llama-70b |
|
🟢 Test of 'llm-d test test_ci' succeeded after 00 hours 08 minutes 02 seconds. 🟢 • Link to the test results. • Link to the reports index. Test configuration: |
|
/test jump-ci llm-d azure_h100 baseline-flavors llama-70b guidellm_heterogeneous_eval |
|
🟢 Test of 'llm-d test test_ci' succeeded after 03 hours 20 minutes 38 seconds. 🟢 • Link to the test results. • Link to the reports index. Test configuration: |
|
/test jump-ci llm-d azure_h100 guidellm_multiturn_eval gpt-oss |
|
/test jump-ci llm-d azure_h100 guidellm_multiturn_eval gpt-oss |
|
🔴 Test of 'llm-d test test_ci' failed after 00 hours 00 minutes 39 seconds. 🔴 • Link to the test results. • Link to the reports index. Test configuration: |
|
/test jump-ci llm-d azure_h100 pd-flavors guidellm_multiturn_eval gpt-oss |
|
/test jump-ci llm-d azure_h100 pd-flavors guidellm_multiturn_eval gpt-oss |
There was a problem hiding this comment.
Actionable comments posted: 7
🧹 Nitpick comments (1)
projects/llm-d/testing/config.yaml (1)
319-319: ⚡ Quick winMake the
flavorstype consistent with preset definitions.Line 319 uses a bare string
pd-x2-ptp4-px1-dtp4, while preset definitions at lines 32, 87, and 93 use lists for the same field. This type inconsistency can cause issues if the code expects a list and receives a string.♻️ Suggested change for type consistency
- flavors: pd-x2-ptp4-px1-dtp4 + flavors: [pd-x2-ptp4-px1-dtp4]🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@projects/llm-d/testing/config.yaml` at line 319, The `flavors` field is currently a bare string (pd-x2-ptp4-px1-dtp4) but other preset definitions use a list; update the `flavors` value to be a YAML list to match the presets (i.e., convert the scalar to a sequence containing the same item) so code that expects an array/sequence will receive a list; ensure the `flavors` key in this config matches the same list formatting used by the preset entries referenced.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@projects/llm-d/testing/config.yaml`:
- Line 340: The global setting tests.llmd.inference_service.aks_hotfix_enabled
is currently true and will apply AKS-specific behavior everywhere; change the
default to false in that section and enable it only in the Azure preset by
adding tests.llmd.inference_service.aks_hotfix_enabled: true under the azure
preset (or, if you prefer to keep it true globally, update the comment next to
aks_hotfix_enabled to state explicitly that the hotfix is a no-op on non-AKS
environments and must be overridden where unsafe). Ensure you update the
tests.llmd.inference_service block and the azure preset block (named azure)
accordingly.
In `@projects/llm-d/testing/llmisvcs/llmisvc-simple.yaml`:
- Line 21: The pod nodeSelector in
projects/llm-d/testing/llmisvcs/llmisvc-simple.yaml currently uses the
nonstandard label key nvidia.com/gpu.deploy.container-toolkit: "true"; verify
that this exact label/value exists on your AKS H100 node pool (e.g., inspect
node labels) or replace the selector with the repo-standard label
nvidia.com/gpu.present: "true" so pods will schedule consistently; update the
manifest's nodeSelector key accordingly or add a note documenting the cluster
labeling contract if the custom label is intentional.
In `@projects/llm-d/testing/test_llmd.py`:
- Around line 1015-1028: The helper apply_to_containers assumes
container['resources'], ['limits'] and ['requests'] exist which can raise
KeyError; update apply_to_containers to ensure container['resources'] is a dict
and to initialize container['resources']['limits'] and
container['resources']['requests'] if missing (e.g., via setdefault or explicit
checks) before popping or assigning RDMA keys, then proceed to use
infiniband_config to add the appropriate RDMA entries.
- Around line 631-650: The code currently enforces that VLLM_ADDITIONAL_ARGS
must exist by raising a RuntimeError when vllm_args_found is False; remove that
requirement so hostPath models can run with a plain "vllm serve" command: delete
or replace the final if-not-vllm_args_found raise block (the check using
vllm_args_found and the RuntimeError) in the routine that processes
container['env'] so that when VLLM_ADDITIONAL_ARGS is absent the code simply
continues (optionally log a debug/info message) instead of throwing an
exception.
- Around line 723-743: The code currently replaces existing volumeMounts/volumes
(e.g., container['volumeMounts'] = [hf_cache_mount] and
isvc_data['spec']['template']['volumes'] = [hf_cache_volume]) which drops
pre-existing mounts; change these to append the hf_cache_mount/hf_cache_volume
instead: ensure in configure_container (and when handling prefill_container) you
create the keys if missing and extend the existing lists (e.g., use
container.setdefault('volumeMounts', []).append(hf_cache_mount) and
template.setdefault('volumes', []).append(hf_cache_volume') semantics) so
existing TLS/config mounts and other volumes are preserved.
In `@projects/llm-d/toolbox/llmd_capture_isvc_state/tasks/main.yml`:
- Around line 95-109: The two tasks "Capture HTTPRoute for routing" and "Capture
HTTPRoute status" only redirect stdout to the artifact files so any oc errors go
to stderr and are lost; update the shell commands that call oc get httproute
(the tasks referencing variables llmd_capture_isvc_state_llmisvc_name,
target_namespace and artifact_extra_logs_dir) to also capture stderr into the
same artifact files (e.g., redirect stderr into stdout) so that diagnostic
messages are preserved while keeping ignore_errors: true.
In `@projects/llm-d/toolbox/llmd_deploy_llm_inference_service/tasks/main.yml`:
- Around line 167-182: The two tasks "Capture HTTPRoute YAML for routing" and
"Capture HTTPRoute status for routing" drop stderr when the shell runs with
failed_when: false; update each shell invocation in tasks/main.yml so the oc
command's stderr is preserved in the artifact files (e.g., redirect stderr into
the same output file or a companion .stderr file) — keep failed_when: false but
ensure the shell redirect captures both stdout and stderr (or write stderr
separately) so any command failure details are recorded in the artifacts.
---
Nitpick comments:
In `@projects/llm-d/testing/config.yaml`:
- Line 319: The `flavors` field is currently a bare string (pd-x2-ptp4-px1-dtp4)
but other preset definitions use a list; update the `flavors` value to be a YAML
list to match the presets (i.e., convert the scalar to a sequence containing the
same item) so code that expects an array/sequence will receive a list; ensure
the `flavors` key in this config matches the same list formatting used by the
preset entries referenced.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 6979cfda-842f-457a-a917-08bd6e3884c8
📒 Files selected for processing (10)
projects/llm-d/testing/config.yamlprojects/llm-d/testing/llmisvcs/llmisvc-pd.yamlprojects/llm-d/testing/llmisvcs/llmisvc-simple.yamlprojects/llm-d/testing/prepare_llmd.pyprojects/llm-d/testing/test_llmd.pyprojects/llm-d/toolbox/llmd_capture_isvc_state/tasks/main.ymlprojects/llm-d/toolbox/llmd_deploy_llm_inference_service/tasks/main.ymlprojects/llm-d/visualizations/llmd_inference/data/plots.yamlprojects/llm-d/visualizations/llmd_inference/data/reports.yamlprojects/llm-d/visualizations/llmd_inference/plotting/throughput_comparisons.py
💤 Files with no reviewable changes (2)
- projects/llm-d/visualizations/llmd_inference/data/plots.yaml
- projects/llm-d/testing/llmisvcs/llmisvc-pd.yaml
✅ Files skipped from review due to trivial changes (1)
- projects/llm-d/visualizations/llmd_inference/data/reports.yaml
| model: gpt-oss-20b | ||
| max_model_len: 40960 | ||
| infiniband: null # null=unchanged, true=enable rdma/ib: "1", false=disable rdma/ib | ||
| aks_hotfix_enabled: true # Enable/disable AKS ConfigMap hotfix mounts |
There was a problem hiding this comment.
Move AKS-specific setting to the Azure preset or default to false.
aks_hotfix_enabled: true is set globally in the default tests.llmd.inference_service section. According to the layer description, this enables "AKS-specific UCX/NIC env, IPC_LOCK capability, optional vllm-ucx-multiproc-hotfix ConfigMap creation". Setting this to true globally will apply AKS-specific behavior to all environments, including non-AKS deployments, which may fail or create unnecessary resources.
🔧 Recommended fix
Option 1 (preferred): Default to false globally and enable in Azure presets
- aks_hotfix_enabled: true # Enable/disable AKS ConfigMap hotfix mounts
+ aks_hotfix_enabled: false # Enable/disable AKS ConfigMap hotfix mountsThen add to the azure preset (around line 60):
azure:
security.run_as_root: true
prepare.preload.skip: true
prepare.operators.skip: true
prepare.monitoring.skip: true
prepare.cluster.skip: true
prepare.rhoai.skip: true
tests.capture_prom: false
tests.capture_prom_uwm: false
tests.llmd.skip_prepare: false
+ tests.llmd.inference_service.aks_hotfix_enabled: trueOption 2: Keep true but document that non-AKS environments must override
If the hotfix is safe to apply universally (no-op on non-AKS), update the comment to clarify this behavior.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| aks_hotfix_enabled: true # Enable/disable AKS ConfigMap hotfix mounts | |
| aks_hotfix_enabled: false # Enable/disable AKS ConfigMap hotfix mounts |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@projects/llm-d/testing/config.yaml` at line 340, The global setting
tests.llmd.inference_service.aks_hotfix_enabled is currently true and will apply
AKS-specific behavior everywhere; change the default to false in that section
and enable it only in the Azure preset by adding
tests.llmd.inference_service.aks_hotfix_enabled: true under the azure preset
(or, if you prefer to keep it true globally, update the comment next to
aks_hotfix_enabled to state explicitly that the hotfix is a no-op on non-AKS
environments and must be overridden where unsafe). Ensure you update the
tests.llmd.inference_service block and the azure preset block (named azure)
accordingly.
| - pluginRef: max-score-picker | ||
| nodeSelector: | ||
| nvidia.com/gpu.present: "true" | ||
| nvidia.com/gpu.deploy.container-toolkit: "true" |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Search for references to both old and new nodeSelector labels
echo "=== Searching for old label: nvidia.com/gpu.present ==="
rg -n "nvidia\.com/gpu\.present" -C2
echo -e "\n=== Searching for new label: nvidia.com/gpu.deploy.container-toolkit ==="
rg -n "nvidia\.com/gpu\.deploy\.container-toolkit" -C2
echo -e "\n=== All nvidia.com/gpu nodeSelector patterns ==="
rg -n "nvidia\.com/gpu\.[a-z.-]+" -C2Repository: openshift-psap/topsail
Length of output: 21530
🌐 Web query:
AKS NVIDIA GPU node labels container toolkit deployment
💡 Result:
In Azure Kubernetes Service (AKS), GPU node pools come with NVIDIA drivers and the NVIDIA Container Toolkit pre-installed by default [1]. Consequently, you typically do not need to manually deploy the container toolkit [2][1]. Node Labeling and Scheduling AKS automatically applies default labels to GPU nodes, such as accelerator=nvidia [3]. When deploying GPU-enabled workloads, it is recommended practice to: 1. Use Node Pools and Taints: Isolate GPU workloads by creating dedicated GPU node pools and applying taints (e.g., sku=gpu:NoSchedule) [4][5]. 2. Use Node Selectors or Affinity: Ensure your workload pods land on the correct nodes using node selectors or affinity that match your labels [4][6]. 3. Custom Labels: You can apply additional labels as needed for specific hardware or configuration requirements using kubectl label nodes = [6]. NVIDIA GPU Operator and Toolkit While AKS provides pre-installed components, you may choose to use the NVIDIA GPU Operator for advanced lifecycle management (e.g., automated driver updates, monitoring) [2][1]. If you use the Operator in an environment where drivers and the toolkit are already pre-installed, you must configure the installation to disable these components to avoid conflicts [1][7]: helm install gpu-operator nvidia/gpu-operator -n gpu-operator --set driver.enabled=false --set toolkit.enabled=false --set operator.runtimeClass=nvidia-container-runtime If you require custom driver management, you can skip the default installation during node pool creation by using the --gpu-driver none flag [2][8]. Once skipped, you assume responsibility for installing the drivers and the container toolkit (e.g., via the GPU Operator or manual DaemonSet deployment) [2][1][8]. Dynamic Resource Allocation (DRA) For newer scenarios like virtualized GPUs (vGPU), AKS utilizes NVIDIA DRA, which runs as a DaemonSet and may require you to manually apply specific labels, such as nvidia.com/gpu.present=true, to your nodes [3]. Always verify the current AKS documentation for your specific GPU architecture and workload requirements [2][9].
Citations:
- 1: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/microsoft-aks.html
- 2: https://learn.microsoft.com/en-us/azure/aks/use-nvidia-gpu
- 3: https://blog.aks.azure.com/2026/03/06/dra-with-vGPUs-on-aks
- 4: https://securityboulevard.com/2026/05/how-do-i-run-ai-workloads-on-kubernetes-without-wasting-gpus/
- 5: https://blog.colbyford.com/building-gpu-enabled-clusters-in-azure-kubernetes-service-7a6d4b2c4f56
- 6: https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/
- 7: GPU Operator installation failure in AKS NVIDIA/gpu-operator#529
- 8: https://docs.azure.cn/en-us/aks/use-nvidia-gpu
- 9: https://blog.aks.azure.com/2025/11/17/dra-devices-and-drivers-on-kubernetes
Harden/align nodeSelector label for router scheduler pods (AKS H100): nvidia.com/gpu.deploy.container-toolkit is repo-unique and not documented as a standard AKS/NVIDIA label
- In
projects/llm-d/testing/llmisvcs/llmisvc-simple.yaml(line 21), pod placement is constrained tonvidia.com/gpu.deploy.container-toolkit: "true"; if that label isn’t present on the target GPU nodes, pods won’t schedule. - The repo references
nvidia.com/gpu.present=trueacross configs/manifests/tests, butnvidia.com/gpu.deploy.container-toolkitis only used in this single manifest. - Public AKS NVIDIA GPU docs/guidance point to
nvidia.com/gpu.present=true(notnvidia.com/gpu.deploy.container-toolkit) for scheduling, and note the toolkit is typically already installed on AKS GPU nodes. - Action: confirm the exact label/value exists on the AKS H100 GPU node pool you target (e.g.,
oc get nodes --show-labelsand grep), and/or switch this nodeSelector to the repo’s existingnvidia.com/gpu.present: "true"(or document the cluster labeling contract if intentionally different).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@projects/llm-d/testing/llmisvcs/llmisvc-simple.yaml` at line 21, The pod
nodeSelector in projects/llm-d/testing/llmisvcs/llmisvc-simple.yaml currently
uses the nonstandard label key nvidia.com/gpu.deploy.container-toolkit: "true";
verify that this exact label/value exists on your AKS H100 node pool (e.g.,
inspect node labels) or replace the selector with the repo-standard label
nvidia.com/gpu.present: "true" so pods will schedule consistently; update the
manifest's nodeSelector key accordingly or add a note documenting the cluster
labeling contract if the custom label is intentional.
| def apply_to_containers(containers): | ||
| for container in containers: | ||
| # Always remove existing rdma/ib first | ||
| container['resources']['limits'].pop('rdma/ib', None) | ||
| container['resources']['requests'].pop('rdma/ib', None) | ||
|
|
||
| # Add InfiniBand resource based on config type | ||
| if infiniband_config is True: | ||
| container['resources']['limits']['rdma/ib'] = "1" | ||
| container['resources']['requests']['rdma/ib'] = "1" | ||
| elif isinstance(infiniband_config, str): | ||
| # Use custom resource string (e.g., "rdma/shared_ib") | ||
| container['resources']['limits'][infiniband_config] = "1" | ||
| container['resources']['requests'][infiniband_config] = "1" |
There was a problem hiding this comment.
Initialize resources.requests and resources.limits before writing RDMA keys.
This helper assumes every decode/prefill container already has resources.requests and resources.limits. That isn't guaranteed on the current reshape path, so a plain pd manifest without those blocks can fail here with KeyError.
Suggested fix
def apply_to_containers(containers):
for container in containers:
+ resources = container.setdefault('resources', {})
+ limits = resources.setdefault('limits', {})
+ requests = resources.setdefault('requests', {})
+
# Always remove existing rdma/ib first
- container['resources']['limits'].pop('rdma/ib', None)
- container['resources']['requests'].pop('rdma/ib', None)
+ limits.pop('rdma/ib', None)
+ requests.pop('rdma/ib', None)
# Add InfiniBand resource based on config type
if infiniband_config is True:
- container['resources']['limits']['rdma/ib'] = "1"
- container['resources']['requests']['rdma/ib'] = "1"
+ limits['rdma/ib'] = "1"
+ requests['rdma/ib'] = "1"
elif isinstance(infiniband_config, str):
# Use custom resource string (e.g., "rdma/shared_ib")
- container['resources']['limits'][infiniband_config] = "1"
- container['resources']['requests'][infiniband_config] = "1"
+ limits[infiniband_config] = "1"
+ requests[infiniband_config] = "1"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def apply_to_containers(containers): | |
| for container in containers: | |
| # Always remove existing rdma/ib first | |
| container['resources']['limits'].pop('rdma/ib', None) | |
| container['resources']['requests'].pop('rdma/ib', None) | |
| # Add InfiniBand resource based on config type | |
| if infiniband_config is True: | |
| container['resources']['limits']['rdma/ib'] = "1" | |
| container['resources']['requests']['rdma/ib'] = "1" | |
| elif isinstance(infiniband_config, str): | |
| # Use custom resource string (e.g., "rdma/shared_ib") | |
| container['resources']['limits'][infiniband_config] = "1" | |
| container['resources']['requests'][infiniband_config] = "1" | |
| def apply_to_containers(containers): | |
| for container in containers: | |
| resources = container.setdefault('resources', {}) | |
| limits = resources.setdefault('limits', {}) | |
| requests = resources.setdefault('requests', {}) | |
| # Always remove existing rdma/ib first | |
| limits.pop('rdma/ib', None) | |
| requests.pop('rdma/ib', None) | |
| # Add InfiniBand resource based on config type | |
| if infiniband_config is True: | |
| limits['rdma/ib'] = "1" | |
| requests['rdma/ib'] = "1" | |
| elif isinstance(infiniband_config, str): | |
| # Use custom resource string (e.g., "rdma/shared_ib") | |
| limits[infiniband_config] = "1" | |
| requests[infiniband_config] = "1" |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@projects/llm-d/testing/test_llmd.py` around lines 1015 - 1028, The helper
apply_to_containers assumes container['resources'], ['limits'] and ['requests']
exist which can raise KeyError; update apply_to_containers to ensure
container['resources'] is a dict and to initialize
container['resources']['limits'] and container['resources']['requests'] if
missing (e.g., via setdefault or explicit checks) before popping or assigning
RDMA keys, then proceed to use infiniband_config to add the appropriate RDMA
entries.
| - name: Capture HTTPRoute for routing | ||
| shell: | ||
| oc get httproute \ | ||
| -l "app.kubernetes.io/component=llminferenceservice-router,app.kubernetes.io/name={{ llmd_capture_isvc_state_llmisvc_name }}" \ | ||
| -n "{{ target_namespace }}" \ | ||
| -oyaml > "{{ artifact_extra_logs_dir }}/artifacts/llminferenceservice.httproute.yaml" | ||
| ignore_errors: true | ||
|
|
||
| - name: Capture HTTPRoute status | ||
| shell: | ||
| oc get httproute \ | ||
| -l "app.kubernetes.io/component=llminferenceservice-router,app.kubernetes.io/name={{ llmd_capture_isvc_state_llmisvc_name }}" \ | ||
| -n "{{ target_namespace }}" \ | ||
| > "{{ artifact_extra_logs_dir }}/artifacts/llminferenceservice.httproute.status" | ||
| ignore_errors: true |
There was a problem hiding this comment.
Capture stderr into the HTTPRoute artifacts as well.
At Line 97 and Line 105, only stdout is redirected. When oc get httproute fails, the most useful diagnostics are on stderr, so the artifact file can be empty while the task still passes (ignore_errors: true).
Suggested patch
- name: Capture HTTPRoute for routing
shell:
oc get httproute \
-l "app.kubernetes.io/component=llminferenceservice-router,app.kubernetes.io/name={{ llmd_capture_isvc_state_llmisvc_name }}" \
-n "{{ target_namespace }}" \
- -oyaml > "{{ artifact_extra_logs_dir }}/artifacts/llminferenceservice.httproute.yaml"
+ --ignore-not-found \
+ -oyaml > "{{ artifact_extra_logs_dir }}/artifacts/llminferenceservice.httproute.yaml" 2>&1
ignore_errors: true
- name: Capture HTTPRoute status
shell:
oc get httproute \
-l "app.kubernetes.io/component=llminferenceservice-router,app.kubernetes.io/name={{ llmd_capture_isvc_state_llmisvc_name }}" \
-n "{{ target_namespace }}" \
- > "{{ artifact_extra_logs_dir }}/artifacts/llminferenceservice.httproute.status"
+ --ignore-not-found \
+ > "{{ artifact_extra_logs_dir }}/artifacts/llminferenceservice.httproute.status" 2>&1
ignore_errors: true📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| - name: Capture HTTPRoute for routing | |
| shell: | |
| oc get httproute \ | |
| -l "app.kubernetes.io/component=llminferenceservice-router,app.kubernetes.io/name={{ llmd_capture_isvc_state_llmisvc_name }}" \ | |
| -n "{{ target_namespace }}" \ | |
| -oyaml > "{{ artifact_extra_logs_dir }}/artifacts/llminferenceservice.httproute.yaml" | |
| ignore_errors: true | |
| - name: Capture HTTPRoute status | |
| shell: | |
| oc get httproute \ | |
| -l "app.kubernetes.io/component=llminferenceservice-router,app.kubernetes.io/name={{ llmd_capture_isvc_state_llmisvc_name }}" \ | |
| -n "{{ target_namespace }}" \ | |
| > "{{ artifact_extra_logs_dir }}/artifacts/llminferenceservice.httproute.status" | |
| ignore_errors: true | |
| - name: Capture HTTPRoute for routing | |
| shell: | |
| oc get httproute \ | |
| -l "app.kubernetes.io/component=llminferenceservice-router,app.kubernetes.io/name={{ llmd_capture_isvc_state_llmisvc_name }}" \ | |
| -n "{{ target_namespace }}" \ | |
| --ignore-not-found \ | |
| -oyaml > "{{ artifact_extra_logs_dir }}/artifacts/llminferenceservice.httproute.yaml" 2>&1 | |
| ignore_errors: true | |
| - name: Capture HTTPRoute status | |
| shell: | |
| oc get httproute \ | |
| -l "app.kubernetes.io/component=llminferenceservice-router,app.kubernetes.io/name={{ llmd_capture_isvc_state_llmisvc_name }}" \ | |
| -n "{{ target_namespace }}" \ | |
| --ignore-not-found \ | |
| > "{{ artifact_extra_logs_dir }}/artifacts/llminferenceservice.httproute.status" 2>&1 | |
| ignore_errors: true |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@projects/llm-d/toolbox/llmd_capture_isvc_state/tasks/main.yml` around lines
95 - 109, The two tasks "Capture HTTPRoute for routing" and "Capture HTTPRoute
status" only redirect stdout to the artifact files so any oc errors go to stderr
and are lost; update the shell commands that call oc get httproute (the tasks
referencing variables llmd_capture_isvc_state_llmisvc_name, target_namespace and
artifact_extra_logs_dir) to also capture stderr into the same artifact files
(e.g., redirect stderr into stdout) so that diagnostic messages are preserved
while keeping ignore_errors: true.
| - name: Capture HTTPRoute YAML for routing | ||
| shell: | ||
| oc get httproute \ | ||
| -l "app.kubernetes.io/component=llminferenceservice-router,app.kubernetes.io/name={{ llmd_deploy_llm_inference_service_name }}" \ | ||
| -n "{{ llmd_deploy_llm_inference_service_namespace }}" \ | ||
| -oyaml \ | ||
| > "{{ artifact_extra_logs_dir }}/artifacts/llm_inference_service.httproute.yaml" | ||
| failed_when: false | ||
|
|
||
| - name: Capture HTTPRoute status for routing | ||
| shell: | ||
| oc get httproute \ | ||
| -l "app.kubernetes.io/component=llminferenceservice-router,app.kubernetes.io/name={{ llmd_deploy_llm_inference_service_name }}" \ | ||
| -n "{{ llmd_deploy_llm_inference_service_namespace }}" \ | ||
| > "{{ artifact_extra_logs_dir }}/artifacts/llm_inference_service.httproute.status" | ||
| failed_when: false |
There was a problem hiding this comment.
Preserve HTTPRoute command errors in artifact files.
At Line 169 and Line 178, failed_when: false keeps the play running, but stderr is currently discarded. If the query fails, artifacts may miss the failure reason.
Suggested patch
- name: Capture HTTPRoute YAML for routing
shell:
oc get httproute \
-l "app.kubernetes.io/component=llminferenceservice-router,app.kubernetes.io/name={{ llmd_deploy_llm_inference_service_name }}" \
-n "{{ llmd_deploy_llm_inference_service_namespace }}" \
+ --ignore-not-found \
-oyaml \
- > "{{ artifact_extra_logs_dir }}/artifacts/llm_inference_service.httproute.yaml"
+ > "{{ artifact_extra_logs_dir }}/artifacts/llm_inference_service.httproute.yaml" 2>&1
failed_when: false
- name: Capture HTTPRoute status for routing
shell:
oc get httproute \
-l "app.kubernetes.io/component=llminferenceservice-router,app.kubernetes.io/name={{ llmd_deploy_llm_inference_service_name }}" \
-n "{{ llmd_deploy_llm_inference_service_namespace }}" \
- > "{{ artifact_extra_logs_dir }}/artifacts/llm_inference_service.httproute.status"
+ --ignore-not-found \
+ > "{{ artifact_extra_logs_dir }}/artifacts/llm_inference_service.httproute.status" 2>&1
failed_when: false🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@projects/llm-d/toolbox/llmd_deploy_llm_inference_service/tasks/main.yml`
around lines 167 - 182, The two tasks "Capture HTTPRoute YAML for routing" and
"Capture HTTPRoute status for routing" drop stderr when the shell runs with
failed_when: false; update each shell invocation in tasks/main.yml so the oc
command's stderr is preserved in the artifact files (e.g., redirect stderr into
the same output file or a companion .stderr file) — keep failed_when: false but
ensure the shell redirect captures both stdout and stderr (or write stderr
separately) so any command failure details are recorded in the artifacts.
|
🟢 Test of 'llm-d test test_ci' succeeded after 01 hours 16 minutes 13 seconds. 🟢 • Link to the test results. • Link to the reports index. Test configuration: |
|
/test jump-ci llm-d azure_h100 pd-flavors guidellm_multiturn_eval gpt-oss |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
projects/llm-d/testing/test_llmd.py (1)
1163-1171: 💤 Low valueRemove unused
add_hotfix_volumesfunction.
add_hotfix_volumes(lines 1163-1171) is defined but never called. Onlyadd_hotfix_to_containers(lines 1174-1182) is used, and they are identical.Suggested fix
- # Helper function to add hotfix volume mounts to containers - def add_hotfix_volumes(containers, template_volumes): - for container in containers: - # Add volume mounts - if 'volumeMounts' not in container: - container['volumeMounts'] = [] - container['volumeMounts'].extend(hotfix_mounts) - - # Add volume to template - template_volumes.append(hotfix_volume) - # Apply hotfix volumes and complete configuration to containers def add_hotfix_to_containers(containers, template_volumes):🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@projects/llm-d/testing/test_llmd.py` around lines 1163 - 1171, The function add_hotfix_volumes is redundant and never invoked; remove its definition (the entire add_hotfix_volumes function block) and keep the existing add_hotfix_to_containers implementation instead so behavior remains unchanged; verify that references to hotfix_mounts, hotfix_volume, containers, and template_volumes remain used only by add_hotfix_to_containers and adjust any imports or tests if they referenced add_hotfix_volumes.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@projects/llm-d/testing/test_llmd.py`:
- Around line 1163-1171: The function add_hotfix_volumes is redundant and never
invoked; remove its definition (the entire add_hotfix_volumes function block)
and keep the existing add_hotfix_to_containers implementation instead so
behavior remains unchanged; verify that references to hotfix_mounts,
hotfix_volume, containers, and template_volumes remain used only by
add_hotfix_to_containers and adjust any imports or tests if they referenced
add_hotfix_volumes.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 4d86284a-3bfe-4231-a5b1-bc880d6815e3
📒 Files selected for processing (2)
projects/llm-d/testing/config.yamlprojects/llm-d/testing/test_llmd.py
|
🟢 Test of 'llm-d test test_ci' succeeded after 01 hours 18 minutes 36 seconds. 🟢 • Link to the test results. • Link to the reports index. Test configuration: |
Summary by CodeRabbit
Chores
New Features
Bug Fixes