[code not in mergable state yet] Add MI325X DeepSeek-R1 FP8 disaggregated inference with Broadcom Thor 2 IBGDA#985
[code not in mergable state yet] Add MI325X DeepSeek-R1 FP8 disaggregated inference with Broadcom Thor 2 IBGDA#985JordanNanos wants to merge 4 commits intomainfrom
Conversation
…or 2 IBGDA) Port the MI355X disagg recipe to MI325X (gfx942/CDNA3) on a Vultr Slurm cluster with Broadcom BCM5760X Thor 2 NICs using IBGDA for GPU-Direct RDMA via MoRI. Container image: ghcr.io/jordannanos/sgl-mi325x-mori:v0.5.9-bnxt Built from akao-amd/sglang rocm.Dockerfile with: - GPU_ARCH=gfx942, ENABLE_MORI=1, NIC_BACKEND=ibgda - Broadcom bnxt_rocelib (bcm5760x_231.2.63.0a) for RDMA userspace - MoRI pinned to HEAD (c0eccaf2) for bundled bnxt headers + dlopen - smg-wasm pinned to =1.0.0 (v1.0.1 breaks sgl-model-gateway v0.5.9 API) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you |
functionstackx
left a comment
There was a problem hiding this comment.
thanks for the PR. can u add sweep-enabled PR label or /sweep command to get ur PR into an mergable state such that u can merge ur first line of code into the main repo?
.github/configs/amd-master.yaml
Outdated
|
|
||
|
|
||
| dsr1-fp8-mi325x-sglang-disagg: | ||
| image: ghcr.io/jordannanos/sgl-mi325x-mori:v0.5.9-bnxt |
There was a problem hiding this comment.
what is this built with? can u add the permalink to dockerfile and ur docker build commands?
There was a problem hiding this comment.
its described in the PR description
There was a problem hiding this comment.
can u add the exact git clone xyz what sglang hash
and the wget broadcom drivers
docker build
exct coomands for reproducible?
There was a problem hiding this comment.
its not a wget, had to manually download from the broadcom site and then copy the tarball over to the cluster, based on exact firmware version installed on the cluster. happens for all thor2 NICs
There was a problem hiding this comment.
the build command is included in the sbatch and/or the shell script: https://github.com/JordanNanos/sglang/blob/main/docker/build-sglang-bnxt.sh and https://github.com/JordanNanos/sglang/blob/main/docker/build-sglang-bnxt.sbatch
you want this build command in this repo?
There was a problem hiding this comment.
yes, maybe in utils/? and add an readme too about how to manually download and tarball from broadcom?
that way we can get amd engineer to read ur dockerfile & build command and fix upstream builds
| - "DECODE_NODES=2" | ||
| - "DECODE_MTP_SIZE=0" | ||
|
|
||
| # "Low concurrency" (1 prefill worker at TP4, 1 decode worker at TP8) |
There was a problem hiding this comment.
r u sure that TP4 is on the pareto here? do u have an graph?
There was a problem hiding this comment.
u only have TP4 curve and u have "hide non-optimal"? can u run the rest of the 24 datapoints?
| additional-settings: | ||
| - "DECODE_NODES=1" | ||
| - "DECODE_MTP_SIZE=0" | ||
|
|
There was a problem hiding this comment.
ur missing MTP submissions and u only have not MTP so far
| @@ -357,7 +360,7 @@ exec sudo docker run --rm \ | |||
| --privileged \ | |||
| -v ${MODEL_DIR}:/models \ | |||
| -v \$HOME/.ssh:/root/.ssh \ | |||
| -v $(which nicctl):/usr/sbin/nicctl \ | |||
| $(command -v nicctl &>/dev/null && echo "-v $(which nicctl):/usr/sbin/nicctl") \ | |||
There was a problem hiding this comment.
can u verify if these changes break mi355 disagg? +viz @Oseltamivir
There was a problem hiding this comment.
the check for nicctl was breaking on this cluster, MoRI needs it to enforce QoS, disabled for now as it's not installed on these nodes or in the container built and seems unnecessary
| - "DECODE_MTP_SIZE=1" | ||
|
|
||
|
|
||
| dsr1-fp8-mi325x-sglang-disagg: |
There was a problem hiding this comment.
ur missing perfchange log . yaml too
- Add dsr1-fp8-mi325x-sglang-disagg-mtp config with MTP=1/2 across all curve points (top/middle/bottom/low-conc) for both 1k/1k and 8k/1k - Expand concurrency lists to cover full pareto frontier including non-optimal points - Update image tag to v0.5.9-bnxt-good (the pushed image) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/sweep test-config --config-files .github/configs/amd-master.yaml --config-keys dsr1-fp8-mi325x-sglang-disagg dsr1-fp8-mi325x-sglang-disagg-mtp |
|
@JordanNanos Kicking off a sweep. Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/23812520838 |
|
holy |


Description
Port the MI355X DeepSeek-R1 FP8 disaggregated inference recipe to MI325X (gfx942/CDNA3) on a Vultr Slurm cluster with Broadcom BCM5760X Thor 2 NICs using IBGDA for GPU-Direct RDMA via MoRI.
Container image
ghcr.io/jordannanos/sgl-mi325x-mori:v0.5.9-bnxt-goodBuilt from a patched
rocm.Dockerfilebased on https://github.com/sgl-project/sglang. The Dockerfile and build scripts with all patches applied are published at:Three patches were required to make the upstream Dockerfile build for MI325X + Broadcom IBGDA:
install_bcm_lib.sh— the upstream script usedtar zxfon a.ziparchive; fixed to detect archive format and useunzipfor.zipfilessmg-wasmpinned to=1.0.0— upstream v0.5.9 ships without aCargo.lock;smg-wasm 1.0.1(published 2026-02-23) changed theWasmModuleManagerAPI, breaking thesgl-model-gatewayRust buildc0eccaf2) — the previously pinned commit (2f88d06) requires system-installedinfiniband/bnxt_re_dv.h/bnxt_re_hsi.hheaders that the Broadcom BCM driver package does not ship; HEAD uses bundled headers + dlopen at runtime (commitead84d86)The Broadcom BCM5760X driver (
bcm5760x_231.2.63.0a.zip) must be placed in the build context. Download from https://www.broadcom.com/support/download-search (search "BCM5760X" or "Thor 2", select the Linux OFED package matching your firmware version).Build command:
Cluster hardware
bnxt_re), 400Gbps RoCEv2, FW 231.2.63.0mlx5), 100GbpsBenchmark results — DeepSeek-R1-0528 FP8, ISL=1024 OSL=1024, 1P(TP4)+1D(TP8) = 12 GPUs
MI325X throughput is ~4x lower than MI355X at the same GPU count, with ~3x higher decode latency. This is expected given:
nicctlis unavailable in the container, soMORI_RDMA_TC/MORI_RDMA_SLdefault to 0 (no PFC priority) — impacts throughput at higher concurrenciesFiles changed
.github/configs/amd-master.yaml— adddsr1-fp8-mi325x-sglang-disaggconfig (mirrors MI355X bottom-of-curve: TP4p/TP8d, conc 1-64).github/configs/runners.yaml— addmi325x-disaggrunner entrybenchmarks/multi_node/dsr1_fp8_mi325x_sglang-disagg.sh— new benchmark scriptbenchmarks/multi_node/amd_utils/env.sh— addchi-mi325x*hostname detection for Broadcombnxt_reIB devices (skipbnxt_re6which is DOWN)benchmarks/multi_node/amd_utils/job.slurm— minor fixes for MI325X Docker device passthroughbenchmarks/multi_node/amd_utils/server.sh— add model config compatrunners/launch_mi325x-amd.sh— multi-node disagg launch support via sbatch+Dockerscripts/manual-test-mi325x.sh— manual test entry pointRelated Issue
Fixes #981
Type of Change
Checklist
perf-changelog.yaml