Skip to content

[code not in mergable state yet] Add MI325X DeepSeek-R1 FP8 disaggregated inference with Broadcom Thor 2 IBGDA#985

Draft
JordanNanos wants to merge 4 commits intomainfrom
jordan/mi325x-disagg-bnxt
Draft

[code not in mergable state yet] Add MI325X DeepSeek-R1 FP8 disaggregated inference with Broadcom Thor 2 IBGDA#985
JordanNanos wants to merge 4 commits intomainfrom
jordan/mi325x-disagg-bnxt

Conversation

@JordanNanos
Copy link
Copy Markdown
Collaborator

@JordanNanos JordanNanos commented Mar 31, 2026

Description

Port the MI355X DeepSeek-R1 FP8 disaggregated inference recipe to MI325X (gfx942/CDNA3) on a Vultr Slurm cluster with Broadcom BCM5760X Thor 2 NICs using IBGDA for GPU-Direct RDMA via MoRI.

Container image

ghcr.io/jordannanos/sgl-mi325x-mori:v0.5.9-bnxt-good

Built from a patched rocm.Dockerfile based on https://github.com/sgl-project/sglang. The Dockerfile and build scripts with all patches applied are published at:

https://github.com/JordanNanos/sglang (branch main, directory docker/)

Three patches were required to make the upstream Dockerfile build for MI325X + Broadcom IBGDA:

  1. install_bcm_lib.sh — the upstream script used tar zxf on a .zip archive; fixed to detect archive format and use unzip for .zip files
  2. smg-wasm pinned to =1.0.0 — upstream v0.5.9 ships without a Cargo.lock; smg-wasm 1.0.1 (published 2026-02-23) changed the WasmModuleManager API, breaking the sgl-model-gateway Rust build
  3. MoRI commit updated to HEAD (c0eccaf2) — the previously pinned commit (2f88d06) requires system-installed infiniband/bnxt_re_dv.h / bnxt_re_hsi.h headers that the Broadcom BCM driver package does not ship; HEAD uses bundled headers + dlopen at runtime (commit ead84d86)

The Broadcom BCM5760X driver (bcm5760x_231.2.63.0a.zip) must be placed in the build context. Download from https://www.broadcom.com/support/download-search (search "BCM5760X" or "Thor 2", select the Linux OFED package matching your firmware version).

Build command:

Cluster hardware

Component Details
GPUs 8x AMD Instinct MI325X (gfx942) per node
CPUs 2x AMD EPYC 9575F 64-Core
RDMA NICs 9x Broadcom BCM5760X Thor 2 (bnxt_re), 400Gbps RoCEv2, FW 231.2.63.0
Mgmt NICs 2x Mellanox ConnectX-6 Dx (mlx5), 100Gbps

Benchmark results — DeepSeek-R1-0528 FP8, ISL=1024 OSL=1024, 1P(TP4)+1D(TP8) = 12 GPUs

Metric MI355X c=1 MI325X c=1 MI355X c=4 MI325X c=4
Output tok/s (total) 208.4 47.8 645.1 164.7
Output tok/s/gpu 17.37 3.98 53.76 13.72
Median TTFT (ms) 185.4 154.8 174.0 431.8
Median TPOT (ms) 7.00 20.79 8.72 23.68
Median ITL (ms) 6.99 20.77 8.73 23.68

MI325X throughput is ~4x lower than MI355X at the same GPU count, with ~3x higher decode latency. This is expected given:

  • CDNA3 vs CDNA4: MI355X has higher HBM bandwidth and compute
  • No RDMA QoS: nicctl is unavailable in the container, so MORI_RDMA_TC / MORI_RDMA_SL default to 0 (no PFC priority) — impacts throughput at higher concurrencies
  • Baseline MVP: first working disagg config on MI325X with Broadcom NICs; optimization is future work

Files changed

  • .github/configs/amd-master.yaml — add dsr1-fp8-mi325x-sglang-disagg config (mirrors MI355X bottom-of-curve: TP4p/TP8d, conc 1-64)
  • .github/configs/runners.yaml — add mi325x-disagg runner entry
  • benchmarks/multi_node/dsr1_fp8_mi325x_sglang-disagg.sh — new benchmark script
  • benchmarks/multi_node/amd_utils/env.sh — add chi-mi325x* hostname detection for Broadcom bnxt_re IB devices (skip bnxt_re6 which is DOWN)
  • benchmarks/multi_node/amd_utils/job.slurm — minor fixes for MI325X Docker device passthrough
  • benchmarks/multi_node/amd_utils/server.sh — add model config compat
  • runners/launch_mi325x-amd.sh — multi-node disagg launch support via sbatch+Docker
  • scripts/manual-test-mi325x.sh — manual test entry point

Related Issue

Fixes #981

Type of Change

  • Bug fix
  • New feature
  • Configuration change
  • Documentation update
  • Other (please describe)

Checklist

  • I have tested my changes locally
  • I have updated documentation if necessary
  • If I changed a container image or config, I have already updated perf-changelog.yaml

…or 2 IBGDA)

Port the MI355X disagg recipe to MI325X (gfx942/CDNA3) on a Vultr Slurm cluster
with Broadcom BCM5760X Thor 2 NICs using IBGDA for GPU-Direct RDMA via MoRI.

Container image: ghcr.io/jordannanos/sgl-mi325x-mori:v0.5.9-bnxt
Built from akao-amd/sglang rocm.Dockerfile with:
  - GPU_ARCH=gfx942, ENABLE_MORI=1, NIC_BACKEND=ibgda
  - Broadcom bnxt_rocelib (bcm5760x_231.2.63.0a) for RDMA userspace
  - MoRI pinned to HEAD (c0eccaf2) for bundled bnxt headers + dlopen
  - smg-wasm pinned to =1.0.0 (v1.0.1 breaks sgl-model-gateway v0.5.9 API)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

Copy link
Copy Markdown
Contributor

@functionstackx functionstackx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the PR. can u add sweep-enabled PR label or /sweep command to get ur PR into an mergable state such that u can merge ur first line of code into the main repo?



dsr1-fp8-mi325x-sglang-disagg:
image: ghcr.io/jordannanos/sgl-mi325x-mori:v0.5.9-bnxt
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this built with? can u add the permalink to dockerfile and ur docker build commands?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its described in the PR description

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can u add the exact git clone xyz what sglang hash

and the wget broadcom drivers

docker build

exct coomands for reproducible?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the build command is empty rn in ur PR description. can u fix? generally prefer that the build scripts be checked into the repo instead of PR descr

image

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its not a wget, had to manually download from the broadcom site and then copy the tarball over to the cluster, based on exact firmware version installed on the cluster. happens for all thor2 NICs

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the build command is included in the sbatch and/or the shell script: https://github.com/JordanNanos/sglang/blob/main/docker/build-sglang-bnxt.sh and https://github.com/JordanNanos/sglang/blob/main/docker/build-sglang-bnxt.sbatch

you want this build command in this repo?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, maybe in utils/? and add an readme too about how to manually download and tarball from broadcom?

that way we can get amd engineer to read ur dockerfile & build command and fix upstream builds

- "DECODE_NODES=2"
- "DECODE_MTP_SIZE=0"

# "Low concurrency" (1 prefill worker at TP4, 1 decode worker at TP8)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

r u sure that TP4 is on the pareto here? do u have an graph?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image just the initial sweep

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

u only have TP4 curve and u have "hide non-optimal"? can u run the rest of the 24 datapoints?

additional-settings:
- "DECODE_NODES=1"
- "DECODE_MTP_SIZE=0"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ur missing MTP submissions and u only have not MTP so far

Comment on lines 256 to +363
@@ -357,7 +360,7 @@ exec sudo docker run --rm \
--privileged \
-v ${MODEL_DIR}:/models \
-v \$HOME/.ssh:/root/.ssh \
-v $(which nicctl):/usr/sbin/nicctl \
$(command -v nicctl &>/dev/null && echo "-v $(which nicctl):/usr/sbin/nicctl") \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can u verify if these changes break mi355 disagg? +viz @Oseltamivir

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the check for nicctl was breaking on this cluster, MoRI needs it to enforce QoS, disabled for now as it's not installed on these nodes or in the container built and seems unnecessary

- "DECODE_MTP_SIZE=1"


dsr1-fp8-mi325x-sglang-disagg:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ur missing perfchange log . yaml too

@functionstackx functionstackx marked this pull request as draft March 31, 2026 17:45
@functionstackx functionstackx changed the title [AMD] Add MI325X DeepSeek-R1 FP8 disaggregated inference with Broadcom Thor 2 IBGDA [code not in mergable state yet] Add MI325X DeepSeek-R1 FP8 disaggregated inference with Broadcom Thor 2 IBGDA Mar 31, 2026
root and others added 2 commits March 31, 2026 17:51
- Add dsr1-fp8-mi325x-sglang-disagg-mtp config with MTP=1/2 across
  all curve points (top/middle/bottom/low-conc) for both 1k/1k and 8k/1k
- Expand concurrency lists to cover full pareto frontier including
  non-optimal points
- Update image tag to v0.5.9-bnxt-good (the pushed image)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@SemiAnalysisAI SemiAnalysisAI deleted a comment from github-actions bot Mar 31, 2026
@JordanNanos
Copy link
Copy Markdown
Collaborator Author

/sweep test-config --config-files .github/configs/amd-master.yaml --config-keys dsr1-fp8-mi325x-sglang-disagg dsr1-fp8-mi325x-sglang-disagg-mtp

@github-actions
Copy link
Copy Markdown
Contributor

@JordanNanos Kicking off a sweep.

Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/23812520838
Command: test-config --config-files .github/configs/amd-master.yaml --config-keys dsr1-fp8-mi325x-sglang-disagg dsr1-fp8-mi325x-sglang-disagg-mtp
Pinned ref: 2421ca5
Approval: not required (trusted collaborator).

@SemiAnalysisAI SemiAnalysisAI deleted a comment from github-actions bot Mar 31, 2026
@SemiAnalysisAI SemiAnalysisAI deleted a comment from github-actions bot Mar 31, 2026
@SemiAnalysisAI SemiAnalysisAI deleted a comment from github-actions bot Mar 31, 2026
@JordanNanos
Copy link
Copy Markdown
Collaborator Author

holy

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

starter task: MVP port mi355 deepseek disagg recipe to mi325

2 participants