[sgl][recipe]update sglang-atom ds fp4 recipe by zhuyuhua-v · Pull Request #975 · ROCm/ATOM

zhuyuhua-v · 2026-05-29T08:49:41Z

PR 描述：更新 ATOM SGLang Benchmark、Accuracy 和 DeepSeek Recipe

概要

这个 PR 更新 ATOM SGLang benchmark、accuracy validation、PR CI 以及 DeepSeek-R1 recipe。目标是让 DeepSeek-R1-0528 FP8/MXFP4、Qwen3.5 以及 Mesh/OOB benchmark 的启动参数更明确、矩阵分组更清晰，并清理 accuracy/benchmark 配置里不应该重复维护的运行时环境变量。

主要变化：

扩展并整理 SGLang benchmark catalog，当前 benchmark catalog 有 22 个 model config，展开后覆盖 120 个 OOB benchmark config 和 71 个 Mesh benchmark config。
为 benchmark catalog 增加 templates + models 抽象，复用重复的 SGLang server args 和 env 片段，workflow 读取时会展开成原来的完整 model list。
新增/调整 DeepSeek-R1-0528-MXFP4 OOB 覆盖，包括 TP4/TP8、TP4 DP4 EP4、TP4 DP8 EP8、TP8 MTP1、TP8 MTP3、MTP3 TP4 DP4 EP4、MTP3 TP8 DP8 EP8。
保留并整理 SGLang-Mesh 覆盖，包括 FP8 DPA4/DPA8、FP4 TP4/TP8、FP4 DPA4/DPA8，以及对应 MTP Mesh config。
调整 SGLang benchmark schedule，将 OOB non-MTP、OOB MTP+Qwen、Mesh 全量和 OOB 全量拆到不同日期运行。
更新手动触发 preset，增加 OOB DeepSeek MTP / non-MTP 分组，Mesh preset 继续支持 FP4/FP8、TP、DPA、MTP 等细分。
对 Qwen3.5-397B-A17B-FP8 benchmark 显式补齐 SGLANG_EXTERNAL_MODEL_PACKAGE=atom.plugin.sglang.models，使 benchmark catalog 和 nightly accuracy 配置一致。
清理 DeepSeek MTP/DP case 中重复或不再需要的 env，例如 CUDA_VISIBLE_DEVICES、TP_SIZE、MTP、SPECULATIVE_NUM_DRAFT_TOKENS 和 ATOM_DUAL_STREAM_MOE_TOKEN_THRESHOLD=0。
accuracy 配置中移除 AITER_QUICK_REDUCE_QUANTIZATION=INT4，但 benchmark catalog 中 DeepSeek benchmark 仍保留该 env。
将 FP4 Mesh TP case 的 1x1024 专用 env 从 ATOM_USE_FP4_TRITON_GEMM=1 更新为 ATOM_USE_FP4_NON_SHUFFLE_TRITON_GEMM=1。
PR CI 只保留 3 个轻量 SGLang accuracy case；legacy amd/DeepSeek-R1-0528-MXFP4-MTP-MoEFP4 场景只放在 nightly/full accuracy validation。
更新 recipes/atom_sglang/DeepSeek-R1.md，补充 DeepSeek-R1-0528 FP8/MXFP4、TP/DP/EP/MTP 相关 SGLang 启动命令和说明。

Benchmark Catalog 覆盖范围

.github/benchmark/sglang_benchmark_models.json 当前共有 22 个 model config：

SGLang-OOB：12 个 model config，默认 OOB 参数展开后是 120 个 benchmark config。
SGLang-Mesh：10 个 model config，按各自支持的 input/output pair 和 concurrency 过滤后是 71 个 benchmark config。
总计：191 个 benchmark config。

这里的数量按 workflow 实际展开后的 benchmark config/cell 计算，不按 JSON 中的 model entry 数量计算。

OOB 覆盖

OOB 默认参数是 10 组：

1024x1024，concurrency: 4, 8, 16, 32, 64
8192x1024，concurrency: 4, 8, 16, 32, 64

当前 OOB model config 展开如下：

DeepSeek-R1-0528 FP8 TP4：10 个 config
DeepSeek-R1-0528 FP8 TP8：10 个 config
DeepSeek-R1-0528-MXFP4 FP4 TP4：10 个 config
DeepSeek-R1-0528-MXFP4 FP4 TP8：10 个 config
DeepSeek-R1-0528-MXFP4 FP4 TP4 DP4 EP4：10 个 config
DeepSeek-R1-0528-MXFP4 FP4 TP4 DP8 EP8：10 个 config
DeepSeek-R1-0528-MXFP4 FP4 TP8 MTP1：10 个 config
DeepSeek-R1-0528-MXFP4 FP4 TP8 MTP3：10 个 config
DeepSeek-R1-0528-MXFP4 FP4 MTP3 TP4 DP4 EP4：10 个 config
DeepSeek-R1-0528-MXFP4 FP4 MTP3 TP8 DP8 EP8：10 个 config
Qwen3.5-397B-A17B-FP8 TP4：10 个 config
Qwen3.5-397B-A17B-FP8 TP8：10 个 config

OOB 总计：120 个 benchmark config。

Mesh 覆盖

Mesh 默认参数包含 8192x1 和 1x1024 两类请求，并由每个 model config 自己的 supported_input_output_pairs、supported_concurrency_values_by_pair 过滤。

当前 Mesh model config 展开如下：

DeepSeek-R1-0528 FP8 SGLang-Mesh DPA4 EP4：3 个 config
DeepSeek-R1-0528 FP8 SGLang-Mesh DPA8 EP8：6 个 config
DeepSeek-R1-0528-MXFP4 FP4 SGLang-Mesh TP4：6 个 config
DeepSeek-R1-0528-MXFP4 FP4 SGLang-Mesh TP8：16 个 config
DeepSeek-R1-0528-MXFP4 FP4 SGLang-Mesh DPA4 EP4：3 个 config
DeepSeek-R1-0528-MXFP4 FP4 SGLang-Mesh DPA8 EP8：6 个 config
DeepSeek-R1-0528-MXFP4 FP4 SGLang-Mesh TP4 MTP：6 个 config
DeepSeek-R1-0528-MXFP4 FP4 SGLang-Mesh TP8 MTP：16 个 config
DeepSeek-R1-0528-MXFP4 FP4 SGLang-Mesh DPA4 EP4 MTP：3 个 config
DeepSeek-R1-0528-MXFP4 FP4 SGLang-Mesh DPA8 EP8 MTP：6 个 config

Mesh 总计：71 个 benchmark config。

Scheduled Benchmark 分组

atom-sglang-benchmark.yaml 的 scheduled run 都在北京时间晚上 10 点触发，即 14:00 UTC。

Cron	覆盖范围	Model config 数	Benchmark config 数
`0 14 * * 1,3`	OOB DeepSeek non-MTP	6	60
`0 14 * * 2,4`	OOB DeepSeek MTP + Qwen	6	60
`0 14 * * 5`	Mesh 全量	10	71
`0 14 * * 6`	OOB 全量	12	120

Manual Preset 覆盖

OOB 手动 preset 的展开规模：

all-deepseek：10 个 DeepSeek model config，100 个 benchmark config
all-deepseek-non-mtp：6 个 DeepSeek non-MTP model config，60 个 benchmark config
all-deepseek-mtp：4 个 DeepSeek MTP model config，40 个 benchmark config
all-qwen：2 个 Qwen model config，20 个 benchmark config
all-oob：12 个 OOB model config，120 个 benchmark config

Mesh 手动 preset 的展开规模：

ds-all / all：10 个 Mesh model config，71 个 benchmark config
ds-fp4-all / fp4-all：8 个 FP4 Mesh model config，62 个 benchmark config
ds-fp8-all / fp8-all：2 个 FP8 Mesh model config，9 个 benchmark config
non-mtp：6 个 Mesh non-MTP model config，40 个 benchmark config
mtp：4 个 Mesh MTP model config，31 个 benchmark config

Benchmark Server Args 与 Env

Benchmark catalog 现在尽量显式写出 server args，减少依赖隐藏默认值：

templates.extra_args 抽出了 trust_remote_code、aiter_runtime、qwen_reasoning、mtp1_nextn、mtp3_nextn_256、mtp3_nextn_4096 等公共启动参数片段。
templates.env_vars 抽出了 DeepSeek OOB、DP、MTP、Mesh、Mesh MTP 以及 Qwen 的公共 env 片段。
每个 model config 只保留自己的 TP/DP/EP、MTP 步数、path、prefix、preset 和支持的 benchmark 参数范围。
DeepSeek FP8/FP4 OOB 和 Mesh server args 统一使用 --mem-fraction-static 0.85。
DeepSeek case 显式带 --trust-remote-code、--attention-backend aiter、--kv-cache-dtype fp8_e4m3、--page-size 1、--disable-radix-cache 等关键参数。
DP/EP case 显式使用 --enable-dp-attention、--data-parallel-size、--expert-parallel-size。
MTP case 显式使用 --speculative-draft-model-path SGLang/DeepSeek-R1-NextN、--speculative-algorithm NEXTN、--speculative-num-steps、--speculative-num-draft-tokens 和 --cuda-graph-bs。
Mesh DPA case 通过 case_extra_args_by_pair 在 8192x1 case 上补 --chunked-prefill-size 65536。
FP4 Mesh TP case 通过 case_env_vars_by_pair 在 1x1024 case 上补 ATOM_USE_FP4_NON_SHUFFLE_TRITON_GEMM=1。

DeepSeek benchmark env 中保留：

AITER_QUICK_REDUCE_QUANTIZATION=INT4
SGLANG_AITER_FP8_PREFILL_ATTN=0
SGLANG_USE_AITER=1
ATOM_ENABLE_DS_QKNORM_QUANT_FUSION=1
SGLANG_EXTERNAL_MODEL_PACKAGE=atom.plugin.sglang.models
SGLANG_ENABLE_TORCH_COMPILE=1
TORCHINDUCTOR_COMPILE_THREADS=128

accuracy 配置中不再设置 AITER_QUICK_REDUCE_QUANTIZATION=INT4。

Qwen Benchmark Env 对齐

Qwen3.5-397B-A17B-FP8 TP4 和 Qwen3.5-397B-A17B-FP8 TP8 的 benchmark env 现在显式包含：

SGLANG_DEFAULT_SERVER_ARGS=
SGLANG_EXTERNAL_MODEL_PACKAGE=atom.plugin.sglang.models
ATOM_ENABLE_QK_NORM_ROPE_CACHE_QUANT_FUSION=0

SGLANG_EXTERNAL_MODEL_PACKAGE 原本会由 .github/scripts/atom_sglang_test.sh 默认导出，但 benchmark catalog 中不显式写会导致它和 nightly accuracy 配置看起来不一致。本 PR 将它写入 JSON，方便审查和配置对齐。

SGLang PR CI Accuracy

.github/workflows/atom-sglang-test.yaml 当前 PR CI 只保留 3 个轻量 accuracy case：

DeepSeek-R1-FP8 TP4
DeepSeek-R1-FP4 TP4
Qwen3.5-35B-A3B-FP8 TP2

这些 PR CI case 都不再设置 AITER_QUICK_REDUCE_QUANTIZATION=INT4。

legacy amd/DeepSeek-R1-0528-MXFP4-MTP-MoEFP4 相关 case 没有加入 PR CI，避免 PR 检查成本过高。

Nightly / Full Accuracy Validation

.github/benchmark/sglang_models_accuracy.json 和 .github/workflows/atom-sglang-accuracy-validation.yaml 当前覆盖 15 个 nightly/full accuracy case：

DeepSeek-R1-FP8 TP4
Qwen3.5-35B-A3B-FP8 TP2
Qwen3.5-35B-A3B TP2
Qwen3.5-397B-A17B-FP8 TP4
Qwen3.5-397B-A17B-FP8 TP8
DeepSeek-R1-FP8 TP8
DeepSeek-R1-FP4 TP4
DeepSeek-R1-FP4 TP4 DP4 EP4
DeepSeek-R1-FP4 TP4 DP8 EP8
DeepSeek-R1-FP4 TP8
DeepSeek-R1-FP4 TP8 MTP3
DeepSeek-R1-FP4-MTP-MoEFP4 TP8
DeepSeek-R1-FP4-MTP-MoEFP4 TP8 DP8 EP8
DeepSeek-R1-FP4-MTP-MoEFP4 TP8 MTP3
DeepSeek-R1-FP4 TP8 MTP1

nightly/full accuracy 的 DeepSeek case 与 benchmark catalog 能对齐的部分，server args 保持一致；差异主要是 accuracy 不设置 AITER_QUICK_REDUCE_QUANTIZATION=INT4，runner 也和 benchmark runner 不同。

其中 Qwen3.5-35B-A3B* 和 legacy DeepSeek-R1-FP4-MTP-MoEFP4* 是 accuracy 专用覆盖，在 benchmark catalog 里没有对应 benchmark entry。

Mesh Benchmark Workflow

atom-sglang-benchmark.yaml 和 atom_sglang_mesh_benchmark.sh 的改动让 Mesh benchmark 更准确地消费 model config：

workflow 在 benchmark container 启动前应用 case_env_vars_by_pair。
workflow 在启动 SGLang 前应用 case_extra_args_by_pair，覆盖 sglang-atom 和 sglang-mori 两种 server mode。
Mesh DPA/EP case 可以通过 case_extra_args_by_pair 注入 --chunked-prefill-size 65536。
atom_sglang_mesh_benchmark.sh 从 SERVER_EXTRA_ARGS 里解析 --chunked-prefill-size，转换成脚本内部的 prefill 参数，避免参数重复或不一致。
对 Mesh DP+EP case，benchmark client 使用更低的请求采样规模，降低高并发 DP/EP case 的运行成本。

DeepSeek Recipe

recipes/atom_sglang/DeepSeek-R1.md 同步更新了 DeepSeek-R1-0528 SGLang 使用说明：

覆盖 deepseek-ai/DeepSeek-R1-0528 FP8。
覆盖 amd/DeepSeek-R1-0528-MXFP4-v2 FP4。
补充 TP、DP、EP、MTP 组合的启动命令。
对齐 benchmark catalog 中使用的关键 server args 和 env。

ZhiweiYan-96 · 2026-05-29T12:20:21Z

      "1x1024": [64, 128, 256]
    },
-    "env_vars": "AITER_QUICK_REDUCE_QUANTIZATION=INT4\nSGLANG_AITER_FP8_PREFILL_ATTN=0\nSGLANG_USE_AITER=1\nATOM_ENABLE_DS_QKNORM_QUANT_FUSION=1"
+    "env_vars": "SGLANG_DEFAULT_SERVER_ARGS=\nAITER_QUICK_REDUCE_QUANTIZATION=INT4\nSGLANG_AITER_FP8_PREFILL_ATTN=0\nSGLANG_USE_AITER=1\nATOM_ENABLE_DS_QKNORM_QUANT_FUSION=1\nSGLANG_EXTERNAL_MODEL_PACKAGE=atom.plugin.sglang.models\nTORCHINDUCTOR_COMPILE_THREADS=128",


Signed-off-by: zhuyuhua-v <yuhzhu@amd.com>

Co-authored-by: Cursor <cursoragent@cursor.com>

qichu-yun previously approved these changes May 29, 2026

View reviewed changes

ZLkanyo009 previously approved these changes May 29, 2026

View reviewed changes

zhuyuhua-v dismissed stale reviews from ZLkanyo009 and qichu-yun via 8954765 May 29, 2026 09:54

zhuyuhua-v force-pushed the yuhua/sgl-recipe branch from 47fea28 to 3abdf6f Compare May 29, 2026 12:07

ZhiweiYan-96 reviewed May 29, 2026

View reviewed changes

Comment thread .github/benchmark/sglang_benchmark_models.json Outdated

ZhiweiYan-96 reviewed May 29, 2026

View reviewed changes

Comment thread .github/benchmark/sglang_benchmark_models.json Outdated

ZhiweiYan-96 reviewed May 29, 2026

View reviewed changes

Comment thread .github/benchmark/sglang_benchmark_models.json Outdated

zhuyuhua-v and others added 6 commits June 8, 2026 10:20

Update SGLang benchmark coverage

6ed3122

update benchmark

7c9e532

Signed-off-by: zhuyuhua-v <yuhzhu@amd.com>

update ci, nightly and schedule

255b878

Signed-off-by: zhuyuhua-v <yuhzhu@amd.com>

update ci, nightly casses

6ce4b62

Signed-off-by: zhuyuhua-v <yuhzhu@amd.com>

update benchmark.json for read

8feaefd

Signed-off-by: zhuyuhua-v <yuhzhu@amd.com>

update sglang benchmark presets after rebase

acd0ae6

Co-authored-by: Cursor <cursoragent@cursor.com>

zhangxinyuanliuhengyu force-pushed the yuhua/sgl-recipe branch from 1c0aebc to acd0ae6 Compare June 8, 2026 02:25

zhangxinyuanliuhengyu requested review from ZLkanyo009, ZhiweiYan-96 and qichu-yun June 8, 2026 03:14

zhangxinyuanliuhengyu marked this pull request as ready for review June 8, 2026 03:18

qichu-yun approved these changes Jun 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[sgl][recipe]update sglang-atom ds fp4 recipe#975

[sgl][recipe]update sglang-atom ds fp4 recipe#975
zhuyuhua-v wants to merge 6 commits into
mainfrom
yuhua/sgl-recipe

zhuyuhua-v commented May 29, 2026 •

edited

Loading

Uh oh!

Uh oh!

ZhiweiYan-96 May 29, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

zhuyuhua-v commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

概要

Benchmark Catalog 覆盖范围

OOB 覆盖

Mesh 覆盖

Scheduled Benchmark 分组

Manual Preset 覆盖

Benchmark Server Args 与 Env

Qwen Benchmark Env 对齐

SGLang PR CI Accuracy

Nightly / Full Accuracy Validation

Mesh Benchmark Workflow

DeepSeek Recipe

Uh oh!

Uh oh!

ZhiweiYan-96 May 29, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

zhuyuhua-v commented May 29, 2026 •

edited

Loading