[sgl][recipe]update sglang-atom ds fp4 recipe#975
Open
zhuyuhua-v wants to merge 6 commits into
Open
Conversation
qichu-yun
previously approved these changes
May 29, 2026
ZLkanyo009
previously approved these changes
May 29, 2026
47fea28 to
3abdf6f
Compare
| "1x1024": [64, 128, 256] | ||
| }, | ||
| "env_vars": "AITER_QUICK_REDUCE_QUANTIZATION=INT4\nSGLANG_AITER_FP8_PREFILL_ATTN=0\nSGLANG_USE_AITER=1\nATOM_ENABLE_DS_QKNORM_QUANT_FUSION=1" | ||
| "env_vars": "SGLANG_DEFAULT_SERVER_ARGS=\nAITER_QUICK_REDUCE_QUANTIZATION=INT4\nSGLANG_AITER_FP8_PREFILL_ATTN=0\nSGLANG_USE_AITER=1\nATOM_ENABLE_DS_QKNORM_QUANT_FUSION=1\nSGLANG_EXTERNAL_MODEL_PACKAGE=atom.plugin.sglang.models\nTORCHINDUCTOR_COMPILE_THREADS=128", |
Signed-off-by: zhuyuhua-v <yuhzhu@amd.com>
Signed-off-by: zhuyuhua-v <yuhzhu@amd.com>
Signed-off-by: zhuyuhua-v <yuhzhu@amd.com>
Signed-off-by: zhuyuhua-v <yuhzhu@amd.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
1c0aebc to
acd0ae6
Compare
qichu-yun
approved these changes
Jun 8, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PR 描述:更新 ATOM SGLang Benchmark、Accuracy 和 DeepSeek Recipe
概要
这个 PR 更新 ATOM SGLang benchmark、accuracy validation、PR CI 以及 DeepSeek-R1 recipe。目标是让 DeepSeek-R1-0528 FP8/MXFP4、Qwen3.5 以及 Mesh/OOB benchmark 的启动参数更明确、矩阵分组更清晰,并清理 accuracy/benchmark 配置里不应该重复维护的运行时环境变量。
主要变化:
templates+models抽象,复用重复的 SGLang server args 和 env 片段,workflow 读取时会展开成原来的完整 model list。SGLANG_EXTERNAL_MODEL_PACKAGE=atom.plugin.sglang.models,使 benchmark catalog 和 nightly accuracy 配置一致。CUDA_VISIBLE_DEVICES、TP_SIZE、MTP、SPECULATIVE_NUM_DRAFT_TOKENS和ATOM_DUAL_STREAM_MOE_TOKEN_THRESHOLD=0。AITER_QUICK_REDUCE_QUANTIZATION=INT4,但 benchmark catalog 中 DeepSeek benchmark 仍保留该 env。1x1024专用 env 从ATOM_USE_FP4_TRITON_GEMM=1更新为ATOM_USE_FP4_NON_SHUFFLE_TRITON_GEMM=1。amd/DeepSeek-R1-0528-MXFP4-MTP-MoEFP4场景只放在 nightly/full accuracy validation。recipes/atom_sglang/DeepSeek-R1.md,补充 DeepSeek-R1-0528 FP8/MXFP4、TP/DP/EP/MTP 相关 SGLang 启动命令和说明。Benchmark Catalog 覆盖范围
.github/benchmark/sglang_benchmark_models.json当前共有 22 个 model config:SGLang-OOB:12 个 model config,默认 OOB 参数展开后是 120 个 benchmark config。SGLang-Mesh:10 个 model config,按各自支持的 input/output pair 和 concurrency 过滤后是 71 个 benchmark config。这里的数量按 workflow 实际展开后的 benchmark config/cell 计算,不按 JSON 中的 model entry 数量计算。
OOB 覆盖
OOB 默认参数是 10 组:
1024x1024,concurrency:4, 8, 16, 32, 648192x1024,concurrency:4, 8, 16, 32, 64当前 OOB model config 展开如下:
DeepSeek-R1-0528 FP8 TP4:10 个 configDeepSeek-R1-0528 FP8 TP8:10 个 configDeepSeek-R1-0528-MXFP4 FP4 TP4:10 个 configDeepSeek-R1-0528-MXFP4 FP4 TP8:10 个 configDeepSeek-R1-0528-MXFP4 FP4 TP4 DP4 EP4:10 个 configDeepSeek-R1-0528-MXFP4 FP4 TP4 DP8 EP8:10 个 configDeepSeek-R1-0528-MXFP4 FP4 TP8 MTP1:10 个 configDeepSeek-R1-0528-MXFP4 FP4 TP8 MTP3:10 个 configDeepSeek-R1-0528-MXFP4 FP4 MTP3 TP4 DP4 EP4:10 个 configDeepSeek-R1-0528-MXFP4 FP4 MTP3 TP8 DP8 EP8:10 个 configQwen3.5-397B-A17B-FP8 TP4:10 个 configQwen3.5-397B-A17B-FP8 TP8:10 个 configOOB 总计:120 个 benchmark config。
Mesh 覆盖
Mesh 默认参数包含
8192x1和1x1024两类请求,并由每个 model config 自己的supported_input_output_pairs、supported_concurrency_values_by_pair过滤。当前 Mesh model config 展开如下:
DeepSeek-R1-0528 FP8 SGLang-Mesh DPA4 EP4:3 个 configDeepSeek-R1-0528 FP8 SGLang-Mesh DPA8 EP8:6 个 configDeepSeek-R1-0528-MXFP4 FP4 SGLang-Mesh TP4:6 个 configDeepSeek-R1-0528-MXFP4 FP4 SGLang-Mesh TP8:16 个 configDeepSeek-R1-0528-MXFP4 FP4 SGLang-Mesh DPA4 EP4:3 个 configDeepSeek-R1-0528-MXFP4 FP4 SGLang-Mesh DPA8 EP8:6 个 configDeepSeek-R1-0528-MXFP4 FP4 SGLang-Mesh TP4 MTP:6 个 configDeepSeek-R1-0528-MXFP4 FP4 SGLang-Mesh TP8 MTP:16 个 configDeepSeek-R1-0528-MXFP4 FP4 SGLang-Mesh DPA4 EP4 MTP:3 个 configDeepSeek-R1-0528-MXFP4 FP4 SGLang-Mesh DPA8 EP8 MTP:6 个 configMesh 总计:71 个 benchmark config。
Scheduled Benchmark 分组
atom-sglang-benchmark.yaml的 scheduled run 都在北京时间晚上 10 点触发,即14:00 UTC。0 14 * * 1,30 14 * * 2,40 14 * * 50 14 * * 6Manual Preset 覆盖
OOB 手动 preset 的展开规模:
all-deepseek:10 个 DeepSeek model config,100 个 benchmark configall-deepseek-non-mtp:6 个 DeepSeek non-MTP model config,60 个 benchmark configall-deepseek-mtp:4 个 DeepSeek MTP model config,40 个 benchmark configall-qwen:2 个 Qwen model config,20 个 benchmark configall-oob:12 个 OOB model config,120 个 benchmark configMesh 手动 preset 的展开规模:
ds-all/all:10 个 Mesh model config,71 个 benchmark configds-fp4-all/fp4-all:8 个 FP4 Mesh model config,62 个 benchmark configds-fp8-all/fp8-all:2 个 FP8 Mesh model config,9 个 benchmark confignon-mtp:6 个 Mesh non-MTP model config,40 个 benchmark configmtp:4 个 Mesh MTP model config,31 个 benchmark configBenchmark Server Args 与 Env
Benchmark catalog 现在尽量显式写出 server args,减少依赖隐藏默认值:
templates.extra_args抽出了trust_remote_code、aiter_runtime、qwen_reasoning、mtp1_nextn、mtp3_nextn_256、mtp3_nextn_4096等公共启动参数片段。templates.env_vars抽出了 DeepSeek OOB、DP、MTP、Mesh、Mesh MTP 以及 Qwen 的公共 env 片段。--mem-fraction-static 0.85。--trust-remote-code、--attention-backend aiter、--kv-cache-dtype fp8_e4m3、--page-size 1、--disable-radix-cache等关键参数。--enable-dp-attention、--data-parallel-size、--expert-parallel-size。--speculative-draft-model-path SGLang/DeepSeek-R1-NextN、--speculative-algorithm NEXTN、--speculative-num-steps、--speculative-num-draft-tokens和--cuda-graph-bs。case_extra_args_by_pair在8192x1case 上补--chunked-prefill-size 65536。case_env_vars_by_pair在1x1024case 上补ATOM_USE_FP4_NON_SHUFFLE_TRITON_GEMM=1。DeepSeek benchmark env 中保留:
accuracy 配置中不再设置
AITER_QUICK_REDUCE_QUANTIZATION=INT4。Qwen Benchmark Env 对齐
Qwen3.5-397B-A17B-FP8 TP4和Qwen3.5-397B-A17B-FP8 TP8的 benchmark env 现在显式包含:SGLANG_EXTERNAL_MODEL_PACKAGE原本会由.github/scripts/atom_sglang_test.sh默认导出,但 benchmark catalog 中不显式写会导致它和 nightly accuracy 配置看起来不一致。本 PR 将它写入 JSON,方便审查和配置对齐。SGLang PR CI Accuracy
.github/workflows/atom-sglang-test.yaml当前 PR CI 只保留 3 个轻量 accuracy case:DeepSeek-R1-FP8 TP4DeepSeek-R1-FP4 TP4Qwen3.5-35B-A3B-FP8 TP2这些 PR CI case 都不再设置
AITER_QUICK_REDUCE_QUANTIZATION=INT4。legacy
amd/DeepSeek-R1-0528-MXFP4-MTP-MoEFP4相关 case 没有加入 PR CI,避免 PR 检查成本过高。Nightly / Full Accuracy Validation
.github/benchmark/sglang_models_accuracy.json和.github/workflows/atom-sglang-accuracy-validation.yaml当前覆盖 15 个 nightly/full accuracy case:DeepSeek-R1-FP8 TP4Qwen3.5-35B-A3B-FP8 TP2Qwen3.5-35B-A3B TP2Qwen3.5-397B-A17B-FP8 TP4Qwen3.5-397B-A17B-FP8 TP8DeepSeek-R1-FP8 TP8DeepSeek-R1-FP4 TP4DeepSeek-R1-FP4 TP4 DP4 EP4DeepSeek-R1-FP4 TP4 DP8 EP8DeepSeek-R1-FP4 TP8DeepSeek-R1-FP4 TP8 MTP3DeepSeek-R1-FP4-MTP-MoEFP4 TP8DeepSeek-R1-FP4-MTP-MoEFP4 TP8 DP8 EP8DeepSeek-R1-FP4-MTP-MoEFP4 TP8 MTP3DeepSeek-R1-FP4 TP8 MTP1nightly/full accuracy 的 DeepSeek case 与 benchmark catalog 能对齐的部分,server args 保持一致;差异主要是 accuracy 不设置
AITER_QUICK_REDUCE_QUANTIZATION=INT4,runner 也和 benchmark runner 不同。其中
Qwen3.5-35B-A3B*和 legacyDeepSeek-R1-FP4-MTP-MoEFP4*是 accuracy 专用覆盖,在 benchmark catalog 里没有对应 benchmark entry。Mesh Benchmark Workflow
atom-sglang-benchmark.yaml和atom_sglang_mesh_benchmark.sh的改动让 Mesh benchmark 更准确地消费 model config:case_env_vars_by_pair。case_extra_args_by_pair,覆盖sglang-atom和sglang-mori两种 server mode。case_extra_args_by_pair注入--chunked-prefill-size 65536。atom_sglang_mesh_benchmark.sh从SERVER_EXTRA_ARGS里解析--chunked-prefill-size,转换成脚本内部的 prefill 参数,避免参数重复或不一致。DeepSeek Recipe
recipes/atom_sglang/DeepSeek-R1.md同步更新了 DeepSeek-R1-0528 SGLang 使用说明:deepseek-ai/DeepSeek-R1-0528FP8。amd/DeepSeek-R1-0528-MXFP4-v2FP4。