Add Qwen3.5 h200 MTP by hshrivastava-droid · Pull Request #921 · SemiAnalysisAI/InferenceX

hshrivastava-droid · 2026-03-20T01:57:43Z

Summary

Add Qwen3.5-397B-A17B-FP8 MTP benchmark configuration for H200 single-node with SGLang and EAGLE speculative decoding.

Changes

New benchmark script: `benchmarks/single_node/qwen3.5_fp8_h200_mtp.sh`

Model: Qwen/Qwen3.5-397B-A17B-FP8
Framework: SGLang with EAGLE speculative decoding
MTP Config: num-steps=3, draft-tokens=4, eagle-topk=1
Server flags: TP/EP from env, FP8 quantization, FP8 E4M3 KV cache, FlashInfer attention backend, chunked prefill (16384), mem-fraction-static 0.8, radix cache disabled, --enable-flashinfer-allreduce-fusion
Parsers: --reasoning-parser qwen3, --tool-call-parser qwen3_coder

New config: `qwen3.5-fp8-h200-sglang-mtp` in `nvidia-master.yaml`

Image: lmsysorg/sglang:v0.5.9-cu129-amd64
Runner: H200 single-node
Search space: TP8/EP8, concurrency 4–128
Sequence lengths: 1k/1k, 1k/8k, 8k/1k

`perf-changelog.yaml`

Added entry for qwen3.5-fp8-h200-sglang-mtp documenting the new config.

github-actions · 2026-03-20T01:57:51Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

github-actions · 2026-03-20T01:57:51Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

github-actions · 2026-03-20T01:57:51Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

benchmarks/single_node/qwen3.5_fp8_h200_mtp.sh

claude · 2026-03-20T02:21:13Z

perf-changelog.yaml

+    - qwen3.5-fp8-h200-sglang-mtp
+  description:
+    - "Add Qwen3.5-397B-A17B-FP8 H200 SGLang MTP (EAGLE speculative decoding)"
+  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/921


🟡 Nit: The pr-link for the new qwen3.5-fp8-h200-sglang-mtp entry uses a placeholder /pull/XXX instead of /pull/921. Please update before merging.

Extended reasoning...

Bug Description

The new perf-changelog entry added at line 987 for qwen3.5-fp8-h200-sglang-mtp uses a placeholder PR link:

pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX

instead of the actual PR number:

pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/921

Code Path

The diff adds a new changelog block at the end of perf-changelog.yaml (lines 982-987). Every other entry in the file that was finalized has a concrete PR number in its pr-link field, making this an outlier that needs updating.

Pre-existing Context

There are several other pre-existing XXX placeholders in the file (e.g., for glm5-fp8-mi355x-sglang, dsr1-fp8-h200-sglang, minimaxm2.5-fp8-h200-vllm, qwen3.5-bf16-mi325x-sglang, qwen3.5-fp8-mi325x-sglang). However, those are from other PRs and outside the scope of this change. This PR should fix its own entry.

Impact

The impact is low — this is a metadata/documentation field, not functional code. The placeholder link would point to a nonexistent or incorrect pull request page, making it harder for someone reviewing the changelog to trace the entry back to its source PR.

Suggested Fix

Replace XXX with 921 on line 987:

pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/921

Given that the PR title is [WIP], this is likely a known TODO that the author plans to fix before final merge. Flagging it here as a reminder.

ankursingh-nv · 2026-03-24T00:02:15Z

benchmarks/single_node/qwen3.5_fp8_h200_mtp.sh

+  --speculative-algorithm EAGLE \
+  --speculative-num-steps 2 \
+  --speculative-num-draft-tokens 3 \
+  --speculative-eagle-topk 1 \


Nitpicking here, can you please set these values as env variables?
Refer 👇

InferenceX/benchmarks/single_node/qwen3.5_fp8_b200_mtp.sh

Lines 44 to 47 in 26d415b

# MTP (Multi-Token Prediction) Config - EAGLE speculative decoding

SPECULATIVE_NUM_STEPS=3

SPECULATIVE_DRAFT_TOKENS=4

SPECULATIVE_EAGLE_TOPK=1

ankursingh-nv · 2026-03-24T00:04:42Z

benchmarks/single_node/qwen3.5_fp8_h200_mtp.sh

+
+SERVER_LOG=/workspace/server.log
+PORT=${PORT:-8888}
+MAX_SEQ_LEN=$((ISL + OSL + 20))


You can use MAX_MODEL_LEN here. This env is made available to the benchmark script just like TP, CONC, etc.

ankursingh-nv · 2026-03-24T00:07:45Z

benchmarks/single_node/qwen3.5_fp8_h200_mtp.sh

+  --enable-flashinfer-allreduce-fusion \
+  --max-running-requests 128 \
+  --chunked-prefill-size 16384 \
+  --decode-log-interval 1 \


Are you sure we need --decode-log-interval flag when benchmarking?

Ankur-singh · 2026-03-30T21:13:41Z

@claude Add description based on diff and perf-changelog

Klaud-Cold · 2026-03-30T21:14:06Z

Claude finished @Ankur-singh's task in 49s —— View job

Adding PR Description

Gather context from diff and perf-changelog
Draft and update PR description

PR description has been updated with a summary of all changes: new benchmark script, nvidia-master.yaml config entry, and perf-changelog entry for qwen3.5-fp8-h200-sglang-mtp.

Updated PR link formatting in the changelog.

Ankur-singh · 2026-03-30T22:02:12Z

@functionstackx can you please review the PR?

jgangani

Looks good to me.

hshrivastava-droid · 2026-03-31T18:27:25Z

@functionstackx - could you please help review this?

functionstackx · 2026-03-31T19:04:43Z

benchmarks/single_node/qwen3.5_fp8_h200_mtp.sh

+  --tp "$TP" \
+  --expert-parallel-size "$EP_SIZE" \
+  --reasoning-parser qwen3 \
+  --tool-call-parser qwen3_coder \
+  --enable-flashinfer-allreduce-fusion \
+  --max-running-requests 128 \
+  --chunked-prefill-size 16384 \
+  --mem-fraction-static 0.8 \
+  --cuda-graph-max-bs "$CONC" \
+  --context-length "$MAX_MODEL_LEN" \
+  --kv-cache-dtype fp8_e4m3 \
+  --quantization fp8 \
+  --attention-backend flashinfer \
+  --stream-interval 50 \
+  --tokenizer-worker-num 6 \
+  --mamba-ssm-dtype bfloat16 \
+  --disable-radix-cache \
+  --trust-remote-code \
+  --speculative-algorithm EAGLE \
+  --speculative-num-steps "$SPECULATIVE_NUM_STEPS" \


add and merge it to cookbook please

https://cookbook.sglang.io/autoregressive/Qwen/Qwen3.5

working on this, will share PR link here

functionstackx

PR generally looks good to me, needs recipe & deleting 1k8k tho

.github/configs/nvidia-master.yaml

Removed duplicate configuration for isl: 1024 and osl: 8192.

hshrivastava-droid · 2026-03-31T20:04:53Z

@functionstackx - Resolved the comments, could you review again

functionstackx · 2026-03-31T20:15:31Z

@functionstackx - Resolved the comments, could you review again

thanks for addressing it, can u plz share the cookbook recipe PR link

hshrivastava-droid · 2026-03-31T20:17:10Z

sure, will share once PR is ready

kedarpotdar-nv · 2026-03-31T20:54:37Z

@faradawn is working on cookbook PR. Can we get review @cquil11 @functionstackx ?

functionstackx · 2026-03-31T21:06:11Z

@kedarpotdar-nv the PR LGTM. just waiting on recipe before we can merge. does NVIDIA have an claude slash command to make it easy to convert to sglang cookbook recipes?

Add Qwen3.5 h200 MTP

734c1ba

hshrivastava-droid requested a review from a team March 20, 2026 01:57

hshrivastava-droid requested review from ankursingh-nv, jgangani and kedarpotdar-nv as code owners March 20, 2026 01:57

github-project-automation bot added this to InferenceMAX Board Mar 20, 2026

extend conc

3d80687

hshrivastava-droid added NVIDIA sweep-enabled labels Mar 20, 2026

claude bot reviewed Mar 20, 2026

View reviewed changes

hshrivastava-droid added 10 commits March 19, 2026 21:48

adding flag

96a8f02

Merge branch 'main' into nv/h200-qwen35

db6e75d

Update perf-changelog.yaml

0d13e55

add new line

0ce06cf

Update perf-changelog.yaml

5dd2308

add new line

14a584f

Update perf-changelog.yaml

55a6d03

fix: perf bug

2e489b9

fix:perf

cbe069d

Delete docs/accuracy_evals_slides.html

250e850

ankursingh-nv reviewed Mar 24, 2026

View reviewed changes

hshrivastava-droid added 6 commits March 23, 2026 22:36

change env variable

af9bda1

fix: max seq len

571512b

remove extra flag

cc42d88

fix:perf

4a35e5c

Merge branch 'main' into nv/h200-qwen35

1e96dc3

add new line

61d531d

hshrivastava-droid added 2 commits March 25, 2026 14:43

update:spec num

75a4160

update: spec token

1e93eae

kedarpotdar-nv approved these changes Mar 26, 2026

View reviewed changes

kedarpotdar-nv changed the title ~~[WIP] Add Qwen3.5 h200 MTP~~ Add Qwen3.5 h200 MTP Mar 28, 2026

Ankur-singh approved these changes Mar 30, 2026

View reviewed changes

hshrivastava-droid added 9 commits March 30, 2026 14:27

Merge branch 'main' into nv/h200-qwen35

3ff815b

fix:perf log

ea4e969

fix:perf log delete

c76c817

fix:perf log add

63a71ae

fix:perf log add

e374eb1

fix:perf log add

1570da9

fix:perf log add

de7febd

add perf

315e1e8

Fix PR link formatting in perf-changelog.yaml

8205d84

Updated PR link formatting in the changelog.

jgangani approved these changes Mar 30, 2026

View reviewed changes

fix:eval error8k1k

82bf274

functionstackx reviewed Mar 31, 2026

View reviewed changes

functionstackx requested changes Mar 31, 2026

View reviewed changes

.github/configs/nvidia-master.yaml Outdated Show resolved Hide resolved

Remove duplicate configuration entry in nvidia-master.yaml

59200f2

Removed duplicate configuration for isl: 1024 and osl: 8192.

hshrivastava-droid requested a review from functionstackx March 31, 2026 20:19

Ankur-singh approved these changes Mar 31, 2026

View reviewed changes

	# MTP (Multi-Token Prediction) Config - EAGLE speculative decoding
	SPECULATIVE_NUM_STEPS=3
	SPECULATIVE_DRAFT_TOKENS=4
	SPECULATIVE_EAGLE_TOPK=1

Conversation

hshrivastava-droid commented Mar 20, 2026 • edited by Klaud-Cold Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

New benchmark script: benchmarks/single_node/qwen3.5_fp8_h200_mtp.sh

New config: qwen3.5-fp8-h200-sglang-mtp in nvidia-master.yaml

perf-changelog.yaml

Uh oh!

github-actions bot commented Mar 20, 2026

Uh oh!

github-actions bot commented Mar 20, 2026

Uh oh!

github-actions bot commented Mar 20, 2026

Uh oh!

Uh oh!

claude bot Mar 20, 2026

Choose a reason for hiding this comment

Bug Description

Code Path

Pre-existing Context

Impact

Suggested Fix

Uh oh!

hshrivastava-droid Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

ankursingh-nv Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

ankursingh-nv Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ankursingh-nv Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Ankur-singh commented Mar 30, 2026

Uh oh!

Klaud-Cold commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Adding PR Description

Uh oh!

Ankur-singh commented Mar 30, 2026

Uh oh!

jgangani left a comment

Choose a reason for hiding this comment

Uh oh!

hshrivastava-droid commented Mar 31, 2026

Uh oh!

functionstackx Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

hshrivastava-droid Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

functionstackx left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hshrivastava-droid commented Mar 31, 2026

Uh oh!

functionstackx commented Mar 31, 2026

Uh oh!

hshrivastava-droid commented Mar 31, 2026

Uh oh!

kedarpotdar-nv commented Mar 31, 2026

Uh oh!

functionstackx commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

hshrivastava-droid commented Mar 20, 2026 •

edited by Klaud-Cold

Loading

New benchmark script: `benchmarks/single_node/qwen3.5_fp8_h200_mtp.sh`

New config: `qwen3.5-fp8-h200-sglang-mtp` in `nvidia-master.yaml`

`perf-changelog.yaml`

ankursingh-nv Mar 24, 2026 •

edited

Loading

ankursingh-nv Mar 24, 2026 •

edited

Loading

Klaud-Cold commented Mar 30, 2026 •

edited

Loading