Skip to content

Add Qwen3.5 h200 MTP#921

Open
hshrivastava-droid wants to merge 32 commits intomainfrom
nv/h200-qwen35
Open

Add Qwen3.5 h200 MTP#921
hshrivastava-droid wants to merge 32 commits intomainfrom
nv/h200-qwen35

Conversation

@hshrivastava-droid
Copy link
Copy Markdown
Collaborator

@hshrivastava-droid hshrivastava-droid commented Mar 20, 2026

Summary

Add Qwen3.5-397B-A17B-FP8 MTP benchmark configuration for H200 single-node with SGLang and EAGLE speculative decoding.

Changes

New benchmark script: benchmarks/single_node/qwen3.5_fp8_h200_mtp.sh

  • Model: Qwen/Qwen3.5-397B-A17B-FP8
  • Framework: SGLang with EAGLE speculative decoding
  • MTP Config: num-steps=3, draft-tokens=4, eagle-topk=1
  • Server flags: TP/EP from env, FP8 quantization, FP8 E4M3 KV cache, FlashInfer attention backend, chunked prefill (16384), mem-fraction-static 0.8, radix cache disabled, --enable-flashinfer-allreduce-fusion
  • Parsers: --reasoning-parser qwen3, --tool-call-parser qwen3_coder

New config: qwen3.5-fp8-h200-sglang-mtp in nvidia-master.yaml

  • Image: lmsysorg/sglang:v0.5.9-cu129-amd64
  • Runner: H200 single-node
  • Search space: TP8/EP8, concurrency 4–128
  • Sequence lengths: 1k/1k, 1k/8k, 8k/1k

perf-changelog.yaml

  • Added entry for qwen3.5-fp8-h200-sglang-mtp documenting the new config.

@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

2 similar comments
@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

- qwen3.5-fp8-h200-sglang-mtp
description:
- "Add Qwen3.5-397B-A17B-FP8 H200 SGLang MTP (EAGLE speculative decoding)"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/921
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Nit: The pr-link for the new qwen3.5-fp8-h200-sglang-mtp entry uses a placeholder /pull/XXX instead of /pull/921. Please update before merging.

Extended reasoning...

Bug Description

The new perf-changelog entry added at line 987 for qwen3.5-fp8-h200-sglang-mtp uses a placeholder PR link:

pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX

instead of the actual PR number:

pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/921

Code Path

The diff adds a new changelog block at the end of perf-changelog.yaml (lines 982-987). Every other entry in the file that was finalized has a concrete PR number in its pr-link field, making this an outlier that needs updating.

Pre-existing Context

There are several other pre-existing XXX placeholders in the file (e.g., for glm5-fp8-mi355x-sglang, dsr1-fp8-h200-sglang, minimaxm2.5-fp8-h200-vllm, qwen3.5-bf16-mi325x-sglang, qwen3.5-fp8-mi325x-sglang). However, those are from other PRs and outside the scope of this change. This PR should fix its own entry.

Impact

The impact is low — this is a metadata/documentation field, not functional code. The placeholder link would point to a nonexistent or incorrect pull request page, making it harder for someone reviewing the changelog to trace the entry back to its source PR.

Suggested Fix

Replace XXX with 921 on line 987:

pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/921

Given that the PR title is [WIP], this is likely a known TODO that the author plans to fix before final merge. Flagging it here as a reminder.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

Comment on lines +56 to +59
--speculative-algorithm EAGLE \
--speculative-num-steps 2 \
--speculative-num-draft-tokens 3 \
--speculative-eagle-topk 1 \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpicking here, can you please set these values as env variables?
Refer 👇

# MTP (Multi-Token Prediction) Config - EAGLE speculative decoding
SPECULATIVE_NUM_STEPS=3
SPECULATIVE_DRAFT_TOKENS=4
SPECULATIVE_EAGLE_TOPK=1


SERVER_LOG=/workspace/server.log
PORT=${PORT:-8888}
MAX_SEQ_LEN=$((ISL + OSL + 20))
Copy link
Copy Markdown
Contributor

@ankursingh-nv ankursingh-nv Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use MAX_MODEL_LEN here. This env is made available to the benchmark script just like TP, CONC, etc.

--enable-flashinfer-allreduce-fusion \
--max-running-requests 128 \
--chunked-prefill-size 16384 \
--decode-log-interval 1 \
Copy link
Copy Markdown
Contributor

@ankursingh-nv ankursingh-nv Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure we need --decode-log-interval flag when benchmarking?

@kedarpotdar-nv kedarpotdar-nv changed the title [WIP] Add Qwen3.5 h200 MTP Add Qwen3.5 h200 MTP Mar 28, 2026
@Ankur-singh
Copy link
Copy Markdown
Collaborator

@claude Add description based on diff and perf-changelog

@Klaud-Cold
Copy link
Copy Markdown
Collaborator

Klaud-Cold commented Mar 30, 2026

Claude finished @Ankur-singh's task in 49s —— View job


Adding PR Description

  • Gather context from diff and perf-changelog
  • Draft and update PR description

PR description has been updated with a summary of all changes: new benchmark script, nvidia-master.yaml config entry, and perf-changelog entry for qwen3.5-fp8-h200-sglang-mtp.

@Ankur-singh
Copy link
Copy Markdown
Collaborator

@functionstackx can you please review the PR?

Copy link
Copy Markdown
Collaborator

@jgangani jgangani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

@hshrivastava-droid
Copy link
Copy Markdown
Collaborator Author

@functionstackx - could you please help review this?

Comment on lines +42 to +61
--tp "$TP" \
--expert-parallel-size "$EP_SIZE" \
--reasoning-parser qwen3 \
--tool-call-parser qwen3_coder \
--enable-flashinfer-allreduce-fusion \
--max-running-requests 128 \
--chunked-prefill-size 16384 \
--mem-fraction-static 0.8 \
--cuda-graph-max-bs "$CONC" \
--context-length "$MAX_MODEL_LEN" \
--kv-cache-dtype fp8_e4m3 \
--quantization fp8 \
--attention-backend flashinfer \
--stream-interval 50 \
--tokenizer-worker-num 6 \
--mamba-ssm-dtype bfloat16 \
--disable-radix-cache \
--trust-remote-code \
--speculative-algorithm EAGLE \
--speculative-num-steps "$SPECULATIVE_NUM_STEPS" \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add and merge it to cookbook please

https://cookbook.sglang.io/autoregressive/Qwen/Qwen3.5
Image

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

working on this, will share PR link here

Copy link
Copy Markdown
Contributor

@functionstackx functionstackx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR generally looks good to me, needs recipe & deleting 1k8k tho

Removed duplicate configuration for isl: 1024 and osl: 8192.
@hshrivastava-droid
Copy link
Copy Markdown
Collaborator Author

@functionstackx - Resolved the comments, could you review again

@functionstackx
Copy link
Copy Markdown
Contributor

@functionstackx - Resolved the comments, could you review again

thanks for addressing it, can u plz share the cookbook recipe PR link

@hshrivastava-droid
Copy link
Copy Markdown
Collaborator Author

sure, will share once PR is ready

@kedarpotdar-nv
Copy link
Copy Markdown
Collaborator

@faradawn is working on cookbook PR. Can we get review @cquil11 @functionstackx ?

@functionstackx
Copy link
Copy Markdown
Contributor

@kedarpotdar-nv the PR LGTM. just waiting on recipe before we can merge. does NVIDIA have an claude slash command to make it easy to convert to sglang cookbook recipes?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

7 participants