Conversation
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you |
2 similar comments
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you |
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you |
perf-changelog.yaml
Outdated
| - qwen3.5-fp8-h200-sglang-mtp | ||
| description: | ||
| - "Add Qwen3.5-397B-A17B-FP8 H200 SGLang MTP (EAGLE speculative decoding)" | ||
| pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/921 |
There was a problem hiding this comment.
🟡 Nit: The pr-link for the new qwen3.5-fp8-h200-sglang-mtp entry uses a placeholder /pull/XXX instead of /pull/921. Please update before merging.
Extended reasoning...
Bug Description
The new perf-changelog entry added at line 987 for qwen3.5-fp8-h200-sglang-mtp uses a placeholder PR link:
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXXinstead of the actual PR number:
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/921Code Path
The diff adds a new changelog block at the end of perf-changelog.yaml (lines 982-987). Every other entry in the file that was finalized has a concrete PR number in its pr-link field, making this an outlier that needs updating.
Pre-existing Context
There are several other pre-existing XXX placeholders in the file (e.g., for glm5-fp8-mi355x-sglang, dsr1-fp8-h200-sglang, minimaxm2.5-fp8-h200-vllm, qwen3.5-bf16-mi325x-sglang, qwen3.5-fp8-mi325x-sglang). However, those are from other PRs and outside the scope of this change. This PR should fix its own entry.
Impact
The impact is low — this is a metadata/documentation field, not functional code. The placeholder link would point to a nonexistent or incorrect pull request page, making it harder for someone reviewing the changelog to trace the entry back to its source PR.
Suggested Fix
Replace XXX with 921 on line 987:
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/921Given that the PR title is [WIP], this is likely a known TODO that the author plans to fix before final merge. Flagging it here as a reminder.
There was a problem hiding this comment.
updated
| --speculative-algorithm EAGLE \ | ||
| --speculative-num-steps 2 \ | ||
| --speculative-num-draft-tokens 3 \ | ||
| --speculative-eagle-topk 1 \ |
There was a problem hiding this comment.
Nitpicking here, can you please set these values as env variables?
Refer 👇
InferenceX/benchmarks/single_node/qwen3.5_fp8_b200_mtp.sh
Lines 44 to 47 in 26d415b
|
|
||
| SERVER_LOG=/workspace/server.log | ||
| PORT=${PORT:-8888} | ||
| MAX_SEQ_LEN=$((ISL + OSL + 20)) |
There was a problem hiding this comment.
You can use MAX_MODEL_LEN here. This env is made available to the benchmark script just like TP, CONC, etc.
| --enable-flashinfer-allreduce-fusion \ | ||
| --max-running-requests 128 \ | ||
| --chunked-prefill-size 16384 \ | ||
| --decode-log-interval 1 \ |
There was a problem hiding this comment.
Are you sure we need --decode-log-interval flag when benchmarking?
|
@claude Add description based on diff and perf-changelog |
|
Claude finished @Ankur-singh's task in 49s —— View job Adding PR Description
PR description has been updated with a summary of all changes: new benchmark script, nvidia-master.yaml config entry, and perf-changelog entry for |
Updated PR link formatting in the changelog.
|
@functionstackx can you please review the PR? |
|
@functionstackx - could you please help review this? |
| --tp "$TP" \ | ||
| --expert-parallel-size "$EP_SIZE" \ | ||
| --reasoning-parser qwen3 \ | ||
| --tool-call-parser qwen3_coder \ | ||
| --enable-flashinfer-allreduce-fusion \ | ||
| --max-running-requests 128 \ | ||
| --chunked-prefill-size 16384 \ | ||
| --mem-fraction-static 0.8 \ | ||
| --cuda-graph-max-bs "$CONC" \ | ||
| --context-length "$MAX_MODEL_LEN" \ | ||
| --kv-cache-dtype fp8_e4m3 \ | ||
| --quantization fp8 \ | ||
| --attention-backend flashinfer \ | ||
| --stream-interval 50 \ | ||
| --tokenizer-worker-num 6 \ | ||
| --mamba-ssm-dtype bfloat16 \ | ||
| --disable-radix-cache \ | ||
| --trust-remote-code \ | ||
| --speculative-algorithm EAGLE \ | ||
| --speculative-num-steps "$SPECULATIVE_NUM_STEPS" \ |
There was a problem hiding this comment.
add and merge it to cookbook please
There was a problem hiding this comment.
working on this, will share PR link here
functionstackx
left a comment
There was a problem hiding this comment.
PR generally looks good to me, needs recipe & deleting 1k8k tho
Removed duplicate configuration for isl: 1024 and osl: 8192.
|
@functionstackx - Resolved the comments, could you review again |
thanks for addressing it, can u plz share the cookbook recipe PR link |
|
sure, will share once PR is ready |
|
@faradawn is working on cookbook PR. Can we get review @cquil11 @functionstackx ? |
|
@kedarpotdar-nv the PR LGTM. just waiting on recipe before we can merge. does NVIDIA have an claude slash command to make it easy to convert to sglang cookbook recipes? |

Summary
Add Qwen3.5-397B-A17B-FP8 MTP benchmark configuration for H200 single-node with SGLang and EAGLE speculative decoding.
Changes
New benchmark script:
benchmarks/single_node/qwen3.5_fp8_h200_mtp.shQwen/Qwen3.5-397B-A17B-FP8num-steps=3,draft-tokens=4,eagle-topk=1--enable-flashinfer-allreduce-fusion--reasoning-parser qwen3,--tool-call-parser qwen3_coderNew config:
qwen3.5-fp8-h200-sglang-mtpinnvidia-master.yamllmsysorg/sglang:v0.5.9-cu129-amd64perf-changelog.yamlqwen3.5-fp8-h200-sglang-mtpdocumenting the new config.