[Do not merge][AMD]Update cli args for qwen3.5 by zhentaocc · Pull Request #980 · SemiAnalysisAI/InferenceX

zhentaocc · 2026-03-30T09:39:48Z

update image, include pr changes:
- [AMD] Add fused GemmaRMSNorm forward_hip to use aiter/vllm kernels for qwen3.5 sgl-project/sglang#21188
- [AMD]Integrate aiter's fused_topk for softmax scoring in topk function sgl-project/sglang#21421
update command args

New perf data:

BF16 local results

conc 64, 1k1k, TPUT 501.29->661.46 tokens/s/gpu, 31.95% boost. @functionstackx

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 64        
Successful requests:                     640       
Benchmark duration (s):                  222.97    
Total input tokens:                      590851    
Total input text tokens:                 590851    
Total generated tokens:                  589052    
Total generated tokens (retokenized):    587369    
Request throughput (req/s):              2.87      
Input token throughput (tok/s):          2649.88   
Output token throughput (tok/s):         2641.81   
Peak output token throughput (tok/s):    3137.00   
Peak concurrent requests:                80        
Total token throughput (tok/s):          5291.69   
Concurrency:                             62.41     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   21744.73  
Median E2E Latency (ms):                 21584.66  
P90 E2E Latency (ms):                    24098.07  
P99 E2E Latency (ms):                    25117.02  
---------------Time to First Token----------------
Mean TTFT (ms):                          500.13    
Median TTFT (ms):                        475.10    
P99 TTFT (ms):                           1085.17   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          23.11     
Median TPOT (ms):                        23.18     
P99 TPOT (ms):                           24.63     
---------------Inter-Token Latency----------------
Mean ITL (ms):                           23.11     
Median ITL (ms):                         20.92     
P95 ITL (ms):                            22.33     
P99 ITL (ms):                            106.41    
Max ITL (ms):                            1403.09   
==================================================

FP8 local test results

conc 64, 1k1k, TPUT 708.75tokens/s/gpu @functionstackx

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 64        
Successful requests:                     640       
Benchmark duration (s):                  209.64    
Total input tokens:                      590851    
Total input text tokens:                 590851    
Total generated tokens:                  589052    
Total generated tokens (retokenized):    554942    
Request throughput (req/s):              3.05      
Input token throughput (tok/s):          2818.38   
Output token throughput (tok/s):         2809.80   
Peak output token throughput (tok/s):    3682.00   
Peak concurrent requests:                82        
Total token throughput (tok/s):          5628.19   
Concurrency:                             62.31     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   20411.98  
Median E2E Latency (ms):                 20282.34  
P90 E2E Latency (ms):                    23190.49  
P99 E2E Latency (ms):                    26606.04  
---------------Time to First Token----------------
Mean TTFT (ms):                          455.92    
Median TTFT (ms):                        423.01    
P99 TTFT (ms):                           1617.35   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          21.71     
Median TPOT (ms):                        21.62     
P99 TPOT (ms):                           26.91     
---------------Inter-Token Latency----------------
Mean ITL (ms):                           21.71     
Median ITL (ms):                         19.30     
P95 ITL (ms):                            20.90     
P99 ITL (ms):                            89.05     
Max ITL (ms):                            3259.61   
==================================================

github-actions · 2026-03-30T09:39:58Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

claude · 2026-03-30T09:47:27Z

perf-changelog.yaml

+  description:
+    - "Update cli args of Qwen3.5 FP8 and BF16 SGLang benchmarks for MI355X to achieve better performance"
+  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/942
+


🟡 The new perf-changelog entry references pr-link #942, but the CLI arg changes it describes (adding --tokenizer-worker-num, --enable-aiter-allreduce-fusion, --cuda-graph-max-bs, etc.) are physically introduced in this PR (#980). The pr-link should point to #980 to maintain accurate changelog cross-referencing.

Extended reasoning...

What the bug is: The perf-changelog.yaml entry added by this PR (#980) contains pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/942. The description says 'Update cli args of Qwen3.5 FP8 and BF16 SGLang benchmarks for MI355X to achieve better performance', which is precisely what this PR does.

The specific code path: The diff in this PR adds seven new CLI arguments to both qwen3.5_bf16_mi355x.sh and qwen3.5_fp8_mi355x.sh: --tokenizer-worker-num 6, --enable-aiter-allreduce-fusion, --cuda-graph-max-bs $CONC, --context-length $CONTEXT_LENGTH, --disable-radix-cache, --max-prefill-tokens $MAX_PREFILL_TOKENS, and --scheduler-recv-interval 30. These are the exact changes described in the perf-changelog entry.

Why existing code doesn't prevent it: The perf-changelog is a manually maintained YAML file. There's no automated validation that a pr-link matches the PR number that introduces the described changes.

Addressing the refutation: The refutation argues that because the PR title contains '[Do not merge]', PR #942 is the intended merge vehicle, making the pr-link intentional. However, this reasoning is speculative. Looking at the existing changelog history, PR #942 has no prior entry—there is an entry pointing to PR #943 (a separate SGLang image update), but nothing for #942. If #942 were the real merge vehicle, it would already have its own changelog entry reflecting any prior changes. The '[Do not merge]' prefix is commonly used for testing/review branches, and doesn't conclusively establish that #942 will absorb these exact changes. If this PR is ever merged or its changes cherry-picked into another PR with a different number, the #942 link will be a broken cross-reference.

Impact: If the changelog entry is merged with a link to #942, anyone trying to trace the origin of these CLI arg changes will land on the wrong PR. This is a documentation/traceability issue with no functional impact.

How to fix: Change pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/942 to pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/980 (or whichever PR number this branch ultimately merges as).

Step-by-step proof:

PR [Do not merge][AMD]Update cli args for qwen3.5 #980 modifies qwen3.5_bf16_mi355x.sh adding 7 new CLI args.

PR [Do not merge][AMD]Update cli args for qwen3.5 #980 modifies qwen3.5_fp8_mi355x.sh adding the same 7 CLI args.

PR [Do not merge][AMD]Update cli args for qwen3.5 #980 adds a perf-changelog entry describing 'Update cli args of Qwen3.5 FP8 and BF16 SGLang benchmarks for MI355X'.

The pr-link in that entry is pull/942, not pull/980.

Searching the existing perf-changelog shows no prior entry for PR [WIP][AMD]feat: enhance Qwen benchmark scripts with additional parameters #942, ruling out that [WIP][AMD]feat: enhance Qwen benchmark scripts with additional parameters #942 is a known related PR.

zhentaocc · 2026-03-30T10:02:17Z

/sweep test-config --config-files .github/configs/amd-master.yaml --runner-config .github/configs/runners.yaml --config-keys qwen3.5-bf16-mi355x-sglang qwen3.5-fp8-mi355x-sglang

github-actions · 2026-03-30T10:02:26Z

@zhentaocc Kicking off a sweep.

Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/23739126783
Command: test-config --config-files .github/configs/amd-master.yaml --runner-config .github/configs/runners.yaml --config-keys qwen3.5-bf16-mi355x-sglang qwen3.5-fp8-mi355x-sglang
Pinned ref: 8ea832e
Approval: not required (trusted collaborator).

zhentaocc · 2026-03-30T15:02:53Z

/sweep test-config --config-files .github/configs/amd-master.yaml --runner-config .github/configs/runners.yaml --config-keys qwen3.5-bf16-mi355x-sglang qwen3.5-fp8-mi355x-sglang

github-actions · 2026-03-30T15:03:05Z

@zhentaocc Kicking off a sweep.

Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/23751700196
Command: test-config --config-files .github/configs/amd-master.yaml --runner-config .github/configs/runners.yaml --config-keys qwen3.5-bf16-mi355x-sglang qwen3.5-fp8-mi355x-sglang
Pinned ref: ffacd58
Approval: not required (trusted collaborator).

cquil11 · 2026-03-30T17:37:54Z

@zhentaocc these runs are hanging and clogging up the CI queue: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/23751700196/job/69195336352

functionstackx · 2026-03-31T07:07:01Z

benchmarks/single_node/qwen3.5_fp8_mi355x.sh

+    --tokenizer-worker-num 6 \
+    --enable-aiter-allreduce-fusion \
+    --cuda-graph-max-bs $CONC \
+    --context-length $CONTEXT_LENGTH \
+    --disable-radix-cache \
+    --max-prefill-tokens $MAX_PREFILL_TOKENS \
+    --scheduler-recv-interval 30 \


awesome work! @zhentaocc

can u create a PR first in the sglang cookbook https://github.com/sgl-project/sgl-cookbook before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

seems like the sglang recipe dont have it yet
https://cookbook.sglang.io/autoregressive/Qwen/Qwen3.5

zhentaocc · 2026-03-31T09:24:48Z

/sweep test-config --config-files .github/configs/amd-master.yaml --runner-config .github/configs/runners.yaml --config-keys qwen3.5-bf16-mi355x-sglang qwen3.5-fp8-mi355x-sglang

github-actions · 2026-03-31T09:24:58Z

@zhentaocc Kicking off a sweep.

Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/23790183854
Command: test-config --config-files .github/configs/amd-master.yaml --runner-config .github/configs/runners.yaml --config-keys qwen3.5-bf16-mi355x-sglang qwen3.5-fp8-mi355x-sglang
Pinned ref: a350dc3
Approval: not required (trusted collaborator).

* Added CONTEXT_LENGTH and MAX_PREFILL_TOKENS variables for better configuration. * Updated launch_server command with new options: --tokenizer-worker-num, --enable-aiter-allreduce-fusion, --cuda-graph-max-bs, --context-length, --disable-radix-cache, --max-prefill-tokens, and --scheduler-recv-interval.

… benchmark configurations for MI355X, enhancing performance with updated CLI arguments.

….yaml to v0.5.9, ensuring compatibility with recent changes.

… and BF16 SGLang benchmarks on MI355X, ensuring accurate tracking of performance enhancements.

… configurations and adjust perf-changelog.yaml to reflect the changes, ensuring accurate performance tracking and compatibility.

…ngelog.yaml to reflect improved CLI arguments for MI355X, ensuring better performance tracking.

…ter and adjusting memory fraction. Updated launch_server command to include data-parallel-size and improved context length handling for better performance.

…chmarks, increasing conc-end values and adding new entries for improved performance tuning on MI355X and MI300X.

…cripts for MI355X to streamline configuration and improve performance.

zhentaocc requested a review from a team March 30, 2026 09:39

zhentaocc requested review from billishyahao and chunfangamd as code owners March 30, 2026 09:39

github-project-automation bot added this to InferenceMAX Board Mar 30, 2026

claude bot reviewed Mar 30, 2026

View reviewed changes

SemiAnalysisAI deleted a comment from github-actions bot Mar 30, 2026

functionstackx reviewed Mar 31, 2026

View reviewed changes

Chen, Todd added 10 commits March 31, 2026 04:34

Update perf-changelog.yaml to include new Qwen3.5 FP8 and BF16 SGLang…

44bc53a

… benchmark configurations for MI355X, enhancing performance with updated CLI arguments.

Update SGLang image versions for Qwen3.5 configurations in amd-master…

128ddfd

….yaml to v0.5.9, ensuring compatibility with recent changes.

use 0327 build

ab7166e

Update perf-changelog.yaml to reflect the new PR link for Qwen3.5 FP8…

036a487

… and BF16 SGLang benchmarks on MI355X, ensuring accurate tracking of performance enhancements.

Update Qwen3.5 image tags in amd-master.yaml to v0.5.10rc0 for MI355X…

5077256

… configurations and adjust perf-changelog.yaml to reflect the changes, ensuring accurate performance tracking and compatibility.

Update Qwen3.5 FP8 and BF16 SGLang benchmark descriptions in perf-cha…

46e763f

…ngelog.yaml to reflect improved CLI arguments for MI355X, ensuring better performance tracking.

Enhance Qwen3.5 benchmark scripts for MI355X by adding EP_SIZE parame…

bf316da

…ter and adjusting memory fraction. Updated launch_server command to include data-parallel-size and improved context length handling for better performance.

Update search-space configurations in amd-master.yaml for Qwen3.5 ben…

251c54e

…chmarks, increasing conc-end values and adding new entries for improved performance tuning on MI355X and MI300X.

Remove context length parameter from Qwen3.5 BF16 and FP8 benchmark s…

6fe8422

…cripts for MI355X to streamline configuration and improve performance.

zhentaocc force-pushed the update_cli_args_qwen branch from a350dc3 to 6fe8422 Compare March 31, 2026 09:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Do not merge][AMD]Update cli args for qwen3.5#980

[Do not merge][AMD]Update cli args for qwen3.5#980
zhentaocc wants to merge 10 commits intomainfrom
update_cli_args_qwen

zhentaocc commented Mar 30, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 30, 2026

Uh oh!

claude bot Mar 30, 2026

Uh oh!

zhentaocc commented Mar 30, 2026

Uh oh!

github-actions bot commented Mar 30, 2026

Uh oh!

zhentaocc commented Mar 30, 2026

Uh oh!

github-actions bot commented Mar 30, 2026

Uh oh!

cquil11 commented Mar 30, 2026

Uh oh!

functionstackx Mar 31, 2026

Uh oh!

zhentaocc commented Mar 31, 2026

Uh oh!

github-actions bot commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

zhentaocc commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

BF16 local results

FP8 local test results

Uh oh!

github-actions bot commented Mar 30, 2026

Uh oh!

claude bot Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

zhentaocc commented Mar 30, 2026

Uh oh!

github-actions bot commented Mar 30, 2026

Uh oh!

zhentaocc commented Mar 30, 2026

Uh oh!

github-actions bot commented Mar 30, 2026

Uh oh!

cquil11 commented Mar 30, 2026

Uh oh!

functionstackx Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

zhentaocc commented Mar 31, 2026

Uh oh!

github-actions bot commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zhentaocc commented Mar 30, 2026 •

edited

Loading