Skip to content

[Do not merge][AMD]Update cli args for qwen3.5#980

Open
zhentaocc wants to merge 10 commits intomainfrom
update_cli_args_qwen
Open

[Do not merge][AMD]Update cli args for qwen3.5#980
zhentaocc wants to merge 10 commits intomainfrom
update_cli_args_qwen

Conversation

@zhentaocc
Copy link
Copy Markdown
Collaborator

@zhentaocc zhentaocc commented Mar 30, 2026

New perf data:

BF16 local results

conc 64, 1k1k, TPUT 501.29->661.46 tokens/s/gpu, 31.95% boost. @functionstackx

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 64        
Successful requests:                     640       
Benchmark duration (s):                  222.97    
Total input tokens:                      590851    
Total input text tokens:                 590851    
Total generated tokens:                  589052    
Total generated tokens (retokenized):    587369    
Request throughput (req/s):              2.87      
Input token throughput (tok/s):          2649.88   
Output token throughput (tok/s):         2641.81   
Peak output token throughput (tok/s):    3137.00   
Peak concurrent requests:                80        
Total token throughput (tok/s):          5291.69   
Concurrency:                             62.41     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   21744.73  
Median E2E Latency (ms):                 21584.66  
P90 E2E Latency (ms):                    24098.07  
P99 E2E Latency (ms):                    25117.02  
---------------Time to First Token----------------
Mean TTFT (ms):                          500.13    
Median TTFT (ms):                        475.10    
P99 TTFT (ms):                           1085.17   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          23.11     
Median TPOT (ms):                        23.18     
P99 TPOT (ms):                           24.63     
---------------Inter-Token Latency----------------
Mean ITL (ms):                           23.11     
Median ITL (ms):                         20.92     
P95 ITL (ms):                            22.33     
P99 ITL (ms):                            106.41    
Max ITL (ms):                            1403.09   
==================================================

FP8 local test results

conc 64, 1k1k, TPUT 708.75tokens/s/gpu @functionstackx

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 64        
Successful requests:                     640       
Benchmark duration (s):                  209.64    
Total input tokens:                      590851    
Total input text tokens:                 590851    
Total generated tokens:                  589052    
Total generated tokens (retokenized):    554942    
Request throughput (req/s):              3.05      
Input token throughput (tok/s):          2818.38   
Output token throughput (tok/s):         2809.80   
Peak output token throughput (tok/s):    3682.00   
Peak concurrent requests:                82        
Total token throughput (tok/s):          5628.19   
Concurrency:                             62.31     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   20411.98  
Median E2E Latency (ms):                 20282.34  
P90 E2E Latency (ms):                    23190.49  
P99 E2E Latency (ms):                    26606.04  
---------------Time to First Token----------------
Mean TTFT (ms):                          455.92    
Median TTFT (ms):                        423.01    
P99 TTFT (ms):                           1617.35   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          21.71     
Median TPOT (ms):                        21.62     
P99 TPOT (ms):                           26.91     
---------------Inter-Token Latency----------------
Mean ITL (ms):                           21.71     
Median ITL (ms):                         19.30     
P95 ITL (ms):                            20.90     
P99 ITL (ms):                            89.05     
Max ITL (ms):                            3259.61   
==================================================

@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

Comment on lines +1226 to +1229
description:
- "Update cli args of Qwen3.5 FP8 and BF16 SGLang benchmarks for MI355X to achieve better performance"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/942

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The new perf-changelog entry references pr-link #942, but the CLI arg changes it describes (adding --tokenizer-worker-num, --enable-aiter-allreduce-fusion, --cuda-graph-max-bs, etc.) are physically introduced in this PR (#980). The pr-link should point to #980 to maintain accurate changelog cross-referencing.

Extended reasoning...

What the bug is: The perf-changelog.yaml entry added by this PR (#980) contains pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/942. The description says 'Update cli args of Qwen3.5 FP8 and BF16 SGLang benchmarks for MI355X to achieve better performance', which is precisely what this PR does.

The specific code path: The diff in this PR adds seven new CLI arguments to both qwen3.5_bf16_mi355x.sh and qwen3.5_fp8_mi355x.sh: --tokenizer-worker-num 6, --enable-aiter-allreduce-fusion, --cuda-graph-max-bs $CONC, --context-length $CONTEXT_LENGTH, --disable-radix-cache, --max-prefill-tokens $MAX_PREFILL_TOKENS, and --scheduler-recv-interval 30. These are the exact changes described in the perf-changelog entry.

Why existing code doesn't prevent it: The perf-changelog is a manually maintained YAML file. There's no automated validation that a pr-link matches the PR number that introduces the described changes.

Addressing the refutation: The refutation argues that because the PR title contains '[Do not merge]', PR #942 is the intended merge vehicle, making the pr-link intentional. However, this reasoning is speculative. Looking at the existing changelog history, PR #942 has no prior entry—there is an entry pointing to PR #943 (a separate SGLang image update), but nothing for #942. If #942 were the real merge vehicle, it would already have its own changelog entry reflecting any prior changes. The '[Do not merge]' prefix is commonly used for testing/review branches, and doesn't conclusively establish that #942 will absorb these exact changes. If this PR is ever merged or its changes cherry-picked into another PR with a different number, the #942 link will be a broken cross-reference.

Impact: If the changelog entry is merged with a link to #942, anyone trying to trace the origin of these CLI arg changes will land on the wrong PR. This is a documentation/traceability issue with no functional impact.

How to fix: Change pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/942 to pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/980 (or whichever PR number this branch ultimately merges as).

Step-by-step proof:

  1. PR [Do not merge][AMD]Update cli args for qwen3.5 #980 modifies qwen3.5_bf16_mi355x.sh adding 7 new CLI args.
  2. PR [Do not merge][AMD]Update cli args for qwen3.5 #980 modifies qwen3.5_fp8_mi355x.sh adding the same 7 CLI args.
  3. PR [Do not merge][AMD]Update cli args for qwen3.5 #980 adds a perf-changelog entry describing 'Update cli args of Qwen3.5 FP8 and BF16 SGLang benchmarks for MI355X'.
  4. The pr-link in that entry is pull/942, not pull/980.
  5. Searching the existing perf-changelog shows no prior entry for PR [WIP][AMD]feat: enhance Qwen benchmark scripts with additional parameters #942, ruling out that [WIP][AMD]feat: enhance Qwen benchmark scripts with additional parameters #942 is a known related PR.

@zhentaocc
Copy link
Copy Markdown
Collaborator Author

/sweep test-config --config-files .github/configs/amd-master.yaml --runner-config .github/configs/runners.yaml --config-keys qwen3.5-bf16-mi355x-sglang qwen3.5-fp8-mi355x-sglang

@github-actions
Copy link
Copy Markdown
Contributor

@zhentaocc Kicking off a sweep.

Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/23739126783
Command: test-config --config-files .github/configs/amd-master.yaml --runner-config .github/configs/runners.yaml --config-keys qwen3.5-bf16-mi355x-sglang qwen3.5-fp8-mi355x-sglang
Pinned ref: 8ea832e
Approval: not required (trusted collaborator).

@SemiAnalysisAI SemiAnalysisAI deleted a comment from github-actions bot Mar 30, 2026
@zhentaocc
Copy link
Copy Markdown
Collaborator Author

/sweep test-config --config-files .github/configs/amd-master.yaml --runner-config .github/configs/runners.yaml --config-keys qwen3.5-bf16-mi355x-sglang qwen3.5-fp8-mi355x-sglang

@github-actions
Copy link
Copy Markdown
Contributor

@zhentaocc Kicking off a sweep.

Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/23751700196
Command: test-config --config-files .github/configs/amd-master.yaml --runner-config .github/configs/runners.yaml --config-keys qwen3.5-bf16-mi355x-sglang qwen3.5-fp8-mi355x-sglang
Pinned ref: ffacd58
Approval: not required (trusted collaborator).

@cquil11
Copy link
Copy Markdown
Collaborator

cquil11 commented Mar 30, 2026

@zhentaocc these runs are hanging and clogging up the CI queue: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/23751700196/job/69195336352

Comment on lines +40 to +46
--tokenizer-worker-num 6 \
--enable-aiter-allreduce-fusion \
--cuda-graph-max-bs $CONC \
--context-length $CONTEXT_LENGTH \
--disable-radix-cache \
--max-prefill-tokens $MAX_PREFILL_TOKENS \
--scheduler-recv-interval 30 \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome work! @zhentaocc

can u create a PR first in the sglang cookbook https://github.com/sgl-project/sgl-cookbook before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

seems like the sglang recipe dont have it yet
https://cookbook.sglang.io/autoregressive/Qwen/Qwen3.5
Image

@zhentaocc
Copy link
Copy Markdown
Collaborator Author

/sweep test-config --config-files .github/configs/amd-master.yaml --runner-config .github/configs/runners.yaml --config-keys qwen3.5-bf16-mi355x-sglang qwen3.5-fp8-mi355x-sglang

@github-actions
Copy link
Copy Markdown
Contributor

@zhentaocc Kicking off a sweep.

Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/23790183854
Command: test-config --config-files .github/configs/amd-master.yaml --runner-config .github/configs/runners.yaml --config-keys qwen3.5-bf16-mi355x-sglang qwen3.5-fp8-mi355x-sglang
Pinned ref: a350dc3
Approval: not required (trusted collaborator).

Chen, Todd added 10 commits March 31, 2026 04:34
* Added CONTEXT_LENGTH and MAX_PREFILL_TOKENS variables for better configuration.
* Updated launch_server command with new options: --tokenizer-worker-num, --enable-aiter-allreduce-fusion, --cuda-graph-max-bs, --context-length, --disable-radix-cache, --max-prefill-tokens, and --scheduler-recv-interval.
… benchmark configurations for MI355X, enhancing performance with updated CLI arguments.
….yaml to v0.5.9, ensuring compatibility with recent changes.
… and BF16 SGLang benchmarks on MI355X, ensuring accurate tracking of performance enhancements.
… configurations and adjust perf-changelog.yaml to reflect the changes, ensuring accurate performance tracking and compatibility.
…ngelog.yaml to reflect improved CLI arguments for MI355X, ensuring better performance tracking.
…ter and adjusting memory fraction. Updated launch_server command to include data-parallel-size and improved context length handling for better performance.
…chmarks, increasing conc-end values and adding new entries for improved performance tuning on MI355X and MI300X.
…cripts for MI355X to streamline configuration and improve performance.
@zhentaocc zhentaocc force-pushed the update_cli_args_qwen branch from a350dc3 to 6fe8422 Compare March 31, 2026 09:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

3 participants