[AMD] improve dsr1 fp4 disagg perf on mi355x#983
Conversation
…transformers v5 Transformers v5 incorrectly rebuilds pre_tokenizer/decoder components for models like DeepSeek-R1 that use LlamaTokenizerFast with a non-Llama tokenizer architecture. The sglang server fixes this at startup, but the benchmark client loads the tokenizer without these fixes, causing a ~5x token count inflation (e.g. 7000 tokens -> 35000 tokens) and false performance regressions in TTFT and throughput benchmarks. Apply the same tokenizer fixes (pre_tokenizer/decoder restoration and add_bos_token recovery) that sglang server applies, so client and server tokenize identically. No-op on transformers v4. Made-with: Cursor
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you |
1 similar comment
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you |
|
|
||
| dsr1-fp8-mi355x-sglang-disagg: | ||
| image: rocm/sgl-dev:sglang-0.5.9-rocm720-mi35x-mori-0227-2 | ||
| image: rocm/sgl-dev:sglang-0.5.9-rocm720-mi35x-mori-0327 |
There was a problem hiding this comment.
in early march, you said that after consulting with @HaiShaw and others in the org that by End of March, you would be using upstream images. Can u please update this use upstream nightly images instead of second class forks?
lets ensure that we work towards amd being an first class platform on sglang instead of continuing to submit second class forks
The new patch is adding the following optimization: