Fix: fix possible result value error in internode and intranode EP#87
Fix: fix possible result value error in internode and intranode EP#87
Conversation
|
Hi @dongmin-ra , thanks to moreh team! About Issue 1: Have you did the test and see the incorrect result? Because of the intranode kernel, we will set call_reset=False, which means that the reset operation will not be invoked. We will conduct further tests. FYI, We will update our new kernel recently. |
Yes, I found that the response comes out incorrectly when sending a curl request in the vllm + mori environment. The response returned some random output strings. (fyi, vLLM server was launched with 8 DP and EP.)
When sending a curl request in vLLM, only one DP rank handles the request while the others run a dummy batch with 1 token. This token difference may cause different layers to be processed across DP ranks at the same time, leading to incorrect results. I’ll create a unit test to reproduce this. |
|
@jhchouuu I created a PR (#92) to enhance the EP unit tests.
|
|
@dongmin-ra Sorry for the late reply since team's bandwidth is saturated recently. We will start testing and reviewing the series of PRs raised by Moreh team this week, thanks again! |
|
@dongmin-ra Thanks for this commit, I conducted further tests for ep8 in vllm and also discovered similar issues. Your fix can help us solve the problem. I will first merge PR #87 #93 . And #92 will be merged as soon as it is tested. |
Motivation
Fix possible result value error in internode and intranode EP
Technical Details
In this PR, the CUDA graph hang issue was resolved, but there are parts where result errors could occur.
crossDeviceBarrierFlagvalue was missing.crossDeviceBarrierFlagby 1 at the end of internode combine kernel, but there is no barrier before this.The changes are as follows:
crossDeviceBarrierFlagvalue in the intranode combine kernel.crossDeviceBarrierFlagas a local variable at the start of the kernel.