You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, Mori EP unit test (i.e. test_dispatch_combine.py) does not sufficiently cover real cases.
Problem 1. Each test case performed dispatch/combine only once.
In practice, depending on the model, dispatch/combine is executed multiple times corresponding to the number of layers.
e.g. In DeepSeek-R1, dispatch/combine is performed 58 times per step.
When dispatch/combine is executed multiple times, there may be cases where the results are incorrect, but currently this is not being checked.
Problem 2. There is synchronization across devices before checking the results.
In practice, for performance reasons, explicit synchronization is not performed after dispatch and combine.
Most issues with the output values occur due to device-to-device timing differences.
Since the unit test is synchronizing across devices after dispatch/combine, it cannot detect the timing issue.
Technical Details
Updated the unit tests to more closely resemble real use cases:
Introduced the num_reps parameter to allow each test case to perform dispatch/combine multiple times.
Removed explicitly synchronize across devices after dispatch/combine.
Instead, results are stored in a list during execution.
verification is performed after all dispatch/combine operations are complete.
Note
Currently, there is an issue in Mori EP where results are incorrect if num_reps is greater than 1. I will create a PR to fix this.
Fixed an issue where the test would hang if an error occurred on some ranks:
Previously, the parent process waited for results from all processes, but if an error occurred on some ranks, other ranks continued executing dispatch/combine and never sent a test result to the result queue.
As a result, the parent process would wait indefinitely for responses that never arrive, causing the test to hang.
Changed the behavior so that if an error occurs in any process, the error is raised immediately without waiting for the result queue.
@dongmin-ra Could I be granted more access to the moreh-dev repo?
I replicated your issue during the test, and use your pr to fix this issue.
but I found that your modification cause benchmark init issue, python3 tests/python/ops/bench_dispatch_combine.py.
Then I have fixed it, and I found combine result mismatch during warmup step, could you please help to check this benchmark status after cherry-pick this commit?
@jhchouuu Could I be granted more access to the moreh-dev repo?
Okay. I granted you write access to moreh-dev/mori repo.
Then I have fixed it, and I found combine result mismatch during warmup step, could you please help to check this benchmark status after cherry-pick this commit?
@jhchouuu I fixed the errors on bench_dispatch_combine.py and bench_dispatch_combine_tune.py scripts. Please take a look when you get a chance.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This is a work by a team from Moreh Inc.
Motivation
Currently, Mori EP unit test (i.e.
test_dispatch_combine.py) does not sufficiently cover real cases.Problem 1. Each test case performed dispatch/combine only once.
Problem 2. There is synchronization across devices before checking the results.
Technical Details
Updated the unit tests to more closely resemble real use cases:
num_repsparameter to allow each test case to perform dispatch/combine multiple times.Note
Currently, there is an issue in Mori EP where results are incorrect if
num_repsis greater than 1. I will create a PR to fix this.Fixed an issue where the test would hang if an error occurred on some ranks:
Test Plan
Test Result
Submission Checklist