Bug fix: profiler kernelch event type field #1973
Open
+1
−1
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
I run NCCL Profiler to check the kernelCh event in single node with 8 * A100 GPUs. And I just clone the master branch code from NCCL. And I got the unexpected results where most of kernelCh events have stopTs 0 and StopGpuClk 0.
I use the recommended compilation, just make.
The launch script is follows:
Then the event result is(Show only a portion to save space.)
For this problem, with nccl-2.27.3 version ext-profiler plugin code, I can get the right result.
I found the root cause is the wrong type in ext-profiler/example/event.h:struct kernelCh
the type field is declared as uint8_t. However, the updateEvent function in plugin.cc assumes every event type is uint64_t. Here is the code snippet.
Related Issues
Changes & Impact
Modify the declaration of type in struct kernelCh from uint8_t to uint_64_t.
Performance Impact
I modify the kernelCh type declaration and get the right result.