Fix stream wait events referencing future correlation IDs (#1339)#1339
Closed
jiannanWang wants to merge 1 commit intopytorch:mainfrom
Closed
Fix stream wait events referencing future correlation IDs (#1339)#1339jiannanWang wants to merge 1 commit intopytorch:mainfrom
jiannanWang wants to merge 1 commit intopytorch:mainfrom
Conversation
|
@jiannanWang has imported this pull request. If you are a Meta employee, you can view this in D99131994. |
jiannanWang
added a commit
to jiannanWang/kineto
that referenced
this pull request
Apr 8, 2026
Summary: ### Problem `waitEventMap()` stored only the most recent `cudaEventRecord` for each `(context, eventId)` pair. This caused two bugs: Bug 1 Future correlation ID: A CUDA event can be recorded multiple times (re-used). CUPTI may deliver the activity records out of buffer order. If a later `cudaEventRecord` (higher correlationId) was delivered before the `cudaStreamWaitEvent` was processed, it would overwrite the map entry. The wait event would then be linked to a future `cudaEventRecord`. Bug 2 Stale entries across sessions: The static `waitEventMap()` was never cleared between profiling sessions. If a CUDA event ID was reused in a subsequent session, the wait event would pick up a correlation ID from the previous session. ### Fix Bug 1: Change `waitEventMap()` from `unordered_map<CtxEventPair, WaitEventInfo>` to `unordered_map<CtxEventPair, vector<WaitEventInfo>>`, storing all records for each `(context, eventId)`. Pass `correlationId` of the wait event into `getWaitEventInfo` and use `std::upper_bound` to find the most recent `cudaEventRecord` that occurred strictly before the wait (largest `correlationId ≤ queryCorrelationId`). **Note**: since now we record all WaitEventInfo in the vector, it may accumulate during the profiling session. I'm not sure if that would be a problem for taking up the memory. Bug 2: Clear waitEventMap() in onResetTraceData(). The sorted insert (`std::lower_bound`) at record time keeps the vector ordered, making the lookup O(log n) instead of O(n). ### Tests - `StreamWaitEventFutureCorrelation`: simulates out-of-order CUPTI delivery where a re-record (`corrId=200`) arrives before the wait (`corrId=101`). Verify the wait links to `corrId=100`, not `corrId=200`. - `WaitEventMapClearedOnReset`: runs two back-to-back profiling sessions with the same event ID. Verify the second session does not pick up the correlation ID from the first. Differential Revision: D99131994 Pulled By: jiannanWang
52a7f29 to
45beda9
Compare
|
@jiannanWang has exported this pull request. If you are a Meta employee, you can view the originating Diff in D99131994. |
Summary: ### Problem `waitEventMap()` stored only the most recent `cudaEventRecord` for each `(context, eventId)` pair. This caused two bugs: Bug 1 Future correlation ID: A CUDA event can be recorded multiple times (re-used). CUPTI may deliver the activity records out of buffer order. If a later `cudaEventRecord` (higher correlationId) was delivered before the `cudaStreamWaitEvent` was processed, it would overwrite the map entry. The wait event would then be linked to a future `cudaEventRecord`. Bug 2 Stale entries across sessions: The static `waitEventMap()` was never cleared between profiling sessions. If a CUDA event ID was reused in a subsequent session, the wait event would pick up a correlation ID from the previous session. ### Fix Bug 1: Change `waitEventMap()` from `unordered_map<CtxEventPair, WaitEventInfo>` to `unordered_map<CtxEventPair, vector<WaitEventInfo>>`, storing all records for each `(context, eventId)`. Pass `correlationId` of the wait event into `getWaitEventInfo` and use `std::upper_bound` to find the most recent `cudaEventRecord` that occurred strictly before the wait (largest `correlationId ≤ queryCorrelationId`). **Note**: since now we record all WaitEventInfo in the vector, it may accumulate during the profiling session. I'm not sure if that would be a problem for taking up the memory. Bug 2: Clear waitEventMap() in onResetTraceData(). The sorted insert (`std::lower_bound`) at record time keeps the vector ordered, making the lookup O(log n) instead of O(n). ### Tests - `StreamWaitEventFutureCorrelation`: simulates out-of-order CUPTI delivery where a re-record (`corrId=200`) arrives before the wait (`corrId=101`). Verify the wait links to `corrId=100`, not `corrId=200`. - `WaitEventMapClearedOnReset`: runs two back-to-back profiling sessions with the same event ID. Verify the second session does not pick up the correlation ID from the first. Reviewed By: ryanzhang22 Differential Revision: D99131994 Pulled By: jiannanWang
45beda9 to
de74a4b
Compare
|
@jiannanWang merged this pull request in 23b5bb5. |
scotts
added a commit
to scotts/pytorch
that referenced
this pull request
Apr 16, 2026
Includes the following commits: - Fix stream wait events referencing future correlation IDs (pytorch/kineto#1339) 23b5bb5 - Remove kineto tb_plugin directory entirely (pytorch/kineto#1368) 9497960 - Move Stream Sync events to a new row in JSON trace export (pytorch/kineto#1356) 041e7ce - Expose isGpuCollectionStopped() through Kineto's public API (pytorch/kineto#1367) 17708f5 - Fix toggle test (pytorch/kineto#1369) ee2103c - Link to correct fmt repo (pytorch/kineto#1345) 3447834 - Fix data race on CuptiActivityApi::externalCorrelationEnabled_ (pytorch/kineto#1365) 0e86499 - Stop allocating CUPTI buffers after exceeding max buffer count (pytorch/kineto#1362) 666f62c - Add XPU workflow (pytorch/kineto#1302) 11cc1e0 - Remove RocprofActivity.h/RoctracerActivity.h from RocmActivityProfiler.h (pytorch/kineto#1357) 896068d - Split ActivityProfilerController into Sync and Async Handlers (pytorch/kineto#1269) 6d7f045 - Add priority field to kernel metadata (pytorch/kineto#1361) f2a7423 - Add kineto-release skill (pytorch/kineto#1360) 675b6cd Authored with Claude.
pytorchmergebot
pushed a commit
to pytorch/pytorch
that referenced
this pull request
Apr 17, 2026
Includes the following commits: - Fix stream wait events referencing future correlation IDs (pytorch/kineto#1339) 23b5bb5 - Remove kineto tb_plugin directory entirely (pytorch/kineto#1368) 9497960 - Move Stream Sync events to a new row in JSON trace export (pytorch/kineto#1356) 041e7ce - Expose isGpuCollectionStopped() through Kineto's public API (pytorch/kineto#1367) 17708f5 - Fix toggle test (pytorch/kineto#1369) ee2103c - Link to correct fmt repo (pytorch/kineto#1345) 3447834 - Fix data race on CuptiActivityApi::externalCorrelationEnabled_ (pytorch/kineto#1365) 0e86499 - Stop allocating CUPTI buffers after exceeding max buffer count (pytorch/kineto#1362) 666f62c - Add XPU workflow (pytorch/kineto#1302) 11cc1e0 - Remove RocprofActivity.h/RoctracerActivity.h from RocmActivityProfiler.h (pytorch/kineto#1357) 896068d - Split ActivityProfilerController into Sync and Async Handlers (pytorch/kineto#1269) 6d7f045 - Add priority field to kernel metadata (pytorch/kineto#1361) f2a7423 - Add kineto-release skill (pytorch/kineto#1360) 675b6cd Authored with Claude. Pull Request resolved: #180606 Approved by: https://github.com/ryanzhang22, https://github.com/Skylion007
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary:
Problem
waitEventMap()stored only the most recentcudaEventRecordfor each(context, eventId)pair. This caused two bugs:Bug 1 Future correlation ID: A CUDA event can be recorded multiple times (re-used). CUPTI may deliver the activity records out of buffer order. If a later
cudaEventRecord(higher correlationId) was delivered before thecudaStreamWaitEventwas processed, it would overwrite the map entry. The wait event would then be linked to a futurecudaEventRecord.Bug 2 Stale entries across sessions: The static
waitEventMap()was never cleared between profiling sessions. If a CUDA event ID was reused in a subsequent session, the wait event would pick up a correlation ID from the previous session.Fix
Bug 1: Change
waitEventMap()fromunordered_map<CtxEventPair, WaitEventInfo>tounordered_map<CtxEventPair, vector<WaitEventInfo>>, storing all records for each(context, eventId). PasscorrelationIdof the wait event intogetWaitEventInfoand usestd::upper_boundto find the most recentcudaEventRecordthat occurred strictly before the wait (largestcorrelationId ≤ queryCorrelationId).Note: since now we record all WaitEventInfo in the vector, it may accumulate during the profiling session. I'm not sure if that would be a problem for taking up the memory.
Bug 2: Clear waitEventMap() in onResetTraceData().
The sorted insert (
std::lower_bound) at record time keeps the vector ordered, making the lookup O(log n) instead of O(n).Tests
StreamWaitEventFutureCorrelation: simulates out-of-order CUPTI delivery where a re-record (corrId=200) arrives before the wait (corrId=101). Verify the wait links tocorrId=100, notcorrId=200.WaitEventMapClearedOnReset: runs two back-to-back profiling sessions with the same event ID. Verify the second session does not pick up the correlation ID from the first.Reviewed By: ryanzhang22
Differential Revision: D99131994
Pulled By: jiannanWang