Skip to content

Fix stream wait events referencing future correlation IDs (#1339)#1339

Closed
jiannanWang wants to merge 1 commit intopytorch:mainfrom
jiannanWang:fix/t2t4-future-correlation-ids
Closed

Fix stream wait events referencing future correlation IDs (#1339)#1339
jiannanWang wants to merge 1 commit intopytorch:mainfrom
jiannanWang:fix/t2t4-future-correlation-ids

Conversation

@jiannanWang
Copy link
Copy Markdown
Contributor

@jiannanWang jiannanWang commented Apr 1, 2026

Summary:

Problem

waitEventMap() stored only the most recent cudaEventRecord for each (context, eventId) pair. This caused two bugs:

Bug 1 Future correlation ID: A CUDA event can be recorded multiple times (re-used). CUPTI may deliver the activity records out of buffer order. If a later cudaEventRecord (higher correlationId) was delivered before the cudaStreamWaitEvent was processed, it would overwrite the map entry. The wait event would then be linked to a future cudaEventRecord.

Bug 2 Stale entries across sessions: The static waitEventMap() was never cleared between profiling sessions. If a CUDA event ID was reused in a subsequent session, the wait event would pick up a correlation ID from the previous session.

Fix

Bug 1: Change waitEventMap() from unordered_map<CtxEventPair, WaitEventInfo> to unordered_map<CtxEventPair, vector<WaitEventInfo>>, storing all records for each (context, eventId). Pass correlationId of the wait event into getWaitEventInfo and use std::upper_bound to find the most recent cudaEventRecord that occurred strictly before the wait (largest correlationId ≤ queryCorrelationId).

Note: since now we record all WaitEventInfo in the vector, it may accumulate during the profiling session. I'm not sure if that would be a problem for taking up the memory.

Bug 2: Clear waitEventMap() in onResetTraceData().

The sorted insert (std::lower_bound) at record time keeps the vector ordered, making the lookup O(log n) instead of O(n).

Tests

  • StreamWaitEventFutureCorrelation: simulates out-of-order CUPTI delivery where a re-record (corrId=200) arrives before the wait (corrId=101). Verify the wait links to corrId=100, not corrId=200.
  • WaitEventMapClearedOnReset: runs two back-to-back profiling sessions with the same event ID. Verify the second session does not pick up the correlation ID from the first.

Reviewed By: ryanzhang22

Differential Revision: D99131994

Pulled By: jiannanWang

@meta-codesync
Copy link
Copy Markdown

meta-codesync bot commented Apr 1, 2026

@jiannanWang has imported this pull request. If you are a Meta employee, you can view this in D99131994.

@meta-codesync meta-codesync bot changed the title Fix stream wait events referencing future correlation IDs Fix stream wait events referencing future correlation IDs (#1339) Apr 8, 2026
jiannanWang added a commit to jiannanWang/kineto that referenced this pull request Apr 8, 2026
Summary:
### Problem

`waitEventMap()` stored only the most recent `cudaEventRecord` for each `(context, eventId)` pair. This caused two bugs:

Bug 1 Future correlation ID: A CUDA event can be recorded multiple times (re-used). CUPTI may deliver the activity records out of buffer order. If a later `cudaEventRecord` (higher correlationId) was delivered before the `cudaStreamWaitEvent` was processed, it would overwrite the map entry. The wait event would then be linked to a future `cudaEventRecord`.

Bug 2 Stale entries across sessions: The static `waitEventMap()` was never cleared between profiling sessions. If a CUDA event ID was reused in a subsequent session, the wait event would pick up a correlation ID from the previous session.

###  Fix

Bug 1: Change `waitEventMap()` from `unordered_map<CtxEventPair, WaitEventInfo>` to `unordered_map<CtxEventPair, vector<WaitEventInfo>>`, storing all records for each `(context, eventId)`. Pass `correlationId` of the wait event into `getWaitEventInfo` and use `std::upper_bound` to find the most recent `cudaEventRecord` that occurred strictly before the wait (largest `correlationId ≤ queryCorrelationId`).

**Note**: since now we record all WaitEventInfo in the vector, it may accumulate during the profiling session. I'm not sure if that would be a problem for taking up the memory.

 Bug 2: Clear waitEventMap() in onResetTraceData().

 The sorted insert (`std::lower_bound`) at record time keeps the vector ordered, making the lookup O(log n) instead of O(n).

###  Tests
- `StreamWaitEventFutureCorrelation`: simulates out-of-order CUPTI delivery where a re-record (`corrId=200`) arrives before the wait (`corrId=101`). Verify the wait links to `corrId=100`, not `corrId=200`.
-  `WaitEventMapClearedOnReset`: runs two back-to-back profiling sessions with the same event ID. Verify the second session does not pick up the correlation ID from the first.


Differential Revision: D99131994

Pulled By: jiannanWang
@jiannanWang jiannanWang force-pushed the fix/t2t4-future-correlation-ids branch from 52a7f29 to 45beda9 Compare April 8, 2026 17:40
@meta-codesync
Copy link
Copy Markdown

meta-codesync bot commented Apr 8, 2026

@jiannanWang has exported this pull request. If you are a Meta employee, you can view the originating Diff in D99131994.

Summary:
### Problem

`waitEventMap()` stored only the most recent `cudaEventRecord` for each `(context, eventId)` pair. This caused two bugs:

Bug 1 Future correlation ID: A CUDA event can be recorded multiple times (re-used). CUPTI may deliver the activity records out of buffer order. If a later `cudaEventRecord` (higher correlationId) was delivered before the `cudaStreamWaitEvent` was processed, it would overwrite the map entry. The wait event would then be linked to a future `cudaEventRecord`.

Bug 2 Stale entries across sessions: The static `waitEventMap()` was never cleared between profiling sessions. If a CUDA event ID was reused in a subsequent session, the wait event would pick up a correlation ID from the previous session.

###  Fix

Bug 1: Change `waitEventMap()` from `unordered_map<CtxEventPair, WaitEventInfo>` to `unordered_map<CtxEventPair, vector<WaitEventInfo>>`, storing all records for each `(context, eventId)`. Pass `correlationId` of the wait event into `getWaitEventInfo` and use `std::upper_bound` to find the most recent `cudaEventRecord` that occurred strictly before the wait (largest `correlationId ≤ queryCorrelationId`).

**Note**: since now we record all WaitEventInfo in the vector, it may accumulate during the profiling session. I'm not sure if that would be a problem for taking up the memory.

 Bug 2: Clear waitEventMap() in onResetTraceData().

 The sorted insert (`std::lower_bound`) at record time keeps the vector ordered, making the lookup O(log n) instead of O(n).

###  Tests
- `StreamWaitEventFutureCorrelation`: simulates out-of-order CUPTI delivery where a re-record (`corrId=200`) arrives before the wait (`corrId=101`). Verify the wait links to `corrId=100`, not `corrId=200`.
-  `WaitEventMapClearedOnReset`: runs two back-to-back profiling sessions with the same event ID. Verify the second session does not pick up the correlation ID from the first.


Reviewed By: ryanzhang22

Differential Revision: D99131994

Pulled By: jiannanWang
@jiannanWang jiannanWang force-pushed the fix/t2t4-future-correlation-ids branch from 45beda9 to de74a4b Compare April 15, 2026 23:38
@meta-codesync meta-codesync bot closed this in 23b5bb5 Apr 16, 2026
@meta-codesync
Copy link
Copy Markdown

meta-codesync bot commented Apr 16, 2026

@jiannanWang merged this pull request in 23b5bb5.

scotts added a commit to scotts/pytorch that referenced this pull request Apr 16, 2026
Includes the following commits:

- Fix stream wait events referencing future correlation IDs (pytorch/kineto#1339) 23b5bb5
- Remove kineto tb_plugin directory entirely (pytorch/kineto#1368) 9497960
- Move Stream Sync events to a new row in JSON trace export (pytorch/kineto#1356) 041e7ce
- Expose isGpuCollectionStopped() through Kineto's public API (pytorch/kineto#1367) 17708f5
- Fix toggle test (pytorch/kineto#1369) ee2103c
- Link to correct fmt repo (pytorch/kineto#1345) 3447834
- Fix data race on CuptiActivityApi::externalCorrelationEnabled_ (pytorch/kineto#1365) 0e86499
- Stop allocating CUPTI buffers after exceeding max buffer count (pytorch/kineto#1362) 666f62c
- Add XPU workflow (pytorch/kineto#1302) 11cc1e0
- Remove RocprofActivity.h/RoctracerActivity.h from RocmActivityProfiler.h (pytorch/kineto#1357) 896068d
- Split ActivityProfilerController into Sync and Async Handlers (pytorch/kineto#1269) 6d7f045
- Add priority field to kernel metadata (pytorch/kineto#1361) f2a7423
- Add kineto-release skill (pytorch/kineto#1360) 675b6cd

Authored with Claude.
pytorchmergebot pushed a commit to pytorch/pytorch that referenced this pull request Apr 17, 2026
Includes the following commits:

- Fix stream wait events referencing future correlation IDs (pytorch/kineto#1339) 23b5bb5
- Remove kineto tb_plugin directory entirely (pytorch/kineto#1368) 9497960
- Move Stream Sync events to a new row in JSON trace export (pytorch/kineto#1356) 041e7ce
- Expose isGpuCollectionStopped() through Kineto's public API (pytorch/kineto#1367) 17708f5
- Fix toggle test (pytorch/kineto#1369) ee2103c
- Link to correct fmt repo (pytorch/kineto#1345) 3447834
- Fix data race on CuptiActivityApi::externalCorrelationEnabled_ (pytorch/kineto#1365) 0e86499
- Stop allocating CUPTI buffers after exceeding max buffer count (pytorch/kineto#1362) 666f62c
- Add XPU workflow (pytorch/kineto#1302) 11cc1e0
- Remove RocprofActivity.h/RoctracerActivity.h from RocmActivityProfiler.h (pytorch/kineto#1357) 896068d
- Split ActivityProfilerController into Sync and Async Handlers (pytorch/kineto#1269) 6d7f045
- Add priority field to kernel metadata (pytorch/kineto#1361) f2a7423
- Add kineto-release skill (pytorch/kineto#1360) 675b6cd

Authored with Claude.
Pull Request resolved: #180606
Approved by: https://github.com/ryanzhang22, https://github.com/Skylion007
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant