Fix stream wait events referencing future correlation IDs (#1339) by jiannanWang · Pull Request #1339 · pytorch/kineto

jiannanWang · 2026-04-01T18:00:41Z

Summary:

Problem

waitEventMap() stored only the most recent cudaEventRecord for each (context, eventId) pair. This caused two bugs:

Bug 1 Future correlation ID: A CUDA event can be recorded multiple times (re-used). CUPTI may deliver the activity records out of buffer order. If a later cudaEventRecord (higher correlationId) was delivered before the cudaStreamWaitEvent was processed, it would overwrite the map entry. The wait event would then be linked to a future cudaEventRecord.

Bug 2 Stale entries across sessions: The static waitEventMap() was never cleared between profiling sessions. If a CUDA event ID was reused in a subsequent session, the wait event would pick up a correlation ID from the previous session.

Fix

Bug 1: Change waitEventMap() from unordered_map<CtxEventPair, WaitEventInfo> to unordered_map<CtxEventPair, vector<WaitEventInfo>>, storing all records for each (context, eventId). Pass correlationId of the wait event into getWaitEventInfo and use std::upper_bound to find the most recent cudaEventRecord that occurred strictly before the wait (largest correlationId ≤ queryCorrelationId).

Note: since now we record all WaitEventInfo in the vector, it may accumulate during the profiling session. I'm not sure if that would be a problem for taking up the memory.

Bug 2: Clear waitEventMap() in onResetTraceData().

The sorted insert (std::lower_bound) at record time keeps the vector ordered, making the lookup O(log n) instead of O(n).

Tests

StreamWaitEventFutureCorrelation: simulates out-of-order CUPTI delivery where a re-record (corrId=200) arrives before the wait (corrId=101). Verify the wait links to corrId=100, not corrId=200.
WaitEventMapClearedOnReset: runs two back-to-back profiling sessions with the same event ID. Verify the second session does not pick up the correlation ID from the first.

Reviewed By: ryanzhang22

Differential Revision: D99131994

Pulled By: jiannanWang

meta-codesync · 2026-04-01T18:18:42Z

@jiannanWang has imported this pull request. If you are a Meta employee, you can view this in D99131994.

Summary: ### Problem `waitEventMap()` stored only the most recent `cudaEventRecord` for each `(context, eventId)` pair. This caused two bugs: Bug 1 Future correlation ID: A CUDA event can be recorded multiple times (re-used). CUPTI may deliver the activity records out of buffer order. If a later `cudaEventRecord` (higher correlationId) was delivered before the `cudaStreamWaitEvent` was processed, it would overwrite the map entry. The wait event would then be linked to a future `cudaEventRecord`. Bug 2 Stale entries across sessions: The static `waitEventMap()` was never cleared between profiling sessions. If a CUDA event ID was reused in a subsequent session, the wait event would pick up a correlation ID from the previous session. ### Fix Bug 1: Change `waitEventMap()` from `unordered_map<CtxEventPair, WaitEventInfo>` to `unordered_map<CtxEventPair, vector<WaitEventInfo>>`, storing all records for each `(context, eventId)`. Pass `correlationId` of the wait event into `getWaitEventInfo` and use `std::upper_bound` to find the most recent `cudaEventRecord` that occurred strictly before the wait (largest `correlationId ≤ queryCorrelationId`). **Note**: since now we record all WaitEventInfo in the vector, it may accumulate during the profiling session. I'm not sure if that would be a problem for taking up the memory. Bug 2: Clear waitEventMap() in onResetTraceData(). The sorted insert (`std::lower_bound`) at record time keeps the vector ordered, making the lookup O(log n) instead of O(n). ### Tests - `StreamWaitEventFutureCorrelation`: simulates out-of-order CUPTI delivery where a re-record (`corrId=200`) arrives before the wait (`corrId=101`). Verify the wait links to `corrId=100`, not `corrId=200`. - `WaitEventMapClearedOnReset`: runs two back-to-back profiling sessions with the same event ID. Verify the second session does not pick up the correlation ID from the first. Differential Revision: D99131994 Pulled By: jiannanWang

meta-codesync · 2026-04-08T17:40:49Z

@jiannanWang has exported this pull request. If you are a Meta employee, you can view the originating Diff in D99131994.

Summary: ### Problem `waitEventMap()` stored only the most recent `cudaEventRecord` for each `(context, eventId)` pair. This caused two bugs: Bug 1 Future correlation ID: A CUDA event can be recorded multiple times (re-used). CUPTI may deliver the activity records out of buffer order. If a later `cudaEventRecord` (higher correlationId) was delivered before the `cudaStreamWaitEvent` was processed, it would overwrite the map entry. The wait event would then be linked to a future `cudaEventRecord`. Bug 2 Stale entries across sessions: The static `waitEventMap()` was never cleared between profiling sessions. If a CUDA event ID was reused in a subsequent session, the wait event would pick up a correlation ID from the previous session. ### Fix Bug 1: Change `waitEventMap()` from `unordered_map<CtxEventPair, WaitEventInfo>` to `unordered_map<CtxEventPair, vector<WaitEventInfo>>`, storing all records for each `(context, eventId)`. Pass `correlationId` of the wait event into `getWaitEventInfo` and use `std::upper_bound` to find the most recent `cudaEventRecord` that occurred strictly before the wait (largest `correlationId ≤ queryCorrelationId`). **Note**: since now we record all WaitEventInfo in the vector, it may accumulate during the profiling session. I'm not sure if that would be a problem for taking up the memory. Bug 2: Clear waitEventMap() in onResetTraceData(). The sorted insert (`std::lower_bound`) at record time keeps the vector ordered, making the lookup O(log n) instead of O(n). ### Tests - `StreamWaitEventFutureCorrelation`: simulates out-of-order CUPTI delivery where a re-record (`corrId=200`) arrives before the wait (`corrId=101`). Verify the wait links to `corrId=100`, not `corrId=200`. - `WaitEventMapClearedOnReset`: runs two back-to-back profiling sessions with the same event ID. Verify the second session does not pick up the correlation ID from the first. Reviewed By: ryanzhang22 Differential Revision: D99131994 Pulled By: jiannanWang

meta-codesync · 2026-04-16T19:27:59Z

@jiannanWang merged this pull request in 23b5bb5.

Includes the following commits: - Fix stream wait events referencing future correlation IDs (pytorch/kineto#1339) 23b5bb5 - Remove kineto tb_plugin directory entirely (pytorch/kineto#1368) 9497960 - Move Stream Sync events to a new row in JSON trace export (pytorch/kineto#1356) 041e7ce - Expose isGpuCollectionStopped() through Kineto's public API (pytorch/kineto#1367) 17708f5 - Fix toggle test (pytorch/kineto#1369) ee2103c - Link to correct fmt repo (pytorch/kineto#1345) 3447834 - Fix data race on CuptiActivityApi::externalCorrelationEnabled_ (pytorch/kineto#1365) 0e86499 - Stop allocating CUPTI buffers after exceeding max buffer count (pytorch/kineto#1362) 666f62c - Add XPU workflow (pytorch/kineto#1302) 11cc1e0 - Remove RocprofActivity.h/RoctracerActivity.h from RocmActivityProfiler.h (pytorch/kineto#1357) 896068d - Split ActivityProfilerController into Sync and Async Handlers (pytorch/kineto#1269) 6d7f045 - Add priority field to kernel metadata (pytorch/kineto#1361) f2a7423 - Add kineto-release skill (pytorch/kineto#1360) 675b6cd Authored with Claude.

Includes the following commits: - Fix stream wait events referencing future correlation IDs (pytorch/kineto#1339) 23b5bb5 - Remove kineto tb_plugin directory entirely (pytorch/kineto#1368) 9497960 - Move Stream Sync events to a new row in JSON trace export (pytorch/kineto#1356) 041e7ce - Expose isGpuCollectionStopped() through Kineto's public API (pytorch/kineto#1367) 17708f5 - Fix toggle test (pytorch/kineto#1369) ee2103c - Link to correct fmt repo (pytorch/kineto#1345) 3447834 - Fix data race on CuptiActivityApi::externalCorrelationEnabled_ (pytorch/kineto#1365) 0e86499 - Stop allocating CUPTI buffers after exceeding max buffer count (pytorch/kineto#1362) 666f62c - Add XPU workflow (pytorch/kineto#1302) 11cc1e0 - Remove RocprofActivity.h/RoctracerActivity.h from RocmActivityProfiler.h (pytorch/kineto#1357) 896068d - Split ActivityProfilerController into Sync and Async Handlers (pytorch/kineto#1269) 6d7f045 - Add priority field to kernel metadata (pytorch/kineto#1361) f2a7423 - Add kineto-release skill (pytorch/kineto#1360) 675b6cd Authored with Claude. Pull Request resolved: #180606 Approved by: https://github.com/ryanzhang22, https://github.com/Skylion007

meta-cla bot added the cla signed label Apr 1, 2026

jiannanWang requested review from divyanshk, ryanzhang22 and scotts April 1, 2026 18:01

meta-codesync bot changed the title ~~Fix stream wait events referencing future correlation IDs~~ Fix stream wait events referencing future correlation IDs (#1339) Apr 8, 2026

jiannanWang force-pushed the fix/t2t4-future-correlation-ids branch from 52a7f29 to 45beda9 Compare April 8, 2026 17:40

meta-codesync bot added fb-exported meta-exported labels Apr 8, 2026

jiannanWang force-pushed the fix/t2t4-future-correlation-ids branch from 45beda9 to de74a4b Compare April 15, 2026 23:38

meta-codesync bot closed this in 23b5bb5 Apr 16, 2026

facebook-github-tools bot added the Merged label Apr 16, 2026

scotts mentioned this pull request Apr 16, 2026

Update third_party/kineto submodule to 23b5bb57 pytorch/pytorch#180606

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix stream wait events referencing future correlation IDs (#1339)#1339

Fix stream wait events referencing future correlation IDs (#1339)#1339
jiannanWang wants to merge 1 commit intopytorch:mainfrom
jiannanWang:fix/t2t4-future-correlation-ids

jiannanWang commented Apr 1, 2026 •

edited by meta-codesync bot

Loading

Uh oh!

meta-codesync bot commented Apr 1, 2026

Uh oh!

meta-codesync bot commented Apr 8, 2026

Uh oh!

meta-codesync bot commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jiannanWang commented Apr 1, 2026 • edited by meta-codesync bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Tests

Uh oh!

meta-codesync bot commented Apr 1, 2026

Uh oh!

meta-codesync bot commented Apr 8, 2026

Uh oh!

meta-codesync bot commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jiannanWang commented Apr 1, 2026 •

edited by meta-codesync bot

Loading