Expose comms_id in traces by ycui1984 · Pull Request #1286 · pytorch/kineto

ycui1984 · 2026-03-07T23:39:49Z

Summary:

Context

Add a comms_id to PyTorch profiler traces that uniquely identifies each collective/P2P
communication operation across all ranks. This enables trace analysis tools to correlate the same operation across different ranks for debugging distributed training performance.

How comms_id is computed

comms_id = hash(pg_name, seqNumber, isP2P, globalRankStart, globalRankStride, worldSize)

pg_name — identifies the process group
seqNumber — per-PG operation counter, identifies which operation within the PG
isP2P — distinguishes P2P ops (send/recv) from collectives (allreduce, etc.), since they use separate sequence number counters
globalRankStart, globalRankStride, worldSize — encodes the communicator topology,
disambiguating cases where one PG creates multiple communicators (e.g., comm splits)

Changes by layer

Data model (ParamCommsUtils.hpp/.cpp) — Added seqNumber, isP2P fields to ParamCommsDebugInfo, the class that carries communication metadata through the profiling stack.
Hash computation (profiler/util.cpp/.h) — In saveNcclMeta(), computes comms_id from the 6 fields above and emits it as "Comms Id" in the profiler metadata map.
Trace output (output_json.cpp) — Kineto reads "Comms Id" from the metadata and writes it into the Chrome trace JSON, making it visible in trace viewers.
Tests (comms_id.cpp, CuptiActivityProfilerTest.cpp) — 9 unit tests covering:

Storage/retrieval of seqNumber and isP2P
Default values
End-to-end: comms_id appears in saveNcclMeta() output with correct hash
Determinism across instances
Uniqueness across different PG names, sequence numbers, P2P vs collective, and communicator topologies

Differential Revision: D95659539

Summary: # Context Add a comms_id to PyTorch profiler traces that uniquely identifies each collective/P2P communication operation across all ranks. This enables trace analysis tools to correlate the same operation across different ranks for debugging distributed training performance. How comms_id is computed comms_id = hash(pg_name, seqNumber, isP2P, globalRankStart, globalRankStride, worldSize) - pg_name — identifies the process group - seqNumber — per-PG operation counter, identifies which operation within the PG - isP2P — distinguishes P2P ops (send/recv) from collectives (allreduce, etc.), since they use separate sequence number counters - globalRankStart, globalRankStride, worldSize — encodes the communicator topology, disambiguating cases where one PG creates multiple communicators (e.g., comm splits) Changes by layer 1. Data model (ParamCommsUtils.hpp/.cpp) — Added seqNumber, isP2P fields to ParamCommsDebugInfo, the class that carries communication metadata through the profiling stack. 2. Hash computation (profiler/util.cpp/.h) — In saveNcclMeta(), computes comms_id from the 6 fields above and emits it as "Comms Id" in the profiler metadata map. 3. Trace output (output_json.cpp) — Kineto reads "Comms Id" from the metadata and writes it into the Chrome trace JSON, making it visible in trace viewers. 4. Tests (comms_id.cpp, CuptiActivityProfilerTest.cpp) — 9 unit tests covering: - Storage/retrieval of seqNumber and isP2P - Default values - End-to-end: comms_id appears in saveNcclMeta() output with correct hash - Determinism across instances - Uniqueness across different PG names, sequence numbers, P2P vs collective, and communicator topologies Differential Revision: D95659539

meta-codesync · 2026-03-07T23:39:59Z

@ycui1984 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D95659539.

Summary: X-link: pytorch/kineto#1286 # Context Add a comms_id to PyTorch profiler traces that uniquely identifies each collective/P2P communication operation across all ranks. This enables trace analysis tools to correlate the same operation across different ranks for debugging distributed training performance. How comms_id is computed comms_id = hash(pg_name, seqNumber, isP2P, globalRankStart, globalRankStride, worldSize) - pg_name — identifies the process group - seqNumber — per-PG operation counter, identifies which operation within the PG - isP2P — distinguishes P2P ops (send/recv) from collectives (allreduce, etc.), since they use separate sequence number counters - globalRankStart, globalRankStride, worldSize — encodes the communicator topology, disambiguating cases where one PG creates multiple communicators (e.g., comm splits) Changes by layer 1. Data model (ParamCommsUtils.hpp/.cpp) — Added seqNumber, isP2P fields to ParamCommsDebugInfo, the class that carries communication metadata through the profiling stack. 2. Hash computation (profiler/util.cpp/.h) — In saveNcclMeta(), computes comms_id from the 6 fields above and emits it as "Comms Id" in the profiler metadata map. 3. Trace output (output_json.cpp) — Kineto reads "Comms Id" from the metadata and writes it into the Chrome trace JSON, making it visible in trace viewers. 4. Tests (comms_id.cpp, CuptiActivityProfilerTest.cpp) — 9 unit tests covering: - Storage/retrieval of seqNumber and isP2P - Default values - End-to-end: comms_id appears in saveNcclMeta() output with correct hash - Determinism across instances - Uniqueness across different PG names, sequence numbers, P2P vs collective, and communicator topologies Test Plan: 1. added unit tests Differential Revision: D95659539

Summary: This is part of a larger effort to expose comms_id in profiler traces to enable correlating the same communication operation across different ranks. This diff adds the Kineto side: reading "Comms Id" from the profiler metadata and writing it into the Chrome trace JSON output. When the metadata key is present, it will be included in the trace event args, making it visible in trace viewers. The PyTorch side (computing and emitting the comms_id) will be in a follow-up diff after the Kineto submodule is updated. Differential Revision: D96153960

meta-cla bot added the cla signed label Mar 7, 2026

meta-codesync bot added fb-exported meta-exported labels Mar 7, 2026

scotts added the ciflow/rocm label Mar 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expose comms_id in traces#1286

Expose comms_id in traces#1286
ycui1984 wants to merge 1 commit intopytorch:mainfrom
ycui1984:export-D95659539

ycui1984 commented Mar 7, 2026

Uh oh!

meta-codesync bot commented Mar 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ycui1984 commented Mar 7, 2026

Context

Uh oh!

meta-codesync bot commented Mar 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants