Skip to content

Expose comms_id in traces#1286

Open
ycui1984 wants to merge 1 commit intopytorch:mainfrom
ycui1984:export-D95659539
Open

Expose comms_id in traces#1286
ycui1984 wants to merge 1 commit intopytorch:mainfrom
ycui1984:export-D95659539

Conversation

@ycui1984
Copy link
Copy Markdown
Contributor

@ycui1984 ycui1984 commented Mar 7, 2026

Summary:

Context

Add a comms_id to PyTorch profiler traces that uniquely identifies each collective/P2P
communication operation across all ranks. This enables trace analysis tools to correlate the same operation across different ranks for debugging distributed training performance.

How comms_id is computed

comms_id = hash(pg_name, seqNumber, isP2P, globalRankStart, globalRankStride, worldSize)

  • pg_name — identifies the process group
  • seqNumber — per-PG operation counter, identifies which operation within the PG
  • isP2P — distinguishes P2P ops (send/recv) from collectives (allreduce, etc.), since they use separate sequence number counters
  • globalRankStart, globalRankStride, worldSize — encodes the communicator topology,
    disambiguating cases where one PG creates multiple communicators (e.g., comm splits)

Changes by layer

  1. Data model (ParamCommsUtils.hpp/.cpp) — Added seqNumber, isP2P fields to ParamCommsDebugInfo, the class that carries communication metadata through the profiling stack.
  2. Hash computation (profiler/util.cpp/.h) — In saveNcclMeta(), computes comms_id from the 6 fields above and emits it as "Comms Id" in the profiler metadata map.
  3. Trace output (output_json.cpp) — Kineto reads "Comms Id" from the metadata and writes it into the Chrome trace JSON, making it visible in trace viewers.
  4. Tests (comms_id.cpp, CuptiActivityProfilerTest.cpp) — 9 unit tests covering:
  • Storage/retrieval of seqNumber and isP2P
  • Default values
  • End-to-end: comms_id appears in saveNcclMeta() output with correct hash
  • Determinism across instances
  • Uniqueness across different PG names, sequence numbers, P2P vs collective, and communicator topologies

Differential Revision: D95659539

Summary:
# Context

Add a comms_id to PyTorch profiler traces that uniquely identifies each collective/P2P       
communication operation across all ranks. This enables trace analysis tools to correlate the same operation across different ranks for debugging distributed training performance.

How comms_id is computed

comms_id = hash(pg_name, seqNumber, isP2P, globalRankStart, globalRankStride, worldSize)

  - pg_name — identifies the process group
  - seqNumber — per-PG operation counter, identifies which operation within the PG
  - isP2P — distinguishes P2P ops (send/recv) from collectives (allreduce, etc.), since they use separate sequence number counters
- globalRankStart, globalRankStride, worldSize — encodes the communicator topology,
  disambiguating cases where one PG creates multiple communicators (e.g., comm splits)

 Changes by layer

 1. Data model (ParamCommsUtils.hpp/.cpp) — Added seqNumber, isP2P fields to ParamCommsDebugInfo, the class that carries communication metadata through the profiling stack.
 2. Hash computation (profiler/util.cpp/.h) — In saveNcclMeta(), computes comms_id from the 6 fields above and emits it as "Comms Id" in the profiler metadata map.
3. Trace output (output_json.cpp) — Kineto reads "Comms Id" from the metadata and writes it into the Chrome trace JSON, making it visible in trace viewers.
4. Tests (comms_id.cpp, CuptiActivityProfilerTest.cpp) — 9 unit tests covering:
  - Storage/retrieval of seqNumber and isP2P
  - Default values
  - End-to-end: comms_id appears in saveNcclMeta() output with correct hash
  - Determinism across instances
  - Uniqueness across different PG names, sequence numbers, P2P vs collective, and communicator topologies

Differential Revision: D95659539
@meta-cla meta-cla bot added the cla signed label Mar 7, 2026
@meta-codesync
Copy link
Copy Markdown

meta-codesync bot commented Mar 7, 2026

@ycui1984 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D95659539.

ycui1984 added a commit to ycui1984/pytorch that referenced this pull request Mar 7, 2026
Summary:
X-link: pytorch/kineto#1286

# Context

Add a comms_id to PyTorch profiler traces that uniquely identifies each collective/P2P       
communication operation across all ranks. This enables trace analysis tools to correlate the same operation across different ranks for debugging distributed training performance.

How comms_id is computed

comms_id = hash(pg_name, seqNumber, isP2P, globalRankStart, globalRankStride, worldSize)

  - pg_name — identifies the process group
  - seqNumber — per-PG operation counter, identifies which operation within the PG
  - isP2P — distinguishes P2P ops (send/recv) from collectives (allreduce, etc.), since they use separate sequence number counters
- globalRankStart, globalRankStride, worldSize — encodes the communicator topology,
  disambiguating cases where one PG creates multiple communicators (e.g., comm splits)

 Changes by layer

 1. Data model (ParamCommsUtils.hpp/.cpp) — Added seqNumber, isP2P fields to ParamCommsDebugInfo, the class that carries communication metadata through the profiling stack.
 2. Hash computation (profiler/util.cpp/.h) — In saveNcclMeta(), computes comms_id from the 6 fields above and emits it as "Comms Id" in the profiler metadata map.
3. Trace output (output_json.cpp) — Kineto reads "Comms Id" from the metadata and writes it into the Chrome trace JSON, making it visible in trace viewers.
4. Tests (comms_id.cpp, CuptiActivityProfilerTest.cpp) — 9 unit tests covering:
  - Storage/retrieval of seqNumber and isP2P
  - Default values
  - End-to-end: comms_id appears in saveNcclMeta() output with correct hash
  - Determinism across instances
  - Uniqueness across different PG names, sequence numbers, P2P vs collective, and communicator topologies

Test Plan: 1. added unit tests

Differential Revision: D95659539
ycui1984 added a commit to ycui1984/kineto that referenced this pull request Mar 11, 2026
Summary:

This is part of a larger effort to expose comms_id in profiler traces to enable correlating the same communication operation across different ranks.

This diff adds the Kineto side: reading "Comms Id" from the profiler metadata and writing it into the Chrome trace JSON output. When the metadata key is present, it will be included in the trace event args, making it visible in trace viewers.

The PyTorch side (computing and emitting the comms_id) will be in a follow-up diff after the Kineto submodule is updated.

Differential Revision: D96153960
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants