Skip to content

Conversation

@ksivaman
Copy link
Member

@ksivaman ksivaman commented Jan 1, 2026

Description

When passing an int as the barrier_id to cutlass::arch::NamedBarrier::sync, this overload of the sync function is invoked which already accounts for the reserved barriers. Passing 10 results in barrier_id=18 which goes over the 16 provided by hardware.

Introduced in #2411.

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

  • Use a smaller barrier ID to be within range.

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
@ksivaman ksivaman requested review from ptrendx and zhongbozhu January 1, 2026 14:45
@ksivaman ksivaman added the bug Something isn't working label Jan 1, 2026
@ksivaman
Copy link
Member Author

ksivaman commented Jan 1, 2026

/te-ci

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 1, 2026

Greptile Summary

This PR fixes a critical out-of-bounds barrier ID bug where passing barrier_id=10 to cutlass::arch::NamedBarrier::sync resulted in hardware barrier ID 18, exceeding the 16 available barriers (0-15).

  • Changed row_quant_barrier_id from 10 to 2, ensuring final barrier ID is 10 after CUTLASS adds the 8-barrier offset
  • Added clear comment explaining the CUTLASS offset behavior to prevent future mistakes
  • Changed from int to constexpr int for compile-time optimization
  • The fix aligns with other barrier usage in the codebase (e.g., ReservedNamedBarriers::EpilogueBarrier)

Confidence Score: 5/5

  • This PR is safe to merge - it fixes a critical hardware constraint violation with a minimal, well-documented change
  • The fix is straightforward, directly addresses the bug described in the PR, adds helpful documentation, and makes a small optimization (constexpr). No logic changes beyond the barrier ID value.
  • No files require special attention

Important Files Changed

Filename Overview
transformer_engine/common/hadamard_transform/group_row_cast_col_hadamard_transform_cast_fusion.cu Fixed out-of-bounds barrier ID from 10 to 2, preventing hardware overflow (CUTLASS adds 8, so 2→10, not 10→18)

Sequence Diagram

sequenceDiagram
    participant K as CUDA Kernel
    participant C as CUTLASS NamedBarrier
    participant H as Hardware Barriers
    
    Note over K,H: Before Fix (barrier_id=10)
    K->>C: sync(threads, barrier_id=10)
    C->>C: Add reserved barrier offset (+8)
    C->>H: bar.sync with id=18
    Note over H: ERROR: Only 16 barriers available (0-15)
    
    Note over K,H: After Fix (barrier_id=2)
    K->>C: sync(threads, barrier_id=2)
    C->>C: Add reserved barrier offset (+8)
    C->>H: bar.sync with id=10
    Note over H: SUCCESS: Within valid range (0-15)
Loading

@ksivaman
Copy link
Member Author

ksivaman commented Jan 2, 2026

/te-ci

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant