Skip to content

NCC_ILLC059: ring_attention_spmd_fwd collective_permute send_k_buf fails on LNC=2 #1352

Description

@yahavb

Summary

nkilib.experimental.attention.ring_attention_fwd.ring_attention_spmd_fwd fails to compile under LNC=2 with:

[INTERNAL_ERROR] [NCC_ILLC059] Could not find MemoryLocation named inst__I-5-0:send_k_buf on core 1

The same kernel compiles and runs correctly under LNC=1 on the same hardware. The bug is in neuronx-cc's collective_permute buffer allocation under paired-core (LNC=2) topology — HBM capacity is irrelevant.

Minimal Repro

from nkilib.experimental.attention.ring_attention_fwd import ring_attention_spmd_fwd
import torch_neuronx

# q/k/v: [1, 16, 11136, 64] bf16 (also fails at [1, 8, 22272, 64] and head_dim=128)
# Wrapped via torch_neuronx.nki_hop.wrap_nki
ring_attention_spmd_fwd(q, k, v,
    replica_groups=((0, 1, 2, 3, 4, 5, 6, 7),),  # cp=8
    num_workers=8, use_causal_mask=False, tp_q=True, tp_k=True)

Environment:

  • NEURON_RT_VIRTUAL_CORE_SIZE=2 (LNC=2) → NCC_ILLC059
  • NEURON_RT_VIRTUAL_CORE_SIZE=1 (LNC=1) → compiles and runs

Environment

  • Hardware: trn2.48xlarge, 8 NeuronCores
  • Instance types tested: s-lnc1-trn2 (12 GB/bank) and m-trn2 (24 GB/bank)
  • Model context: non-causal MHA, embed_dim=1024, 16 heads, head_dim=64, bf16
  • NEURON_CC_FLAGS: --model-type=transformer

What works vs. what fails

cp head_dim LNC HBM/bank Result
8 64 2 12 GB NCC_ILLC059 send_k_buf on core 1
4 64 2 12 GB NCC_ILLC059 identical
4 128 (padded) 2 12 GB NCC_ILLC059 identical
8 64 2 24 GB NCC_ILLC059 identical (proves not memory-related)
4 128 (padded) 1 12 GB Compiles and runs
8 64 1 12 GB Compiles and runs

Key observations

  1. LNC=2 is the sole trigger — independent of cp degree (4 or 8), head_dim (64 or 128), and HBM capacity (12 GB or 24 GB per bank).
  2. LNC=1 always works — the kernel compiles and executes correctly.
  3. The nkilib ring attention tests cover only cp≤4, head_dim=128, nheads≤3. Our config is outside that envelope, but the failure is topology-specific (LNC=2), not shape-specific.

Question

Why does ring_attention_spmd_fwd's collective_permute send_k_buf fail to resolve a MemoryLocation on core 1 under LNC=2? Is LNC=2 supported for this kernel?

Happy to provide full compiler logs if needed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Trn2bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions