Summary
nkilib.experimental.attention.ring_attention_fwd.ring_attention_spmd_fwd fails to compile under LNC=2 with:
[INTERNAL_ERROR] [NCC_ILLC059] Could not find MemoryLocation named inst__I-5-0:send_k_buf on core 1
The same kernel compiles and runs correctly under LNC=1 on the same hardware. The bug is in neuronx-cc's collective_permute buffer allocation under paired-core (LNC=2) topology — HBM capacity is irrelevant.
Minimal Repro
from nkilib.experimental.attention.ring_attention_fwd import ring_attention_spmd_fwd
import torch_neuronx
# q/k/v: [1, 16, 11136, 64] bf16 (also fails at [1, 8, 22272, 64] and head_dim=128)
# Wrapped via torch_neuronx.nki_hop.wrap_nki
ring_attention_spmd_fwd(q, k, v,
replica_groups=((0, 1, 2, 3, 4, 5, 6, 7),), # cp=8
num_workers=8, use_causal_mask=False, tp_q=True, tp_k=True)
Environment:
- NEURON_RT_VIRTUAL_CORE_SIZE=2 (LNC=2) → NCC_ILLC059
- NEURON_RT_VIRTUAL_CORE_SIZE=1 (LNC=1) → compiles and runs
Environment
- Hardware: trn2.48xlarge, 8 NeuronCores
- Instance types tested: s-lnc1-trn2 (12 GB/bank) and m-trn2 (24 GB/bank)
- Model context: non-causal MHA, embed_dim=1024, 16 heads, head_dim=64, bf16
- NEURON_CC_FLAGS:
--model-type=transformer
What works vs. what fails
| cp |
head_dim |
LNC |
HBM/bank |
Result |
| 8 |
64 |
2 |
12 GB |
NCC_ILLC059 send_k_buf on core 1 |
| 4 |
64 |
2 |
12 GB |
NCC_ILLC059 identical |
| 4 |
128 (padded) |
2 |
12 GB |
NCC_ILLC059 identical |
| 8 |
64 |
2 |
24 GB |
NCC_ILLC059 identical (proves not memory-related) |
| 4 |
128 (padded) |
1 |
12 GB |
Compiles and runs |
| 8 |
64 |
1 |
12 GB |
Compiles and runs |
Key observations
- LNC=2 is the sole trigger — independent of cp degree (4 or 8), head_dim (64 or 128), and HBM capacity (12 GB or 24 GB per bank).
- LNC=1 always works — the kernel compiles and executes correctly.
- The nkilib ring attention tests cover only cp≤4, head_dim=128, nheads≤3. Our config is outside that envelope, but the failure is topology-specific (LNC=2), not shape-specific.
Question
Why does ring_attention_spmd_fwd's collective_permute send_k_buf fail to resolve a MemoryLocation on core 1 under LNC=2? Is LNC=2 supported for this kernel?
Happy to provide full compiler logs if needed.
Summary
nkilib.experimental.attention.ring_attention_fwd.ring_attention_spmd_fwdfails to compile under LNC=2 with:The same kernel compiles and runs correctly under LNC=1 on the same hardware. The bug is in neuronx-cc's
collective_permutebuffer allocation under paired-core (LNC=2) topology — HBM capacity is irrelevant.Minimal Repro
Environment:
Environment
--model-type=transformerWhat works vs. what fails
Key observations
Question
Why does
ring_attention_spmd_fwd'scollective_permutesend_k_buffail to resolve a MemoryLocation on core 1 under LNC=2? Is LNC=2 supported for this kernel?Happy to provide full compiler logs if needed.