NCCL_TOPO_XML_MAX_NODES=256 limit hit during intra-node XML fusion on 32-NIC hosts (AWS p5.48xlarge)

# Issue: hit `NCCL_TOPO_XML_MAX_NODES=256` limit during intra-node XML fusion on 32-NIC hosts (AWS p5.48xlarge)

## Summary

On AWS p5.48xlarge (8 x H100, 32 EFA NICs), `ncclTopoFuseXml` hits its 256-node XML buffer cap while fusing the 8 local-rank XMLs at initialization. The error fires as:

```
graph/xml.h: NCCL WARN Error : too many XML nodes (max 256)
```

Each rank autogenerates ~126 XML nodes from its `/sys/class/pci_bus/*` walk on this host shape; per-rank dedup during `ncclTopoFuseXmlRecursive` is insufficient to hold the merged output below 256 nodes on our reproducer.

## Reproducer

- Hardware: AWS p5.48xlarge (8 GPUs, 32 EFA NICs)
- NCCL: 2.30.4 (`nvidia-nccl-cu13==2.30.4` pip wheel)
- Plugin: stock aws-ofi-nccl v1.19.1, no topology patches, no `NCCL_TOPO_FILE` override
- Workload: DeepSeek DeepEP V2 `tests/elastic/test_ep.py`. We expect any workload that exercises the same intra-node XML fusion path on this host shape to reproduce, but we have not attempted a minimal `nccl-tests` repro.

The overflow fires during early init, before any collective.

## Observation vs. inferred mechanism

**Observed:** The overflow reproduces consistently on this host shape with the stock setup above. We additionally tried supplying an identical 122-node `NCCL_TOPO_FILE` to every rank; the overflow still fires. We also tried narrowing visible EFA devices via libfabric (`FI_EFA_DEVICE_LIST`); the overflow still fires, because NCCL walks `/sys/class/pci_bus/*` independently of what libfabric exposes.

**Inferred (from reading the 2.30.4 source, not a live-debugger trace):** The non-MNNVL branch of `ncclTopoGetSystem` reuses the initial 256-slot XML buffer for the fused output rather than allocating a larger buffer (the MNNVL branch does allocate a larger buffer). This path relies on `xmlFindNode` dedup in `ncclTopoFuseXmlRecursive` to keep the merged buffer within 256. `xmlFindNode` (static inline in `src/graph/xml.h`) requires `nAttrs` match plus every attribute key/value match, and per-rank generated attrs on the `gpu`/`pci` nodes may be preventing sufficient collapse of otherwise-equivalent subtrees. We did not step through a live debugger to confirm which specific attribute difference blocks which dedup call.

## Candidate fixes

**(a) Allocate a larger destination XML buffer in the non-MNNVL intra-node fuse path** (analogous to the MNNVL path). Architecturally symmetric.

**(b) Bump `NCCL_TOPO_XML_MAX_NODES`** (the one-line `#define` change). Simpler but increases the buffer footprint for every topology-load path.

**(c) Fix the dedup** in `xmlFindNode` / `ncclTopoFuseXmlRecursive` so rank-specific attributes do not block collapse of otherwise-identical subtrees. We are not familiar enough with the invariants your dedup code relies on to recommend a specific change here.

We tested option (b) locally (`#define` 256 -> 2048) across three DeepEP-based images on the reproducer hardware: 351 combined `test_ep.py` config runs produced zero `too many XML nodes` warnings. That validates option (b) works for our workload; it does not say anything about the structural correctness of other paths in NCCL that allocate XML buffers.

## Happy to send a minimal PR

If the maintainers prefer option (a), I am happy to attempt a PR, though I recognize dynamic allocation in the fuse path may require touching the underlying `ncclXml` struct if it embeds a fixed-size array. If you prefer option (b), it is a 1-line bump. Either way I'd rather not guess which shape you'd accept without a maintainer nudge.

## Context / incident writeup

Full incident writeup with raw logs, controlled experiments, and the source-reading chain:
https://github.com/antonai-work/deepep-v2-efa-base/blob/main/docs/INCIDENT-PR1226-AWS-OFI-NCCL-XML-OVERFLOW.md

Thanks for your work on NCCL.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NCCL_TOPO_XML_MAX_NODES=256 limit hit during intra-node XML fusion on 32-NIC hosts (AWS p5.48xlarge) #2160

Issue: hit `NCCL_TOPO_XML_MAX_NODES=256` limit during intra-node XML fusion on 32-NIC hosts (AWS p5.48xlarge)

Summary

Reproducer

Observation vs. inferred mechanism

Candidate fixes

Happy to send a minimal PR

Context / incident writeup

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

NCCL_TOPO_XML_MAX_NODES=256 limit hit during intra-node XML fusion on 32-NIC hosts (AWS p5.48xlarge) #2160

Description

Issue: hit NCCL_TOPO_XML_MAX_NODES=256 limit during intra-node XML fusion on 32-NIC hosts (AWS p5.48xlarge)

Summary

Reproducer

Observation vs. inferred mechanism

Candidate fixes

Happy to send a minimal PR

Context / incident writeup

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Issue: hit `NCCL_TOPO_XML_MAX_NODES=256` limit during intra-node XML fusion on 32-NIC hosts (AWS p5.48xlarge)