Issue: hit NCCL_TOPO_XML_MAX_NODES=256 limit during intra-node XML fusion on 32-NIC hosts (AWS p5.48xlarge)
Summary
On AWS p5.48xlarge (8 x H100, 32 EFA NICs), ncclTopoFuseXml hits its 256-node XML buffer cap while fusing the 8 local-rank XMLs at initialization. The error fires as:
graph/xml.h: NCCL WARN Error : too many XML nodes (max 256)
Each rank autogenerates ~126 XML nodes from its /sys/class/pci_bus/* walk on this host shape; per-rank dedup during ncclTopoFuseXmlRecursive is insufficient to hold the merged output below 256 nodes on our reproducer.
Reproducer
- Hardware: AWS p5.48xlarge (8 GPUs, 32 EFA NICs)
- NCCL: 2.30.4 (
nvidia-nccl-cu13==2.30.4 pip wheel)
- Plugin: stock aws-ofi-nccl v1.19.1, no topology patches, no
NCCL_TOPO_FILE override
- Workload: DeepSeek DeepEP V2
tests/elastic/test_ep.py. We expect any workload that exercises the same intra-node XML fusion path on this host shape to reproduce, but we have not attempted a minimal nccl-tests repro.
The overflow fires during early init, before any collective.
Observation vs. inferred mechanism
Observed: The overflow reproduces consistently on this host shape with the stock setup above. We additionally tried supplying an identical 122-node NCCL_TOPO_FILE to every rank; the overflow still fires. We also tried narrowing visible EFA devices via libfabric (FI_EFA_DEVICE_LIST); the overflow still fires, because NCCL walks /sys/class/pci_bus/* independently of what libfabric exposes.
Inferred (from reading the 2.30.4 source, not a live-debugger trace): The non-MNNVL branch of ncclTopoGetSystem reuses the initial 256-slot XML buffer for the fused output rather than allocating a larger buffer (the MNNVL branch does allocate a larger buffer). This path relies on xmlFindNode dedup in ncclTopoFuseXmlRecursive to keep the merged buffer within 256. xmlFindNode (static inline in src/graph/xml.h) requires nAttrs match plus every attribute key/value match, and per-rank generated attrs on the gpu/pci nodes may be preventing sufficient collapse of otherwise-equivalent subtrees. We did not step through a live debugger to confirm which specific attribute difference blocks which dedup call.
Candidate fixes
(a) Allocate a larger destination XML buffer in the non-MNNVL intra-node fuse path (analogous to the MNNVL path). Architecturally symmetric.
(b) Bump NCCL_TOPO_XML_MAX_NODES (the one-line #define change). Simpler but increases the buffer footprint for every topology-load path.
(c) Fix the dedup in xmlFindNode / ncclTopoFuseXmlRecursive so rank-specific attributes do not block collapse of otherwise-identical subtrees. We are not familiar enough with the invariants your dedup code relies on to recommend a specific change here.
We tested option (b) locally (#define 256 -> 2048) across three DeepEP-based images on the reproducer hardware: 351 combined test_ep.py config runs produced zero too many XML nodes warnings. That validates option (b) works for our workload; it does not say anything about the structural correctness of other paths in NCCL that allocate XML buffers.
Happy to send a minimal PR
If the maintainers prefer option (a), I am happy to attempt a PR, though I recognize dynamic allocation in the fuse path may require touching the underlying ncclXml struct if it embeds a fixed-size array. If you prefer option (b), it is a 1-line bump. Either way I'd rather not guess which shape you'd accept without a maintainer nudge.
Context / incident writeup
Full incident writeup with raw logs, controlled experiments, and the source-reading chain:
https://github.com/antonai-work/deepep-v2-efa-base/blob/main/docs/INCIDENT-PR1226-AWS-OFI-NCCL-XML-OVERFLOW.md
Thanks for your work on NCCL.
Issue: hit
NCCL_TOPO_XML_MAX_NODES=256limit during intra-node XML fusion on 32-NIC hosts (AWS p5.48xlarge)Summary
On AWS p5.48xlarge (8 x H100, 32 EFA NICs),
ncclTopoFuseXmlhits its 256-node XML buffer cap while fusing the 8 local-rank XMLs at initialization. The error fires as:Each rank autogenerates ~126 XML nodes from its
/sys/class/pci_bus/*walk on this host shape; per-rank dedup duringncclTopoFuseXmlRecursiveis insufficient to hold the merged output below 256 nodes on our reproducer.Reproducer
nvidia-nccl-cu13==2.30.4pip wheel)NCCL_TOPO_FILEoverridetests/elastic/test_ep.py. We expect any workload that exercises the same intra-node XML fusion path on this host shape to reproduce, but we have not attempted a minimalnccl-testsrepro.The overflow fires during early init, before any collective.
Observation vs. inferred mechanism
Observed: The overflow reproduces consistently on this host shape with the stock setup above. We additionally tried supplying an identical 122-node
NCCL_TOPO_FILEto every rank; the overflow still fires. We also tried narrowing visible EFA devices via libfabric (FI_EFA_DEVICE_LIST); the overflow still fires, because NCCL walks/sys/class/pci_bus/*independently of what libfabric exposes.Inferred (from reading the 2.30.4 source, not a live-debugger trace): The non-MNNVL branch of
ncclTopoGetSystemreuses the initial 256-slot XML buffer for the fused output rather than allocating a larger buffer (the MNNVL branch does allocate a larger buffer). This path relies onxmlFindNodededup inncclTopoFuseXmlRecursiveto keep the merged buffer within 256.xmlFindNode(static inline insrc/graph/xml.h) requiresnAttrsmatch plus every attribute key/value match, and per-rank generated attrs on thegpu/pcinodes may be preventing sufficient collapse of otherwise-equivalent subtrees. We did not step through a live debugger to confirm which specific attribute difference blocks which dedup call.Candidate fixes
(a) Allocate a larger destination XML buffer in the non-MNNVL intra-node fuse path (analogous to the MNNVL path). Architecturally symmetric.
(b) Bump
NCCL_TOPO_XML_MAX_NODES(the one-line#definechange). Simpler but increases the buffer footprint for every topology-load path.(c) Fix the dedup in
xmlFindNode/ncclTopoFuseXmlRecursiveso rank-specific attributes do not block collapse of otherwise-identical subtrees. We are not familiar enough with the invariants your dedup code relies on to recommend a specific change here.We tested option (b) locally (
#define256 -> 2048) across three DeepEP-based images on the reproducer hardware: 351 combinedtest_ep.pyconfig runs produced zerotoo many XML nodeswarnings. That validates option (b) works for our workload; it does not say anything about the structural correctness of other paths in NCCL that allocate XML buffers.Happy to send a minimal PR
If the maintainers prefer option (a), I am happy to attempt a PR, though I recognize dynamic allocation in the fuse path may require touching the underlying
ncclXmlstruct if it embeds a fixed-size array. If you prefer option (b), it is a 1-line bump. Either way I'd rather not guess which shape you'd accept without a maintainer nudge.Context / incident writeup
Full incident writeup with raw logs, controlled experiments, and the source-reading chain:
https://github.com/antonai-work/deepep-v2-efa-base/blob/main/docs/INCIDENT-PR1226-AWS-OFI-NCCL-XML-OVERFLOW.md
Thanks for your work on NCCL.