Please provide the below details to ensure we understand your needs
What is the goal of this request?
Replace assert() statements with NCCL's proper error handling mechanisms (e.g., WARN() + ncclResult_t return codes).
Issues with current assert() usage:
- Causes immediate process crash without graceful degradation
- Provides minimal error context (no actual values logged)
- Not integrated with NCCL's logging infrastructure
Example locations: (NCCL 2.30.4)
src/transport/coll_net.cc:1316-1317
src/transport/net_ib/p2p_resiliency.cc:78
Who will benefit from this feature?
Production deployments, system administrators, and developers who need better diagnostics and error recovery.
Is this request for a specific GPU architecture or network infrastructure?
No, this is architecture-agnostic.
How will this feature improve current workflows or processes?
Current:
assert(reqSize == sizeof(struct collnetRegInfo));
- Crashes immediately
- No diagnostic information
Proposed:
if (reqSize != sizeof(struct collnetRegInfo)) {
WARN("Size mismatch: expected %zu, got %zu",
sizeof(struct collnetRegInfo), reqSize);
return ncclInternalError;
}
- Detailed error messages with actual values
- Allows graceful error propagation
- Integrated with NCCL logging
What is the priority level of this request?
Medium - Improves production reliability and debuggability without requiring API changes.
Please provide the below details to ensure we understand your needs
What is the goal of this request?
Replace
assert()statements with NCCL's proper error handling mechanisms (e.g.,WARN()+ncclResult_treturn codes).Issues with current
assert()usage:Example locations: (NCCL 2.30.4)
src/transport/coll_net.cc:1316-1317src/transport/net_ib/p2p_resiliency.cc:78Who will benefit from this feature?
Production deployments, system administrators, and developers who need better diagnostics and error recovery.
Is this request for a specific GPU architecture or network infrastructure?
No, this is architecture-agnostic.
How will this feature improve current workflows or processes?
Current:
Proposed:
What is the priority level of this request?
Medium - Improves production reliability and debuggability without requiring API changes.