Skip to content

[RFE]: Replace assert() with proper error handling #2163

@LeviMee

Description

@LeviMee

Please provide the below details to ensure we understand your needs

What is the goal of this request?

Replace assert() statements with NCCL's proper error handling mechanisms (e.g., WARN() + ncclResult_t return codes).

Issues with current assert() usage:

  • Causes immediate process crash without graceful degradation
  • Provides minimal error context (no actual values logged)
  • Not integrated with NCCL's logging infrastructure

Example locations: (NCCL 2.30.4)

  • src/transport/coll_net.cc:1316-1317
  • src/transport/net_ib/p2p_resiliency.cc:78

Who will benefit from this feature?

Production deployments, system administrators, and developers who need better diagnostics and error recovery.

Is this request for a specific GPU architecture or network infrastructure?

No, this is architecture-agnostic.

How will this feature improve current workflows or processes?

Current:

assert(reqSize == sizeof(struct collnetRegInfo));
  • Crashes immediately
  • No diagnostic information
    Proposed:
if (reqSize != sizeof(struct collnetRegInfo)) {
  WARN("Size mismatch: expected %zu, got %zu", 
       sizeof(struct collnetRegInfo), reqSize);
  return ncclInternalError;
}
  • Detailed error messages with actual values
  • Allows graceful error propagation
  • Integrated with NCCL logging

What is the priority level of this request?

Medium - Improves production reliability and debuggability without requiring API changes.

Metadata

Metadata

Assignees

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions