Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR introduces a specialized skill definition for NCCL to help developers and AI assistants navigate the complexities of multi-GPU communication and cluster-level performance tuning.
Description
This PR adds nccl-expert to SKILL.md. This is an initial trial to explore how structured skill definitions can assist in managing NVIDIA Collective Communications Library (NCCL) configurations. It provides a foundational knowledge base for:
Optimizing collective primitives (AllReduce, AllGather).
Configuring high-bandwidth interconnects (InfiniBand, RoCE, NVLink).
Debugging cluster-wide hangs and topology bottlenecks.
Leveraging latest features for modern architectures like Blackwell (SM100).
Related Issues
This is an exploratory PR aimed at improving developer experience for distributed training and inference. It serves as a companion to the NIXL SKILL.md trial to see how these definitions work across different communication libraries.
Changes & Impact
New File/Section: Adds nccl-expert skill definition.
Guideline Standardization: Formally documents critical environment variables (e.g., NCCL_P2P_LEVEL, NCCL_IB_GID_INDEX) and debugging workflows.
Impact: There is no impact on the NCCL binary or runtime. This is a documentation-only change intended to improve the quality of AI-generated configuration advice and manual troubleshooting.
Performance Impact
Direct Impact: None on the library code itself.
Indirect Impact: Positive. By following the documented "Protocol Selection" and "Blackwell Optimization" guidelines, users can significantly reduce latency and increase throughput in multi-node environments.
Testing: Validated against common failure scenarios (RoCE v2 route ambiguity and NVLink topology mismatches) to ensure the troubleshooting steps provided in the skill are actionable.