Skip to content

Conversation

@zhenghax
Copy link

This PR introduces a specialized skill definition for NCCL to help developers and AI assistants navigate the complexities of multi-GPU communication and cluster-level performance tuning.

Description
This PR adds nccl-expert to SKILL.md. This is an initial trial to explore how structured skill definitions can assist in managing NVIDIA Collective Communications Library (NCCL) configurations. It provides a foundational knowledge base for:
Optimizing collective primitives (AllReduce, AllGather).
Configuring high-bandwidth interconnects (InfiniBand, RoCE, NVLink).
Debugging cluster-wide hangs and topology bottlenecks.
Leveraging latest features for modern architectures like Blackwell (SM100).

Related Issues
This is an exploratory PR aimed at improving developer experience for distributed training and inference. It serves as a companion to the NIXL SKILL.md trial to see how these definitions work across different communication libraries.

Changes & Impact
New File/Section: Adds nccl-expert skill definition.
Guideline Standardization: Formally documents critical environment variables (e.g., NCCL_P2P_LEVEL, NCCL_IB_GID_INDEX) and debugging workflows.

Impact: There is no impact on the NCCL binary or runtime. This is a documentation-only change intended to improve the quality of AI-generated configuration advice and manual troubleshooting.

Performance Impact
Direct Impact: None on the library code itself.
Indirect Impact: Positive. By following the documented "Protocol Selection" and "Blackwell Optimization" guidelines, users can significantly reduce latency and increase throughput in multi-node environments.
Testing: Validated against common failure scenarios (RoCE v2 route ambiguity and NVLink topology mismatches) to ensure the troubleshooting steps provided in the skill are actionable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants