Create SKILL.md for NCCL #2004

zhenghax · 2026-01-27T22:44:01Z

This PR introduces a specialized skill definition for NCCL to help developers and AI assistants navigate the complexities of multi-GPU communication and cluster-level performance tuning.

Description
This PR adds nccl-expert to SKILL.md. This is an initial trial to explore how structured skill definitions can assist in managing NVIDIA Collective Communications Library (NCCL) configurations. It provides a foundational knowledge base for:
Optimizing collective primitives (AllReduce, AllGather).
Configuring high-bandwidth interconnects (InfiniBand, RoCE, NVLink).
Debugging cluster-wide hangs and topology bottlenecks.
Leveraging latest features for modern architectures like Blackwell (SM100).

Related Issues
This is an exploratory PR aimed at improving developer experience for distributed training and inference. It serves as a companion to the NIXL SKILL.md trial to see how these definitions work across different communication libraries.

Changes & Impact
New File/Section: Adds nccl-expert skill definition.
Guideline Standardization: Formally documents critical environment variables (e.g., NCCL_P2P_LEVEL, NCCL_IB_GID_INDEX) and debugging workflows.

Impact: There is no impact on the NCCL binary or runtime. This is a documentation-only change intended to improve the quality of AI-generated configuration advice and manual troubleshooting.

Performance Impact
Direct Impact: None on the library code itself.
Indirect Impact: Positive. By following the documented "Protocol Selection" and "Blackwell Optimization" guidelines, users can significantly reduce latency and increase throughput in multi-node environments.
Testing: Validated against common failure scenarios (RoCE v2 route ambiguity and NVLink topology mismatches) to ensure the troubleshooting steps provided in the skill are actionable.

.github/skills/nccl-experts/SKILL.md

zhenghax added 2 commits January 27, 2026 14:35

Create SKILL.md

8f2a011

Update SKILL.md

dd8ac80

chenhengqi reviewed Jan 29, 2026

View reviewed changes

.github/skills/nccl-experts/SKILL.md Outdated Show resolved Hide resolved

chenhengqi reviewed Jan 29, 2026

View reviewed changes

.github/skills/nccl-experts/SKILL.md Outdated Show resolved Hide resolved

Update SKILL.md

c1dd4e8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create SKILL.md for NCCL #2004

Create SKILL.md for NCCL #2004

zhenghax commented Jan 27, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Create SKILL.md for NCCL #2004

Are you sure you want to change the base?

Create SKILL.md for NCCL #2004

Conversation

zhenghax commented Jan 27, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants