A high-performance, hardware-agnostic 3D Sparse Convolution library implemented purely in Triton. This project aims to provide a seamless torch.nn compatible interface that runs on any device supported by the Triton compiler (NVIDIA, AMD, etc.), eliminating the dependency on proprietary CUDA/C++ extensions.
- Vendor Agnostic: Zero CUDA C++ code. Fully compatible with NVIDIA and AMD (ROCm) via Triton.
- Memory Efficiency: Utilizes sparse data structures to handle large-scale 3D point clouds or medical volumes.
- Performance: Highly optimized kernels for Gather-GEMM-Scatter operations.
This project is currently under active development. While core functionalities are implemented, some features may be unstable, undocumented, or subject to change. Performance optimizations and comprehensive testing are ongoing. Your contributions and feedback are highly welcome!
- Submanifold Sparse Convolution: Preserves input sparsity patterns for deep architectures.
- Standard Sparse Convolution: Supports stride and padding for downsampling.
- GPU-based Rule Generation: Fast index mapping and rulebook generation using Triton-based hashing.
- Autograd Support: Full backward pass implementation for end-to-end training.
- Coordinate Hashing: Map 3D coordinates to linear indices using a parallel hash table.
- Rulebook Generation: Identify active neighbor pairs for each kernel offset.
- Gather-GEMM-Scatter:
- Gather: Collect features based on the rulebook.
- GEMM: Perform matrix multiplication using Triton's fused kernels.
- Scatter: Distribute results back to the output sparse tensor.
- Backward pass weight gradient incorrect (99.5% mismatch) -
DEBUG_SUMMARY.md - Transposed convolution forward fails for stride=2 cases
- Pooling operations fail on CPU (atomic operations not supported)
- CPU implementation or fallback (many tests skipped on CPU)
- Autotuner config tuning (C_in=1, 64 show issues)
- Memory efficiency for large-scale point clouds
- More activation functions
- Batch normalization for CPU
- Documentation & API reference