Context
CUDA LU solve is 3.7ms vs cuDSS 0.6ms (6x gap) on c6288 (25380x25380 circuit Jacobian). Factor is already at cuDSS parity (2.85ms vs 2.5ms).
Current solve architecture:
- Separate kernel dispatches for sparse-elim forward L / backward U phases
- Per-lump CPU iteration for 16 dense lumps
- Three
flush() barriers (permutation → L solve → U solve)
- Multiple cuBLAS calls for dense GEMV/TRSV
Proposed approach
Implement a persistent kernel solve using cuBLASDx (device-side BLAS), following the cuDSS architecture from Sparse Days 2024:
- Single kernel launch for the entire triangular solve (forward L + backward U)
- Inter-CTA synchronization via
atomicAdd on done[] counters — thread blocks spin-wait until dependencies are satisfied, then immediately process their supernode
- cuBLASDx for device-side GEMV/TRSV — no separate cuBLAS dispatch overhead
- Level-set parallelism — independent supernodes at the same tree level processed by different thread blocks simultaneously
Key requirements
- cuBLASDx — compile-time template library for device-side BLAS
- NVIDIA forward-progress guarantee for resident thread blocks
- Pre-computed dependency graph (already available via
LevelSetSchedule)
- Shared memory sizing for per-supernode TRSV/GEMV working storage
Expected improvement
- Solve: 3.7ms → <1ms (eliminate all kernel launch + cuBLAS dispatch overhead)
- Total LU: 6.5ms → ~4ms (approaching cuDSS's 3.1ms)
Notes
- This is CUDA-only; Metal lacks forward-progress guarantees and device-side BLAS
- The existing modular solve (Solver.cpp
internalSolveLRangeUnit / internalSolveURange) should remain as fallback for non-CUDA backends
- cuBLASDx requires specifying matrix sizes at compile time via templates — may need a few size specializations or runtime dispatch
Context
CUDA LU solve is 3.7ms vs cuDSS 0.6ms (6x gap) on c6288 (25380x25380 circuit Jacobian). Factor is already at cuDSS parity (2.85ms vs 2.5ms).
Current solve architecture:
flush()barriers (permutation → L solve → U solve)Proposed approach
Implement a persistent kernel solve using cuBLASDx (device-side BLAS), following the cuDSS architecture from Sparse Days 2024:
atomicAddondone[]counters — thread blocks spin-wait until dependencies are satisfied, then immediately process their supernodeKey requirements
LevelSetSchedule)Expected improvement
Notes
internalSolveLRangeUnit/internalSolveURange) should remain as fallback for non-CUDA backends