CUDA solve: persistent kernel with cuBLASDx for device-side triangular solve

## Context

CUDA LU solve is 3.7ms vs cuDSS 0.6ms (6x gap) on c6288 (25380x25380 circuit Jacobian). Factor is already at cuDSS parity (2.85ms vs 2.5ms).

Current solve architecture:
- Separate kernel dispatches for sparse-elim forward L / backward U phases
- Per-lump CPU iteration for 16 dense lumps
- Three `flush()` barriers (permutation → L solve → U solve)
- Multiple cuBLAS calls for dense GEMV/TRSV

## Proposed approach

Implement a **persistent kernel** solve using **cuBLASDx** (device-side BLAS), following the cuDSS architecture from [Sparse Days 2024](https://sparsedays.cerfacs.fr/wp-content/uploads/sites/72/2024/07/sd2024-session1-Rodriguez.pdf):

1. **Single kernel launch** for the entire triangular solve (forward L + backward U)
2. **Inter-CTA synchronization** via `atomicAdd` on `done[]` counters — thread blocks spin-wait until dependencies are satisfied, then immediately process their supernode
3. **cuBLASDx** for device-side GEMV/TRSV — no separate cuBLAS dispatch overhead
4. **Level-set parallelism** — independent supernodes at the same tree level processed by different thread blocks simultaneously

## Key requirements

- [cuBLASDx](https://docs.nvidia.com/cuda/cublasdx/) — compile-time template library for device-side BLAS
- NVIDIA forward-progress guarantee for resident thread blocks
- Pre-computed dependency graph (already available via `LevelSetSchedule`)
- Shared memory sizing for per-supernode TRSV/GEMV working storage

## Expected improvement

- Solve: 3.7ms → <1ms (eliminate all kernel launch + cuBLAS dispatch overhead)
- Total LU: 6.5ms → ~4ms (approaching cuDSS's 3.1ms)

## Notes

- This is CUDA-only; Metal lacks forward-progress guarantees and device-side BLAS
- The existing modular solve (Solver.cpp `internalSolveLRangeUnit` / `internalSolveURange`) should remain as fallback for non-CUDA backends
- cuBLASDx requires specifying matrix sizes at compile time via templates — may need a few size specializations or runtime dispatch

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA solve: persistent kernel with cuBLASDx for device-side triangular solve #5

Context

Proposed approach

Key requirements

Expected improvement

Notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

CUDA solve: persistent kernel with cuBLASDx for device-side triangular solve #5

Description

Context

Proposed approach

Key requirements

Expected improvement

Notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions