(Experimental) A high-throughput and memory-efficient inference and serving engine for LLMs optimized for GB10 homelabs
nvidia cuda-kernels cutlass local-inference vllm llm-inference qwen paged-attention self-hosted-ai gb10 sm120 nvfp4 dgx-spark fp4-quantization attention-kernel fp8-kv-cache
-
Updated
Apr 14, 2026 - Python