-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Open
Description
RTX 5090: 23x Performance Regression in vLLM GGUF Inference After nvidia-570 → nvidia-580 Upgrade
Summary
After upgrading from nvidia-driver-570 to nvidia-driver-580.105.08, vLLM inference throughput for Gemma 3 GGUF models dropped from 115 tok/s to 5 tok/s (23x slowdown). The regression persists even with PyTorch cu130 aligned with CUDA 13.0.
System Information
- GPU: NVIDIA GeForce RTX 5090 (Blackwell, 32GB VRAM)
- Driver: nvidia-driver-580.105.08 (previously nvidia-driver-570-open)
- OS: Ubuntu 24.04 LTS
- Kernel: 6.8.0-87-generic
- CUDA: 13.0
- PyTorch: 2.9.1+cu130 (also tested cu128)
- vLLM: 0.14.0rc1
Reproduction
- Install nvidia-driver-580.105.08 on RTX 5090
- Run vLLM with Gemma 3 12B GGUF model:
python -m vllm.entrypoints.openai.api_server \ --model unsloth/gemma-3-12b-it-GGUF \ --dtype float32 --quantization gguf
- Benchmark throughput
Results
| Driver | Throughput | Notes |
|---|---|---|
| nvidia-570-open | 115 tok/s | Baseline (2026-01-10) |
| nvidia-580.105.08 | 5 tok/s | After upgrade (2026-01-12) |
| nvidia-580 + cu130 | 5 tok/s | Ruled out CUDA mismatch |
Kernel Logs
VF BAR assignment failures observed at boot:
pci 0000:02:00.0: VF BAR 2 [mem size 0x10000000 64bit pref]: failed to assign
pci 0000:02:00.0: VF BAR 4 [mem size 0x02000000 64bit pref]: failed to assign
No Xid errors observed.
What Was Ruled Out
- Flash Attention version - Tested VLLM_FLASH_ATTN_VERSION=2, no improvement
- CUDA version mismatch - Built PyTorch 2.9.1+cu130 to match driver's CUDA 13.0, no improvement
- Code changes - Running identical code from before upgrade reproduces regression
Affected Models
- Gemma 3 12B GGUF (Q5_K_M) - 115 → 5 tok/s
- Other GGUF models show similar regression
- Non-GGUF models (AWQ, fp16) less affected or improved
Expected Behavior
Throughput should remain consistent (~115 tok/s) after driver upgrade.
Additional Context
- The regression specifically affects GGUF quantized models on Blackwell architecture
- vLLM logs show model loads successfully but inference is extremely slow
- No errors in vLLM output, just degraded performance
Metadata
Metadata
Assignees
Labels
No labels