RTX 5090: 23x Performance Regression in vLLM GGUF Inference After nvidia-570 → nvidia-580 Upgrade

# RTX 5090: 23x Performance Regression in vLLM GGUF Inference After nvidia-570 → nvidia-580 Upgrade

## Summary

After upgrading from nvidia-driver-570 to nvidia-driver-580.105.08, vLLM inference throughput for Gemma 3 GGUF models dropped from 115 tok/s to 5 tok/s (23x slowdown). The regression persists even with PyTorch cu130 aligned with CUDA 13.0.

## System Information

- **GPU**: NVIDIA GeForce RTX 5090 (Blackwell, 32GB VRAM)
- **Driver**: nvidia-driver-580.105.08 (previously nvidia-driver-570-open)
- **OS**: Ubuntu 24.04 LTS
- **Kernel**: 6.8.0-87-generic
- **CUDA**: 13.0
- **PyTorch**: 2.9.1+cu130 (also tested cu128)
- **vLLM**: 0.14.0rc1

## Reproduction

1. Install nvidia-driver-580.105.08 on RTX 5090
2. Run vLLM with Gemma 3 12B GGUF model:
   ```bash
   python -m vllm.entrypoints.openai.api_server \
     --model unsloth/gemma-3-12b-it-GGUF \
     --dtype float32 --quantization gguf
   ```
3. Benchmark throughput

## Results

| Driver | Throughput | Notes |
|--------|------------|-------|
| nvidia-570-open | 115 tok/s | Baseline (2026-01-10) |
| nvidia-580.105.08 | 5 tok/s | After upgrade (2026-01-12) |
| nvidia-580 + cu130 | 5 tok/s | Ruled out CUDA mismatch |

## Kernel Logs

VF BAR assignment failures observed at boot:

```
pci 0000:02:00.0: VF BAR 2 [mem size 0x10000000 64bit pref]: failed to assign
pci 0000:02:00.0: VF BAR 4 [mem size 0x02000000 64bit pref]: failed to assign
```

No Xid errors observed.

## What Was Ruled Out

1. **Flash Attention version** - Tested VLLM_FLASH_ATTN_VERSION=2, no improvement
2. **CUDA version mismatch** - Built PyTorch 2.9.1+cu130 to match driver's CUDA 13.0, no improvement
3. **Code changes** - Running identical code from before upgrade reproduces regression

## Affected Models

- Gemma 3 12B GGUF (Q5_K_M) - 115 → 5 tok/s
- Other GGUF models show similar regression
- Non-GGUF models (AWQ, fp16) less affected or improved

## Expected Behavior

Throughput should remain consistent (~115 tok/s) after driver upgrade.

## Additional Context

- The regression specifically affects GGUF quantized models on Blackwell architecture
- vLLM logs show model loads successfully but inference is extremely slow
- No errors in vLLM output, just degraded performance


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RTX 5090: 23x Performance Regression in vLLM GGUF Inference After nvidia-570 → nvidia-580 Upgrade #1001

RTX 5090: 23x Performance Regression in vLLM GGUF Inference After nvidia-570 → nvidia-580 Upgrade

Summary

System Information

Reproduction

Results

Kernel Logs

What Was Ruled Out

Affected Models

Expected Behavior

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Driver	Throughput	Notes
nvidia-570-open	115 tok/s	Baseline (2026-01-10)
nvidia-580.105.08	5 tok/s	After upgrade (2026-01-12)
nvidia-580 + cu130	5 tok/s	Ruled out CUDA mismatch

RTX 5090: 23x Performance Regression in vLLM GGUF Inference After nvidia-570 → nvidia-580 Upgrade #1001

Description

RTX 5090: 23x Performance Regression in vLLM GGUF Inference After nvidia-570 → nvidia-580 Upgrade

Summary

System Information

Reproduction

Results

Kernel Logs

What Was Ruled Out

Affected Models

Expected Behavior

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions