Skip to content

RTX 5090: 23x Performance Regression in vLLM GGUF Inference After nvidia-570 → nvidia-580 Upgrade #1001

@kitaekatt

Description

@kitaekatt

RTX 5090: 23x Performance Regression in vLLM GGUF Inference After nvidia-570 → nvidia-580 Upgrade

Summary

After upgrading from nvidia-driver-570 to nvidia-driver-580.105.08, vLLM inference throughput for Gemma 3 GGUF models dropped from 115 tok/s to 5 tok/s (23x slowdown). The regression persists even with PyTorch cu130 aligned with CUDA 13.0.

System Information

  • GPU: NVIDIA GeForce RTX 5090 (Blackwell, 32GB VRAM)
  • Driver: nvidia-driver-580.105.08 (previously nvidia-driver-570-open)
  • OS: Ubuntu 24.04 LTS
  • Kernel: 6.8.0-87-generic
  • CUDA: 13.0
  • PyTorch: 2.9.1+cu130 (also tested cu128)
  • vLLM: 0.14.0rc1

Reproduction

  1. Install nvidia-driver-580.105.08 on RTX 5090
  2. Run vLLM with Gemma 3 12B GGUF model:
    python -m vllm.entrypoints.openai.api_server \
      --model unsloth/gemma-3-12b-it-GGUF \
      --dtype float32 --quantization gguf
  3. Benchmark throughput

Results

Driver Throughput Notes
nvidia-570-open 115 tok/s Baseline (2026-01-10)
nvidia-580.105.08 5 tok/s After upgrade (2026-01-12)
nvidia-580 + cu130 5 tok/s Ruled out CUDA mismatch

Kernel Logs

VF BAR assignment failures observed at boot:

pci 0000:02:00.0: VF BAR 2 [mem size 0x10000000 64bit pref]: failed to assign
pci 0000:02:00.0: VF BAR 4 [mem size 0x02000000 64bit pref]: failed to assign

No Xid errors observed.

What Was Ruled Out

  1. Flash Attention version - Tested VLLM_FLASH_ATTN_VERSION=2, no improvement
  2. CUDA version mismatch - Built PyTorch 2.9.1+cu130 to match driver's CUDA 13.0, no improvement
  3. Code changes - Running identical code from before upgrade reproduces regression

Affected Models

  • Gemma 3 12B GGUF (Q5_K_M) - 115 → 5 tok/s
  • Other GGUF models show similar regression
  • Non-GGUF models (AWQ, fp16) less affected or improved

Expected Behavior

Throughput should remain consistent (~115 tok/s) after driver upgrade.

Additional Context

  • The regression specifically affects GGUF quantized models on Blackwell architecture
  • vLLM logs show model loads successfully but inference is extremely slow
  • No errors in vLLM output, just degraded performance

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions