Skip to content

Eval bug: llama.cpp (AMD GPU): partial layers run on CPU when loading dflash model, cannot fully offload to GPU #45

@rykerzhou

Description

@rykerzhou

Name and Version

.\llama-cli.exe --version
ggml_cuda_init: found 1 ROCm devices (Total VRAM: 89976 MiB):
Device 0: AMD Radeon(TM) 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32, VRAM: 89976 MiB
version: 9459 (07ac3ce)
built with Clang 22.0.0 for Windows AMD64

Operating systems

Windows

GGML backends

HIP

Hardware

OS: Microsoft Windows 11 专业版 10.0.26200 26200
CPU: AMD RYZEN AI MAX+ 395 w/ Radeon 8060S
GPU: AMD Radeon(TM) 8060S Graphics 4293918720 32.0.31007.5012

Models

Qwen3.6-27B-Q5_K_S.gguf
Qwen3.6-27B-DFlash-Q4_K_M.gguf

Problem description & steps to reproduce

.\llama-server.exe -m C:\Users\xh\Desktop\work\code\novamax\data\models_dir\llm\unsloth\Qwen3.6-27B-GGUF\Qwen3.6-27B-Q5_K_S.gguf --mmproj C:\Users\xh\Desktop\work\code\novamax\data\models_dir\llm\unsloth\Qwen3.5-27B-GGUF\mmproj-BF16.gguf --no-mmproj-offload --spec-draft-model C:\Users\xh\Downloads\Qwen3.6-27B-DFlash-Q4_K_M.gguf --spec-type dflash --spec-dflash-cross-ctx 1024 --host 0.0.0.0 --port 1234 --parallel 1 --kv-unified --n-gpu-layers all --spec-draft-ngl all -b 2048 -ub 512 --ctx-size 102400 --cache-type-k q5_0 --cache-type-v q4_1 --flash-attn on --cache-ram 0 --jinja --no-mmap --mlock --temperature 0.6 --reasoning off --top-p 1.0 --top-k 20 --min-p 0.0 --spec-draft-n-max 8

On an AMD GPU system, when loading and running dflash model with llama.cpp, only part of the model layers are offloaded to GPU, while the rest still execute on CPU. GPU usage is partial and CPU load remains high, resulting in slower inference speed.

dflash_model.log

Image

First Bad Commit

No response

Relevant log output

Logs

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions