Name and Version
.\llama-cli.exe --version
ggml_cuda_init: found 1 ROCm devices (Total VRAM: 89976 MiB):
Device 0: AMD Radeon(TM) 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32, VRAM: 89976 MiB
version: 9459 (07ac3ce)
built with Clang 22.0.0 for Windows AMD64
Operating systems
Windows
GGML backends
HIP
Hardware
OS: Microsoft Windows 11 专业版 10.0.26200 26200
CPU: AMD RYZEN AI MAX+ 395 w/ Radeon 8060S
GPU: AMD Radeon(TM) 8060S Graphics 4293918720 32.0.31007.5012
Models
Qwen3.6-27B-Q5_K_S.gguf
Qwen3.6-27B-DFlash-Q4_K_M.gguf
Problem description & steps to reproduce
.\llama-server.exe -m C:\Users\xh\Desktop\work\code\novamax\data\models_dir\llm\unsloth\Qwen3.6-27B-GGUF\Qwen3.6-27B-Q5_K_S.gguf --mmproj C:\Users\xh\Desktop\work\code\novamax\data\models_dir\llm\unsloth\Qwen3.5-27B-GGUF\mmproj-BF16.gguf --no-mmproj-offload --spec-draft-model C:\Users\xh\Downloads\Qwen3.6-27B-DFlash-Q4_K_M.gguf --spec-type dflash --spec-dflash-cross-ctx 1024 --host 0.0.0.0 --port 1234 --parallel 1 --kv-unified --n-gpu-layers all --spec-draft-ngl all -b 2048 -ub 512 --ctx-size 102400 --cache-type-k q5_0 --cache-type-v q4_1 --flash-attn on --cache-ram 0 --jinja --no-mmap --mlock --temperature 0.6 --reasoning off --top-p 1.0 --top-k 20 --min-p 0.0 --spec-draft-n-max 8
On an AMD GPU system, when loading and running dflash model with llama.cpp, only part of the model layers are offloaded to GPU, while the rest still execute on CPU. GPU usage is partial and CPU load remains high, resulting in slower inference speed.
dflash_model.log
First Bad Commit
No response
Relevant log output
Logs
Name and Version
.\llama-cli.exe --version
ggml_cuda_init: found 1 ROCm devices (Total VRAM: 89976 MiB):
Device 0: AMD Radeon(TM) 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32, VRAM: 89976 MiB
version: 9459 (07ac3ce)
built with Clang 22.0.0 for Windows AMD64
Operating systems
Windows
GGML backends
HIP
Hardware
OS: Microsoft Windows 11 专业版 10.0.26200 26200
CPU: AMD RYZEN AI MAX+ 395 w/ Radeon 8060S
GPU: AMD Radeon(TM) 8060S Graphics 4293918720 32.0.31007.5012
Models
Qwen3.6-27B-Q5_K_S.gguf
Qwen3.6-27B-DFlash-Q4_K_M.gguf
Problem description & steps to reproduce
.\llama-server.exe -m C:\Users\xh\Desktop\work\code\novamax\data\models_dir\llm\unsloth\Qwen3.6-27B-GGUF\Qwen3.6-27B-Q5_K_S.gguf --mmproj C:\Users\xh\Desktop\work\code\novamax\data\models_dir\llm\unsloth\Qwen3.5-27B-GGUF\mmproj-BF16.gguf --no-mmproj-offload --spec-draft-model C:\Users\xh\Downloads\Qwen3.6-27B-DFlash-Q4_K_M.gguf --spec-type dflash --spec-dflash-cross-ctx 1024 --host 0.0.0.0 --port 1234 --parallel 1 --kv-unified --n-gpu-layers all --spec-draft-ngl all -b 2048 -ub 512 --ctx-size 102400 --cache-type-k q5_0 --cache-type-v q4_1 --flash-attn on --cache-ram 0 --jinja --no-mmap --mlock --temperature 0.6 --reasoning off --top-p 1.0 --top-k 20 --min-p 0.0 --spec-draft-n-max 8On an AMD GPU system, when loading and running dflash model with llama.cpp, only part of the model layers are offloaded to GPU, while the rest still execute on CPU. GPU usage is partial and CPU load remains high, resulting in slower inference speed.
dflash_model.log
First Bad Commit
No response
Relevant log output
Logs