Skip to content

Add Strix Halo llama.cpp GGUF model knowledge (aliases + verified perf)#81

Open
rjckkkkk wants to merge 1 commit into
developfrom
feat/strix-halo-gguf-model-knowledge
Open

Add Strix Halo llama.cpp GGUF model knowledge (aliases + verified perf)#81
rjckkkkk wants to merge 1 commit into
developfrom
feat/strix-halo-gguf-model-knowledge

Conversation

@rjckkkkk

@rjckkkkk rjckkkkk commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator

What

Catalog (YAML-only) knowledge for running these models via llama.cpp + GGUF on the AMD Strix Halo (Ryzen AI MAX+ 395 / Radeon 8060S iGPU, RDNA3.5) rig, verified end-to-end through AIMA's native runtime.

model scanned GGUF alias decode (b9180 HIP, Q4_K_M, all layers)
qwen3.5-9b Qwen3.5-9B-Q4_K_M 33.8 tok/s
qwen3.5-27b Qwen3.5-27B-Q4_K_M 11.7 tok/s
glm-4.7-flash GLM-4.7-Flash-Q4_K_M 58.9 tok/s (30B-A3B MoE, deepseek2 arch)
qwen3.5-35b-a3b Qwen3.5-35B-A3B-Q4_K_M 63.0 tok/s (35B-A3B MoE)

Why

The local scanner reports a model by its on-disk GGUF dir/file name (Qwen3.5-9B-Q4_K_M), which normalizeModelLookupKey could not match to metadata.name (qwen3.5-9b). Deploy therefore fell back to auto-detect and ignored the curated variant/config. Adding metadata.aliases (the documented scan-name matching mechanism) fixes this with no Go changes (INV-1/2).

Changes

  • All four: add metadata.aliases with the scanned GGUF name; record verified Strix Halo decode perf in the matching llamacpp variant's expected_performance.
  • qwen3.5-27b, glm-4.7-flash: were catalogued only for safetensors (vLLM/SGLang on GB10/Ada). Add gguf to storage.formats, a GGUF source, and a universal (gpu_arch: "*") llamacpp variant so GGUF deploys resolve to curated config.
  • qwen3.5-9b, qwen3.5-35b-a3b: already had a llamacpp variant — alias + verified perf only.

Verification

Each model deployed via aima deploy <model> --engine llamacpp on the rig: llama-server (b9180 HIP) launched on the iGPU with all 999 layers offloaded, served the OpenAI API, decode measured from /v1/chat/completions timings. (Depends on engine discovery from #80.)

🤖 Generated with Claude Code

On the AMD Strix Halo (Radeon 8060S iGPU) rig, scanned on-disk GGUF model
names (e.g. Qwen3.5-9B-Q4_K_M) did not match catalog metadata.name, so
deploy fell back to auto-detect instead of the curated config. Add
metadata.aliases so the local scanner matches, and record llama.cpp b9180
HIP verified decode perf (all 999 layers offloaded, Q4_K_M):

  qwen3.5-9b       33.8 tok/s   (alias only; llamacpp variant already present)
  qwen3.5-27b      11.7 tok/s   (added universal llamacpp variant + GGUF source)
  glm-4.7-flash    58.9 tok/s   (added universal llamacpp variant + GGUF source)
  qwen3.5-35b-a3b  63.0 tok/s   (alias + verified perf; variant already present)

New llamacpp variants use gpu_arch "*" so they apply on any device (GGUF is
the path for low-VRAM hardware). No Go changes — knowledge-only (INV-1/2).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant