Add Strix Halo llama.cpp GGUF model knowledge (aliases + verified perf) by rjckkkkk · Pull Request #81 · Approaching-AI/AIMA

rjckkkkk · 2026-06-08T15:25:43Z

What

Catalog (YAML-only) knowledge for running these models via llama.cpp + GGUF on the AMD Strix Halo (Ryzen AI MAX+ 395 / Radeon 8060S iGPU, RDNA3.5) rig, verified end-to-end through AIMA's native runtime.

model	scanned GGUF alias	decode (b9180 HIP, Q4_K_M, all layers)
qwen3.5-9b	`Qwen3.5-9B-Q4_K_M`	33.8 tok/s
qwen3.5-27b	`Qwen3.5-27B-Q4_K_M`	11.7 tok/s
glm-4.7-flash	`GLM-4.7-Flash-Q4_K_M`	58.9 tok/s (30B-A3B MoE, deepseek2 arch)
qwen3.5-35b-a3b	`Qwen3.5-35B-A3B-Q4_K_M`	63.0 tok/s (35B-A3B MoE)

Why

The local scanner reports a model by its on-disk GGUF dir/file name (Qwen3.5-9B-Q4_K_M), which normalizeModelLookupKey could not match to metadata.name (qwen3.5-9b). Deploy therefore fell back to auto-detect and ignored the curated variant/config. Adding metadata.aliases (the documented scan-name matching mechanism) fixes this with no Go changes (INV-1/2).

Changes

All four: add metadata.aliases with the scanned GGUF name; record verified Strix Halo decode perf in the matching llamacpp variant's expected_performance.
qwen3.5-27b, glm-4.7-flash: were catalogued only for safetensors (vLLM/SGLang on GB10/Ada). Add gguf to storage.formats, a GGUF source, and a universal (gpu_arch: "*") llamacpp variant so GGUF deploys resolve to curated config.
qwen3.5-9b, qwen3.5-35b-a3b: already had a llamacpp variant — alias + verified perf only.

Verification

Each model deployed via aima deploy <model> --engine llamacpp on the rig: llama-server (b9180 HIP) launched on the iGPU with all 999 layers offloaded, served the OpenAI API, decode measured from /v1/chat/completions timings. (Depends on engine discovery from #80.)

🤖 Generated with Claude Code

On the AMD Strix Halo (Radeon 8060S iGPU) rig, scanned on-disk GGUF model names (e.g. Qwen3.5-9B-Q4_K_M) did not match catalog metadata.name, so deploy fell back to auto-detect instead of the curated config. Add metadata.aliases so the local scanner matches, and record llama.cpp b9180 HIP verified decode perf (all 999 layers offloaded, Q4_K_M): qwen3.5-9b 33.8 tok/s (alias only; llamacpp variant already present) qwen3.5-27b 11.7 tok/s (added universal llamacpp variant + GGUF source) glm-4.7-flash 58.9 tok/s (added universal llamacpp variant + GGUF source) qwen3.5-35b-a3b 63.0 tok/s (alias + verified perf; variant already present) New llamacpp variants use gpu_arch "*" so they apply on any device (GGUF is the path for low-VRAM hardware). No Go changes — knowledge-only (INV-1/2). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Strix Halo llama.cpp GGUF model knowledge (aliases + verified perf)#81

Add Strix Halo llama.cpp GGUF model knowledge (aliases + verified perf)#81
rjckkkkk wants to merge 1 commit into
developfrom
feat/strix-halo-gguf-model-knowledge

rjckkkkk commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rjckkkkk commented Jun 8, 2026

What

Why

Changes

Verification

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant