Add Strix Halo llama.cpp GGUF model knowledge (aliases + verified perf)#81
Open
rjckkkkk wants to merge 1 commit into
Open
Add Strix Halo llama.cpp GGUF model knowledge (aliases + verified perf)#81rjckkkkk wants to merge 1 commit into
rjckkkkk wants to merge 1 commit into
Conversation
On the AMD Strix Halo (Radeon 8060S iGPU) rig, scanned on-disk GGUF model names (e.g. Qwen3.5-9B-Q4_K_M) did not match catalog metadata.name, so deploy fell back to auto-detect instead of the curated config. Add metadata.aliases so the local scanner matches, and record llama.cpp b9180 HIP verified decode perf (all 999 layers offloaded, Q4_K_M): qwen3.5-9b 33.8 tok/s (alias only; llamacpp variant already present) qwen3.5-27b 11.7 tok/s (added universal llamacpp variant + GGUF source) glm-4.7-flash 58.9 tok/s (added universal llamacpp variant + GGUF source) qwen3.5-35b-a3b 63.0 tok/s (alias + verified perf; variant already present) New llamacpp variants use gpu_arch "*" so they apply on any device (GGUF is the path for low-VRAM hardware). No Go changes — knowledge-only (INV-1/2). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Catalog (YAML-only) knowledge for running these models via llama.cpp + GGUF on the AMD Strix Halo (Ryzen AI MAX+ 395 / Radeon 8060S iGPU, RDNA3.5) rig, verified end-to-end through AIMA's native runtime.
Qwen3.5-9B-Q4_K_MQwen3.5-27B-Q4_K_MGLM-4.7-Flash-Q4_K_MQwen3.5-35B-A3B-Q4_K_MWhy
The local scanner reports a model by its on-disk GGUF dir/file name (
Qwen3.5-9B-Q4_K_M), whichnormalizeModelLookupKeycould not match tometadata.name(qwen3.5-9b). Deploy therefore fell back to auto-detect and ignored the curated variant/config. Addingmetadata.aliases(the documented scan-name matching mechanism) fixes this with no Go changes (INV-1/2).Changes
metadata.aliaseswith the scanned GGUF name; record verified Strix Halo decode perf in the matching llamacpp variant'sexpected_performance.gguftostorage.formats, a GGUF source, and a universal (gpu_arch: "*") llamacpp variant so GGUF deploys resolve to curated config.Verification
Each model deployed via
aima deploy <model> --engine llamacppon the rig:llama-server(b9180 HIP) launched on the iGPU with all 999 layers offloaded, served the OpenAI API, decode measured from/v1/chat/completionstimings. (Depends on engine discovery from #80.)🤖 Generated with Claude Code