feat(model): mark speculative draft heads (DFlash/MTP) as non-standalone#82
Open
rjckkkkk wants to merge 1 commit into
Open
feat(model): mark speculative draft heads (DFlash/MTP) as non-standalone#82rjckkkkk wants to merge 1 commit into
rjckkkkk wants to merge 1 commit into
Conversation
AIMA's model scanner lists every weight artifact on disk as an independently deployable model. Speculative draft heads (DFlash / MTP) only make sense paired with their parent model for speculative decoding, yet they showed up as standalone models with a deploy button (e.g. Qwen3.6-35B-A3B-DFlash, Qwen3.6-35B-A3B-DFlash-Q4_K_M). The catalog already names each draft via its parent variant's speculative_config.model, so detection needs no new per-model YAML: - knowledge.NormalizeModelKey: lowercases a model name and strips quantization/precision/layout suffixes (q4_k_m, bf16, ud, unfused, ...) while keeping role tokens like "dflash", so all on-disk artifacts of one logical draft share a key and stay distinct from the parent. - Catalog.SpeculativeDraftModelKeys: harvests speculative_config.model across all variants into a set of normalized draft keys. - annotateModelsFromCatalog: a scanned model whose normalized name is a draft key gets standalone_deploy=false + ui.role=draft (only when not already set); the embedded UI already hides the deploy button on standalone_deploy=false. Knowledge-driven (INV-1/2): no hardcoded model names, derived entirely from the catalog. Parent models and quantization variants are unaffected, and DB rows / deploy-by-name paths are untouched (annotation is applied at model.list time only). Tests: NormalizeModelKey table cases (incl. "glm-4.7-flash" not stripped), SpeculativeDraftModelKeys harvest + nil-catalog, and the annotate wiring (drafts -> non-standalone draft, parent stays deployable). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The model scanner is filesystem-first: every weight artifact on disk becomes a separate, independently-deployable model card. Two consequences surfaced while onboarding models on a Strix Halo box:
Speculative draft heads get a deploy button.
Qwen3.6-35B-A3B-DFlash(safetensors, ~0.9G) andQwen3.6-35B-A3B-DFlash-Q4_K_M(gguf, ~0.3M) showed up as standalone models you could "部署". A DFlash/MTP head is a speculative-decoding companion of its parent — it cannot run on its own.(Not in this PR) One logical model appears as many cards —
qwen3.6-35b-a3bwas scanned as 8 entries (safetensors + bf16/bf16-unfused gguf + q4_k_m/ud-q4_k_m quants + 2 DFlash copies). See Follow-up below.Fix (scope: #1, the clear correctness bug)
The catalog already names each draft via its parent variant's
speculative_config.model(e.g.qwen3.6-35b-a3b.yaml→model: /models/Qwen3.6-35B-A3B-DFlash). So detection is fully knowledge-driven — no new per-model YAML, no hardcoded names (INV-1/2):knowledge.NormalizeModelKey— lowercases a name and strips quant/precision/layout suffixes (q4_k_m,bf16,ud,unfused, …) while keeping role tokens likedflash, so every on-disk artifact of one logical draft shares a key and stays distinct from the parent. (glm-4.7-flashis left intact —flashis identity, not a quant.)Catalog.SpeculativeDraftModelKeys— harvestsspeculative_config.modelacross all variants into a set of normalized draft keys.annotateModelsFromCatalog— a scanned model whose normalized name is a draft key getsstandalone_deploy=false+ui.role=draft(only when not already set). The embedded UI already hides the deploy button whenstandalone_deploy=false.Parent models and quant variants are untouched; DB rows and deploy-by-name paths are unchanged (annotation is applied at
model.listtime only).Tests
NormalizeModelKeytable cases (incl. theglm-4.7-flashnegative case).SpeculativeDraftModelKeysharvest + nil-catalog guard.annotateModelsFromCatalogwiring: drafts → non-standalonedraft, parent stays deployable.go test ./...,go build ./...,go vet,gofmtall clean.Follow-up (separate PR)
Variant grouping for #2 — fold
bf16/bf16-unfused/q4_k_m/ud-q4_k_mof one model into a single card with a variant/quant selector (needs a model.list grouping shape + a UI change; dropping distinct quants outright would be wrong since they're genuinely different deployables).NormalizeModelKeyintroduced here is the intended base-key primitive for that work.