Skip to content

feat(model): mark speculative draft heads (DFlash/MTP) as non-standalone#82

Open
rjckkkkk wants to merge 1 commit into
developfrom
feat/speculative-draft-detection
Open

feat(model): mark speculative draft heads (DFlash/MTP) as non-standalone#82
rjckkkkk wants to merge 1 commit into
developfrom
feat/speculative-draft-detection

Conversation

@rjckkkkk

@rjckkkkk rjckkkkk commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

Problem

The model scanner is filesystem-first: every weight artifact on disk becomes a separate, independently-deployable model card. Two consequences surfaced while onboarding models on a Strix Halo box:

  1. Speculative draft heads get a deploy button. Qwen3.6-35B-A3B-DFlash (safetensors, ~0.9G) and Qwen3.6-35B-A3B-DFlash-Q4_K_M (gguf, ~0.3M) showed up as standalone models you could "部署". A DFlash/MTP head is a speculative-decoding companion of its parent — it cannot run on its own.

  2. (Not in this PR) One logical model appears as many cards — qwen3.6-35b-a3b was scanned as 8 entries (safetensors + bf16/bf16-unfused gguf + q4_k_m/ud-q4_k_m quants + 2 DFlash copies). See Follow-up below.

Fix (scope: #1, the clear correctness bug)

The catalog already names each draft via its parent variant's speculative_config.model (e.g. qwen3.6-35b-a3b.yamlmodel: /models/Qwen3.6-35B-A3B-DFlash). So detection is fully knowledge-driven — no new per-model YAML, no hardcoded names (INV-1/2):

  • knowledge.NormalizeModelKey — lowercases a name and strips quant/precision/layout suffixes (q4_k_m, bf16, ud, unfused, …) while keeping role tokens like dflash, so every on-disk artifact of one logical draft shares a key and stays distinct from the parent. (glm-4.7-flash is left intact — flash is identity, not a quant.)
  • Catalog.SpeculativeDraftModelKeys — harvests speculative_config.model across all variants into a set of normalized draft keys.
  • annotateModelsFromCatalog — a scanned model whose normalized name is a draft key gets standalone_deploy=false + ui.role=draft (only when not already set). The embedded UI already hides the deploy button when standalone_deploy=false.

Parent models and quant variants are untouched; DB rows and deploy-by-name paths are unchanged (annotation is applied at model.list time only).

Tests

  • NormalizeModelKey table cases (incl. the glm-4.7-flash negative case).
  • SpeculativeDraftModelKeys harvest + nil-catalog guard.
  • annotateModelsFromCatalog wiring: drafts → non-standalone draft, parent stays deployable.

go test ./..., go build ./..., go vet, gofmt all clean.

Follow-up (separate PR)

Variant grouping for #2 — fold bf16 / bf16-unfused / q4_k_m / ud-q4_k_m of one model into a single card with a variant/quant selector (needs a model.list grouping shape + a UI change; dropping distinct quants outright would be wrong since they're genuinely different deployables). NormalizeModelKey introduced here is the intended base-key primitive for that work.

AIMA's model scanner lists every weight artifact on disk as an
independently deployable model. Speculative draft heads (DFlash / MTP)
only make sense paired with their parent model for speculative decoding,
yet they showed up as standalone models with a deploy button
(e.g. Qwen3.6-35B-A3B-DFlash, Qwen3.6-35B-A3B-DFlash-Q4_K_M).

The catalog already names each draft via its parent variant's
speculative_config.model, so detection needs no new per-model YAML:

- knowledge.NormalizeModelKey: lowercases a model name and strips
  quantization/precision/layout suffixes (q4_k_m, bf16, ud, unfused, ...)
  while keeping role tokens like "dflash", so all on-disk artifacts of one
  logical draft share a key and stay distinct from the parent.
- Catalog.SpeculativeDraftModelKeys: harvests speculative_config.model
  across all variants into a set of normalized draft keys.
- annotateModelsFromCatalog: a scanned model whose normalized name is a
  draft key gets standalone_deploy=false + ui.role=draft (only when not
  already set); the embedded UI already hides the deploy button on
  standalone_deploy=false.

Knowledge-driven (INV-1/2): no hardcoded model names, derived entirely
from the catalog. Parent models and quantization variants are unaffected,
and DB rows / deploy-by-name paths are untouched (annotation is applied at
model.list time only).

Tests: NormalizeModelKey table cases (incl. "glm-4.7-flash" not stripped),
SpeculativeDraftModelKeys harvest + nil-catalog, and the annotate wiring
(drafts -> non-standalone draft, parent stays deployable).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant