Skip to content

v1.0.1: Fix tokenizer loading, prompt formatting, cache wiring, MCP intelligence#11

Merged
devwhodevs merged 17 commits into
mainfrom
hotfix/v1.0.1-tokenizer-and-review-fixes
Mar 25, 2026
Merged

v1.0.1: Fix tokenizer loading, prompt formatting, cache wiring, MCP intelligence#11
devwhodevs merged 17 commits into
mainfrom
hotfix/v1.0.1-tokenizer-and-review-fixes

Conversation

@devwhodevs

@devwhodevs devwhodevs commented Mar 25, 2026

Copy link
Copy Markdown
Owner

Summary

Hotfix for v1.0.0 — switches inference backend from candle to llama.cpp and fixes all code review issues.

Backend Switch (candle → llama.cpp)

  • llama-cpp-2 replaces candle-core/candle-nn/candle-transformers (-1500 lines net)
  • Metal GPU acceleration — 88 files indexed in 70 seconds (was 37+ minutes on CPU with candle)
  • Same model stack as qmd: embeddinggemma-300M, qwen3-reranker, qwen3 for expansion
  • llama.cpp handles tokenization from GGUF metadata (no separate tokenizer.json needed)

Embedding Fixes

  • Use encode() not decode() for embedding models
  • Set n_ubatch >= n_tokens (llama.cpp assertion requirement)
  • Use AddBos::Never since PromptFormat already adds <bos>
  • Apply prompt format: embed_one → search_query prefix, embed_batch → search_document prefix

Code Review Fixes

  • Fix dimension migration: Store::init() reads stored dim, reset_for_reindex clears FTS
  • Wire LLM cache into search_with_intelligence
  • Wire orchestrator + reranker into MCP server search handler
  • Progress bar during indexing via indicatif

Search Quality (verified on real vault)

  • "Brilliant Earth checkout" → finds BRE-2579, Brilliant Earth note, Luke Clifton
  • "Scentbird Drift engineering manager" → finds Scentbird, Drift Team, Drift notes
  • "who works on checkout" → Intent: Relationship, finds John Nelson, BRE-1728, Brilliant Earth

Test plan

  • cargo test --lib — 270/270 pass
  • cargo clippy -- -D warnings — clean
  • Real vault: 88 files, 3493 chunks, 70s index time on Metal
  • Search quality verified across 5 query types
  • CI green (needs CMake on runners)

GGUF repos rarely ship tokenizer.json and Google Gemma tokenizers
are gated on HuggingFace. FlexTokenizer enum wraps both HuggingFace
tokenizers crate and shimmytok (extracts from GGUF metadata).
CandleEmbed uses FlexTokenizer, orchestrator/reranker use HF-only.
embed_one now calls prompt_format.format_query() and embed_batch calls
prompt_format.format_document() before passing text to embed_text().
This is required for asymmetric models like embeddinggemma that need
specific prefixes for queries vs documents.
reset_for_reindex now also deletes from chunks_fts so stale keyword
entries don't survive a dimension migration. Store::init() reads the
stored embedding_dim from meta to create the vec table with the correct
dimension, preventing a stale 384-dim table from persisting when the
model outputs 256-dim vectors.
When an orchestrator is present, compute a SHA256 cache key from the
query and check the llm_cache table first. On miss, call the
orchestrator and store the result. Adds Serialize/Deserialize to
QueryIntent and OrchestrationResult for JSON round-tripping.
Removes #[allow(dead_code)] from orchestration_cache_key.
The search tool handler now calls search_with_intelligence with the
orchestrator and reranker from EngraphServer, enabling LLM-powered
query expansion and result reranking in the MCP server. Removes
#[allow(dead_code)] from the orchestrator and reranker fields.
…MiniLM-L6-v2

Add BertLayer struct with LayerNorm+bias, absolute position embeddings,
and GELU FFN activation alongside the existing Gemma EmbedLayer. The
CandleEmbed struct now wraps an EmbedModelVariant enum (Gemma | Bert)
and detects architecture from GGUF metadata (general.architecture).

Switch default embedding model from embeddinggemma-300M (256-dim) to
all-MiniLM-L6-v2-GGUF Q8_0 (384-dim, 25MB). Users can still override
to embeddinggemma via config.toml. Update store default dimension to 384.
- Print [N/M] file progress during indexing (was silent for minutes)
- Fix expand model URI: Qwen3-0.6B-Q8_0.gguf (uppercase, was 404)
- Add accelerate feature flag for Apple vecLib optimization
Replace candle_transformers::quantized_nn::RmsNorm (which lacks a Metal
kernel) with candle_nn::RmsNorm throughout the Gemma embedding code.
QTensor weights are dequantized to f32 Tensor at load time so the
standard RmsNorm forward pass runs on Metal without error.

Also restores embeddinggemma as the default model (256-dim), replaces
eprint indexing progress with an indicatif progress bar, and fixes
store tests to match the new default dimension.
…support

candle lacks Metal kernels for quantized GGUF models (rms-norm, QMatMul).
llama.cpp has mature Metal support and auto-detects GPU at build time.

- Replace candle-core/candle-nn/candle-transformers with llama-cpp-2
- CandleEmbed -> LlamaEmbed, CandleOrchestrator -> LlamaOrchestrator,
  CandleRerank -> LlamaRerank
- Remove select_device(), CandleQMatMul, EmbedLayer, BertLayer,
  EmbedModelVariant (llama.cpp handles all model loading internally)
- Remove metal/accelerate/cuda feature flags (llama.cpp handles GPU
  detection at CMake build time)
- LlamaContext is !Send so contexts are created per-call from the
  stored LlamaModel (which is Send+Sync)
- Public API unchanged: traits, MockLlm, download infra, FlexTokenizer,
  PromptFormat, heuristic_orchestrate all preserved
- 270 tests pass (net -1: removed select_device test)
Replace candle with llama-cpp-2 for all ML inference. Gets Metal GPU
acceleration (88 files in 70s vs 37+ min on CPU).

Fixes: use encode() not decode() for embeddings, set n_ubatch >= n_tokens,
use AddBos::Never (PromptFormat already adds <bos>), force CPU device
for quantized ops (candle Metal unsupported).

Keeps BERT GGUF support code for fallback. Default: embeddinggemma-300M.
run_search now loads orchestrator + reranker when intelligence is
enabled and calls search_with_intelligence instead of search_internal.
…eranker

Bug 1: LlamaBackend::init() fails with BackendAlreadyInitialized if called
more than once. Add a module-level llama_backend() function using OnceLock +
a Mutex-guarded double-checked init (get_or_try_init is still unstable on
stable Rust). Remove the backend field from LlamaEmbed, LlamaOrchestrator,
and LlamaRerank; all three now share the single static backend.

Bug 2: LlamaOrchestrator and LlamaRerank were loading an external
tokenizer.json via load_hf_tokenizer(), which does not exist in Qwen3 GGUF
repos. Switch both to llama.cpp's built-in tokenizer: str_to_token() for
encoding, token_to_piece() for decoding, and str_to_token("Yes"/"No") for
Yes/No token ID lookup. Remove the tokenizer field from both structs and
drop the load_hf_tokenizer() helper. Add encoding_rs as a direct dependency
(required by token_to_piece's Decoder parameter; was already a transitive dep).

All 270 unit tests pass, clippy clean, fmt clean.
…elligence

- LlamaBackend shared via OnceLock (was re-initialized per model, crashed)
- Orchestrator/reranker use llama.cpp built-in tokenizer (GGUF-embedded)
- CLI search loads intelligence models when enabled
- Debug log for orchestration results
- README: llama.cpp references, Metal GPU, 270 tests, CMake requirement
- CHANGELOG: v1.0.1 entry with all fixes and backend switch
- CLAUDE.md: llama-cpp-2 deps, LlamaEmbed/LlamaOrchestrator/LlamaRerank
- Release workflow: CMake on Ubuntu, cmake dep in Homebrew formula
- Vault spec: updated with hotfix PR reference
@devwhodevs devwhodevs merged commit 3631c25 into main Mar 25, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant