v1.0.1: Fix tokenizer loading, prompt formatting, cache wiring, MCP intelligence#11
Merged
Merged
Conversation
GGUF repos rarely ship tokenizer.json and Google Gemma tokenizers are gated on HuggingFace. FlexTokenizer enum wraps both HuggingFace tokenizers crate and shimmytok (extracts from GGUF metadata). CandleEmbed uses FlexTokenizer, orchestrator/reranker use HF-only.
embed_one now calls prompt_format.format_query() and embed_batch calls prompt_format.format_document() before passing text to embed_text(). This is required for asymmetric models like embeddinggemma that need specific prefixes for queries vs documents.
reset_for_reindex now also deletes from chunks_fts so stale keyword entries don't survive a dimension migration. Store::init() reads the stored embedding_dim from meta to create the vec table with the correct dimension, preventing a stale 384-dim table from persisting when the model outputs 256-dim vectors.
When an orchestrator is present, compute a SHA256 cache key from the query and check the llm_cache table first. On miss, call the orchestrator and store the result. Adds Serialize/Deserialize to QueryIntent and OrchestrationResult for JSON round-tripping. Removes #[allow(dead_code)] from orchestration_cache_key.
The search tool handler now calls search_with_intelligence with the orchestrator and reranker from EngraphServer, enabling LLM-powered query expansion and result reranking in the MCP server. Removes #[allow(dead_code)] from the orchestrator and reranker fields.
…MiniLM-L6-v2 Add BertLayer struct with LayerNorm+bias, absolute position embeddings, and GELU FFN activation alongside the existing Gemma EmbedLayer. The CandleEmbed struct now wraps an EmbedModelVariant enum (Gemma | Bert) and detects architecture from GGUF metadata (general.architecture). Switch default embedding model from embeddinggemma-300M (256-dim) to all-MiniLM-L6-v2-GGUF Q8_0 (384-dim, 25MB). Users can still override to embeddinggemma via config.toml. Update store default dimension to 384.
- Print [N/M] file progress during indexing (was silent for minutes) - Fix expand model URI: Qwen3-0.6B-Q8_0.gguf (uppercase, was 404) - Add accelerate feature flag for Apple vecLib optimization
Replace candle_transformers::quantized_nn::RmsNorm (which lacks a Metal kernel) with candle_nn::RmsNorm throughout the Gemma embedding code. QTensor weights are dequantized to f32 Tensor at load time so the standard RmsNorm forward pass runs on Metal without error. Also restores embeddinggemma as the default model (256-dim), replaces eprint indexing progress with an indicatif progress bar, and fixes store tests to match the new default dimension.
…support candle lacks Metal kernels for quantized GGUF models (rms-norm, QMatMul). llama.cpp has mature Metal support and auto-detects GPU at build time. - Replace candle-core/candle-nn/candle-transformers with llama-cpp-2 - CandleEmbed -> LlamaEmbed, CandleOrchestrator -> LlamaOrchestrator, CandleRerank -> LlamaRerank - Remove select_device(), CandleQMatMul, EmbedLayer, BertLayer, EmbedModelVariant (llama.cpp handles all model loading internally) - Remove metal/accelerate/cuda feature flags (llama.cpp handles GPU detection at CMake build time) - LlamaContext is !Send so contexts are created per-call from the stored LlamaModel (which is Send+Sync) - Public API unchanged: traits, MockLlm, download infra, FlexTokenizer, PromptFormat, heuristic_orchestrate all preserved - 270 tests pass (net -1: removed select_device test)
Replace candle with llama-cpp-2 for all ML inference. Gets Metal GPU acceleration (88 files in 70s vs 37+ min on CPU). Fixes: use encode() not decode() for embeddings, set n_ubatch >= n_tokens, use AddBos::Never (PromptFormat already adds <bos>), force CPU device for quantized ops (candle Metal unsupported). Keeps BERT GGUF support code for fallback. Default: embeddinggemma-300M.
run_search now loads orchestrator + reranker when intelligence is enabled and calls search_with_intelligence instead of search_internal.
…eranker
Bug 1: LlamaBackend::init() fails with BackendAlreadyInitialized if called
more than once. Add a module-level llama_backend() function using OnceLock +
a Mutex-guarded double-checked init (get_or_try_init is still unstable on
stable Rust). Remove the backend field from LlamaEmbed, LlamaOrchestrator,
and LlamaRerank; all three now share the single static backend.
Bug 2: LlamaOrchestrator and LlamaRerank were loading an external
tokenizer.json via load_hf_tokenizer(), which does not exist in Qwen3 GGUF
repos. Switch both to llama.cpp's built-in tokenizer: str_to_token() for
encoding, token_to_piece() for decoding, and str_to_token("Yes"/"No") for
Yes/No token ID lookup. Remove the tokenizer field from both structs and
drop the load_hf_tokenizer() helper. Add encoding_rs as a direct dependency
(required by token_to_piece's Decoder parameter; was already a transitive dep).
All 270 unit tests pass, clippy clean, fmt clean.
…elligence - LlamaBackend shared via OnceLock (was re-initialized per model, crashed) - Orchestrator/reranker use llama.cpp built-in tokenizer (GGUF-embedded) - CLI search loads intelligence models when enabled - Debug log for orchestration results
- README: llama.cpp references, Metal GPU, 270 tests, CMake requirement - CHANGELOG: v1.0.1 entry with all fixes and backend switch - CLAUDE.md: llama-cpp-2 deps, LlamaEmbed/LlamaOrchestrator/LlamaRerank - Release workflow: CMake on Ubuntu, cmake dep in Homebrew formula - Vault spec: updated with hotfix PR reference
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Hotfix for v1.0.0 — switches inference backend from candle to llama.cpp and fixes all code review issues.
Backend Switch (candle → llama.cpp)
Embedding Fixes
encode()notdecode()for embedding modelsn_ubatch >= n_tokens(llama.cpp assertion requirement)AddBos::Neversince PromptFormat already adds<bos>embed_one→ search_query prefix,embed_batch→ search_document prefixCode Review Fixes
Store::init()reads stored dim,reset_for_reindexclears FTSsearch_with_intelligenceSearch Quality (verified on real vault)
Test plan
cargo test --lib— 270/270 passcargo clippy -- -D warnings— clean