v1.0.1: Fix tokenizer loading, prompt formatting, cache wiring, MCP intelligence by devwhodevs · Pull Request #11 · devwhodevs/engraph

devwhodevs · 2026-03-25T18:19:22Z

Summary

Hotfix for v1.0.0 — switches inference backend from candle to llama.cpp and fixes all code review issues.

Backend Switch (candle → llama.cpp)

llama-cpp-2 replaces candle-core/candle-nn/candle-transformers (-1500 lines net)
Metal GPU acceleration — 88 files indexed in 70 seconds (was 37+ minutes on CPU with candle)
Same model stack as qmd: embeddinggemma-300M, qwen3-reranker, qwen3 for expansion
llama.cpp handles tokenization from GGUF metadata (no separate tokenizer.json needed)

Embedding Fixes

Use encode() not decode() for embedding models
Set n_ubatch >= n_tokens (llama.cpp assertion requirement)
Use AddBos::Never since PromptFormat already adds <bos>
Apply prompt format: embed_one → search_query prefix, embed_batch → search_document prefix

Code Review Fixes

Fix dimension migration: Store::init() reads stored dim, reset_for_reindex clears FTS
Wire LLM cache into search_with_intelligence
Wire orchestrator + reranker into MCP server search handler
Progress bar during indexing via indicatif

Search Quality (verified on real vault)

"Brilliant Earth checkout" → finds BRE-2579, Brilliant Earth note, Luke Clifton
"Scentbird Drift engineering manager" → finds Scentbird, Drift Team, Drift notes
"who works on checkout" → Intent: Relationship, finds John Nelson, BRE-1728, Brilliant Earth

Test plan

cargo test --lib — 270/270 pass
cargo clippy -- -D warnings — clean
Real vault: 88 files, 3493 chunks, 70s index time on Metal
Search quality verified across 5 query types
CI green (needs CMake on runners)

GGUF repos rarely ship tokenizer.json and Google Gemma tokenizers are gated on HuggingFace. FlexTokenizer enum wraps both HuggingFace tokenizers crate and shimmytok (extracts from GGUF metadata). CandleEmbed uses FlexTokenizer, orchestrator/reranker use HF-only.

embed_one now calls prompt_format.format_query() and embed_batch calls prompt_format.format_document() before passing text to embed_text(). This is required for asymmetric models like embeddinggemma that need specific prefixes for queries vs documents.

reset_for_reindex now also deletes from chunks_fts so stale keyword entries don't survive a dimension migration. Store::init() reads the stored embedding_dim from meta to create the vec table with the correct dimension, preventing a stale 384-dim table from persisting when the model outputs 256-dim vectors.

When an orchestrator is present, compute a SHA256 cache key from the query and check the llm_cache table first. On miss, call the orchestrator and store the result. Adds Serialize/Deserialize to QueryIntent and OrchestrationResult for JSON round-tripping. Removes #[allow(dead_code)] from orchestration_cache_key.

The search tool handler now calls search_with_intelligence with the orchestrator and reranker from EngraphServer, enabling LLM-powered query expansion and result reranking in the MCP server. Removes #[allow(dead_code)] from the orchestrator and reranker fields.

…MiniLM-L6-v2 Add BertLayer struct with LayerNorm+bias, absolute position embeddings, and GELU FFN activation alongside the existing Gemma EmbedLayer. The CandleEmbed struct now wraps an EmbedModelVariant enum (Gemma | Bert) and detects architecture from GGUF metadata (general.architecture). Switch default embedding model from embeddinggemma-300M (256-dim) to all-MiniLM-L6-v2-GGUF Q8_0 (384-dim, 25MB). Users can still override to embeddinggemma via config.toml. Update store default dimension to 384.

- Print [N/M] file progress during indexing (was silent for minutes) - Fix expand model URI: Qwen3-0.6B-Q8_0.gguf (uppercase, was 404) - Add accelerate feature flag for Apple vecLib optimization

Replace candle_transformers::quantized_nn::RmsNorm (which lacks a Metal kernel) with candle_nn::RmsNorm throughout the Gemma embedding code. QTensor weights are dequantized to f32 Tensor at load time so the standard RmsNorm forward pass runs on Metal without error. Also restores embeddinggemma as the default model (256-dim), replaces eprint indexing progress with an indicatif progress bar, and fixes store tests to match the new default dimension.

…support candle lacks Metal kernels for quantized GGUF models (rms-norm, QMatMul). llama.cpp has mature Metal support and auto-detects GPU at build time. - Replace candle-core/candle-nn/candle-transformers with llama-cpp-2 - CandleEmbed -> LlamaEmbed, CandleOrchestrator -> LlamaOrchestrator, CandleRerank -> LlamaRerank - Remove select_device(), CandleQMatMul, EmbedLayer, BertLayer, EmbedModelVariant (llama.cpp handles all model loading internally) - Remove metal/accelerate/cuda feature flags (llama.cpp handles GPU detection at CMake build time) - LlamaContext is !Send so contexts are created per-call from the stored LlamaModel (which is Send+Sync) - Public API unchanged: traits, MockLlm, download infra, FlexTokenizer, PromptFormat, heuristic_orchestrate all preserved - 270 tests pass (net -1: removed select_device test)

Replace candle with llama-cpp-2 for all ML inference. Gets Metal GPU acceleration (88 files in 70s vs 37+ min on CPU). Fixes: use encode() not decode() for embeddings, set n_ubatch >= n_tokens, use AddBos::Never (PromptFormat already adds <bos>), force CPU device for quantized ops (candle Metal unsupported). Keeps BERT GGUF support code for fallback. Default: embeddinggemma-300M.

run_search now loads orchestrator + reranker when intelligence is enabled and calls search_with_intelligence instead of search_internal.

…eranker Bug 1: LlamaBackend::init() fails with BackendAlreadyInitialized if called more than once. Add a module-level llama_backend() function using OnceLock + a Mutex-guarded double-checked init (get_or_try_init is still unstable on stable Rust). Remove the backend field from LlamaEmbed, LlamaOrchestrator, and LlamaRerank; all three now share the single static backend. Bug 2: LlamaOrchestrator and LlamaRerank were loading an external tokenizer.json via load_hf_tokenizer(), which does not exist in Qwen3 GGUF repos. Switch both to llama.cpp's built-in tokenizer: str_to_token() for encoding, token_to_piece() for decoding, and str_to_token("Yes"/"No") for Yes/No token ID lookup. Remove the tokenizer field from both structs and drop the load_hf_tokenizer() helper. Add encoding_rs as a direct dependency (required by token_to_piece's Decoder parameter; was already a transitive dep). All 270 unit tests pass, clippy clean, fmt clean.

…elligence - LlamaBackend shared via OnceLock (was re-initialized per model, crashed) - Orchestrator/reranker use llama.cpp built-in tokenizer (GGUF-embedded) - CLI search loads intelligence models when enabled - Debug log for orchestration results

- README: llama.cpp references, Metal GPU, 270 tests, CMake requirement - CHANGELOG: v1.0.1 entry with all fixes and backend switch - CLAUDE.md: llama-cpp-2 deps, LlamaEmbed/LlamaOrchestrator/LlamaRerank - Release workflow: CMake on Ubuntu, cmake dep in Homebrew formula - Vault spec: updated with hotfix PR reference

devwhodevs added 17 commits March 25, 2026 20:09

feat: add accelerate feature flag for optimized CPU on macOS

7171792

fix: add indexing progress output, fix Qwen3 GGUF filename case

4892309

- Print [N/M] file progress during indexing (was silent for minutes) - Fix expand model URI: Qwen3-0.6B-Q8_0.gguf (uppercase, was 404) - Add accelerate feature flag for Apple vecLib optimization

style: cargo fmt

8c4f192

ci: install CMake on Ubuntu for llama.cpp build

f55dd85

feat(search): wire intelligence models into CLI search path

55f67ad

run_search now loads orchestrator + reranker when intelligence is enabled and calls search_with_intelligence instead of search_internal.

devwhodevs merged commit 3631c25 into main Mar 25, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.0.1: Fix tokenizer loading, prompt formatting, cache wiring, MCP intelligence#11

v1.0.1: Fix tokenizer loading, prompt formatting, cache wiring, MCP intelligence#11
devwhodevs merged 17 commits into
mainfrom
hotfix/v1.0.1-tokenizer-and-review-fixes

devwhodevs commented Mar 25, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

devwhodevs commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Backend Switch (candle → llama.cpp)

Embedding Fixes

Code Review Fixes

Search Quality (verified on real vault)

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

devwhodevs commented Mar 25, 2026 •

edited

Loading