Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
17 commits
Select commit Hold shift + click to select a range
b478bec
fix(llm): add shimmytok fallback for GGUF-embedded tokenizers
devwhodevs Mar 25, 2026
bd87d71
fix(llm): apply prompt format in CandleEmbed embed_one and embed_batch
devwhodevs Mar 25, 2026
7c6ba89
fix(store): clear FTS on reindex and use stored dim for vec table init
devwhodevs Mar 25, 2026
9017388
fix(search): wire LLM cache into search_with_intelligence
devwhodevs Mar 25, 2026
6a55e31
fix(serve): wire orchestrator and reranker into MCP search handler
devwhodevs Mar 25, 2026
dd21f41
feat(llm): add BERT GGUF architecture support, switch default to all-…
devwhodevs Mar 25, 2026
7171792
feat: add accelerate feature flag for optimized CPU on macOS
devwhodevs Mar 25, 2026
4892309
fix: add indexing progress output, fix Qwen3 GGUF filename case
devwhodevs Mar 25, 2026
20be487
fix: use float32 RmsNorm for Metal GPU compatibility in Gemma embedding
devwhodevs Mar 25, 2026
ebb814b
refactor(llm): replace candle backend with llama-cpp-2 for Metal GPU …
devwhodevs Mar 25, 2026
3db1ae5
feat(llm): switch to llama.cpp backend, fix embedding params
devwhodevs Mar 25, 2026
8c4f192
style: cargo fmt
devwhodevs Mar 25, 2026
f55dd85
ci: install CMake on Ubuntu for llama.cpp build
devwhodevs Mar 25, 2026
55f67ad
feat(search): wire intelligence models into CLI search path
devwhodevs Mar 25, 2026
5c7f285
fix: singleton LlamaBackend and built-in tokenizer for orchestrator/r…
devwhodevs Mar 25, 2026
8b85976
fix(llm): global backend singleton, built-in tokenizers, wire CLI int…
devwhodevs Mar 25, 2026
e8c159f
docs: update README, CHANGELOG, CLAUDE.md for llama.cpp backend
devwhodevs Mar 25, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,9 @@ jobs:
runs-on: ${{ matrix.os }}
steps:
- uses: actions/checkout@v4
- name: Install CMake (Ubuntu)
if: runner.os == 'Linux'
run: sudo apt-get update && sudo apt-get install -y cmake
- uses: dtolnay/rust-toolchain@stable
with:
components: rustfmt, clippy
Expand Down
4 changes: 4 additions & 0 deletions .github/workflows/release.yml
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,9 @@ jobs:
contents: write
steps:
- uses: actions/checkout@v4
- name: Install CMake (Ubuntu)
if: runner.os == 'Linux'
run: sudo apt-get update && sudo apt-get install -y cmake
- uses: dtolnay/rust-toolchain@stable
- run: cargo build --release
- name: Archive binary
Expand Down Expand Up @@ -60,6 +63,7 @@ jobs:
sha256 "SHA256"
license "MIT"

depends_on "cmake" => :build
depends_on "rust" => :build

def install
Expand Down
24 changes: 24 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,29 @@
# Changelog

## [1.0.1] - 2026-03-26

### Changed
- **Inference backend switched from candle to llama.cpp** — via `llama-cpp-2` Rust bindings. Gets full Metal GPU acceleration on macOS (88 files indexed in 70s vs 37+ minutes on CPU with candle). Same backend as [qmd](https://github.com/tobi/qmd).
- Default embedding model produces 256-dim vectors via embeddinggemma-300M (Matryoshka truncation)
- BERT GGUF architecture support added alongside Gemma (future model flexibility)
- Progress bar during indexing via indicatif (was silent for minutes)
- CI workflow installs CMake on Ubuntu (required for llama.cpp build)

### Fixed
- **Prompt format applied during embedding** — `embed_one` uses search_query prefix, `embed_batch` uses search_document prefix. Without this, embeddinggemma operated in wrong symmetric mode.
- **GGUF tokenizer fallback** — added `shimmytok` crate to extract tokenizer from GGUF metadata when tokenizer.json is unavailable (Google Gemma repos are gated)
- **LlamaBackend singleton** — global `OnceLock` prevents double-initialization crash when loading multiple models
- **Orchestrator/reranker use built-in tokenizer** — llama.cpp reads tokenizer from GGUF metadata, no external tokenizer.json needed
- **Dimension migration clears FTS** — `reset_for_reindex` now also clears `chunks_fts` to prevent duplicate entries
- **LLM cache wired into search** — `search_with_intelligence` checks/populates `llm_cache` table
- **MCP server wires intelligence** — search handler passes orchestrator + reranker via `SearchConfig`
- **CLI search wires intelligence** — `run_search` loads models when intelligence enabled
- **Qwen3 GGUF filename** — fixed case sensitivity (was 404)
- **Embedding batch params** — `n_ubatch >= n_tokens` assertion, use `encode()` not `decode()`, `AddBos::Never` (PromptFormat adds `<bos>`)

### Removed
- `candle-core`, `candle-nn`, `candle-transformers` dependencies (replaced by `llama-cpp-2`)

## [1.0.0] - 2026-03-25

### Added
Expand Down
14 changes: 7 additions & 7 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ Single binary with 19 modules behind a lib crate:
- `config.rs` — loads `~/.engraph/config.toml` and `vault.toml`, merges CLI args, provides `data_dir()`. Includes `intelligence: Option<bool>` and `[models]` section for model overrides. `Config::save()` writes back to disk.
- `chunker.rs` — smart chunking with break-point scoring algorithm. Finds optimal split points considering headings, code fences, blank lines, and thematic breaks. `split_oversized_chunks()` handles token-aware secondary splitting with overlap
- `docid.rs` — deterministic 6-char hex IDs for files (SHA-256 of path, truncated). Shown in search results for quick reference
- `llm.rs` — candle model management. Three traits: `EmbedModel` (embeddings), `RerankModel` (cross-encoder scoring), `OrchestratorModel` (query intent + expansion). Three candle implementations: `CandleEmbed` (custom bidirectional transformer from GGUF for embeddinggemma), `CandleOrchestrator` (quantized_qwen3 for query analysis), `CandleRerank` (quantized_qwen3 for relevance scoring). Also: `MockLlm` for testing, `HfModelUri` for model download, `PromptFormat` for model-family prompt templates, `heuristic_orchestrate()` fast path, `LaneWeights` per query intent
- `llm.rs` — ML inference via llama.cpp (Rust bindings: `llama-cpp-2`). Three traits: `EmbedModel` (embeddings), `RerankModel` (cross-encoder scoring), `OrchestratorModel` (query intent + expansion). Three llama.cpp implementations: `LlamaEmbed` (embeddinggemma-300M GGUF on Metal GPU), `LlamaOrchestrator` (Qwen3-0.6B for query analysis + expansion), `LlamaRerank` (Qwen3-Reranker-0.6B for relevance scoring). Global `LlamaBackend` via `OnceLock`. Also: `MockLlm` for testing, `HfModelUri` for model download, `FlexTokenizer` (HuggingFace tokenizers + shimmytok GGUF fallback), `PromptFormat` for model-family prompt templates, `heuristic_orchestrate()` fast path, `LaneWeights` per query intent
- `fts.rs` — FTS5 full-text search support. Re-exports `FtsResult` from store. BM25-ranked keyword search
- `fusion.rs` — Reciprocal Rank Fusion (RRF) engine. Merges semantic + FTS5 + graph + reranker results. Supports per-lane weighting, `--explain` output with intent + per-lane detail
- `context.rs` — context engine. Six functions: `read` (full note content + metadata), `list` (filtered note listing with `created_by` filter), `vault_map` (structure overview), `who` (person context bundle), `project` (project context bundle), `context_topic` (rich topic context with budget trimming). Pure functions taking `ContextParams` — no model loading except `context_topic` which reuses `search_internal`
Expand Down Expand Up @@ -52,14 +52,13 @@ Single vault only. Re-indexing a different vault path triggers a confirmation pr

## Dependencies to be aware of

- `candle-core` (0.9) — HuggingFace pure Rust ML framework. GGUF model loading, tensor ops. `metal` feature for macOS GPU acceleration
- `candle-nn` (0.9) — neural network building blocks (RmsNorm, rotary embeddings, etc.)
- `candle-transformers` (0.9) — pre-built transformer model architectures. Used: `quantized_qwen3` for orchestrator + reranker
- `llama-cpp-2` (0.1) — Rust bindings to llama.cpp. GGUF model loading + inference. Metal GPU on macOS, CUDA on Linux. Compiles llama.cpp C++ via build script (requires CMake)
- `shimmytok` (0.7) — pure Rust tokenizer that reads from GGUF metadata. Fallback when tokenizer.json is unavailable (gated HuggingFace repos)
- `tokenizers` (0.22) — HuggingFace tokenizer. Kept for FlexTokenizer HuggingFace backend
- `sqlite-vec` (0.1.8-alpha.1) — SQLite extension for vector search. Provides vec0 virtual tables with KNN via `vec_distance_cosine()`
- `zerocopy` (0.7) — zero-copy serialization for vector data passed to sqlite-vec
- `strsim` (0.11) — string similarity for fuzzy tag matching and fuzzy link matching
- `time` (0.3) — date/time handling for frontmatter timestamps
- `tokenizers` (0.22) — HuggingFace tokenizer. Needs `fancy-regex` feature. Used for all three GGUF models
- `ignore` (0.4) — vault walking with `.gitignore` support
- `rusqlite` (0.32) — bundled SQLite with FTS5 support
- `rmcp` (1.2) — MCP server SDK for stdio transport
Expand All @@ -68,12 +67,13 @@ Single vault only. Re-indexing a different vault path triggers a confirmation pr

## Testing

- Unit tests in each module (`cargo test --lib`) — 271 tests, no network required
- Unit tests in each module (`cargo test --lib`) — 270 tests, no network required
- Integration tests (`cargo test --test integration -- --ignored`) — require GGUF model download
- Build requires CMake (for llama.cpp C++ compilation)

## CI/CD

- CI: `cargo fmt --check` + `cargo clippy -- -D warnings` + `cargo test --lib` on macOS + Ubuntu
- CI: `cargo fmt --check` + `cargo clippy -- -D warnings` + `cargo test --lib` on macOS + Ubuntu. Ubuntu step installs CMake.
- Release: native builds on macOS arm64 (macos-14) + Linux x86_64 (ubuntu-latest). Triggered by `v*` tags
- Homebrew: `devwhodevs/homebrew-tap` — formula builds from source tarball

Expand Down
Loading
Loading