diff --git a/CLAUDE.md b/CLAUDE.md index 57a493c..1c967b1 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -4,7 +4,7 @@ Local hybrid search CLI for Obsidian vaults. Rust, MIT licensed. ## Architecture -Single binary with 14 modules behind a lib crate: +Single binary with 19 modules behind a lib crate: - `config.rs` — loads `~/.engraph/config.toml` and `vault.toml`, merges CLI args, provides `data_dir()` - `chunker.rs` — smart chunking with break-point scoring algorithm. Finds optimal split points considering headings, code fences, blank lines, and thematic breaks. `split_oversized_chunks()` handles token-aware secondary splitting with overlap @@ -14,46 +14,56 @@ Single binary with 14 modules behind a lib crate: - `fts.rs` — FTS5 full-text search support. Re-exports `FtsResult` from store. BM25-ranked keyword search - `fusion.rs` — Reciprocal Rank Fusion (RRF) engine. Merges semantic + FTS5 + graph results. Supports lane weighting, `--explain` output with per-lane detail - `context.rs` — context engine. Six functions: `read` (full note content + metadata), `list` (filtered note listing), `vault_map` (structure overview), `who` (person context bundle), `project` (project context bundle), `context_topic` (rich topic context with budget trimming). Pure functions taking `ContextParams` — no model loading except `context_topic` which reuses `search_internal` -- `serve.rs` — MCP stdio server via rmcp SDK. Exposes 7 read-only tools (search, read, list, vault_map, who, project, context). EngraphServer struct with Arc+Mutex wrapping for async handlers. Loads all resources at startup. +- `vecstore.rs` — sqlite-vec virtual table integration. Manages the `vec_chunks` vec0 table for vector storage and KNN search. Handles insert, delete, and search operations against the virtual table +- `tags.rs` — tag registry module. Maintains a `tag_registry` table tracking known tags with source attribution. Supports fuzzy matching for tag suggestions during note creation +- `links.rs` — link discovery module. Scans note content for potential wikilink targets using fuzzy basename matching and heading detection. Suggests links that could be added to improve vault connectivity +- `placement.rs` — folder placement engine. Uses folder centroids (average embeddings per folder) to suggest the best folder for new notes. Falls back to inbox when confidence is low +- `writer.rs` — write pipeline orchestrator. 5-step pipeline: resolve tags (fuzzy match + register new), discover links, place in folder, atomic file write (temp + rename), and index update. Supports create, append, update_metadata, and move_note operations with mtime-based conflict detection and crash recovery via temp file cleanup +- `serve.rs` — MCP stdio server via rmcp SDK. Exposes 11 tools: 7 read (search, read, list, vault_map, who, project, context) + 4 write (create, append, update_metadata, move_note). EngraphServer struct with Arc+Mutex wrapping for async handlers. Loads all resources at startup - `graph.rs` — vault graph agent. Extracts wikilink targets, expands search results by following graph connections 1-2 hops. Relevance filtering via FTS5 term check and shared tags - `profile.rs` — vault profile detection. Auto-detects PARA/Folders/Flat structure, vault type (Obsidian/Logseq/Plain), wikilinks, frontmatter, tags. Writes/loads `vault.toml` -- `store.rs` — SQLite persistence. Tables: `meta`, `files` (with docid), `chunks` (with vector BLOBs), `chunks_fts` (FTS5), `edges` (vault graph), `tombstones`. Handles incremental diffing via content hashes -- `hnsw.rs` — thin wrapper around `hnsw_rs`. **Important:** `hnsw_rs` does not support inserting after `load_hnsw()`. The index is rebuilt from vectors stored in SQLite on every index run -- `indexer.rs` — orchestrates vault walking (via `ignore` crate for `.gitignore` support), diffing, chunking, embedding (Rayon for parallel chunking, serial embedding since `Embedder` is not `Send`), serial writes to store + HNSW + FTS5, and vault graph edge building (wikilinks + people detection) +- `store.rs` — SQLite persistence. Tables: `meta`, `files` (with docid), `chunks` (with vector BLOBs), `chunks_fts` (FTS5), `edges` (vault graph), `tombstones`, `tag_registry`, `folder_centroids`. `vec_chunks` virtual table (sqlite-vec) for KNN search. Handles incremental diffing via content hashes +- `indexer.rs` — orchestrates vault walking (via `ignore` crate for `.gitignore` support), diffing, chunking, embedding (Rayon for parallel chunking, serial embedding since `Embedder` is not `Send`), serial writes to store + sqlite-vec + FTS5, vault graph edge building (wikilinks + people detection), and folder centroid computation +- `search.rs` — hybrid search orchestrator. Runs semantic (sqlite-vec KNN), keyword (FTS5 BM25), and graph expansion lanes, then fuses via RRF -`main.rs` is a thin clap CLI (async via `#[tokio::main]`). Subcommands: `index`, `search` (with `--explain`), `status`, `clear`, `init`, `configure`, `models`, `graph` (show/stats), `context` (read/list/vault-map/who/project/topic), `serve` (MCP stdio server). +`main.rs` is a thin clap CLI (async via `#[tokio::main]`). Subcommands: `index`, `search` (with `--explain`), `status`, `clear`, `init`, `configure`, `models`, `graph` (show/stats), `context` (read/list/vault-map/who/project/topic), `write` (create/append/update-metadata/move), `serve` (MCP stdio server). ## Key patterns -- **3-lane hybrid search:** Queries run through three lanes — semantic (HNSW embeddings), keyword (FTS5 BM25), and graph (wikilink expansion). Results are fused via Reciprocal Rank Fusion (RRF) with configurable lane weights (semantic 1.0, FTS 1.0, graph 0.8) +- **3-lane hybrid search:** Queries run through three lanes — semantic (sqlite-vec KNN embeddings), keyword (FTS5 BM25), and graph (wikilink expansion). Results are fused via Reciprocal Rank Fusion (RRF) with configurable lane weights (semantic 1.0, FTS 1.0, graph 0.8) - **Vault graph:** `edges` table stores bidirectional wikilink edges and mention edges. Built during indexing after all files are written. People detection scans for person name/alias mentions using notes from the configured People folder -- **Graph agent:** Expands seed results by following wikilinks 1-2 hops. Decay: 0.8× for 1-hop, 0.5× for 2-hop. Relevance filter: must contain query term (FTS5) or share tags with seed. Multi-parent merge takes highest score +- **Graph agent:** Expands seed results by following wikilinks 1-2 hops. Decay: 0.8x for 1-hop, 0.5x for 2-hop. Relevance filter: must contain query term (FTS5) or share tags with seed. Multi-parent merge takes highest score - **Smart chunking:** Break-point scoring algorithm assigns scores to potential split points (headings 50-100, code fences 80, thematic breaks 60, blank lines 20). Code fence protection prevents splitting inside code blocks -- **Incremental indexing:** `diff_vault()` compares file content hashes in SQLite against disk. Changed files have their old chunks and edges deleted, then are re-processed. FTS5 entries cleaned up alongside vector entries -- **HNSW rebuild on every run:** Vectors stored as BLOBs. Full HNSW index rebuilt from `store.get_all_vectors()` after SQLite update (hnsw_rs limitation) +- **Incremental indexing:** `diff_vault()` compares file content hashes in SQLite against disk. Changed files have their old chunks, vectors, and edges deleted, then are re-processed. FTS5 and sqlite-vec entries cleaned up alongside store entries +- **sqlite-vec for vector search:** Vectors stored in a `vec_chunks` virtual table (vec0). KNN search via `vec_distance_cosine()`. Real deletes — no tombstone filtering needed during search +- **Write pipeline:** 5-step process for creating/modifying notes: (1) resolve tags via fuzzy matching against tag registry, (2) discover potential wikilinks via basename matching, (3) suggest folder placement via centroid similarity, (4) atomic file write (temp + rename for crash safety), (5) immediate index update (embed + insert into sqlite-vec + FTS5 + edges) - **Docids:** Each file gets a deterministic 6-char hex ID. Displayed in search results - **Vault profiles:** `engraph init` auto-detects vault structure and writes `vault.toml` - **Pluggable models:** `ModelBackend` trait enables future model swapping ## Data directory -`~/.engraph/` — hardcoded via `Config::data_dir()`. Contains `engraph.db` (SQLite with FTS5 + edges), `hnsw/` (index files), `models/` (ONNX model + tokenizer), `vault.toml` (vault profile), `config.toml` (user config). +`~/.engraph/` — hardcoded via `Config::data_dir()`. Contains `engraph.db` (SQLite with FTS5 + sqlite-vec + edges), `models/` (ONNX model + tokenizer), `vault.toml` (vault profile), `config.toml` (user config). Single vault only. Re-indexing a different vault path triggers a confirmation prompt. ## Dependencies to be aware of - `ort` (2.0.0-rc.12) — ONNX Runtime Rust bindings. Pre-release API. Does not provide prebuilt binaries for all targets -- `hnsw_rs` (0.3) — pure Rust HNSW. `Box::leak` in `load()`. Read-only after load +- `sqlite-vec` (0.1.8-alpha.1) — SQLite extension for vector search. Provides vec0 virtual tables with KNN via `vec_distance_cosine()` +- `zerocopy` (0.7) — zero-copy serialization for vector data passed to sqlite-vec +- `strsim` (0.11) — string similarity for fuzzy tag matching in the write pipeline +- `time` (0.3) — date/time handling for frontmatter timestamps - `tokenizers` (0.22) — HuggingFace tokenizer. Needs `fancy-regex` feature - `ignore` (0.4) — vault walking with `.gitignore` support - `rusqlite` (0.32) — bundled SQLite with FTS5 support +- `rmcp` (1.2) — MCP server SDK for stdio transport ## Testing -- Unit tests in each module (`cargo test --lib`) — 146 tests, no network required +- Unit tests in each module (`cargo test --lib`) — 190 tests, no network required - 1 ignored smoke test (`test_embed_smoke`) — downloads ONNX model, verifies embedding -- Integration tests (`cargo test --test integration -- --ignored`) — 8 tests, require model download +- Integration tests (`cargo test --test integration -- --ignored`) — require model download ## CI/CD diff --git a/Cargo.lock b/Cargo.lock index 60f0716..419059f 100644 --- a/Cargo.lock +++ b/Cargo.lock @@ -19,7 +19,7 @@ dependencies = [ "once_cell", "serde", "version_check", - "zerocopy", + "zerocopy 0.8.42", ] [[package]] @@ -31,12 +31,6 @@ dependencies = [ "memchr", ] -[[package]] -name = "allocator-api2" -version = "0.2.21" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "683d7910e743518b0e34f1186f92494becacb047c7b6bf616c96772180fef923" - [[package]] name = "android_system_properties" version = "0.1.5" @@ -46,38 +40,6 @@ dependencies = [ "libc", ] -[[package]] -name = "anndists" -version = "0.1.4" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "8238f99889a837cd6641360f9f3ead18f70b07bf6ce1f04a319bc6bd8a2f48f1" -dependencies = [ - "anyhow", - "cfg-if", - "cpu-time", - "env_logger", - "lazy_static", - "log", - "num-traits", - "num_cpus", - "rayon", -] - -[[package]] -name = "anstream" -version = "0.6.21" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "43d5b281e737544384e969a5ccad3f1cdd24b48086a0fc1b2a5262a26b8f4f4a" -dependencies = [ - "anstyle", - "anstyle-parse 0.2.7", - "anstyle-query", - "anstyle-wincon", - "colorchoice", - "is_terminal_polyfill", - "utf8parse", -] - [[package]] name = "anstream" version = "1.0.0" @@ -85,7 +47,7 @@ source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "824a212faf96e9acacdbd09febd34438f8f711fb84e09a8916013cd7815ca28d" dependencies = [ "anstyle", - "anstyle-parse 1.0.0", + "anstyle-parse", "anstyle-query", "anstyle-wincon", "colorchoice", @@ -99,15 +61,6 @@ version = "1.0.14" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "940b3a0ca603d1eade50a4846a2afffd5ef57a9feac2c0e2ec2e14f9ead76000" -[[package]] -name = "anstyle-parse" -version = "0.2.7" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "4e7644824f0aa2c7b9384579234ef10eb7efb6a0deb83f9630a49594dd9c15c2" -dependencies = [ - "utf8parse", -] - [[package]] name = "anstyle-parse" version = "1.0.0" @@ -178,15 +131,6 @@ version = "1.8.3" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "2af50177e190e07a26ab74f8b1efbfe2ef87da2116221318cb1c2e82baf7de06" -[[package]] -name = "bincode" -version = "1.3.3" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "b1f45e9417d87227c7a56d22e471c6206462cba514c7590c09aff4cf6d1ddcad" -dependencies = [ - "serde", -] - [[package]] name = "bit-set" version = "0.8.0" @@ -270,12 +214,6 @@ version = "1.0.4" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "9330f8b2ff13f34540b44e946ef35111825727b38d33286ef986142615121801" -[[package]] -name = "cfg_aliases" -version = "0.2.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "613afe47fcd5fac7ccf1db93babcb082c5994d996f20b8b159f2ad1658eb5724" - [[package]] name = "chrono" version = "0.4.44" @@ -306,7 +244,7 @@ version = "4.6.0" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "714a53001bf66416adb0e2ef5ac857140e7dc3a0c48fb28b2f10762fc4b5069f" dependencies = [ - "anstream 1.0.0", + "anstream", "anstyle", "clap_lex", "strsim", @@ -336,16 +274,6 @@ version = "1.0.5" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "1d07550c9036bf2ae0c684c4297d503f838287c83c53686d05370d0e139ae570" -[[package]] -name = "combine" -version = "4.6.7" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "ba5a308b75df32fe02788e748662718f03fde005016435c444eea572398219fd" -dependencies = [ - "bytes", - "memchr", -] - [[package]] name = "compact_str" version = "0.9.0" @@ -390,16 +318,6 @@ version = "0.8.7" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "773648b94d0e5d620f64f280777445740e61fe701025087ec8b57f45c791888b" -[[package]] -name = "cpu-time" -version = "1.0.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "e9e393a7668fe1fad3075085b86c781883000b4ede868f43627b34a87c8b7ded" -dependencies = [ - "libc", - "winapi", -] - [[package]] name = "cpufeatures" version = "0.2.17" @@ -541,6 +459,15 @@ dependencies = [ "zeroize", ] +[[package]] +name = "deranged" +version = "0.5.8" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "7cd812cc2bc1d69d4764bd80df88b4317eaef9e773c75226407d9bc0876b211c" +dependencies = [ + "powerfmt", +] + [[package]] name = "derive_builder" version = "0.20.2" @@ -634,12 +561,11 @@ checksum = "34aa73646ffb006b8f5147f3dc182bd4bcb190227ce861fc4a4844bf8e3cb2c0" [[package]] name = "engraph" -version = "0.4.0" +version = "0.6.0" dependencies = [ "anyhow", "clap", "dirs", - "hnsw_rs", "ignore", "indicatif", "ndarray", @@ -650,48 +576,17 @@ dependencies = [ "serde", "serde_json", "sha2", + "sqlite-vec", + "strsim", "tempfile", + "time", "tokenizers", "tokio", "toml", "tracing", "tracing-subscriber", "ureq 2.12.1", -] - -[[package]] -name = "enum-as-inner" -version = "0.6.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "a1e6a265c649f3f5979b601d26f1d05ada116434c87741c9493cb56218f76cbc" -dependencies = [ - "heck", - "proc-macro2", - "quote", - "syn", -] - -[[package]] -name = "env_filter" -version = "1.0.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "7a1c3cc8e57274ec99de65301228b537f1e4eedc1b8e0f9411c6caac8ae7308f" -dependencies = [ - "log", - "regex", -] - -[[package]] -name = "env_logger" -version = "0.11.9" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "b2daee4ea451f429a58296525ddf28b45a3b64f1acf6587e2067437bb11e218d" -dependencies = [ - "anstream 0.6.21", - "anstyle", - "env_filter", - "jiff", - "log", + "zerocopy 0.7.35", ] [[package]] @@ -959,8 +854,6 @@ version = "0.15.5" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "9229cfe53dfd69f0609a49f65461bd93001ea1ef889cd5529dd176593f5338a1" dependencies = [ - "allocator-api2", - "equivalent", "foldhash", ] @@ -985,43 +878,12 @@ version = "0.5.0" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "2304e00983f87ffb38b55b444b5e3b60a884b5d30c0fca7d82fe33449bbe55ea" -[[package]] -name = "hermit-abi" -version = "0.5.2" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "fc0fef456e4baa96da950455cd02c081ca953b141298e41db3fc7e36b1da849c" - [[package]] name = "hmac-sha256" version = "1.1.14" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "ec9d92d097f4749b64e8cc33d924d9f40a2d4eb91402b458014b781f5733d60f" -[[package]] -name = "hnsw_rs" -version = "0.3.4" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "43a5258f079b97bf2e8311ff9579e903c899dcbac0d9a138d62e9a066778bd07" -dependencies = [ - "anndists", - "anyhow", - "bincode", - "cfg-if", - "cpu-time", - "env_logger", - "hashbrown 0.15.5", - "indexmap", - "lazy_static", - "log", - "mmap-rs", - "num-traits", - "num_cpus", - "parking_lot", - "rand", - "rayon", - "serde", -] - [[package]] name = "http" version = "1.4.0" @@ -1238,30 +1100,6 @@ version = "1.0.17" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "92ecc6618181def0457392ccd0ee51198e065e016d1d527a7ac1b6dc7c1f09d2" -[[package]] -name = "jiff" -version = "0.2.23" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "1a3546dc96b6d42c5f24902af9e2538e82e39ad350b0c766eb3fbf2d8f3d8359" -dependencies = [ - "jiff-static", - "log", - "portable-atomic", - "portable-atomic-util", - "serde_core", -] - -[[package]] -name = "jiff-static" -version = "0.2.23" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "2a8c8b344124222efd714b73bb41f8b5120b27a7cc1c75593a6ff768d9d05aa4" -dependencies = [ - "proc-macro2", - "quote", - "syn", -] - [[package]] name = "js-sys" version = "0.3.91" @@ -1322,15 +1160,6 @@ version = "0.8.1" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "6373607a59f0be73a39b6fe456b8192fcc3585f602af20751600e974dd455e77" -[[package]] -name = "lock_api" -version = "0.4.14" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "224399e74b87b5f3557511d98dff8b14089b3dadafcab6bb93eab67d3aace965" -dependencies = [ - "scopeguard", -] - [[package]] name = "log" version = "0.4.29" @@ -1343,15 +1172,6 @@ version = "0.15.7" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "1670343e58806300d87950e3401e820b519b9384281bbabfb15e3636689ffd69" -[[package]] -name = "mach2" -version = "0.4.3" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "d640282b302c0bb0a2a8e0233ead9035e3bed871f0b7e81fe4a1ec829765db44" -dependencies = [ - "libc", -] - [[package]] name = "macro_rules_attribute" version = "0.2.2" @@ -1409,23 +1229,6 @@ dependencies = [ "simd-adler32", ] -[[package]] -name = "mmap-rs" -version = "0.7.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "4ecce9d566cb9234ae3db9e249c8b55665feaaf32b0859ff1e27e310d2beb3d8" -dependencies = [ - "bitflags", - "combine", - "libc", - "mach2", - "nix", - "sysctl", - "thiserror 2.0.18", - "widestring", - "windows", -] - [[package]] name = "monostate" version = "0.1.18" @@ -1480,18 +1283,6 @@ dependencies = [ "rawpointer", ] -[[package]] -name = "nix" -version = "0.30.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "74523f3a35e05aba87a1d978330aef40f67b0304ac79c1c00b294c9830543db6" -dependencies = [ - "bitflags", - "cfg-if", - "cfg_aliases", - "libc", -] - [[package]] name = "nom" version = "7.1.3" @@ -1520,6 +1311,12 @@ dependencies = [ "num-traits", ] +[[package]] +name = "num-conv" +version = "0.2.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "c6673768db2d862beb9b39a78fdcb1a69439615d5794a1be50caa9bc92c81967" + [[package]] name = "num-integer" version = "0.1.46" @@ -1538,16 +1335,6 @@ dependencies = [ "autocfg", ] -[[package]] -name = "num_cpus" -version = "1.17.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "91df4bbde75afed763b708b7eee1e8e7651e02d97f6d5dd763e89367e957b23b" -dependencies = [ - "hermit-abi", - "libc", -] - [[package]] name = "number_prefix" version = "0.4.0" @@ -1640,29 +1427,6 @@ dependencies = [ "ureq 3.2.0", ] -[[package]] -name = "parking_lot" -version = "0.12.5" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "93857453250e3077bd71ff98b6a65ea6621a19bb0f559a85248955ac12c45a1a" -dependencies = [ - "lock_api", - "parking_lot_core", -] - -[[package]] -name = "parking_lot_core" -version = "0.9.12" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "2621685985a2ebf1c516881c026032ac7deafcda1a2c9b7850dc81e3dfcb64c1" -dependencies = [ - "cfg-if", - "libc", - "redox_syscall", - "smallvec", - "windows-link", -] - [[package]] name = "paste" version = "1.0.15" @@ -1726,13 +1490,19 @@ dependencies = [ "zerovec", ] +[[package]] +name = "powerfmt" +version = "0.2.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "439ee305def115ba05938db6eb1644ff94165c5ab5e9420d1c1bcedbba909391" + [[package]] name = "ppv-lite86" version = "0.2.21" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "85eae3c4ed2f50dcfe72643da4befc30deadb458a9b590d720cde2f2b1e97da9" dependencies = [ - "zerocopy", + "zerocopy 0.8.42", ] [[package]] @@ -1841,15 +1611,6 @@ dependencies = [ "crossbeam-utils", ] -[[package]] -name = "redox_syscall" -version = "0.5.18" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "ed2bf2547551a7053d6fdfafda3f938979645c44812fbfcda098faae3f1a362d" -dependencies = [ - "bitflags", -] - [[package]] name = "redox_users" version = "0.4.6" @@ -2077,12 +1838,6 @@ dependencies = [ "syn", ] -[[package]] -name = "scopeguard" -version = "1.2.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "94143f37725109f92c262ed2cf5e59bce7498c01bcc1502d7b9afe439a4e9f49" - [[package]] name = "security-framework" version = "3.7.0" @@ -2242,6 +1997,15 @@ dependencies = [ "unicode-segmentation", ] +[[package]] +name = "sqlite-vec" +version = "0.1.8-alpha.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "4179f99d8cf20a813b52b0746e3d00e26d2a1a6ea2ff8ed228b171afd5ce9a6f" +dependencies = [ + "cc", +] + [[package]] name = "stable_deref_trait" version = "1.2.1" @@ -2288,20 +2052,6 @@ dependencies = [ "syn", ] -[[package]] -name = "sysctl" -version = "0.6.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "01198a2debb237c62b6826ec7081082d951f46dbb64b0e8c7649a452230d1dfc" -dependencies = [ - "bitflags", - "byteorder", - "enum-as-inner", - "libc", - "thiserror 1.0.69", - "walkdir", -] - [[package]] name = "tempfile" version = "3.27.0" @@ -2364,6 +2114,25 @@ dependencies = [ "cfg-if", ] +[[package]] +name = "time" +version = "0.3.47" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "743bd48c283afc0388f9b8827b976905fb217ad9e647fae3a379a9283c4def2c" +dependencies = [ + "deranged", + "num-conv", + "powerfmt", + "serde_core", + "time-core", +] + +[[package]] +name = "time-core" +version = "0.1.8" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "7694e1cfe791f8d31026952abf09c69ca6f6fa4e1a1229e18988f06a04a12dca" + [[package]] name = "tinystr" version = "0.8.2" @@ -2839,12 +2608,6 @@ dependencies = [ "rustls-pki-types", ] -[[package]] -name = "widestring" -version = "1.2.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "72069c3113ab32ab29e5584db3c6ec55d416895e60715417b5b883a357c3e471" - [[package]] name = "winapi" version = "0.3.9" @@ -2876,15 +2639,6 @@ version = "0.4.0" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "712e227841d057c1ee1cd2fb22fa7e5a5461ae8e48fa2ca79ec42cfc1931183f" -[[package]] -name = "windows" -version = "0.48.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "e686886bc078bc1b0b600cac0147aadb815089b6e4da64016cbd754b6342700f" -dependencies = [ - "windows-targets 0.48.5", -] - [[package]] name = "windows-core" version = "0.62.2" @@ -3227,13 +2981,34 @@ dependencies = [ "synstructure", ] +[[package]] +name = "zerocopy" +version = "0.7.35" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "1b9b4fd18abc82b8136838da5d50bae7bdea537c574d8dc1a34ed098d6c166f0" +dependencies = [ + "byteorder", + "zerocopy-derive 0.7.35", +] + [[package]] name = "zerocopy" version = "0.8.42" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "f2578b716f8a7a858b7f02d5bd870c14bf4ddbbcf3a4c05414ba6503640505e3" dependencies = [ - "zerocopy-derive", + "zerocopy-derive 0.8.42", +] + +[[package]] +name = "zerocopy-derive" +version = "0.7.35" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "fa4f8080344d4671fb4e831a13ad1e68092748387dfc4f55e356242fae12ce3e" +dependencies = [ + "proc-macro2", + "quote", + "syn", ] [[package]] diff --git a/Cargo.toml b/Cargo.toml index 90defef..19b9f97 100644 --- a/Cargo.toml +++ b/Cargo.toml @@ -1,6 +1,6 @@ [package] name = "engraph" -version = "0.5.0" +version = "0.6.0" edition = "2024" description = "Local semantic search for Obsidian vaults" license = "MIT" @@ -21,8 +21,11 @@ sha2 = "0.10" ureq = "2.12" indicatif = "0.17" ndarray = "0.17" -hnsw_rs = "0.3" +sqlite-vec = "0.1.8-alpha.1" +zerocopy = { version = "0.7", features = ["derive"] } rayon = "1" +time = "0.3" +strsim = "0.11" ignore = "0.4" rmcp = { version = "1.2", features = ["transport-io"] } tokio = { version = "1", features = ["macros", "rt-multi-thread"] } diff --git a/src/context.rs b/src/context.rs index 23d0446..076caa1 100644 --- a/src/context.rs +++ b/src/context.rs @@ -626,16 +626,15 @@ pub fn context_topic_from_results( }) } -/// Full context topic function (requires embedder + HNSW). +/// Full context topic function (requires embedder + sqlite-vec store). /// Called from CLI handler which provides the heavy resources. pub fn context_topic_with_search( params: &ContextParams, topic: &str, max_chars: usize, embedder: &mut crate::embedder::Embedder, - index: &crate::hnsw::HnswIndex, ) -> Result { - let search_output = crate::search::search_internal(topic, 5, params.store, embedder, index)?; + let search_output = crate::search::search_internal(topic, 5, params.store, embedder)?; context_topic_from_results(params, topic, &search_output.results, max_chars) } diff --git a/src/fusion.rs b/src/fusion.rs index 39b86e4..67d1810 100644 --- a/src/fusion.rs +++ b/src/fusion.rs @@ -1,6 +1,6 @@ /// Reciprocal Rank Fusion (RRF) engine. /// -/// Merges ranked results from multiple search lanes (e.g. semantic HNSW +/// Merges ranked results from multiple search lanes (e.g. semantic vector /// and FTS5 keyword search) into a single ranked list using the RRF formula: /// /// rrf_score = sum( weight_i / (k + rank_i) ) diff --git a/src/hnsw.rs b/src/hnsw.rs deleted file mode 100644 index d861e8c..0000000 --- a/src/hnsw.rs +++ /dev/null @@ -1,206 +0,0 @@ -use std::collections::HashSet; -use std::path::Path; - -use anyhow::{Context, Result}; -use hnsw_rs::anndists::dist::distances::DistCosine; -use hnsw_rs::api::AnnT; -use hnsw_rs::hnsw::Hnsw; -use hnsw_rs::hnswio::HnswIo; - -const EMBEDDING_DIM: usize = 384; -const MAX_NB_CONNECTION: usize = 16; -const MAX_LAYER: usize = 16; -const EF_CONSTRUCTION: usize = 200; -const FILE_BASENAME: &str = "engraph"; - -/// Wrapper around the HNSW index for vector similarity search. -pub struct HnswIndex { - inner: Hnsw<'static, f32, DistCosine>, - next_id: u64, -} - -impl HnswIndex { - /// Create a new empty HNSW index. - pub fn new(max_elements: usize) -> Self { - let inner = Hnsw::new( - MAX_NB_CONNECTION, - max_elements, - MAX_LAYER, - EF_CONSTRUCTION, - DistCosine, - ); - Self { inner, next_id: 0 } - } - - /// Load an HNSW index from the given directory. - /// - /// The `HnswIo` is leaked to satisfy the `'static` lifetime on the inner `Hnsw`. - /// This is acceptable because we only load an index once for the lifetime of the process. - pub fn load(dir: &Path) -> Result { - let hnsw_io = Box::new(HnswIo::new(dir, FILE_BASENAME)); - let hnsw_io: &'static mut HnswIo = Box::leak(hnsw_io); - let inner: Hnsw<'static, f32, DistCosine> = hnsw_io - .load_hnsw() - .context("failed to load HNSW index from disk")?; - - let nb_point = inner.get_nb_point(); - let next_id = nb_point as u64; - - Ok(Self { inner, next_id }) - } - - /// Insert a single vector and return its assigned vector ID. - pub fn insert(&mut self, vector: &[f32]) -> u64 { - assert_eq!( - vector.len(), - EMBEDDING_DIM, - "vector dimension mismatch: expected {EMBEDDING_DIM}, got {}", - vector.len() - ); - let id = self.next_id; - self.inner.insert((vector, id as usize)); - self.next_id += 1; - id - } - - /// Insert a vector with a specific ID (used when rebuilding from stored vectors). - pub fn insert_with_id(&mut self, vector: &[f32], id: u64) { - assert_eq!( - vector.len(), - EMBEDDING_DIM, - "vector dimension mismatch: expected {EMBEDDING_DIM}, got {}", - vector.len() - ); - self.inner.insert((vector, id as usize)); - if id >= self.next_id { - self.next_id = id + 1; - } - } - - /// Insert a batch of vectors and return their assigned vector IDs. - pub fn insert_batch(&mut self, vectors: &[Vec]) -> Vec { - vectors.iter().map(|v| self.insert(v)).collect() - } - - /// Search for the k nearest neighbors of the query vector. - /// - /// Returns `(vector_id, score)` pairs sorted by ascending distance, - /// excluding any IDs in `tombstones`. Requests `k * 2` results from - /// the underlying index for tombstone headroom. - pub fn search(&self, query: &[f32], k: usize, tombstones: &HashSet) -> Vec<(u64, f32)> { - if self.inner.get_nb_point() == 0 { - return Vec::new(); - } - - let ef_search = (k * 2).max(EF_CONSTRUCTION); - let neighbours = self.inner.search(query, k * 2, ef_search); - - let mut results: Vec<(u64, f32)> = neighbours - .into_iter() - .filter(|n| !tombstones.contains(&(n.d_id as u64))) - .map(|n| (n.d_id as u64, n.distance)) - .collect(); - - results.sort_by(|a, b| a.1.partial_cmp(&b.1).unwrap()); - results.truncate(k); - results - } - - /// Save the index to the given directory. - pub fn save(&self, dir: &Path) -> Result<()> { - std::fs::create_dir_all(dir).context("failed to create HNSW save directory")?; - self.inner - .file_dump(dir, FILE_BASENAME) - .context("failed to dump HNSW index to disk")?; - Ok(()) - } -} - -#[cfg(test)] -mod tests { - use super::*; - use tempfile::TempDir; - - fn random_vector(seed: u64) -> Vec { - // Simple deterministic pseudo-random using a linear congruential generator. - let mut state = seed.wrapping_mul(6364136223846793005).wrapping_add(1); - (0..EMBEDDING_DIM) - .map(|_| { - state = state.wrapping_mul(6364136223846793005).wrapping_add(1); - // Normalize to [-1, 1] - ((state >> 33) as f32) / (u32::MAX as f32) * 2.0 - 1.0 - }) - .collect() - } - - #[test] - fn test_insert_and_search() { - let mut index = HnswIndex::new(100); - let vectors: Vec> = (0..10).map(|i| random_vector(i)).collect(); - let ids = index.insert_batch(&vectors); - assert_eq!(ids.len(), 10); - - // Search for the first vector — it should be the top result. - let results = index.search(&vectors[0], 5, &HashSet::new()); - assert!(!results.is_empty(), "search returned no results"); - assert_eq!( - results[0].0, ids[0], - "expected the query vector itself to be the top result" - ); - // Distance to itself should be ~0 for cosine. - assert!( - results[0].1 < 0.01, - "distance to self should be near zero, got {}", - results[0].1 - ); - } - - #[test] - fn test_search_with_tombstones() { - let mut index = HnswIndex::new(100); - let vectors: Vec> = (0..5).map(|i| random_vector(i + 100)).collect(); - let ids = index.insert_batch(&vectors); - - // Tombstone the first vector. - let mut tombstones = HashSet::new(); - tombstones.insert(ids[0]); - - let results = index.search(&vectors[0], 5, &tombstones); - for (id, _score) in &results { - assert_ne!(*id, ids[0], "tombstoned ID should not appear in results"); - } - } - - #[test] - fn test_save_and_load() { - let tmpdir = TempDir::new().unwrap(); - let vectors: Vec> = (0..10).map(|i| random_vector(i + 200)).collect(); - - // Build and save. - { - let mut index = HnswIndex::new(100); - index.insert_batch(&vectors); - index.save(tmpdir.path()).unwrap(); - } - - // Load and search. - let index = HnswIndex::load(tmpdir.path()).unwrap(); - let results = index.search(&vectors[0], 3, &HashSet::new()); - assert!( - !results.is_empty(), - "search after reload returned no results" - ); - assert_eq!( - results[0].0, 0, - "expected vector 0 to be the top result after reload" - ); - } - - #[test] - fn test_empty_index_search() { - let index = HnswIndex::new(100); - let query = random_vector(999); - let results = index.search(&query, 5, &HashSet::new()); - assert!(results.is_empty(), "empty index should return no results"); - } -} diff --git a/src/indexer.rs b/src/indexer.rs index 3d96649..830105f 100644 --- a/src/indexer.rs +++ b/src/indexer.rs @@ -13,7 +13,6 @@ use crate::config::Config; use crate::docid::generate_docid; use crate::embedder::Embedder; use crate::graph::extract_wikilink_targets; -use crate::hnsw::HnswIndex; use crate::store::{FileRecord, Store}; /// Summary of an indexing run. @@ -201,7 +200,7 @@ pub fn load_people_entities( } /// Extract aliases from YAML frontmatter. -fn extract_aliases_from_frontmatter(content: &str) -> Option> { +pub fn extract_aliases_from_frontmatter(content: &str) -> Option> { let trimmed = content.trim_start(); if !trimmed.starts_with("---") { return None; @@ -265,22 +264,42 @@ pub fn build_people_edges( /// Main indexing orchestrator. /// /// Walks the vault, diffs against the store, processes new/changed/deleted files, -/// embeds chunks in parallel, and writes everything to the store and HNSW index. +/// embeds chunks in parallel, and writes everything to the store. pub fn run_index(vault_path: &Path, config: &Config, rebuild: bool) -> Result { let start = Instant::now(); let data_dir = Config::data_dir()?; std::fs::create_dir_all(&data_dir)?; + let cleaned = crate::writer::cleanup_temp_files(vault_path)?; + if cleaned > 0 { + info!(cleaned, "cleaned up incomplete writes from previous run"); + } + let db_path = data_dir.join("engraph.db"); let store = Store::open(&db_path)?; - let hnsw_dir = data_dir.join("hnsw"); + let orphans = crate::writer::verify_index_integrity(&store, vault_path)?; + if orphans > 0 { + info!(orphans, "cleaned up orphan DB entries for missing files"); + } + + // Build exclude list: config excludes + archive folder (if detected) + let mut exclude = config.exclude.clone(); + if let Ok(Some(profile)) = crate::config::Config::load_vault_profile() + && let Some(archive) = &profile.structure.folders.archive + { + let archive_pattern = format!("{}/", archive); + if !exclude.contains(&archive_pattern) { + exclude.push(archive_pattern); + } + } // If rebuild, treat everything as new. - let files = walk_vault(vault_path, &config.exclude)?; + let files = walk_vault(vault_path, &exclude)?; let (new_files, changed_files, deleted_files) = if rebuild { // On rebuild we skip diffing — all files are "new". + store.clear_vec()?; (files.clone(), Vec::new(), Vec::new()) } else { let (n, c, d) = diff_vault(&files, vault_path, &store)?; @@ -294,25 +313,25 @@ pub fn run_index(vault_path: &Path, config: &Config, rebuild: bool) -> Result = new_files.clone(); for file_path in &changed_files { let rel = file_path.strip_prefix(vault_path).unwrap_or(file_path); let rel_str = rel.to_string_lossy().to_string(); if let Some(record) = store.get_file(&rel_str)? { let vector_ids = store.get_vector_ids_for_file(record.id)?; - if !vector_ids.is_empty() { - store.add_tombstones(&vector_ids)?; + for &vid in &vector_ids { + store.delete_vec(vid)?; } store.delete_fts_chunks_for_file(record.id)?; store.delete_file(record.id)?; @@ -424,15 +443,7 @@ pub fn run_index(vault_path: &Path, config: &Config, rebuild: bool) -> Result Result Result Result>> = HashMap::new(); + for result in &results { + let folder = result + .rel_path + .split('/') + .next() + .unwrap_or("(root)") + .to_string(); + for (_heading, _snippet, vector, _token_count) in &result.chunks { + folder_vecs.entry(folder.clone()).or_default().push(vector); + } } - info!( - vectors = all_vectors.len(), - "rebuilt HNSW index from stored vectors" - ); - - // Step 12: Save HNSW index to disk. - hnsw.save(&hnsw_dir)?; + for (folder, vectors) in &folder_vecs { + if vectors.is_empty() { + continue; + } + let dim = 384; + let mut centroid = vec![0.0f32; dim]; + for v in vectors { + for (i, val) in v.iter().enumerate() { + centroid[i] += val; + } + } + let n = vectors.len() as f32; + for val in &mut centroid { + *val /= n; + } + store.upsert_folder_centroid(folder, ¢roid, vectors.len())?; + } let duration = start.elapsed(); info!( diff --git a/src/lib.rs b/src/lib.rs index d48802a..611dd05 100644 --- a/src/lib.rs +++ b/src/lib.rs @@ -6,10 +6,14 @@ pub mod embedder; pub mod fts; pub mod fusion; pub mod graph; -pub mod hnsw; pub mod indexer; +pub mod links; pub mod model; +pub mod placement; pub mod profile; pub mod search; pub mod serve; pub mod store; +pub mod tags; +pub mod vecstore; +pub mod writer; diff --git a/src/links.rs b/src/links.rs new file mode 100644 index 0000000..fcc0683 --- /dev/null +++ b/src/links.rs @@ -0,0 +1,406 @@ +use anyhow::Result; +use std::path::Path; + +use crate::indexer::extract_aliases_from_frontmatter; +use crate::store::Store; + +/// A potential wikilink discovered in note content. +#[derive(Debug, Clone, PartialEq)] +pub struct DiscoveredLink { + pub matched_text: String, + pub target_path: String, + pub display: Option, + pub match_type: LinkMatchType, +} + +/// How a link target was matched. +#[derive(Debug, Clone, PartialEq)] +pub enum LinkMatchType { + /// Matched note filename (basename without .md) + ExactName, + /// Matched an alias from frontmatter + Alias, +} + +/// An entry in the name-to-path lookup table. +#[derive(Debug, Clone)] +pub(crate) struct NameEntry { + name: String, + name_lower: String, + path: String, + match_type: LinkMatchType, +} + +/// Build a lookup table of (name, path, match_type) from all indexed files. +/// +/// For each file: extract basename (without .md) as ExactName (if len >= 3), +/// then read the file from disk to extract aliases (each len >= 2) as Alias entries. +/// Results are sorted by name length descending so longer names match first. +pub(crate) fn build_name_index(store: &Store, vault_path: &Path) -> Result> { + let all_files = store.get_all_files()?; + let mut entries = Vec::new(); + + for file in &all_files { + // Extract basename without .md + let basename = file + .path + .rsplit('/') + .next() + .unwrap_or(&file.path) + .trim_end_matches(".md"); + + if basename.len() >= 3 { + entries.push(NameEntry { + name: basename.to_string(), + name_lower: basename.to_lowercase(), + path: file.path.clone(), + match_type: LinkMatchType::ExactName, + }); + } + + // Read file from disk to extract aliases + let full_path = vault_path.join(&file.path); + if let Ok(content) = std::fs::read_to_string(&full_path) + && let Some(aliases) = extract_aliases_from_frontmatter(&content) + { + for alias in aliases { + if alias.len() >= 2 { + let alias_lower = alias.to_lowercase(); + entries.push(NameEntry { + name: alias, + name_lower: alias_lower, + path: file.path.clone(), + match_type: LinkMatchType::Alias, + }); + } + } + } + } + + // Sort by name length descending — match longer names first + entries.sort_by(|a, b| b.name.len().cmp(&a.name.len())); + Ok(entries) +} + +/// Find byte ranges of existing `[[...]]` wikilinks in content. +/// +/// Returns `(start, end)` pairs where start is the index of the first `[` +/// and end is one past the last `]`. +pub fn find_wikilink_regions(content: &str) -> Vec<(usize, usize)> { + let bytes = content.as_bytes(); + let mut regions = Vec::new(); + let mut i = 0; + + while i + 1 < bytes.len() { + if bytes[i] == b'[' && bytes[i + 1] == b'[' { + // Find the closing ]] + let start = i; + let mut j = i + 2; + while j + 1 < bytes.len() { + if bytes[j] == b']' && bytes[j + 1] == b']' { + regions.push((start, j + 2)); + i = j + 2; + break; + } + j += 1; + } + if j + 1 >= bytes.len() { + // No closing ]] found + i += 2; + } + } else { + i += 1; + } + } + + regions +} + +/// Check if a byte position falls inside any of the given regions. +fn inside_region(pos: usize, end: usize, regions: &[(usize, usize)]) -> bool { + regions.iter().any(|&(rs, re)| pos >= rs && end <= re) +} + +/// Check if a match position overlaps with any already-claimed range. +fn overlaps_claimed(pos: usize, end: usize, claimed: &[(usize, usize)]) -> bool { + claimed.iter().any(|&(cs, ce)| pos < ce && end > cs) +} + +/// Check word boundary at a byte position in content. +fn is_word_boundary(content: &[u8], pos: usize) -> bool { + if pos == 0 { + return true; + } + let ch = content[pos - 1]; + // Word boundary: previous char is not alphanumeric or underscore + !ch.is_ascii_alphanumeric() && ch != b'_' +} + +/// Check word boundary after a match ends. +fn is_word_boundary_after(content: &[u8], end: usize) -> bool { + if end >= content.len() { + return true; + } + let ch = content[end]; + !ch.is_ascii_alphanumeric() && ch != b'_' +} + +/// Discover potential wikilink targets in content by matching note names and aliases. +/// +/// Builds a name index from the store, then scans content for case-insensitive +/// matches that aren't inside existing wikilinks and don't overlap with longer +/// already-matched names. +pub fn discover_links( + store: &Store, + content: &str, + vault_path: &Path, +) -> Result> { + let name_index = build_name_index(store, vault_path)?; + let wikilink_regions = find_wikilink_regions(content); + let content_lower = content.to_lowercase(); + let content_bytes = content.as_bytes(); + + let mut links = Vec::new(); + let mut claimed: Vec<(usize, usize)> = Vec::new(); + + for entry in &name_index { + let needle = &entry.name_lower; + let mut search_from = 0; + + while let Some(rel_pos) = content_lower[search_from..].find(needle.as_str()) { + let pos = search_from + rel_pos; + let end = pos + needle.len(); + search_from = end; + + // Skip if inside existing wikilink + if inside_region(pos, end, &wikilink_regions) { + continue; + } + + // Skip if overlapping with an already-claimed (longer) match + if overlaps_claimed(pos, end, &claimed) { + continue; + } + + // Check word boundaries + if !is_word_boundary(content_bytes, pos) || !is_word_boundary_after(content_bytes, end) + { + continue; + } + + let matched_text = content[pos..end].to_string(); + + let display = match entry.match_type { + LinkMatchType::Alias => Some(matched_text.clone()), + LinkMatchType::ExactName => None, + }; + + links.push(DiscoveredLink { + matched_text, + target_path: entry.path.clone(), + display, + match_type: entry.match_type.clone(), + }); + + claimed.push((pos, end)); + } + } + + Ok(links) +} + +/// Apply discovered links to content, replacing matched text with `[[wikilinks]]`. +/// +/// For exact name matches: `[[TargetName]]` +/// For alias matches: `[[TargetName|DisplayText]]` +/// +/// Replacements are applied from end to start to preserve byte positions. +pub fn apply_links(content: &str, links: &[DiscoveredLink]) -> String { + if links.is_empty() { + return content.to_string(); + } + + let content_lower = content.to_lowercase(); + let content_bytes = content.as_bytes(); + let wikilink_regions = find_wikilink_regions(content); + + // Find the position of each link in the content + let mut replacements: Vec<(usize, usize, String)> = Vec::new(); + let mut claimed: Vec<(usize, usize)> = Vec::new(); + + for link in links { + let needle = link.matched_text.to_lowercase(); + let mut search_from = 0; + + while let Some(rel_pos) = content_lower[search_from..].find(needle.as_str()) { + let pos = search_from + rel_pos; + let end = pos + needle.len(); + search_from = end; + + if inside_region(pos, end, &wikilink_regions) { + continue; + } + if overlaps_claimed(pos, end, &claimed) { + continue; + } + if !is_word_boundary(content_bytes, pos) || !is_word_boundary_after(content_bytes, end) + { + continue; + } + + let target_name = link + .target_path + .rsplit('/') + .next() + .unwrap_or(&link.target_path) + .trim_end_matches(".md"); + + let replacement = match &link.display { + Some(display) => format!("[[{}|{}]]", target_name, display), + None => format!("[[{}]]", target_name), + }; + + replacements.push((pos, end, replacement)); + claimed.push((pos, end)); + break; // Only replace first occurrence per link + } + } + + // Sort by position descending so we can replace from end to start + replacements.sort_by(|a, b| b.0.cmp(&a.0)); + + let mut result = content.to_string(); + for (start, end, replacement) in replacements { + result.replace_range(start..end, &replacement); + } + + result +} + +#[cfg(test)] +mod tests { + use super::*; + use crate::store::Store; + + fn setup_store_and_vault() -> (Store, tempfile::TempDir) { + let vault_dir = tempfile::TempDir::new().unwrap(); + let store = Store::open_memory().unwrap(); + + // Insert files into store + store + .insert_file( + "03-Resources/People/Steve Barbera.md", + "h1", + 0, + &[], + "aaa111", + ) + .unwrap(); + store + .insert_file( + "03-Resources/Code-Snippets/Reciprocal Rank Fusion.md", + "h2", + 0, + &[], + "bbb222", + ) + .unwrap(); + + // Create files on disk for alias reading + let people = vault_dir.path().join("03-Resources/People"); + std::fs::create_dir_all(&people).unwrap(); + std::fs::write(people.join("Steve Barbera.md"), "# Steve Barbera\n").unwrap(); + + let snippets = vault_dir.path().join("03-Resources/Code-Snippets"); + std::fs::create_dir_all(&snippets).unwrap(); + std::fs::write( + snippets.join("Reciprocal Rank Fusion.md"), + "---\naliases: [RRF]\n---\n# Reciprocal Rank Fusion\n", + ) + .unwrap(); + + (store, vault_dir) + } + + #[test] + fn test_exact_name_match() { + let (store, vault_dir) = setup_store_and_vault(); + let content = "Talked to Steve Barbera"; + let links = discover_links(&store, content, vault_dir.path()).unwrap(); + assert_eq!(links.len(), 1); + assert_eq!(links[0].matched_text, "Steve Barbera"); + assert_eq!(links[0].match_type, LinkMatchType::ExactName); + } + + #[test] + fn test_skip_existing_wikilinks() { + let (store, vault_dir) = setup_store_and_vault(); + let content = "Talked to [[Steve Barbera]]"; + let links = discover_links(&store, content, vault_dir.path()).unwrap(); + assert_eq!(links.len(), 0); + } + + #[test] + fn test_multiple_matches() { + let (store, vault_dir) = setup_store_and_vault(); + let content = "Steve Barbera explained Reciprocal Rank Fusion"; + let links = discover_links(&store, content, vault_dir.path()).unwrap(); + assert_eq!(links.len(), 2); + + let names: Vec<&str> = links.iter().map(|l| l.matched_text.as_str()).collect(); + assert!(names.contains(&"Steve Barbera")); + assert!(names.contains(&"Reciprocal Rank Fusion")); + } + + #[test] + fn test_alias_match() { + let (store, vault_dir) = setup_store_and_vault(); + let content = "We use RRF for search"; + let links = discover_links(&store, content, vault_dir.path()).unwrap(); + assert_eq!(links.len(), 1); + assert_eq!(links[0].match_type, LinkMatchType::Alias); + assert_eq!( + links[0].target_path, + "03-Resources/Code-Snippets/Reciprocal Rank Fusion.md" + ); + assert_eq!(links[0].display, Some("RRF".to_string())); + } + + #[test] + fn test_apply_links() { + let (store, vault_dir) = setup_store_and_vault(); + let content = "Steve Barbera explained RRF to me"; + let links = discover_links(&store, content, vault_dir.path()).unwrap(); + let result = apply_links(content, &links); + + assert!(result.contains("[[Steve Barbera]]")); + assert!(result.contains("[[Reciprocal Rank Fusion|RRF]]")); + } + + #[test] + fn test_find_wikilink_regions() { + let content = "Hello [[World]] and [[Foo|Bar]]"; + let regions = find_wikilink_regions(content); + assert_eq!(regions.len(), 2); + assert_eq!(&content[regions[0].0..regions[0].1], "[[World]]"); + assert_eq!(&content[regions[1].0..regions[1].1], "[[Foo|Bar]]"); + } + + #[test] + fn test_case_insensitive_match() { + let (store, vault_dir) = setup_store_and_vault(); + let content = "talked to steve barbera today"; + let links = discover_links(&store, content, vault_dir.path()).unwrap(); + assert_eq!(links.len(), 1); + assert_eq!(links[0].matched_text, "steve barbera"); + } + + #[test] + fn test_word_boundary_check() { + let (store, vault_dir) = setup_store_and_vault(); + // "RRF" embedded inside a word should not match + let content = "The xRRFy algorithm"; + let links = discover_links(&store, content, vault_dir.path()).unwrap(); + assert_eq!(links.len(), 0); + } +} diff --git a/src/main.rs b/src/main.rs index 013e585..1faffe1 100644 --- a/src/main.rs +++ b/src/main.rs @@ -7,7 +7,7 @@ use engraph::store; use anyhow::Result; use clap::{Parser, Subcommand}; -use std::io::{self, BufRead, Write}; +use std::io::{self, BufRead, Read as _, Write}; use std::path::PathBuf; use config::Config; @@ -62,7 +62,7 @@ enum Command { /// Clear cached data. Clear { - /// Remove everything including the HNSW index and embeddings. + /// Remove everything including the database and embeddings. #[arg(long)] all: bool, }, @@ -96,6 +96,12 @@ enum Command { #[command(subcommand)] action: ContextAction, }, + + /// Write a note to the vault. + Write { + #[command(subcommand)] + action: WriteAction, + }, } #[derive(Subcommand, Debug)] @@ -150,6 +156,46 @@ enum ContextAction { }, } +#[derive(Subcommand, Debug)] +enum WriteAction { + /// Create a new note. + Create { + /// Note content (reads from stdin if omitted). + #[arg(long)] + content: Option, + /// Filename (without .md). + #[arg(long)] + filename: Option, + /// Type hint for placement. + #[arg(long)] + type_hint: Option, + /// Tags (comma-separated). + #[arg(long, value_delimiter = ',')] + tags: Vec, + /// Explicit folder (skips placement). + #[arg(long)] + folder: Option, + }, + /// Append content to an existing note. + Append { + /// Target note (path, basename, or #docid). + file: String, + /// Content to append (reads from stdin if omitted). + #[arg(long)] + content: Option, + }, + /// Archive a note (soft delete — moves to archive, removes from index). + Archive { + /// Target note (path, basename, or #docid). + file: String, + }, + /// Restore an archived note to its original location. + Unarchive { + /// Archived note path (e.g., "04-Archive/01-Projects/note.md"). + file: String, + }, +} + #[derive(Subcommand, Debug)] enum ModelsAction { /// List available models. @@ -185,7 +231,7 @@ fn remove_dir_if_exists(path: &std::path::Path) -> Result { async fn main() -> Result<()> { let cli = Cli::parse(); - // Set up tracing. Default: suppress all logs (ort and hnsw_rs are very noisy). + // Set up tracing. Default: suppress all logs (ort is very noisy). // --verbose enables debug for engraph, info for everything else. let filter = if cli.verbose { "engraph=debug,info" @@ -291,22 +337,11 @@ async fn main() -> Result<()> { println!("Nothing to clear (data directory does not exist)."); } } else { - // Delete only index files: engraph.db and hnsw directory. - let mut deleted_any = false; - + // Delete only index files: engraph.db. let db_path = data_dir.join("engraph.db"); if remove_if_exists(&db_path)? { println!("Removed {}", db_path.display()); - deleted_any = true; - } - - let hnsw_dir = data_dir.join("hnsw"); - if remove_dir_if_exists(&hnsw_dir)? { - println!("Removed {}", hnsw_dir.display()); - deleted_any = true; - } - - if !deleted_any { + } else { println!("Nothing to clear (no index files found)."); } } @@ -673,15 +708,12 @@ async fn main() -> Result<()> { ContextAction::Topic { query, budget } => { let models_dir = data_dir.join("models"); let mut embedder = engraph::embedder::Embedder::new(&models_dir)?; - let hnsw_dir = data_dir.join("hnsw"); - let index = engraph::hnsw::HnswIndex::load(&hnsw_dir)?; let bundle = engraph::context::context_topic_with_search( ¶ms, &query, budget, &mut embedder, - &index, )?; if cli.json { println!("{}", serde_json::to_string_pretty(&bundle)?); @@ -716,6 +748,111 @@ async fn main() -> Result<()> { engraph::serve::run_serve(&data_dir).await?; } + Command::Write { action } => { + if !index_exists(&data_dir) { + eprintln!("No index found. Run 'engraph index ' first."); + std::process::exit(1); + } + let db_path = data_dir.join("engraph.db"); + let store = store::Store::open(&db_path)?; + let vault_path_str = store + .get_meta("vault_path")? + .ok_or_else(|| anyhow::anyhow!("No vault path in index."))?; + let vault_path = PathBuf::from(&vault_path_str); + let models_dir = data_dir.join("models"); + let mut embedder = engraph::embedder::Embedder::new(&models_dir)?; + let profile = config::Config::load_vault_profile().ok().flatten(); + + match action { + WriteAction::Create { + content, + filename, + type_hint, + tags, + folder, + } => { + let content = match content { + Some(c) => c, + None => { + let mut buf = String::new(); + io::stdin().lock().read_to_string(&mut buf)?; + buf + } + }; + let input = engraph::writer::CreateNoteInput { + content, + filename, + type_hint, + tags, + folder, + created_by: "cli".into(), + }; + let result = engraph::writer::create_note( + input, + &store, + &mut embedder, + &vault_path, + profile.as_ref(), + )?; + if cli.json { + println!("{}", serde_json::to_string_pretty(&result)?); + } else { + println!( + "Created: {} (#{}) [{}]", + result.path, result.docid, result.strategy + ); + if !result.links_added.is_empty() { + println!("Links: {}", result.links_added.join(", ")); + } + } + } + WriteAction::Append { file, content } => { + let content = match content { + Some(c) => c, + None => { + let mut buf = String::new(); + io::stdin().lock().read_to_string(&mut buf)?; + buf + } + }; + let input = engraph::writer::AppendInput { + file, + content, + modified_by: "cli".into(), + }; + let result = + engraph::writer::append_to_note(input, &store, &mut embedder, &vault_path)?; + if cli.json { + println!("{}", serde_json::to_string_pretty(&result)?); + } else { + println!("Appended to: {} (#{})", result.path, result.docid); + } + } + WriteAction::Archive { file } => { + let result = engraph::writer::archive_note( + &file, + &store, + &vault_path, + profile.as_ref(), + )?; + if cli.json { + println!("{}", serde_json::to_string_pretty(&result)?); + } else { + println!("Archived: {} → {}", file, result.path); + } + } + WriteAction::Unarchive { file } => { + let result = + engraph::writer::unarchive_note(&file, &store, &mut embedder, &vault_path)?; + if cli.json { + println!("{}", serde_json::to_string_pretty(&result)?); + } else { + println!("Restored: {} → {}", file, result.path); + } + } + } + } + Command::Models { action } => { let registry = model::ModelRegistry::default(); match action { diff --git a/src/placement.rs b/src/placement.rs new file mode 100644 index 0000000..97b18b4 --- /dev/null +++ b/src/placement.rs @@ -0,0 +1,512 @@ +use anyhow::Result; + +use crate::embedder::Embedder; +use crate::profile::VaultProfile; +use crate::store::Store; + +#[derive(Debug, Clone)] +pub struct PlacementResult { + pub folder: String, + pub confidence: f64, + pub strategy: PlacementStrategy, + pub reason: String, + /// When strategy is InboxFallback and semantic matching was attempted, + /// this holds the best-matching folder even though it was below threshold. + /// Used to inject `suggested_folder` frontmatter for user triage. + pub suggestion: Option<(String, f64)>, // (folder, confidence) +} + +#[derive(Debug, Clone, PartialEq)] +pub enum PlacementStrategy { + TypeRule, + SemanticCentroid, + InboxFallback, +} + +pub struct PlacementHints { + pub type_hint: Option, + pub tags: Vec, +} + +/// Main entry point. Tries 3 strategies in order: +/// 1. Type-based rules +/// 2. Semantic centroid matching +/// 3. Inbox fallback +pub fn place_note( + content: &str, + hints: &PlacementHints, + profile: Option<&VaultProfile>, + store: &Store, + embedder: Option<&mut Embedder>, +) -> Result { + // Strategy A: Type-based rules + if let Some(result) = try_type_rules(content, hints, profile) { + return Ok(result); + } + + // Strategy B: Semantic centroid matching + let mut semantic_suggestion: Option<(String, f64)> = None; + if let Some(embedder) = embedder + && let Some(result) = try_semantic_placement(content, store, embedder)? + { + if result.strategy == PlacementStrategy::SemanticCentroid { + return Ok(result); + } + // Below threshold — carry suggestion into inbox fallback + semantic_suggestion = result.suggestion; + } + + // Strategy C: Inbox fallback + let inbox = profile + .and_then(|p| p.structure.folders.inbox.clone()) + .unwrap_or_else(|| "00-Inbox".to_string()); + + Ok(PlacementResult { + folder: inbox, + confidence: 0.0, + strategy: PlacementStrategy::InboxFallback, + reason: "No confident placement".to_string(), + suggestion: semantic_suggestion, + }) +} + +/// Strategy A: Type-based rules. +/// Maps explicit type hints or content patterns to known folder roles. +fn try_type_rules( + content: &str, + hints: &PlacementHints, + profile: Option<&VaultProfile>, +) -> Option { + let profile = profile?; + let folders = &profile.structure.folders; + + // Check explicit type hints first + if let Some(ref type_hint) = hints.type_hint { + match type_hint.as_str() { + "person" => { + let folder = folders.people.clone()?; + return Some(PlacementResult { + folder, + confidence: 0.95, + strategy: PlacementStrategy::TypeRule, + reason: "type_hint: person".to_string(), + suggestion: None, + }); + } + "daily" => { + let folder = folders.daily.clone()?; + return Some(PlacementResult { + folder, + confidence: 0.95, + strategy: PlacementStrategy::TypeRule, + reason: "type_hint: daily".to_string(), + suggestion: None, + }); + } + "workout" => { + // areas/Health — need areas folder configured + let areas = folders.areas.as_ref()?; + let folder = format!("{areas}/Health"); + return Some(PlacementResult { + folder, + confidence: 0.90, + strategy: PlacementStrategy::TypeRule, + reason: "type_hint: workout".to_string(), + suggestion: None, + }); + } + "decision" => { + // Decision records go to projects folder (or inbox if no projects folder) + let folder = folders.projects.clone()?; + return Some(PlacementResult { + folder, + confidence: 0.90, + strategy: PlacementStrategy::TypeRule, + reason: "type_hint: decision".to_string(), + suggestion: None, + }); + } + _ => {} + } + } + + // Content-based: person note detection + // First line is "# First Last" (2-4 words) AND content contains "Role:" or "Company:" + if let Some(first_line) = content.lines().next() + && let Some(heading) = first_line.strip_prefix("# ") + { + let words: Vec<&str> = heading.split_whitespace().collect(); + if (2..=4).contains(&words.len()) + && (content.contains("Role:") || content.contains("Company:")) + { + let folder = folders.people.clone()?; + return Some(PlacementResult { + folder, + confidence: 0.85, + strategy: PlacementStrategy::TypeRule, + reason: "content pattern: person note (heading + Role:/Company:)".to_string(), + suggestion: None, + }); + } + } + + // Content-based: ticket/work note detection + // BRE-XXXX or DRIFT-XXX patterns → projects folder + if contains_ticket_id(content) { + let folder = folders.projects.clone()?; + return Some(PlacementResult { + folder, + confidence: 0.80, + strategy: PlacementStrategy::TypeRule, + reason: "content pattern: ticket ID detected".to_string(), + suggestion: None, + }); + } + + // Content-based: daily/meeting note detection + // Has a date-like heading and action items (- [ ] checkboxes) + if looks_like_meeting_note(content) { + let folder = folders.daily.clone().or_else(|| folders.inbox.clone())?; + return Some(PlacementResult { + folder, + confidence: 0.75, + strategy: PlacementStrategy::TypeRule, + reason: "content pattern: date heading + action items".to_string(), + suggestion: None, + }); + } + + None +} + +/// Check if content contains ticket IDs like BRE-1234 or DRIFT-567. +fn contains_ticket_id(content: &str) -> bool { + let bytes = content.as_bytes(); + let mut i = 0; + while i < bytes.len() { + // Look for uppercase letter sequences followed by -digits + if bytes[i].is_ascii_uppercase() { + let start = i; + while i < bytes.len() && bytes[i].is_ascii_uppercase() { + i += 1; + } + let prefix_len = i - start; + if prefix_len >= 2 && i < bytes.len() && bytes[i] == b'-' { + i += 1; // skip '-' + let digit_start = i; + while i < bytes.len() && bytes[i].is_ascii_digit() { + i += 1; + } + if i - digit_start >= 2 { + return true; // Found pattern like XX-123 + } + } + } else { + i += 1; + } + } + false +} + +/// Check if content looks like a meeting/daily note (date-like heading + action items). +fn looks_like_meeting_note(content: &str) -> bool { + let has_date_heading = content.lines().any(|l| { + let t = l.trim(); + // "# 2026-03-25" or "# Meeting 2026-03-25" or "## Action Items" + (t.starts_with("# ") || t.starts_with("## ")) + && (t.contains("202") || t.contains("action item") || t.contains("Action Item")) + }); + let has_checkboxes = content.contains("- [ ]") || content.contains("- [x]"); + has_date_heading && has_checkboxes +} + +/// Strategy B: Semantic centroid matching. +/// Embeds content and compares against precomputed folder centroids. +fn try_semantic_placement( + content: &str, + store: &Store, + embedder: &mut Embedder, +) -> Result> { + let centroids = store.get_folder_centroids()?; + if centroids.is_empty() { + return Ok(None); + } + + let embedding = embedder.embed_one(content)?; + + let mut best_folder = String::new(); + let mut best_sim = f64::NEG_INFINITY; + + for (folder, centroid) in ¢roids { + let sim = cosine_similarity(&embedding, centroid); + if sim > best_sim { + best_sim = sim; + best_folder = folder.clone(); + } + } + + if best_sim > 0.65 { + Ok(Some(PlacementResult { + folder: best_folder, + confidence: best_sim, + strategy: PlacementStrategy::SemanticCentroid, + reason: format!("semantic similarity: {best_sim:.3}"), + suggestion: None, + })) + } else if best_sim > 0.0 && !best_folder.is_empty() { + // Below threshold but we have a candidate — store as suggestion + // so the inbox fallback can surface it in frontmatter + Ok(Some(PlacementResult { + folder: String::new(), // will be overridden by inbox fallback + confidence: 0.0, + strategy: PlacementStrategy::InboxFallback, + reason: "No confident placement".to_string(), + suggestion: Some((best_folder, best_sim)), + })) + } else { + Ok(None) + } +} + +/// Compute cosine similarity between two vectors. +fn cosine_similarity(a: &[f32], b: &[f32]) -> f64 { + let dot: f64 = a + .iter() + .zip(b) + .map(|(x, y)| (*x as f64) * (*y as f64)) + .sum(); + let norm_a: f64 = a.iter().map(|x| (*x as f64).powi(2)).sum::().sqrt(); + let norm_b: f64 = b.iter().map(|x| (*x as f64).powi(2)).sum::().sqrt(); + if norm_a == 0.0 || norm_b == 0.0 { + 0.0 + } else { + dot / (norm_a * norm_b) + } +} + +#[cfg(test)] +mod tests { + use super::*; + use crate::profile::{ + FolderMap, StructureDetection, StructureMethod, VaultProfile, VaultStats, + }; + use crate::store::Store; + use std::path::PathBuf; + + fn make_profile(folders: FolderMap) -> VaultProfile { + VaultProfile { + vault_path: PathBuf::from("/test/vault"), + vault_type: crate::profile::VaultType::Obsidian, + structure: StructureDetection { + method: StructureMethod::Para, + folders, + }, + stats: VaultStats::default(), + } + } + + #[test] + fn test_inbox_fallback() { + let store = Store::open_memory().unwrap(); + let hints = PlacementHints { + type_hint: None, + tags: vec![], + }; + let result = place_note("Some random note.", &hints, None, &store, None).unwrap(); + assert_eq!(result.strategy, PlacementStrategy::InboxFallback); + assert_eq!(result.folder, "00-Inbox"); + } + + #[test] + fn test_type_rule_person_no_profile() { + // Without a profile, type rules return None -> falls through to inbox + let store = Store::open_memory().unwrap(); + let hints = PlacementHints { + type_hint: Some("person".into()), + tags: vec![], + }; + let result = place_note("# John Doe", &hints, None, &store, None).unwrap(); + assert_eq!(result.strategy, PlacementStrategy::InboxFallback); + } + + #[test] + fn test_type_rule_person_with_profile() { + let store = Store::open_memory().unwrap(); + let folders = FolderMap { + people: Some("03-Resources/People".into()), + ..FolderMap::default() + }; + let profile = make_profile(folders); + let hints = PlacementHints { + type_hint: Some("person".into()), + tags: vec![], + }; + let result = place_note("# John Doe", &hints, Some(&profile), &store, None).unwrap(); + assert_eq!(result.strategy, PlacementStrategy::TypeRule); + assert_eq!(result.folder, "03-Resources/People"); + assert!(result.confidence > 0.9); + } + + #[test] + fn test_type_rule_daily() { + let store = Store::open_memory().unwrap(); + let folders = FolderMap { + daily: Some("07-Daily".into()), + ..FolderMap::default() + }; + let profile = make_profile(folders); + let hints = PlacementHints { + type_hint: Some("daily".into()), + tags: vec![], + }; + let result = place_note("Today's notes", &hints, Some(&profile), &store, None).unwrap(); + assert_eq!(result.strategy, PlacementStrategy::TypeRule); + assert_eq!(result.folder, "07-Daily"); + } + + #[test] + fn test_type_rule_workout() { + let store = Store::open_memory().unwrap(); + let folders = FolderMap { + areas: Some("02-Areas".into()), + ..FolderMap::default() + }; + let profile = make_profile(folders); + let hints = PlacementHints { + type_hint: Some("workout".into()), + tags: vec![], + }; + let result = place_note("Leg day workout", &hints, Some(&profile), &store, None).unwrap(); + assert_eq!(result.strategy, PlacementStrategy::TypeRule); + assert_eq!(result.folder, "02-Areas/Health"); + } + + #[test] + fn test_content_based_person_detection() { + let store = Store::open_memory().unwrap(); + let folders = FolderMap { + people: Some("03-Resources/People".into()), + ..FolderMap::default() + }; + let profile = make_profile(folders); + let hints = PlacementHints { + type_hint: None, + tags: vec![], + }; + let content = "# Jane Smith\nRole: Engineering Manager\nCompany: Acme Corp"; + let result = place_note(content, &hints, Some(&profile), &store, None).unwrap(); + assert_eq!(result.strategy, PlacementStrategy::TypeRule); + assert_eq!(result.folder, "03-Resources/People"); + } + + #[test] + fn test_content_based_not_person_no_role() { + let store = Store::open_memory().unwrap(); + let folders = FolderMap { + people: Some("03-Resources/People".into()), + inbox: Some("00-Inbox".into()), + ..FolderMap::default() + }; + let profile = make_profile(folders); + let hints = PlacementHints { + type_hint: None, + tags: vec![], + }; + // Heading with 2 words but no Role: or Company: + let content = "# Jane Smith\nJust some notes about a topic."; + let result = place_note(content, &hints, Some(&profile), &store, None).unwrap(); + assert_eq!(result.strategy, PlacementStrategy::InboxFallback); + } + + #[test] + fn test_inbox_fallback_uses_profile_inbox() { + let store = Store::open_memory().unwrap(); + let folders = FolderMap { + inbox: Some("Inbox".into()), + ..FolderMap::default() + }; + let profile = make_profile(folders); + let hints = PlacementHints { + type_hint: None, + tags: vec![], + }; + let result = place_note("Random note", &hints, Some(&profile), &store, None).unwrap(); + assert_eq!(result.strategy, PlacementStrategy::InboxFallback); + assert_eq!(result.folder, "Inbox"); + } + + #[test] + fn test_ticket_id_detection() { + assert!(contains_ticket_id("Working on BRE-1234 today")); + assert!(contains_ticket_id("DRIFT-567 is in progress")); + assert!(contains_ticket_id("See JIRA-99 for details")); + assert!(!contains_ticket_id("No ticket here")); + assert!(!contains_ticket_id("AB-1")); // digits too short + assert!(!contains_ticket_id("a-1234")); // lowercase prefix + } + + #[test] + fn test_meeting_note_detection() { + let meeting = "# Meeting 2026-03-25\n## Attendees\n- Alice\n## Action Items\n- [ ] Follow up\n- [x] Done"; + assert!(looks_like_meeting_note(meeting)); + + let not_meeting = "# Just a heading\nSome notes without checkboxes"; + assert!(!looks_like_meeting_note(not_meeting)); + + let only_checkboxes = "- [ ] a task\n- [x] done task"; + assert!(!looks_like_meeting_note(only_checkboxes)); + } + + #[test] + fn test_decision_type_hint() { + let store = Store::open_memory().unwrap(); + let folders = FolderMap { + projects: Some("01-Projects".into()), + ..FolderMap::default() + }; + let profile = make_profile(folders); + let hints = PlacementHints { + type_hint: Some("decision".into()), + tags: vec![], + }; + let result = place_note( + "Architecture decision", + &hints, + Some(&profile), + &store, + None, + ) + .unwrap(); + assert_eq!(result.strategy, PlacementStrategy::TypeRule); + assert_eq!(result.folder, "01-Projects"); + assert!(result.confidence >= 0.90); + } + + #[test] + fn test_cosine_similarity_identical() { + let a = vec![1.0, 0.0, 0.0]; + let b = vec![1.0, 0.0, 0.0]; + assert!((cosine_similarity(&a, &b) - 1.0).abs() < 0.001); + } + + #[test] + fn test_cosine_similarity_orthogonal() { + let a = vec![1.0, 0.0, 0.0]; + let b = vec![0.0, 1.0, 0.0]; + assert!(cosine_similarity(&a, &b).abs() < 0.001); + } + + #[test] + fn test_cosine_similarity_opposite() { + let a = vec![1.0, 0.0]; + let b = vec![-1.0, 0.0]; + assert!((cosine_similarity(&a, &b) + 1.0).abs() < 0.001); + } + + #[test] + fn test_cosine_similarity_zero_vector() { + let a = vec![0.0, 0.0, 0.0]; + let b = vec![1.0, 2.0, 3.0]; + assert_eq!(cosine_similarity(&a, &b), 0.0); + } +} diff --git a/src/search.rs b/src/search.rs index 1ed0b92..f776cff 100644 --- a/src/search.rs +++ b/src/search.rs @@ -7,7 +7,6 @@ use serde_json::json; use crate::embedder::Embedder; use crate::fusion::{self, RankedResult}; use crate::graph; -use crate::hnsw::HnswIndex; use crate::store::{Store, StoreStats}; /// A single search result with metadata. @@ -43,14 +42,13 @@ pub fn search_internal( top_n: usize, store: &Store, embedder: &mut Embedder, - index: &HnswIndex, ) -> Result { // --- Semantic lane --- let query_vec = embedder.embed_one(query).context("embedding query")?; - let tombstones = store.get_tombstones().context("loading tombstones")?; + let tombstones = std::collections::HashSet::new(); - // Request extra results to account for tombstone filtering and file-level dedup. - let raw_results = index.search(&query_vec, top_n * 3, &tombstones); + // Request extra results to account for file-level dedup. + let raw_results = store.search_vec(&query_vec, top_n * 3, &tombstones)?; // Group semantic results by file_path, keeping best per file. let mut sem_by_file: HashMap = HashMap::new(); @@ -181,7 +179,7 @@ pub fn search_internal( /// Run a search query and print results. /// -/// Performs both semantic (HNSW) and keyword (FTS5) search, then fuses +/// Performs both semantic (sqlite-vec) and keyword (FTS5) search, then fuses /// results using Reciprocal Rank Fusion. When `explain` is true, each /// result includes per-lane score breakdown. pub fn run_search( @@ -194,13 +192,10 @@ pub fn run_search( let models_dir = data_dir.join("models"); let mut embedder = Embedder::new(&models_dir).context("loading embedder")?; - let hnsw_dir = data_dir.join("hnsw"); - let index = HnswIndex::load(&hnsw_dir).context("loading HNSW index")?; - let db_path = data_dir.join("engraph.db"); let store = Store::open(&db_path).context("opening store")?; - let output = search_internal(query, top_n, &store, &mut embedder, &index)?; + let output = search_internal(query, top_n, &store, &mut embedder)?; let results: Vec = output .results @@ -235,9 +230,8 @@ pub fn run_status(json: bool, data_dir: &Path) -> Result<()> { let store = Store::open(&db_path).context("opening store")?; let stats = store.stats()?; - // Compute index size on disk (sum of HNSW files). - let hnsw_dir = data_dir.join("hnsw"); - let index_size = dir_size(&hnsw_dir); + // Compute index size on disk (sqlite db file). + let index_size = std::fs::metadata(&db_path).map(|m| m.len()).unwrap_or(0); let model_name = "all-MiniLM-L6-v2"; @@ -383,24 +377,6 @@ fn format_bytes(bytes: u64) -> String { } } -/// Compute total size of all files in a directory (non-recursive is fine for HNSW). -fn dir_size(path: &Path) -> u64 { - if !path.exists() { - return 0; - } - let mut total = 0u64; - if let Ok(entries) = std::fs::read_dir(path) { - for entry in entries.flatten() { - if let Ok(meta) = entry.metadata() - && meta.is_file() - { - total += meta.len(); - } - } - } - total -} - #[cfg(test)] mod tests { use super::*; diff --git a/src/serve.rs b/src/serve.rs index 97b57b7..65ee0ab 100644 --- a/src/serve.rs +++ b/src/serve.rs @@ -14,7 +14,6 @@ use tokio::sync::Mutex; use crate::config::Config; use crate::context::{self, ContextParams}; use crate::embedder::Embedder; -use crate::hnsw::HnswIndex; use crate::profile::VaultProfile; use crate::search; use crate::store::Store; @@ -67,6 +66,58 @@ pub struct ContextToolParams { pub budget: Option, } +#[derive(Debug, Deserialize, JsonSchema)] +pub struct CreateParams { + /// Note content (markdown body). + pub content: String, + /// Optional filename (without .md). Auto-generated if omitted. + pub filename: Option, + /// Type hint for placement: "person", "daily", "meeting", "decision". + pub type_hint: Option, + /// Proposed tags (auto-resolved against registry). + pub tags: Option>, + /// Explicit folder path (skips placement engine). + pub folder: Option, +} + +#[derive(Debug, Deserialize, JsonSchema)] +pub struct AppendParams { + /// Target note: file path, basename, or #docid. + pub file: String, + /// Content to append to the note. + pub content: String, +} + +#[derive(Debug, Deserialize, JsonSchema)] +pub struct UpdateMetadataParams { + /// Target note: file path, basename, or #docid. + pub file: String, + /// New tags (replaces existing). + pub tags: Option>, + /// New aliases. + pub aliases: Option>, +} + +#[derive(Debug, Deserialize, JsonSchema)] +pub struct MoveNoteParams { + /// Target note: file path, basename, or #docid. + pub file: String, + /// New folder path (relative to vault root). + pub new_folder: String, +} + +#[derive(Debug, Deserialize, JsonSchema)] +pub struct ArchiveParams { + /// Target note: file path, basename, or #docid. + pub file: String, +} + +#[derive(Debug, Deserialize, JsonSchema)] +pub struct UnarchiveParams { + /// Archived note path (e.g., "04-Archive/01-Projects/note.md"). + pub file: String, +} + // --------------------------------------------------------------------------- // Server // --------------------------------------------------------------------------- @@ -75,7 +126,6 @@ pub struct ContextToolParams { pub struct EngraphServer { store: Arc>, embedder: Arc>, - hnsw_index: Arc, vault_path: Arc, profile: Arc>, tool_router: ToolRouter, @@ -110,14 +160,8 @@ impl EngraphServer { let top_n = params.0.top_n.unwrap_or(10); let store = self.store.lock().await; let mut embedder = self.embedder.lock().await; - let output = search::search_internal( - ¶ms.0.query, - top_n, - &store, - &mut embedder, - &self.hnsw_index, - ) - .map_err(|e| mcp_err(&e))?; + let output = search::search_internal(¶ms.0.query, top_n, &store, &mut embedder) + .map_err(|e| mcp_err(&e))?; to_json_result(&output.results) } @@ -215,15 +259,124 @@ impl EngraphServer { vault_path: &self.vault_path, profile: self.profile.as_ref().as_ref(), }; - let bundle = context::context_topic_with_search( - &ctx, - ¶ms.0.topic, - budget, + let bundle = + context::context_topic_with_search(&ctx, ¶ms.0.topic, budget, &mut embedder) + .map_err(|e| mcp_err(&e))?; + to_json_result(&bundle) + } + + #[tool( + name = "create", + description = "Create a new note with automatic tag resolution, link discovery, and folder placement. Returns the created file's path, docid, and what was auto-resolved." + )] + async fn create(&self, params: Parameters) -> Result { + let store = self.store.lock().await; + let mut embedder = self.embedder.lock().await; + let input = crate::writer::CreateNoteInput { + content: params.0.content, + filename: params.0.filename, + type_hint: params.0.type_hint, + tags: params.0.tags.unwrap_or_default(), + folder: params.0.folder, + created_by: "claude-code".into(), + }; + let result = crate::writer::create_note( + input, + &store, &mut embedder, - &self.hnsw_index, + &self.vault_path, + self.profile.as_ref().as_ref(), ) .map_err(|e| mcp_err(&e))?; - to_json_result(&bundle) + to_json_result(&result) + } + + #[tool( + name = "append", + description = "Append content to an existing note. Safe: only adds content, never overwrites. Detects conflicts via mtime checking." + )] + async fn append(&self, params: Parameters) -> Result { + let store = self.store.lock().await; + let mut embedder = self.embedder.lock().await; + let input = crate::writer::AppendInput { + file: params.0.file, + content: params.0.content, + modified_by: "claude-code".into(), + }; + let result = crate::writer::append_to_note(input, &store, &mut embedder, &self.vault_path) + .map_err(|e| mcp_err(&e))?; + to_json_result(&result) + } + + #[tool( + name = "update_metadata", + description = "Update a note's tags or aliases. Uses mtime conflict detection." + )] + async fn update_metadata( + &self, + params: Parameters, + ) -> Result { + let store = self.store.lock().await; + let input = crate::writer::UpdateMetadataInput { + file: params.0.file, + tags: params.0.tags, + aliases: params.0.aliases, + modified_by: "claude-code".into(), + }; + let result = crate::writer::update_metadata(input, &store, &self.vault_path) + .map_err(|e| mcp_err(&e))?; + to_json_result(&result) + } + + #[tool( + name = "move_note", + description = "Move a note to a different folder. Updates the index path." + )] + async fn move_note( + &self, + params: Parameters, + ) -> Result { + let store = self.store.lock().await; + let result = crate::writer::move_note( + ¶ms.0.file, + ¶ms.0.new_folder, + &store, + &self.vault_path, + ) + .map_err(|e| mcp_err(&e))?; + to_json_result(&result) + } + + #[tool( + name = "archive", + description = "Archive a note: moves it to the archive folder, removes from search index. The note is preserved on disk but invisible to search/context. Use unarchive to restore." + )] + async fn archive(&self, params: Parameters) -> Result { + let store = self.store.lock().await; + let result = crate::writer::archive_note( + ¶ms.0.file, + &store, + &self.vault_path, + self.profile.as_ref().as_ref(), + ) + .map_err(|e| mcp_err(&e))?; + to_json_result(&result) + } + + #[tool( + name = "unarchive", + description = "Restore an archived note to its original location and re-index it for search." + )] + async fn unarchive( + &self, + params: Parameters, + ) -> Result { + let store = self.store.lock().await; + let mut embedder = self.embedder.lock().await; + let result = + crate::writer::unarchive_note(¶ms.0.file, &store, &mut embedder, &self.vault_path) + .map_err(|e| mcp_err(&e))?; + to_json_result(&result) } } @@ -232,8 +385,9 @@ impl rmcp::handler::server::ServerHandler for EngraphServer { fn get_info(&self) -> ServerInfo { ServerInfo::new(ServerCapabilities::builder().enable_tools().build()).with_instructions( "engraph: vault intelligence for Obsidian. \ - Use vault_map to orient, search to find, \ - read for full content, who/project for context bundles.", + Read: vault_map to orient, search to find, read for content, who/project for context. \ + Write: create for new notes, append to add content, update_metadata for tags/aliases, move_note to relocate. \ + Lifecycle: archive to soft-delete (moves to archive, removes from index), unarchive to restore.", ) } } @@ -245,23 +399,33 @@ impl rmcp::handler::server::ServerHandler for EngraphServer { pub async fn run_serve(data_dir: &Path) -> Result<()> { let db_path = data_dir.join("engraph.db"); let models_dir = data_dir.join("models"); - let hnsw_dir = data_dir.join("hnsw"); let store = Store::open(&db_path)?; let embedder = Embedder::new(&models_dir)?; - let hnsw_index = HnswIndex::load(&hnsw_dir)?; let vault_path_str = store.get_meta("vault_path")?.ok_or_else(|| { anyhow::anyhow!("No vault path in index. Run 'engraph index ' first.") })?; let vault_path = PathBuf::from(&vault_path_str); + let cleaned = crate::writer::cleanup_temp_files(&vault_path)?; + if cleaned > 0 { + eprintln!( + "Cleaned up {} incomplete write(s) from previous run", + cleaned + ); + } + + let orphans = crate::writer::verify_index_integrity(&store, &vault_path)?; + if orphans > 0 { + eprintln!("Cleaned up {} orphan DB entries for missing files", orphans); + } + let profile = Config::load_vault_profile().ok().flatten(); let server = EngraphServer { store: Arc::new(Mutex::new(store)), embedder: Arc::new(Mutex::new(embedder)), - hnsw_index: Arc::new(hnsw_index), vault_path: Arc::new(vault_path), profile: Arc::new(profile), tool_router: EngraphServer::tool_router(), diff --git a/src/store.rs b/src/store.rs index f62e6bd..7fa405c 100644 --- a/src/store.rs +++ b/src/store.rs @@ -100,6 +100,7 @@ pub struct Store { impl Store { /// Open a store backed by a file on disk. pub fn open(path: &Path) -> Result { + crate::vecstore::init_sqlite_vec(); let conn = Connection::open(path) .with_context(|| format!("failed to open database at {}", path.display()))?; let store = Self { conn }; @@ -109,6 +110,7 @@ impl Store { /// Open an in-memory store (useful for tests). pub fn open_memory() -> Result { + crate::vecstore::init_sqlite_vec(); let conn = Connection::open_in_memory().context("failed to open in-memory database")?; let store = Self { conn }; store.init()?; @@ -121,6 +123,46 @@ impl Store { .context("failed to initialize schema")?; self.migrate()?; self.ensure_fts_table()?; + crate::vecstore::init_vec_table(&self.conn)?; + self.migrate_vectors_to_vec0()?; + Ok(()) + } + + /// One-time migration: copy BLOB vectors from `chunks.vector` into the vec0 virtual table. + /// Safe to call on every startup — skips if vec0 is already populated or no BLOBs exist. + pub fn migrate_vectors_to_vec0(&self) -> Result<()> { + let vec_count: i64 = self + .conn + .query_row("SELECT count(*) FROM chunks_vec", [], |row| row.get(0)) + .unwrap_or(0); + let blob_count: i64 = self + .conn + .query_row( + "SELECT count(*) FROM chunks WHERE vector IS NOT NULL", + [], + |row| row.get(0), + ) + .unwrap_or(0); + + if vec_count == 0 && blob_count > 0 { + tracing::info!(blob_count, "migrating BLOB vectors to vec0"); + let mut stmt = self + .conn + .prepare("SELECT vector_id, vector FROM chunks WHERE vector IS NOT NULL")?; + let rows: Vec<(i64, Vec)> = stmt + .query_map([], |row| Ok((row.get(0)?, row.get(1)?)))? + .filter_map(|r| r.ok()) + .collect(); + + for (vid, blob) in &rows { + self.conn.execute( + "INSERT OR IGNORE INTO chunks_vec(rowid, embedding) VALUES (?1, ?2)", + rusqlite::params![vid, blob], + )?; + } + tracing::info!(migrated = rows.len(), "BLOB vector migration complete"); + } + Ok(()) } @@ -170,6 +212,26 @@ impl Store { )?; } + // Folder centroids table + self.conn.execute_batch( + "CREATE TABLE IF NOT EXISTS folder_centroids ( + folder TEXT PRIMARY KEY, + centroid BLOB NOT NULL, + file_count INTEGER NOT NULL DEFAULT 0, + updated_at TEXT NOT NULL DEFAULT (datetime('now')) + );", + )?; + + // Tag registry table + self.conn.execute_batch( + "CREATE TABLE IF NOT EXISTS tag_registry ( + name TEXT PRIMARY KEY, + usage_count INTEGER NOT NULL DEFAULT 0, + last_used TEXT, + created_by TEXT NOT NULL DEFAULT 'indexer' + );", + )?; + Ok(()) } @@ -317,7 +379,7 @@ impl Store { Ok(()) } - /// Get all stored vectors with their IDs for HNSW index rebuild. + /// Get all stored vectors with their IDs. /// Returns (vector_id, vector) pairs. pub fn get_all_vectors(&self) -> Result)>> { let mut stmt = self @@ -972,6 +1034,152 @@ impl Store { None => Ok(None), } } + + // ── Vec (sqlite-vec) ──────────────────────────────────────── + + pub fn insert_vec(&self, vector_id: u64, embedding: &[f32]) -> Result<()> { + crate::vecstore::insert_vec(&self.conn, vector_id, embedding) + } + + pub fn delete_vec(&self, vector_id: u64) -> Result<()> { + crate::vecstore::delete_vec(&self.conn, vector_id) + } + + pub fn search_vec( + &self, + query: &[f32], + k: usize, + tombstones: &std::collections::HashSet, + ) -> Result> { + crate::vecstore::search_vec(&self.conn, query, k, tombstones) + } + + pub fn clear_vec(&self) -> Result<()> { + crate::vecstore::clear_vec(&self.conn) + } + + // ── Transactions ──────────────────────────────────────────── + + pub fn begin_transaction(&self) -> Result<()> { + self.conn.execute_batch("BEGIN IMMEDIATE")?; + Ok(()) + } + + pub fn commit(&self) -> Result<()> { + self.conn.execute_batch("COMMIT")?; + Ok(()) + } + + pub fn rollback(&self) -> Result<()> { + self.conn.execute_batch("ROLLBACK")?; + Ok(()) + } + + // ── Folder centroids ───────────────────────────────────────── + + pub fn upsert_folder_centroid( + &self, + folder: &str, + centroid: &[f32], + file_count: usize, + ) -> Result<()> { + let blob: Vec = centroid.iter().flat_map(|f| f.to_le_bytes()).collect(); + self.conn.execute( + "INSERT INTO folder_centroids (folder, centroid, file_count, updated_at) + VALUES (?1, ?2, ?3, datetime('now')) + ON CONFLICT(folder) DO UPDATE SET centroid = ?2, file_count = ?3, updated_at = datetime('now')", + params![folder, blob, file_count as i64], + )?; + Ok(()) + } + + pub fn get_folder_centroids(&self) -> Result)>> { + let mut stmt = self + .conn + .prepare("SELECT folder, centroid FROM folder_centroids")?; + let rows = stmt.query_map([], |row| { + let folder: String = row.get(0)?; + let blob: Vec = row.get(1)?; + let centroid: Vec = blob + .chunks_exact(4) + .map(|b| f32::from_le_bytes([b[0], b[1], b[2], b[3]])) + .collect(); + Ok((folder, centroid)) + })?; + let mut results = Vec::new(); + for row in rows { + results.push(row?); + } + Ok(results) + } + + // ── Helpers ───────────────────────────────────────────────── + + pub fn next_vector_id(&self) -> Result { + let max: Option = self + .conn + .query_row("SELECT MAX(vector_id) FROM chunks", [], |row| row.get(0)) + .ok() + .flatten(); + Ok(max.map_or(0, |m| m as u64 + 1)) + } + + // ── Tags ──────────────────────────────────────────────────── + + /// Tags created by agents (not by indexer). + pub fn agent_created_tags(&self) -> Result> { + let mut stmt = self.conn.prepare( + "SELECT name, created_by, usage_count FROM tag_registry WHERE created_by != 'indexer' ORDER BY usage_count DESC", + )?; + let rows = stmt.query_map([], |row| Ok((row.get(0)?, row.get(1)?, row.get(2)?)))?; + rows.collect::, _>>().map_err(|e| e.into()) + } + + /// Tags used fewer than N times (cleanup candidates). + pub fn low_usage_tags(&self, max_count: i64) -> Result> { + let mut stmt = self.conn.prepare( + "SELECT name, usage_count FROM tag_registry WHERE usage_count < ?1 ORDER BY usage_count", + )?; + let rows = stmt.query_map(params![max_count], |row| Ok((row.get(0)?, row.get(1)?)))?; + rows.collect::, _>>().map_err(|e| e.into()) + } + + /// Tags unused for more than N days. + pub fn stale_tags(&self, days: i64) -> Result> { + let mut stmt = self.conn.prepare( + "SELECT name, last_used FROM tag_registry WHERE last_used IS NOT NULL AND julianday('now') - julianday(last_used) > ?1 ORDER BY last_used", + )?; + let rows = stmt.query_map(params![days], |row| Ok((row.get(0)?, row.get(1)?)))?; + rows.collect::, _>>().map_err(|e| e.into()) + } + + /// Borrow the underlying connection (for modules that need direct access). + pub fn conn(&self) -> &Connection { + &self.conn + } + + /// Resolve a file reference (path, basename, or #docid) to a FileRecord. + pub fn resolve_file(&self, file_or_docid: &str) -> Result> { + if file_or_docid.starts_with('#') && file_or_docid.len() == 7 { + return self.get_file_by_docid(&file_or_docid[1..]); + } + if let Some(f) = self.get_file(file_or_docid)? { + return Ok(Some(f)); + } + self.find_file_by_basename(file_or_docid) + } + + pub fn resolve_tag(&self, proposed: &str) -> Result { + crate::tags::resolve_tag(&self.conn, proposed) + } + + pub fn resolve_tags(&self, proposed: &[String]) -> Result> { + crate::tags::resolve_tags(&self.conn, proposed) + } + + pub fn register_tag(&self, name: &str, created_by: &str) -> Result<()> { + crate::tags::register_tag(&self.conn, name, created_by) + } } fn parse_tags(json: &str) -> Vec { @@ -1635,4 +1843,112 @@ mod tests { let empty = store.edge_counts_for_files(&[]).unwrap(); assert!(empty.is_empty()); } + + // ── Vec integration tests ─────────────────────────────────── + + #[test] + fn test_store_has_vec_table() { + let store = Store::open_memory().unwrap(); + let count: i64 = store + .conn + .query_row( + "SELECT count(*) FROM sqlite_master WHERE type='table' AND name='chunks_vec'", + [], + |row| row.get(0), + ) + .unwrap(); + assert_eq!(count, 1); + } + + #[test] + fn test_store_vec_roundtrip() { + let store = Store::open_memory().unwrap(); + let vector: Vec = (0..384).map(|i| (i as f32) / 384.0).collect(); + store.insert_vec(0, &vector).unwrap(); + + let results = store + .search_vec(&vector, 1, &std::collections::HashSet::new()) + .unwrap(); + assert_eq!(results.len(), 1); + assert_eq!(results[0].0, 0); + assert!(results[0].1 < 0.01); + } + + #[test] + fn test_migrate_vectors_to_vec0() { + let store = Store::open_memory().unwrap(); + // Insert a file + chunk with a vector BLOB. + let file_id = store + .insert_file("test.md", "hash123", 0, &[], "abc123") + .unwrap(); + let vector: Vec = (0..384).map(|i| (i as f32) / 384.0).collect(); + store + .insert_chunk_with_vector(file_id, "heading", "snippet", 0, 100, &vector) + .unwrap(); + + // Clear vec0 to simulate a pre-migration state, then re-run the migration. + store.clear_vec().unwrap(); + store.migrate_vectors_to_vec0().unwrap(); + + // Verify vec0 is now populated. + let results = store + .search_vec(&vector, 1, &std::collections::HashSet::new()) + .unwrap(); + assert_eq!(results.len(), 1); + assert_eq!(results[0].0, 0); + } + + #[test] + fn test_store_transaction() { + let store = Store::open_memory().unwrap(); + store.begin_transaction().unwrap(); + store.set_meta("test_key", "test_value").unwrap(); + store.commit().unwrap(); + assert_eq!( + store.get_meta("test_key").unwrap(), + Some("test_value".into()) + ); + } + + #[test] + fn test_next_vector_id_empty() { + let store = Store::open_memory().unwrap(); + assert_eq!(store.next_vector_id().unwrap(), 0); + } + + // ── Tag query tests ────────────────────────────────────────── + + #[test] + fn test_tag_query_functions() { + let store = Store::open_memory().unwrap(); + + // Register tags with different creators + store.register_tag("rust", "indexer").unwrap(); + store.register_tag("work", "indexer").unwrap(); + store.register_tag("engraph", "claude-code").unwrap(); + store.register_tag("decision", "claude-code").unwrap(); + + // Bump usage counts + store.register_tag("rust", "indexer").unwrap(); + store.register_tag("rust", "indexer").unwrap(); + + // agent_created_tags: should return only non-indexer tags + let agent_tags = store.agent_created_tags().unwrap(); + assert_eq!(agent_tags.len(), 2); + assert!(agent_tags.iter().all(|(_, by, _)| by != "indexer")); + let names: Vec<&str> = agent_tags.iter().map(|(n, _, _)| n.as_str()).collect(); + assert!(names.contains(&"engraph")); + assert!(names.contains(&"decision")); + + // low_usage_tags: tags with usage_count < 2 + let low = store.low_usage_tags(2).unwrap(); + // engraph and decision have count 1, work has count 1, rust has count 3 + assert!(low.iter().any(|(n, _)| n == "engraph")); + assert!(low.iter().any(|(n, _)| n == "work")); + assert!(!low.iter().any(|(n, _)| n == "rust")); + + // stale_tags: no tags should be stale since they were just created + let stale = store.stale_tags(1).unwrap(); + assert!(stale.is_empty()); + } } diff --git a/src/tags.rs b/src/tags.rs new file mode 100644 index 0000000..3e51332 --- /dev/null +++ b/src/tags.rs @@ -0,0 +1,223 @@ +use anyhow::Result; +use rusqlite::{Connection, params}; +use strsim::levenshtein; + +/// Result of resolving a proposed tag against the registry. +#[derive(Debug, Clone, PartialEq)] +pub enum TagResolution { + /// Exact case-insensitive match found. + Exact(String), + /// Fuzzy match: Levenshtein distance ≤ 2. + Fuzzy { + proposed: String, + resolved: String, + distance: usize, + }, + /// Proposed tag extends an existing tag via `/` hierarchy. + Extension(String), + /// No match — this is a brand-new tag. + New(String), +} + +/// Resolve a single proposed tag against the registry. +/// +/// Resolution tiers (priority order): +/// 1. Exact match (case-insensitive) +/// 2. Fuzzy match (Levenshtein distance ≤ 2, pick closest) +/// 3. Prefix extension (proposed starts with `existing_tag/`) +/// 4. New tag +pub fn resolve_tag(conn: &Connection, proposed: &str) -> Result { + let lower = proposed.to_lowercase(); + + // Tier 1: Exact case-insensitive match. + let exact: Option = conn + .prepare("SELECT name FROM tag_registry WHERE LOWER(name) = ?1")? + .query_map(params![lower], |row| row.get::<_, String>(0))? + .filter_map(|r| r.ok()) + .next(); + + if let Some(name) = exact { + return Ok(TagResolution::Exact(name)); + } + + // Load all registered tags for fuzzy + prefix checks. + let all_tags: Vec = conn + .prepare("SELECT name FROM tag_registry")? + .query_map([], |row| row.get::<_, String>(0))? + .filter_map(|r| r.ok()) + .collect(); + + // Tier 2: Fuzzy match — Levenshtein distance ≤ 2. + let mut best: Option<(String, usize)> = None; + for tag in &all_tags { + let dist = levenshtein(&lower, &tag.to_lowercase()); + if dist > 0 && dist <= 2 && (best.is_none() || dist < best.as_ref().unwrap().1) { + best = Some((tag.clone(), dist)); + } + } + if let Some((resolved, distance)) = best { + return Ok(TagResolution::Fuzzy { + proposed: proposed.to_string(), + resolved, + distance, + }); + } + + // Tier 3: Prefix extension — proposed starts with `existing_tag/`. + for tag in &all_tags { + if lower.starts_with(&format!("{}/", tag.to_lowercase())) { + return Ok(TagResolution::Extension(proposed.to_string())); + } + } + + // Tier 4: New tag. + Ok(TagResolution::New(proposed.to_string())) +} + +/// Register (upsert) a tag: increment usage_count if it exists, insert otherwise. +pub fn register_tag(conn: &Connection, name: &str, created_by: &str) -> Result<()> { + conn.execute( + "INSERT INTO tag_registry (name, usage_count, last_used, created_by) + VALUES (?1, 1, datetime('now'), ?2) + ON CONFLICT(name) DO UPDATE SET + usage_count = usage_count + 1, + last_used = datetime('now')", + params![name, created_by], + )?; + Ok(()) +} + +/// Resolve a list of proposed tags, returning the final tag names. +/// +/// - Exact / Fuzzy matches map to the resolved name. +/// - Extension / New tags pass through as-is. +pub fn resolve_tags(conn: &Connection, proposed: &[String]) -> Result> { + let mut out = Vec::with_capacity(proposed.len()); + for tag in proposed { + let resolved = resolve_tag(conn, tag)?; + let name = match resolved { + TagResolution::Exact(name) => name, + TagResolution::Fuzzy { resolved, .. } => resolved, + TagResolution::Extension(name) => name, + TagResolution::New(name) => name, + }; + out.push(name); + } + Ok(out) +} + +#[cfg(test)] +mod tests { + use super::*; + use crate::store::Store; + + fn setup_store() -> Store { + let store = Store::open_memory().unwrap(); + let conn = store.conn(); + // Seed tags with varying usage counts. + for (name, count) in [ + ("domaine", 15), + ("scentbird", 10), + ("engraph", 8), + ("work", 20), + ("work/domaine", 5), + ] { + conn.execute( + "INSERT INTO tag_registry (name, usage_count, last_used, created_by) + VALUES (?1, ?2, datetime('now'), 'test')", + params![name, count], + ) + .unwrap(); + } + store + } + + #[test] + fn test_exact_match() { + let store = setup_store(); + let res = resolve_tag(store.conn(), "domaine").unwrap(); + assert_eq!(res, TagResolution::Exact("domaine".to_string())); + } + + #[test] + fn test_exact_match_case_insensitive() { + let store = setup_store(); + let res = resolve_tag(store.conn(), "Domaine").unwrap(); + assert_eq!(res, TagResolution::Exact("domaine".to_string())); + } + + #[test] + fn test_fuzzy_match() { + let store = setup_store(); + // "doamine" is Levenshtein distance 2 from "domaine" (transposition). + let res = resolve_tag(store.conn(), "doamine").unwrap(); + match res { + TagResolution::Fuzzy { + proposed, + resolved, + distance, + } => { + assert_eq!(proposed, "doamine"); + assert_eq!(resolved, "domaine"); + assert!(distance <= 2); + } + other => panic!("expected Fuzzy, got {other:?}"), + } + } + + #[test] + fn test_hierarchy_extension() { + let store = setup_store(); + let res = resolve_tag(store.conn(), "work/domaine/bre").unwrap(); + assert_eq!( + res, + TagResolution::Extension("work/domaine/bre".to_string()) + ); + } + + #[test] + fn test_new_tag() { + let store = setup_store(); + let res = resolve_tag(store.conn(), "completely-new").unwrap(); + assert_eq!(res, TagResolution::New("completely-new".to_string())); + } + + #[test] + fn test_register_tag() { + let store = setup_store(); + register_tag(store.conn(), "new-tag", "test").unwrap(); + let count: i64 = store + .conn() + .query_row( + "SELECT usage_count FROM tag_registry WHERE name = 'new-tag'", + [], + |row| row.get(0), + ) + .unwrap(); + assert_eq!(count, 1); + + // Register again — count should increment. + register_tag(store.conn(), "new-tag", "test").unwrap(); + let count: i64 = store + .conn() + .query_row( + "SELECT usage_count FROM tag_registry WHERE name = 'new-tag'", + [], + |row| row.get(0), + ) + .unwrap(); + assert_eq!(count, 2); + } + + #[test] + fn test_resolve_tags() { + let store = setup_store(); + let input = vec![ + "domaine".to_string(), + "doamine".to_string(), + "completely-new".to_string(), + ]; + let resolved = resolve_tags(store.conn(), &input).unwrap(); + assert_eq!(resolved, vec!["domaine", "domaine", "completely-new"]); + } +} diff --git a/src/vecstore.rs b/src/vecstore.rs new file mode 100644 index 0000000..a47065c --- /dev/null +++ b/src/vecstore.rs @@ -0,0 +1,199 @@ +use std::collections::HashSet; + +use anyhow::Result; +use rusqlite::Connection; +use sqlite_vec::sqlite3_vec_init; + +/// Register the sqlite-vec extension as an auto-extension. +/// +/// Must be called **before** opening any `Connection` that needs vec0 tables. +/// Safe to call multiple times (idempotent). +pub fn init_sqlite_vec() { + unsafe { + rusqlite::ffi::sqlite3_auto_extension(Some( + #[allow(clippy::missing_transmute_annotations)] + std::mem::transmute(sqlite3_vec_init as *const ()), + )); + } +} + +/// Create the `chunks_vec` virtual table if it doesn't already exist. +pub fn init_vec_table(conn: &Connection) -> Result<()> { + conn.execute( + "CREATE VIRTUAL TABLE IF NOT EXISTS chunks_vec USING vec0( + embedding float[384] distance_metric=cosine + )", + [], + )?; + Ok(()) +} + +/// Insert a vector with the given ID. +pub fn insert_vec(conn: &Connection, vector_id: u64, embedding: &[f32]) -> Result<()> { + use zerocopy::AsBytes; + conn.execute( + "INSERT INTO chunks_vec(rowid, embedding) VALUES (?, ?)", + rusqlite::params![vector_id as i64, embedding.as_bytes()], + )?; + Ok(()) +} + +/// Delete a vector by its ID. +pub fn delete_vec(conn: &Connection, vector_id: u64) -> Result<()> { + conn.execute( + "DELETE FROM chunks_vec WHERE rowid = ?", + rusqlite::params![vector_id as i64], + )?; + Ok(()) +} + +/// Search for the `k` nearest neighbors of `query`, excluding `tombstones`. +/// +/// Returns `(vector_id, distance)` pairs sorted by ascending distance. +/// Cosine distance: 0.0 = identical, 2.0 = opposite. +pub fn search_vec( + conn: &Connection, + query: &[f32], + k: usize, + tombstones: &HashSet, +) -> Result> { + use zerocopy::AsBytes; + + // Request extra results to compensate for tombstone filtering. + let fetch_k = if tombstones.is_empty() { k } else { k * 2 }; + + let mut stmt = conn.prepare( + "SELECT rowid, distance + FROM chunks_vec + WHERE embedding MATCH ?1 + AND k = ?2", + )?; + + let rows = stmt.query_map(rusqlite::params![query.as_bytes(), fetch_k as i64], |row| { + let id: i64 = row.get(0)?; + let dist: f32 = row.get(1)?; + Ok((id as u64, dist)) + })?; + + let mut results: Vec<(u64, f32)> = Vec::new(); + for row in rows { + let (id, dist) = row?; + if tombstones.contains(&id) { + continue; + } + results.push((id, dist)); + if results.len() == k { + break; + } + } + + Ok(results) +} + +/// Delete all vectors from the table. +pub fn clear_vec(conn: &Connection) -> Result<()> { + conn.execute("DELETE FROM chunks_vec", [])?; + Ok(()) +} + +#[cfg(test)] +mod tests { + use super::*; + + fn setup_conn() -> Connection { + init_sqlite_vec(); + let conn = Connection::open_in_memory().unwrap(); + init_vec_table(&conn).unwrap(); + conn + } + + fn random_vector(seed: u64) -> Vec { + let mut state = seed.wrapping_mul(6364136223846793005).wrapping_add(1); + (0..384) + .map(|_| { + state = state.wrapping_mul(6364136223846793005).wrapping_add(1); + ((state >> 33) as f32) / (u32::MAX as f32) * 2.0 - 1.0 + }) + .collect() + } + + #[test] + fn test_init_vec_table() { + let conn = setup_conn(); + // Verify the table exists by querying sqlite_master. + let count: i64 = conn + .query_row( + "SELECT count(*) FROM sqlite_master WHERE type='table' AND name='chunks_vec'", + [], + |row| row.get(0), + ) + .unwrap(); + assert_eq!(count, 1, "chunks_vec table should exist"); + } + + #[test] + fn test_insert_and_search() { + let conn = setup_conn(); + let vectors: Vec> = (0..10).map(random_vector).collect(); + + for (i, v) in vectors.iter().enumerate() { + insert_vec(&conn, i as u64, v).unwrap(); + } + + let results = search_vec(&conn, &vectors[0], 5, &HashSet::new()).unwrap(); + assert!(!results.is_empty(), "search returned no results"); + assert_eq!( + results[0].0, 0, + "expected the query vector itself to be the top result" + ); + assert!( + results[0].1 < 0.01, + "distance to self should be near zero, got {}", + results[0].1 + ); + } + + #[test] + fn test_search_with_tombstones() { + let conn = setup_conn(); + let vectors: Vec> = (0..5).map(|i| random_vector(i + 100)).collect(); + + for (i, v) in vectors.iter().enumerate() { + insert_vec(&conn, i as u64, v).unwrap(); + } + + let mut tombstones = HashSet::new(); + tombstones.insert(0u64); + + let results = search_vec(&conn, &vectors[0], 5, &tombstones).unwrap(); + for (id, _) in &results { + assert_ne!(*id, 0, "tombstoned ID should not appear in results"); + } + } + + #[test] + fn test_delete_vec() { + let conn = setup_conn(); + insert_vec(&conn, 1, &random_vector(42)).unwrap(); + + let count_before: i64 = conn + .query_row("SELECT count(*) FROM chunks_vec", [], |row| row.get(0)) + .unwrap(); + assert_eq!(count_before, 1); + + delete_vec(&conn, 1).unwrap(); + + let count_after: i64 = conn + .query_row("SELECT count(*) FROM chunks_vec", [], |row| row.get(0)) + .unwrap(); + assert_eq!(count_after, 0); + } + + #[test] + fn test_empty_search() { + let conn = setup_conn(); + let query = random_vector(999); + let results = search_vec(&conn, &query, 5, &HashSet::new()).unwrap(); + assert!(results.is_empty(), "empty table should return no results"); + } +} diff --git a/src/writer.rs b/src/writer.rs new file mode 100644 index 0000000..529bd4f --- /dev/null +++ b/src/writer.rs @@ -0,0 +1,1155 @@ +use std::path::Path; + +use anyhow::{Result, bail}; +use ignore::WalkBuilder; +use sha2::{Digest, Sha256}; +use time::OffsetDateTime; + +use crate::chunker::{chunk_markdown, split_oversized_chunks}; +use crate::docid::generate_docid; +use crate::embedder::Embedder; +use crate::indexer::build_edges_for_file; +use crate::links; +use crate::placement::{self, PlacementHints}; +use crate::profile::VaultProfile; +use crate::store::Store; + +// ── Input / Output types ──────────────────────────────────────── + +#[derive(Debug, Clone)] +pub struct CreateNoteInput { + pub content: String, + pub filename: Option, + pub type_hint: Option, + pub tags: Vec, + pub folder: Option, + pub created_by: String, +} + +#[derive(Debug, Clone)] +pub struct AppendInput { + pub file: String, + pub content: String, + pub modified_by: String, +} + +#[derive(Debug, Clone)] +pub struct UpdateMetadataInput { + pub file: String, + pub tags: Option>, + pub aliases: Option>, + pub modified_by: String, +} + +#[derive(Debug, Clone, serde::Serialize)] +pub struct WriteResult { + pub path: String, + pub docid: String, + pub tags: Vec, + pub links_added: Vec, + pub folder: String, + pub confidence: f64, + pub strategy: String, +} + +// ── Helper functions ──────────────────────────────────────────── + +/// Strip characters that are invalid in filenames, keeping alphanumeric, spaces, dashes, underscores, and dots. +pub fn generate_filename(title: &str) -> String { + title + .chars() + .filter(|c| c.is_alphanumeric() || *c == ' ' || *c == '-' || *c == '_' || *c == '.') + .collect() +} + +/// Extract a title from content: first `# heading` or first non-empty line, truncated to 50 chars. +pub fn extract_title(content: &str) -> String { + for line in content.lines() { + let trimmed = line.trim(); + if trimmed.is_empty() { + continue; + } + if let Some(heading) = trimmed.strip_prefix("# ") { + let heading = heading.trim(); + if heading.len() > 50 { + return heading[..50].to_string(); + } + return heading.to_string(); + } + // First non-empty line + if trimmed.len() > 50 { + return trimmed[..50].to_string(); + } + return trimmed.to_string(); + } + "Untitled".to_string() +} + +/// Optional placement suggestion metadata for inbox notes. +pub struct PlacementSuggestion { + pub suggested_folder: String, + pub confidence: f64, + pub reason: String, +} + +/// Build YAML frontmatter string. +pub fn build_frontmatter( + tags: &[String], + created_by: Option<&str>, + aliases: Option<&[String]>, + suggestion: Option<&PlacementSuggestion>, +) -> String { + let mut fm = String::from("---\n"); + + if !tags.is_empty() { + fm.push_str("tags:\n"); + for tag in tags { + fm.push_str(&format!(" - {}\n", tag)); + } + } + + if let Some(aliases) = aliases + && !aliases.is_empty() + { + fm.push_str("aliases:\n"); + for alias in aliases { + fm.push_str(&format!(" - {}\n", alias)); + } + } + + fm.push_str(&format!("created: {}\n", today_date())); + + if let Some(by) = created_by { + fm.push_str(&format!("created_by: {}\n", by)); + } + + // Placement suggestion for inbox notes — user sees why it landed here + if let Some(s) = suggestion { + fm.push_str(&format!("suggested_folder: {}\n", s.suggested_folder)); + fm.push_str(&format!("confidence: {:.2}\n", s.confidence)); + fm.push_str(&format!("reason: \"{}\"\n", s.reason)); + } + + fm.push_str("---\n\n"); + fm +} + +/// Split content into (frontmatter_string, body_string). +/// If no frontmatter, returns ("", content). +pub fn split_frontmatter(content: &str) -> (String, String) { + let trimmed = content.trim_start(); + if !trimmed.starts_with("---") { + return (String::new(), content.to_string()); + } + + // Find the closing --- + let after_open = &trimmed[3..]; + // Skip past any remaining dashes and the newline + let after_open = after_open.trim_start_matches('-'); + let after_open = after_open.strip_prefix('\n').unwrap_or(after_open); + + if let Some(end_pos) = after_open.find("\n---") { + let fm_content = &after_open[..end_pos]; + let rest_start = end_pos + 4; // "\n---" + let rest = &after_open[rest_start..]; + // Skip trailing dashes and newline after closing --- + let rest = rest.trim_start_matches('-'); + let rest = rest.strip_prefix('\n').unwrap_or(rest); + + let fm = format!("---\n{}\n---\n", fm_content); + (fm, rest.to_string()) + } else { + (String::new(), content.to_string()) + } +} + +/// Returns today's date as "YYYY-MM-DD". +pub fn today_date() -> String { + let now = OffsetDateTime::now_utc(); + format!( + "{:04}-{:02}-{:02}", + now.year(), + now.month() as u8, + now.day() + ) +} + +/// Compute SHA-256 hash of content bytes, returned as hex string. +fn compute_content_hash(content: &str) -> String { + let mut hasher = Sha256::new(); + hasher.update(content.as_bytes()); + format!("{:x}", hasher.finalize()) +} + +/// Get file mtime as seconds since epoch. +fn file_mtime(path: &Path) -> Result { + let meta = std::fs::metadata(path)?; + let mtime = meta + .modified()? + .duration_since(std::time::UNIX_EPOCH) + .unwrap_or_default(); + Ok(mtime.as_secs() as i64) +} + +/// Pre-computed chunk data ready for store insertion. +type ChunkData = (String, String, Vec, i64); // (heading, snippet, vector, token_count) + +/// Chunk content, embed, and return pre-computed data ready for store insertion. +fn precompute_chunks(content: &str, embedder: &mut Embedder) -> Result> { + let parsed = chunk_markdown(content); + let chunks = split_oversized_chunks(parsed.chunks, &|s| s.split_whitespace().count(), 512, 50); + + let texts: Vec<&str> = chunks.iter().map(|c| c.text.as_str()).collect(); + let embeddings = embedder.embed_batch(&texts)?; + + let mut results = Vec::with_capacity(chunks.len()); + for (chunk, embedding) in chunks.into_iter().zip(embeddings) { + let heading = chunk.heading.unwrap_or_default(); + let token_count = chunk.text.split_whitespace().count() as i64; + results.push((heading, chunk.snippet, embedding, token_count)); + } + Ok(results) +} + +/// Write content to a temp file and atomically rename to final path. +/// Returns error if final_path already exists and `allow_overwrite` is false. +fn atomic_write(final_path: &Path, content: &str, allow_overwrite: bool) -> Result<()> { + if !allow_overwrite && final_path.exists() { + bail!( + "file already exists at {}, refusing to overwrite", + final_path.display() + ); + } + + // Ensure parent directory exists + if let Some(parent) = final_path.parent() { + std::fs::create_dir_all(parent)?; + } + + let temp_path = final_path.with_extension("md.tmp"); + std::fs::write(&temp_path, content)?; + std::fs::rename(&temp_path, final_path)?; + Ok(()) +} + +/// Clean up incomplete writes from a previous crash. +/// Scans vault for .md.tmp files and removes them. +pub fn cleanup_temp_files(vault_path: &Path) -> Result { + let mut cleaned = 0; + for entry in WalkBuilder::new(vault_path).standard_filters(true).build() { + let entry = entry?; + let path = entry.path(); + if path.is_file() + && path.extension().is_some_and(|e| e == "tmp") + && path.to_string_lossy().ends_with(".md.tmp") + { + std::fs::remove_file(path)?; + cleaned += 1; + } + } + Ok(cleaned) +} + +// ── Pipeline functions ────────────────────────────────────────── + +/// Create a new note via the 5-step write pipeline. +pub fn create_note( + input: CreateNoteInput, + store: &Store, + embedder: &mut Embedder, + vault_path: &Path, + profile: Option<&VaultProfile>, +) -> Result { + // Step 1: Determine filename + let title = if let Some(ref name) = input.filename { + name.clone() + } else { + extract_title(&input.content) + }; + let filename = generate_filename(&title); + let filename = if filename.ends_with(".md") { + filename + } else { + format!("{}.md", filename) + }; + + // Step 2: Resolve tags + let resolved_tags = store.resolve_tags(&input.tags)?; + + // Step 3: Discover links and apply them + let discovered = links::discover_links(store, &input.content, vault_path)?; + let links_added: Vec = discovered.iter().map(|l| l.target_path.clone()).collect(); + + // Apply discovered links to content — wrap matched text in [[wikilinks]] + let mut content_with_links = input.content.clone(); + // Apply in reverse order of position to preserve offsets + let mut replacements: Vec<(usize, usize, String)> = Vec::new(); + let content_lower = content_with_links.to_lowercase(); + for link in &discovered { + let search_lower = link.matched_text.to_lowercase(); + if let Some(pos) = content_lower.find(&search_lower) { + let end = pos + link.matched_text.len(); + let original_text = &content_with_links[pos..end]; + let wikilink = if let Some(ref display) = link.display { + format!( + "[[{}|{}]]", + link.target_path.trim_end_matches(".md"), + display + ) + } else { + format!("[[{}]]", original_text) + }; + replacements.push((pos, end, wikilink)); + } + } + // Sort by position descending so replacements don't shift offsets + replacements.sort_by(|a, b| b.0.cmp(&a.0)); + for (start, end, replacement) in replacements { + content_with_links.replace_range(start..end, &replacement); + } + + // Step 4: Determine folder placement + let placement_result = if let Some(ref folder) = input.folder { + placement::PlacementResult { + folder: folder.clone(), + confidence: 1.0, + strategy: placement::PlacementStrategy::TypeRule, + reason: "Explicit folder".to_string(), + suggestion: None, + } + } else { + let hints = PlacementHints { + type_hint: input.type_hint.clone(), + tags: resolved_tags.clone(), + }; + placement::place_note(&content_with_links, &hints, profile, store, Some(embedder))? + }; + + // Step 5: Build frontmatter and assemble content + // If placement fell back to inbox with a suggestion, inject suggested_folder metadata + let suggestion = if placement_result.strategy == placement::PlacementStrategy::InboxFallback { + placement_result + .suggestion + .as_ref() + .map(|(folder, conf)| PlacementSuggestion { + suggested_folder: folder.clone(), + confidence: *conf, + reason: format!("semantic similarity: {conf:.3}"), + }) + } else { + None + }; + let frontmatter = build_frontmatter( + &resolved_tags, + Some(&input.created_by), + None, + suggestion.as_ref(), + ); + let full_content = format!("{}{}", frontmatter, content_with_links); + + let rel_path = format!("{}/{}", placement_result.folder, filename); + let final_path = vault_path.join(&rel_path); + + // Check for existing file before doing expensive work + if final_path.exists() { + bail!( + "file already exists at {}, refusing to overwrite", + final_path.display() + ); + } + + // Step 6: Pre-compute chunks + embeddings BEFORE transaction + let chunk_data = precompute_chunks(&full_content, embedder)?; + + let content_hash = compute_content_hash(&full_content); + let docid = generate_docid(&rel_path); + + // Write to temp file first + if let Some(parent) = final_path.parent() { + std::fs::create_dir_all(parent)?; + } + let temp_path = final_path.with_extension("md.tmp"); + std::fs::write(&temp_path, &full_content)?; + + // Step 7: BEGIN IMMEDIATE transaction + store.begin_transaction()?; + let result = (|| -> Result { + let mtime = file_mtime(&temp_path).unwrap_or(0); + let file_id = store.insert_file(&rel_path, &content_hash, mtime, &resolved_tags, &docid)?; + + let mut next_vid = store.next_vector_id()?; + for (chunk_seq, (heading, snippet, vector, token_count)) in chunk_data.iter().enumerate() { + let vid = next_vid; + next_vid += 1; + store.insert_chunk_with_vector(file_id, heading, snippet, vid, *token_count, vector)?; + store.insert_vec(vid, vector)?; + store.insert_fts_chunk(file_id, chunk_seq as i64, snippet)?; + } + + build_edges_for_file(store, file_id, &full_content)?; + + // Register new tags + for tag in &resolved_tags { + store.register_tag(tag, &input.created_by)?; + } + + Ok(file_id) + })(); + + match result { + Ok(_) => { + // Step 8: COMMIT + store.commit()?; + // Step 9: Atomic rename temp → final + std::fs::rename(&temp_path, &final_path)?; + // Update stored mtime to match the actual file after rename + // (OS may adjust mtime during rename) + let actual_mtime = file_mtime(&final_path).unwrap_or(0); + store.insert_file( + &rel_path, + &content_hash, + actual_mtime, + &resolved_tags, + &docid, + )?; + + // Incrementally update folder centroid with new note's vectors + if let Ok(centroids) = store.get_folder_centroids() { + let folder = &placement_result.folder; + let new_vecs: Vec<&[f32]> = + chunk_data.iter().map(|(_, _, v, _)| v.as_slice()).collect(); + if !new_vecs.is_empty() { + let existing = centroids.iter().find(|(f, _)| f == folder); + let dim = 384; + let updated_centroid = if let Some((_, old_centroid)) = existing { + // Weighted merge: old centroid already represents N vectors, + // new vectors are added. Approximate by averaging old centroid with new mean. + let mut new_mean = vec![0.0f32; dim]; + for v in &new_vecs { + for (i, val) in v.iter().enumerate() { + new_mean[i] += val; + } + } + let n = new_vecs.len() as f32; + for val in &mut new_mean { + *val /= n; + } + // Weighted average: existing has more weight + let old_weight = 0.9f32; + let new_weight = 0.1f32; + old_centroid + .iter() + .zip(new_mean.iter()) + .map(|(o, n)| o * old_weight + n * new_weight) + .collect::>() + } else { + // First note in this folder — centroid IS the mean of new vectors + let mut mean = vec![0.0f32; dim]; + for v in &new_vecs { + for (i, val) in v.iter().enumerate() { + mean[i] += val; + } + } + let n = new_vecs.len() as f32; + for val in &mut mean { + *val /= n; + } + mean + }; + let _ = store.upsert_folder_centroid(folder, &updated_centroid, new_vecs.len()); + } + } + } + Err(e) => { + let _ = store.rollback(); + let _ = std::fs::remove_file(&temp_path); + return Err(e); + } + } + + let strategy_name = format!("{:?}", placement_result.strategy); + Ok(WriteResult { + path: rel_path, + docid, + tags: resolved_tags, + links_added, + folder: placement_result.folder, + confidence: placement_result.confidence, + strategy: strategy_name, + }) +} + +/// Append content to an existing note. +pub fn append_to_note( + input: AppendInput, + store: &Store, + embedder: &mut Embedder, + vault_path: &Path, +) -> Result { + // Step 1: Resolve file + let file_record = store + .resolve_file(&input.file)? + .ok_or_else(|| anyhow::anyhow!("file not found: {}", input.file))?; + + let full_path = vault_path.join(&file_record.path); + + // Step 2: Mtime conflict check + let disk_mtime = file_mtime(&full_path)?; + if disk_mtime != file_record.mtime { + bail!( + "mtime conflict: file {} was modified outside engraph (disk={}, indexed={})", + file_record.path, + disk_mtime, + file_record.mtime + ); + } + + // Step 3: Append content + let existing_content = std::fs::read_to_string(&full_path)?; + let new_content = format!("{}\n{}", existing_content.trim_end(), input.content); + + // Step 4: Pre-compute new chunks + embeddings + let chunk_data = precompute_chunks(&new_content, embedder)?; + + let content_hash = compute_content_hash(&new_content); + let docid = file_record + .docid + .clone() + .unwrap_or_else(|| generate_docid(&file_record.path)); + + // Write to temp file + let temp_path = full_path.with_extension("md.tmp"); + std::fs::write(&temp_path, &new_content)?; + + // Step 5: Transaction — delete old data, re-insert + store.begin_transaction()?; + let result = (|| -> Result { + // Tombstone old vectors + let old_vids = store.get_vector_ids_for_file(file_record.id)?; + + for vid in &old_vids { + store.delete_vec(*vid)?; + } + + // Delete old chunks, FTS, edges + store.delete_fts_chunks_for_file(file_record.id)?; + store.delete_edges_for_file(file_record.id)?; + store.delete_file(file_record.id)?; + + // Re-insert file + let mtime = file_mtime(&temp_path).unwrap_or(0); + let file_id = store.insert_file( + &file_record.path, + &content_hash, + mtime, + &file_record.tags, + &docid, + )?; + + let mut next_vid = store.next_vector_id()?; + for (chunk_seq, (heading, snippet, vector, token_count)) in chunk_data.iter().enumerate() { + let vid = next_vid; + next_vid += 1; + store.insert_chunk_with_vector(file_id, heading, snippet, vid, *token_count, vector)?; + store.insert_vec(vid, vector)?; + store.insert_fts_chunk(file_id, chunk_seq as i64, snippet)?; + } + + build_edges_for_file(store, file_id, &new_content)?; + Ok(file_id) + })(); + + match result { + Ok(_) => { + store.commit()?; + // Step 6: Rename temp → final + std::fs::rename(&temp_path, &full_path)?; + // Update stored mtime to match actual file after rename + let actual_mtime = file_mtime(&full_path).unwrap_or(0); + store.insert_file( + &file_record.path, + &content_hash, + actual_mtime, + &file_record.tags, + &docid, + )?; + } + Err(e) => { + let _ = store.rollback(); + let _ = std::fs::remove_file(&temp_path); + return Err(e); + } + } + + let folder = file_record + .path + .rsplit_once('/') + .map(|(f, _)| f.to_string()) + .unwrap_or_default(); + + Ok(WriteResult { + path: file_record.path, + docid, + tags: file_record.tags, + links_added: vec![], + folder, + confidence: 1.0, + strategy: "Append".to_string(), + }) +} + +/// Update frontmatter metadata only (tags, aliases). +pub fn update_metadata( + input: UpdateMetadataInput, + store: &Store, + vault_path: &Path, +) -> Result { + // Step 1: Resolve file + let file_record = store + .resolve_file(&input.file)? + .ok_or_else(|| anyhow::anyhow!("file not found: {}", input.file))?; + + let full_path = vault_path.join(&file_record.path); + + // Step 2: Mtime conflict check + let disk_mtime = file_mtime(&full_path)?; + if disk_mtime != file_record.mtime { + bail!( + "mtime conflict: file {} was modified outside engraph (disk={}, indexed={})", + file_record.path, + disk_mtime, + file_record.mtime + ); + } + + // Step 3: Parse existing frontmatter and build new + let existing_content = std::fs::read_to_string(&full_path)?; + let (_old_fm, body) = split_frontmatter(&existing_content); + + let tags = input.tags.unwrap_or_else(|| file_record.tags.clone()); + let aliases_vec = input.aliases.unwrap_or_default(); + let aliases_ref: Option<&[String]> = if aliases_vec.is_empty() { + None + } else { + Some(&aliases_vec) + }; + + let new_fm = build_frontmatter(&tags, Some(&input.modified_by), aliases_ref, None); + let new_content = format!("{}{}", new_fm, body); + + // Step 4: Write via temp + rename + let content_hash = compute_content_hash(&new_content); + let docid = file_record + .docid + .clone() + .unwrap_or_else(|| generate_docid(&file_record.path)); + + atomic_write(&full_path, &new_content, true)?; + + // Step 5: Update store record (metadata-only, no re-chunking) + let mtime = file_mtime(&full_path)?; + store.insert_file(&file_record.path, &content_hash, mtime, &tags, &docid)?; + + // Register tags + for tag in &tags { + store.register_tag(tag, &input.modified_by)?; + } + + let folder = file_record + .path + .rsplit_once('/') + .map(|(f, _)| f.to_string()) + .unwrap_or_default(); + + Ok(WriteResult { + path: file_record.path, + docid, + tags, + links_added: vec![], + folder, + confidence: 1.0, + strategy: "UpdateMetadata".to_string(), + }) +} + +/// Move a note to a new folder. +pub fn move_note( + file: &str, + new_folder: &str, + store: &Store, + vault_path: &Path, +) -> Result { + // Step 1: Resolve file + let file_record = store + .resolve_file(file)? + .ok_or_else(|| anyhow::anyhow!("file not found: {}", file))?; + + let old_path = vault_path.join(&file_record.path); + let basename = file_record + .path + .rsplit('/') + .next() + .unwrap_or(&file_record.path); + let new_rel_path = format!("{}/{}", new_folder, basename); + let new_full_path = vault_path.join(&new_rel_path); + + if new_full_path.exists() { + bail!("target path already exists: {}", new_full_path.display()); + } + + // Read content for re-indexing + let content = std::fs::read_to_string(&old_path)?; + let content_hash = compute_content_hash(&content); + let new_docid = generate_docid(&new_rel_path); + + // Ensure target directory exists + if let Some(parent) = new_full_path.parent() { + std::fs::create_dir_all(parent)?; + } + + // Step 2: Transaction — delete old record, insert new + store.begin_transaction()?; + let result = (|| -> Result<()> { + // Tombstone old vectors + let old_vids = store.get_vector_ids_for_file(file_record.id)?; + + for vid in &old_vids { + store.delete_vec(*vid)?; + } + store.delete_fts_chunks_for_file(file_record.id)?; + store.delete_edges_for_file(file_record.id)?; + store.delete_file(file_record.id)?; + + // Insert with new path (reuse existing chunks data via insert_file only for the record) + let mtime = file_mtime(&old_path)?; + store.insert_file( + &new_rel_path, + &content_hash, + mtime, + &file_record.tags, + &new_docid, + )?; + + Ok(()) + })(); + + match result { + Ok(()) => { + store.commit()?; + // Step 3: Rename file on disk + std::fs::rename(&old_path, &new_full_path)?; + } + Err(e) => { + let _ = store.rollback(); + return Err(e); + } + } + + Ok(WriteResult { + path: new_rel_path, + docid: new_docid, + tags: file_record.tags, + links_added: vec![], + folder: new_folder.to_string(), + confidence: 1.0, + strategy: "Move".to_string(), + }) +} + +// ── Archive / Unarchive ───────────────────────────────────────── + +/// Archive a note: move to archive folder, add archived frontmatter, remove from index. +/// The note becomes invisible to search/context but is physically preserved. +pub fn archive_note( + file: &str, + store: &Store, + vault_path: &Path, + profile: Option<&crate::profile::VaultProfile>, +) -> Result { + let file_record = store + .resolve_file(file)? + .ok_or_else(|| anyhow::anyhow!("file not found: {}", file))?; + + let archive_folder = profile + .and_then(|p| p.structure.folders.archive.as_deref()) + .unwrap_or("04-Archive"); + + // Don't archive something already in the archive + if file_record.path.starts_with(archive_folder) { + bail!("note is already archived: {}", file_record.path); + } + + let old_path = vault_path.join(&file_record.path); + let new_rel_path = format!("{}/{}", archive_folder, file_record.path); + let new_full_path = vault_path.join(&new_rel_path); + + // Read content and inject archive frontmatter + let content = std::fs::read_to_string(&old_path)?; + let (_old_fm, body) = split_frontmatter(&content); + + // Preserve existing tags, add archived metadata + let mut tags = file_record.tags.clone(); + if !tags.contains(&"archived".to_string()) { + tags.push("archived".to_string()); + } + + let archive_fm = format!( + "---\n\ + archived: true\n\ + archived_at: {}\n\ + archived_from: {}\n\ + tags:\n{}\ + ---\n\n", + today_date(), + file_record.path, + tags.iter() + .map(|t| format!(" - {}\n", t)) + .collect::(), + ); + let new_content = format!("{}{}", archive_fm, body); + + // Ensure target directory + if let Some(parent) = new_full_path.parent() { + std::fs::create_dir_all(parent)?; + } + + // Write archived file to new location + atomic_write(&new_full_path, &new_content, false)?; + + // Remove from index (note disappears from search) + let old_vids = store.get_vector_ids_for_file(file_record.id)?; + for vid in &old_vids { + store.delete_vec(*vid)?; + } + store.delete_fts_chunks_for_file(file_record.id)?; + store.delete_edges_for_file(file_record.id)?; + store.delete_file(file_record.id)?; + + // Remove original file + std::fs::remove_file(&old_path)?; + + let docid = file_record.docid.unwrap_or_default(); + + Ok(WriteResult { + path: new_rel_path, + docid, + tags, + links_added: vec![], + folder: archive_folder.to_string(), + confidence: 1.0, + strategy: "Archive".to_string(), + }) +} + +/// Unarchive a note: move back to original location, strip archive frontmatter, re-index. +pub fn unarchive_note( + file: &str, + store: &Store, + embedder: &mut Embedder, + vault_path: &Path, +) -> Result { + // Resolve — the file may not be in the index (archived notes are excluded). + // Try resolving by direct path on disk. + let archive_path = vault_path.join(file); + if !archive_path.exists() { + bail!("archived note not found: {}", file); + } + + let content = std::fs::read_to_string(&archive_path)?; + let (fm_str, body) = split_frontmatter(&content); + + // Extract archived_from from frontmatter + let original_path = fm_str + .lines() + .find(|l| l.starts_with("archived_from:")) + .and_then(|l| l.strip_prefix("archived_from:")) + .map(|s| s.trim().to_string()) + .ok_or_else(|| { + anyhow::anyhow!("no archived_from in frontmatter — cannot determine original location") + })?; + + let restore_full_path = vault_path.join(&original_path); + + if restore_full_path.exists() { + bail!( + "cannot unarchive: a file already exists at {}", + original_path + ); + } + + // Rebuild frontmatter without archive fields + let mut tags: Vec = fm_str + .lines() + .skip_while(|l| !l.starts_with("tags:")) + .skip(1) + .take_while(|l| l.starts_with(" - ")) + .filter_map(|l| l.strip_prefix(" - ")) + .map(|s| s.trim().to_string()) + .filter(|t| t != "archived") + .collect(); + if tags.is_empty() { + // Try inline tags format + if let Some(line) = fm_str.lines().find(|l| l.starts_with("tags:")) + && let Some(rest) = line.strip_prefix("tags:") + { + let rest = rest.trim().trim_start_matches('[').trim_end_matches(']'); + tags = rest + .split(',') + .map(|s| s.trim().to_string()) + .filter(|s| !s.is_empty() && s != "archived") + .collect(); + } + } + + let new_fm = build_frontmatter(&tags, None, None, None); + let restored_content = format!("{}{}", new_fm, body); + + // Ensure target directory + if let Some(parent) = restore_full_path.parent() { + std::fs::create_dir_all(parent)?; + } + + // Write restored file + atomic_write(&restore_full_path, &restored_content, false)?; + + // Index the restored note + let chunk_data = precompute_chunks(&restored_content, embedder)?; + let content_hash = compute_content_hash(&restored_content); + let docid = generate_docid(&original_path); + let mtime = file_mtime(&restore_full_path).unwrap_or(0); + + store.begin_transaction()?; + let result = (|| -> Result<()> { + let file_id = store.insert_file(&original_path, &content_hash, mtime, &tags, &docid)?; + + let mut next_vid = store.next_vector_id()?; + for (seq, (heading, snippet, vector, token_count)) in chunk_data.iter().enumerate() { + let vid = next_vid; + next_vid += 1; + store.insert_chunk_with_vector(file_id, heading, snippet, vid, *token_count, vector)?; + store.insert_vec(vid, vector)?; + store.insert_fts_chunk(file_id, seq as i64, snippet)?; + } + + build_edges_for_file(store, file_id, &restored_content)?; + + for tag in &tags { + store.register_tag(tag, "unarchive")?; + } + + Ok(()) + })(); + + match result { + Ok(()) => store.commit()?, + Err(e) => { + let _ = store.rollback(); + let _ = std::fs::remove_file(&restore_full_path); + return Err(e); + } + } + + // Remove archived file + std::fs::remove_file(&archive_path)?; + + let folder = original_path + .rsplit_once('/') + .map(|(f, _)| f.to_string()) + .unwrap_or_default(); + + Ok(WriteResult { + path: original_path, + docid, + tags, + links_added: vec![], + folder, + confidence: 1.0, + strategy: "Unarchive".to_string(), + }) +} + +// ── Index integrity ───────────────────────────────────────────── + +/// Verify that all indexed files still exist on disk. +/// Removes orphan DB entries for files that no longer exist. +/// Returns the number of orphan entries cleaned up. +pub fn verify_index_integrity(store: &Store, vault_path: &Path) -> Result { + let all_files = store.get_all_files()?; + let mut orphans = 0; + for file in &all_files { + let full_path = vault_path.join(&file.path); + if !full_path.exists() { + // Clean up orphan: vectors, FTS, edges, file record + let vids = store.get_vector_ids_for_file(file.id)?; + for vid in &vids { + store.delete_vec(*vid)?; + } + store.delete_fts_chunks_for_file(file.id)?; + store.delete_edges_for_file(file.id)?; + store.delete_file(file.id)?; + orphans += 1; + } + } + Ok(orphans) +} + +// ── Tests ─────────────────────────────────────────────────────── + +#[cfg(test)] +mod tests { + use super::*; + + #[test] + fn test_generate_filename() { + assert_eq!(generate_filename("My Great Note"), "My Great Note"); + assert_eq!(generate_filename("Note/With:Bad*Chars"), "NoteWithBadChars"); + } + + #[test] + fn test_extract_title() { + assert_eq!(extract_title("# Hello World\nBody"), "Hello World"); + assert_eq!(extract_title("Just some text"), "Just some text"); + } + + #[test] + fn test_extract_title_empty() { + assert_eq!(extract_title(""), "Untitled"); + } + + #[test] + fn test_extract_title_truncation() { + let long_title = "a".repeat(100); + let content = format!("# {}\nBody", long_title); + assert_eq!(extract_title(&content).len(), 50); + } + + #[test] + fn test_build_frontmatter() { + let fm = build_frontmatter( + &["work".to_string(), "engraph".to_string()], + Some("claude-code"), + None, + None, + ); + assert!(fm.starts_with("---\n")); + assert!(fm.ends_with("---\n\n")); + assert!(fm.contains("work")); + assert!(fm.contains("created_by: claude-code")); + } + + #[test] + fn test_build_frontmatter_with_aliases() { + let fm = build_frontmatter( + &["test".to_string()], + Some("writer"), + Some(&["alias1".to_string(), "alias2".to_string()]), + None, + ); + assert!(fm.contains("aliases:")); + assert!(fm.contains(" - alias1")); + assert!(fm.contains(" - alias2")); + } + + #[test] + fn test_split_frontmatter() { + let content = "---\ntags: [a]\n---\n\nBody text"; + let (fm, body) = split_frontmatter(content); + assert!(fm.contains("tags")); + assert_eq!(body.trim(), "Body text"); + } + + #[test] + fn test_split_frontmatter_no_fm() { + let content = "Just body text"; + let (fm, body) = split_frontmatter(content); + assert!(fm.is_empty()); + assert_eq!(body, "Just body text"); + } + + #[test] + fn test_cleanup_temp_files() { + let dir = tempfile::TempDir::new().unwrap(); + std::fs::write(dir.path().join("note.md.tmp"), "incomplete").unwrap(); + std::fs::write(dir.path().join("good.md"), "complete").unwrap(); + std::fs::write(dir.path().join("other.tmp"), "not md tmp").unwrap(); + + let cleaned = cleanup_temp_files(dir.path()).unwrap(); + assert_eq!(cleaned, 1); + assert!(!dir.path().join("note.md.tmp").exists()); + assert!(dir.path().join("good.md").exists()); + assert!(dir.path().join("other.tmp").exists()); // .tmp but not .md.tmp + } + + #[test] + fn test_today_date_format() { + let date = today_date(); + assert_eq!(date.len(), 10); + assert_eq!(&date[4..5], "-"); + assert_eq!(&date[7..8], "-"); + } + + #[test] + fn test_build_frontmatter_with_suggestion() { + let suggestion = PlacementSuggestion { + suggested_folder: "02-Areas/Development".to_string(), + confidence: 0.58, + reason: "semantic similarity: 0.580".to_string(), + }; + let fm = build_frontmatter( + &["work".to_string()], + Some("claude-code"), + None, + Some(&suggestion), + ); + assert!(fm.contains("suggested_folder: 02-Areas/Development")); + assert!(fm.contains("confidence: 0.58")); + assert!(fm.contains("reason: \"semantic similarity: 0.580\"")); + } + + #[test] + fn test_verify_index_integrity() { + let dir = tempfile::TempDir::new().unwrap(); + let vault = dir.path(); + std::fs::create_dir_all(vault.join("notes")).unwrap(); + std::fs::write(vault.join("notes/existing.md"), "# Exists").unwrap(); + + let store = crate::store::Store::open_memory().unwrap(); + // Insert two files: one exists on disk, one does not + store + .insert_file( + "notes/existing.md", + "hash1", + 100, + &[], + &crate::docid::generate_docid("notes/existing.md"), + ) + .unwrap(); + store + .insert_file( + "notes/gone.md", + "hash2", + 100, + &[], + &crate::docid::generate_docid("notes/gone.md"), + ) + .unwrap(); + + let orphans = verify_index_integrity(&store, vault).unwrap(); + assert_eq!(orphans, 1); + + // The gone file should be removed from the store + assert!(store.get_file("notes/gone.md").unwrap().is_none()); + // The existing file should still be there + assert!(store.get_file("notes/existing.md").unwrap().is_some()); + } + + #[test] + fn test_compute_content_hash() { + let h1 = compute_content_hash("hello"); + let h2 = compute_content_hash("hello"); + let h3 = compute_content_hash("world"); + assert_eq!(h1, h2); + assert_ne!(h1, h3); + assert_eq!(h1.len(), 64); // SHA-256 hex + } +} diff --git a/tests/write_pipeline.rs b/tests/write_pipeline.rs new file mode 100644 index 0000000..302f66d --- /dev/null +++ b/tests/write_pipeline.rs @@ -0,0 +1,160 @@ +//! Integration tests for the write pipeline. +//! Run with: cargo test --test write_pipeline -- --ignored + +use std::path::Path; + +use engraph::embedder::Embedder; +use engraph::store::Store; +use engraph::vecstore; +use engraph::writer::{AppendInput, CreateNoteInput, append_to_note, create_note}; + +fn setup(vault_dir: &Path) -> (Store, Embedder) { + // Register sqlite-vec extension + vecstore::init_sqlite_vec(); + + // Create minimal vault structure + std::fs::create_dir_all(vault_dir.join("00-Inbox")).unwrap(); + std::fs::create_dir_all(vault_dir.join("03-Resources/People")).unwrap(); + std::fs::write( + vault_dir.join("03-Resources/People/Steve Barbera.md"), + "# Steve Barbera\n\nRole: VP Engineering\n", + ) + .unwrap(); + + // Open store and set vault path + let data_dir = tempfile::TempDir::new().unwrap(); + let db_path = data_dir.path().join("engraph.db"); + let store = Store::open(&db_path).unwrap(); + store + .set_meta("vault_path", &vault_dir.to_string_lossy()) + .unwrap(); + + // Index the existing file so it's in the store + let docid = engraph::docid::generate_docid("03-Resources/People/Steve Barbera.md"); + store + .insert_file( + "03-Resources/People/Steve Barbera.md", + "hash1", + 0, + &[], + &docid, + ) + .unwrap(); + + // Load embedder + let models_dir = engraph::config::Config::data_dir().unwrap().join("models"); + let embedder = Embedder::new(&models_dir).unwrap(); + + (store, embedder) +} + +#[test] +#[ignore] // requires model download +fn test_create_note_is_immediately_searchable() { + let vault_dir = tempfile::TempDir::new().unwrap(); + let (store, mut embedder) = setup(vault_dir.path()); + + let input = CreateNoteInput { + content: + "# RRF Tuning Notes\n\nWe tested reciprocal rank fusion with k=60 and got good results." + .into(), + filename: Some("RRF Tuning".into()), + type_hint: None, + tags: vec!["engraph".into(), "search".into()], + folder: Some("00-Inbox".into()), + created_by: "test".into(), + }; + + let result = create_note(input, &store, &mut embedder, vault_dir.path(), None).unwrap(); + assert!(result.path.starts_with("00-Inbox/")); + assert!(result.path.ends_with(".md")); + assert!(!result.docid.is_empty()); + + // Verify the file exists on disk + assert!(vault_dir.path().join(&result.path).exists()); + + // Verify it's immediately searchable via sqlite-vec + let search = + engraph::search::search_internal("reciprocal rank fusion", 5, &store, &mut embedder) + .unwrap(); + assert!( + !search.results.is_empty(), + "created note should be searchable immediately" + ); + assert!( + search.results.iter().any(|r| r.file_path == result.path), + "created note should appear in search results" + ); +} + +#[test] +#[ignore] +fn test_append_updates_index() { + let vault_dir = tempfile::TempDir::new().unwrap(); + let (store, mut embedder) = setup(vault_dir.path()); + + // Create a note first + let input = CreateNoteInput { + content: "# Meeting Notes\n\nDiscussed the roadmap for Q2.".into(), + filename: Some("Meeting 2026-03-25".into()), + type_hint: None, + tags: vec![], + folder: Some("00-Inbox".into()), + created_by: "test".into(), + }; + let created = create_note(input, &store, &mut embedder, vault_dir.path(), None).unwrap(); + + // Append new content + let append_input = AppendInput { + file: created.path.clone(), + content: "## Action Items\n\n- Ship sqlite-vec migration by Friday\n- Review PR #42".into(), + modified_by: "test".into(), + }; + let _appended = append_to_note(append_input, &store, &mut embedder, vault_dir.path()).unwrap(); + + // Verify appended content is searchable + let search = + engraph::search::search_internal("sqlite-vec migration", 5, &store, &mut embedder).unwrap(); + assert!( + search.results.iter().any(|r| r.file_path == created.path), + "appended content should be searchable" + ); +} + +#[test] +#[ignore] +fn test_conflict_detection() { + let vault_dir = tempfile::TempDir::new().unwrap(); + let (store, mut embedder) = setup(vault_dir.path()); + + let input = CreateNoteInput { + content: "# Test Note\n\nOriginal content.".into(), + filename: Some("conflict-test".into()), + type_hint: None, + tags: vec![], + folder: Some("00-Inbox".into()), + created_by: "test".into(), + }; + let created = create_note(input, &store, &mut embedder, vault_dir.path(), None).unwrap(); + + // Modify file externally (simulates Obsidian edit) + let abs_path = vault_dir.path().join(&created.path); + // Wait a moment so mtime changes + std::thread::sleep(std::time::Duration::from_millis(1100)); + std::fs::write(&abs_path, "# Modified externally\n\nNew content.").unwrap(); + + // Attempt append — should fail with conflict + let append_input = AppendInput { + file: created.path, + content: "appended content".into(), + modified_by: "test".into(), + }; + let result = append_to_note(append_input, &store, &mut embedder, vault_dir.path()); + assert!(result.is_err(), "should detect mtime conflict"); + let err_msg = result.unwrap_err().to_string(); + assert!( + err_msg.contains("mtime conflict") || err_msg.contains("CONFLICT"), + "error should mention conflict, got: {}", + err_msg + ); +}