Skip to content

JorgeV92/minisearch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

minisearch

minisearch is a small Rust search engine crate and CLI for indexing local text content with an inverted index, BM25-style scoring, phrase, proximity, and fuzzy matching, metadata filters, highlighted snippets, and on-disk persistence.

Features

  • Recursive directory indexing for .txt and .md files by default
  • Custom indexing options for file extensions and maximum file size
  • Lowercased alphanumeric tokenization with positional postings
  • BM25-style ranking for term queries
  • Phrase search with quoted queries like "distributed systems"
  • Proximity search with slop syntax like "distributed systems"~3
  • Fuzzy search with typo-tolerant term syntax like serch~1
  • Metadata filters like ext:rs, path:guides/, and title:search
  • Highlighted snippets with [[...]] markers around matched text
  • Required and excluded terms or phrases via +term, -term, +"phrase", and -"phrase"
  • Search-time filters for path prefixes and minimum score thresholds
  • Simple save/load support for persisting an index to disk
  • Lightweight stats helpers for vocabulary inspection and top terms

Install

cargo add minisearch

To use the CLI locally:

cargo run -- <command>

Python Bindings

The repository also includes a small Python wrapper in python/minisearch that calls the Rust backend through a C ABI exposed by this crate.

Build the shared library:

cargo build --release --features python-bindings

Run the Python example from the repo root:

python3 python/examples/basic.py

Or use it directly:

from minisearch import SearchEngine, SearchOptions

engine = SearchEngine()
engine.add_document("guides/rust.txt", "Rust phrase search and BM25 ranking.")
results = engine.search("rust", SearchOptions(top_k=5))

for result in results:
    print(result.path, result.score)

If the shared library lives outside target/release or target/debug, set MINISEARCH_LIBRARY to the full path before importing minisearch.

Quick Start

use minisearch::{SearchEngine, SearchOptions};

fn main() {
    let mut engine = SearchEngine::new();
    engine.add_document(
        "guides/project.txt",
        "A mini search engine in Rust with BM25 ranking and phrase search.",
    );
    engine.add_document(
        "notes/distributed.txt",
        "This document talks about distributed systems and indexing.",
    );

    let results = engine.search_with_options(
        "path:guides/ ext:txt rust serch~1 +\"phrase search\"",
        &SearchOptions::new(10).with_path_prefix("guides/"),
    );

    for result in results {
        println!(
            "{} -> {:.3} [{}]",
            result.path,
            result.score,
            result.matched_terms.join(", ")
        );
        if let Some(snippet) = result.snippet {
            println!("snippet: {snippet}");
        }
    }
}

Query Syntax

Syntax Meaning Example
rust bm25 Optional terms ranked by BM25 rust bm25
+rust Required term +rust search
-java Excluded term rust -java
serch~1 Fuzzy term match within edit distance 1 serch~1
ext:rs Required extension filter ext:rs
path:guides/ Required path-prefix filter path:guides/
title:search Required title-term filter title:search
"phrase search" Phrase boost / phrase-only search "phrase search"
"distributed systems"~3 Ordered proximity search with up to 3 extra tokens between terms "distributed systems"~3
+"phrase search" Required phrase rust +"phrase search"
-"toy example" Excluded phrase rust -"toy example"

Notes:

  • Optional terms contribute score when they appear.
  • Required terms and required phrases must match for a document to be returned.
  • Excluded terms and phrases remove a document from the result set.
  • Fuzzy terms use term~N and match indexed terms within edit distance N.
  • Metadata filters default to required; use -ext:md, -path:notes/, or -title:generated to exclude.
  • Phrase-only queries work even when no standalone terms are present.
  • Proximity phrases preserve term order and allow up to N extra intervening tokens.
  • Search results include an optional highlighted snippet built from the original stored document text.

Library API

Build an Index from a Directory

use minisearch::{IndexOptions, SearchEngine};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let options = IndexOptions::default()
        .with_extensions(["md", "txt", "rs"])
        .with_max_file_size_bytes(250_000);

    let engine = SearchEngine::build_from_directory_with_options("src", &options)?;
    println!("Indexed {} documents", engine.document_count());
    Ok(())
}

Filter Search Results

use minisearch::{SearchEngine, SearchOptions};

fn main() {
    let mut engine = SearchEngine::new();
    engine.add_document("guides/rust.md", "rust search engine rust phrase search");
    engine.add_document("notes/rust.md", "rust notes");

    let options = SearchOptions::new(5)
        .with_path_prefix("guides/")
        .with_min_score(1.0);

    for result in engine.search_with_options("rust", &options) {
        println!("{} -> {:.3}", result.path, result.score);
    }
}

Inspect the Vocabulary

use minisearch::SearchEngine;

fn main() {
    let mut engine = SearchEngine::new();
    engine.add_document("guide.txt", "rust rust search");
    engine.add_document("notes.txt", "rust indexing");

    println!("document frequency: {}", engine.document_frequency("rust"));
    println!("term frequency in doc 0: {}", engine.term_frequency(0, "rust"));

    for stat in engine.top_terms(3) {
        println!(
            "{} -> total {}, docs {}",
            stat.term, stat.total_frequency, stat.document_frequency
        );
    }
}

Save and Reload an Index

use minisearch::SearchEngine;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let mut engine = SearchEngine::new();
    engine.add_document("guide.txt", "rust search engine rust bm25");
    engine.save_to_path("sample.idx")?;

    let loaded = SearchEngine::load_from_path("sample.idx")?;
    println!("loaded {} documents", loaded.document_count());
    Ok(())
}

CLI

Commands

minisearch index <docs_dir> <index_file> [--ext=txt,md,rs] [--max-bytes=1048576]
minisearch search <index_file> <query> [top_k] [--path-prefix=guides/] [--min-score=1.0]
minisearch stats <index_file> [top_terms]
minisearch demo

Examples

cargo run -- index docs search.idx --ext=txt,md,rs --max-bytes=100000
cargo run -- search search.idx 'rust +"phrase search"' 5 --path-prefix=guides/
cargo run -- search search.idx 'bm25' --min-score=1.0
cargo run -- stats search.idx 10
cargo run -- demo

Included Examples

Run any example with cargo run --example <name>.

  • basic: in-memory indexing plus filtered search
  • custom_indexing: directory indexing with custom extensions and file size limits
  • filtered_search: search-time path and score filters
  • persistence: save/load and vocabulary statistics
  • query_syntax: inspect parsed queries and required/excluded phrases

Public Types

  • SearchEngine: the main in-memory index
  • SearchOptions: search-time filters like top_k, path_prefix, and min_score
  • IndexOptions: directory indexing controls for extensions and max file size
  • SearchResult: a matched document with score and matched terms snippet contains a highlighted excerpt using [[...]] markers when source content is available.
  • TermStat: aggregated term statistics for reporting
  • ParsedQuery / PhraseQuery / FuzzyTermQuery / MetadataFilter: parsed query structures if you want to inspect or cache queries

Persistence Format

Indexes are stored in a plain-text format that begins with the MSE3 header and records:

  • average document length
  • document metadata including extension, title, and modified timestamp
  • original content for snippet generation
  • positional postings for each term

Older MSE1 and MSE2 indexes still load. Legacy indexes derive missing metadata from the stored path/content, and MSE1 results still lack snippets because those files never stored original content.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors