Tools for working with text embeddings. Pick the right model for your data, find duplicates, classify, search by meaning. Runs on a laptop.
You have text data. Support tickets, product listings, documents, commit messages, whatever. You want to do something useful with it: categorize it, find the duplicates, make it searchable, group similar items together.
These tools do that. They use embedding models under the hood, but you don't need to know what that means to get value out of them.
| Tool | What it does | Status |
|---|---|---|
ignite-eval |
Figures out which embedding model works best for your specific data | Working |
ignite-read |
Shows you what your data looks like before you process it | Working |
ignite-explore |
Finds duplicates, natural groupings, outliers in your text | Planned |
ignite-classify |
Sorts text into categories without training a model | Planned |
ignite-index |
Makes your content searchable by meaning, not just keywords | Planned |
git clone https://github.com/Artain-AI/ignite-tools.git
cd ignite-tools
pip install -e . # core (local data only)
pip install -e ".[cloud]" # + S3/Azure download support
pip install -e ".[all]" # cloud + all optional formatsPython 3.10 or newer.
ignite-eval your-data/ --yesWhat happens:
- Reads your data. Figures out the language, text length, domain.
- Picks 3-4 models that make sense for your situation.
- Downloads them, runs each one on your data.
- Measures speed and quality.
- Tells you which one to use, and why.
Output looks like this:
-- Result ---------------------------------------------------
Recommendation: BGE-small
Best balance of quality (AUC 0.72) and speed (1200 texts/sec).
384-dimensional embeddings, 450 MB memory.
Confidence: +++ (clear winner)
Why not the others:
- MiniLM-L12: lower quality (AUC 0.65)
- E5-small: similar quality but slower
Benchmarked on:
Apple M2 Max, 32 GB RAM, Apple Silicon GPU
ignite-read your-data/ --yesShows file structure, text lengths, detected languages, topic distribution, label balance. Useful for sanity-checking before you run anything expensive.
One config file controls everything: ignite-format.yaml. It tells the tools where your data is, how to read it, and what each tool should do.
If you don't have a config, the tool creates one for you:
ignite-eval your-data/ --save-config ./ignite-format.yamlThis sniffs your data (format, fields, languages, structure) and writes a proposed config. Review it, edit if needed, then use it for all runs.
# Where the data is
storage:
type: local
path: ./your-data/
recursive: true
# How to parse it
format:
type: jsonl
# Where the text lives in each record
text:
fields: [body, title]
# Optional: which field has category labels
labels:
field: categoryEach tool can have its own settings in the same file. Use the tool's name as the key:
# Shared data reading (used by all tools)
storage: ...
text: ...
labels: ...
# ignite-read settings
ignite-read:
sections: [corpus_stats, per_source, top_words]
top_words:
count: 30
# ignite-eval settings
ignite-eval:
task: classify
priority: quality
constraints:
max_size_mb: 500You can also put tool settings in a separate file:
# In ignite-format.yaml:
ignite-read: ./read-settings.yaml
ignite-eval: ./eval-settings.yamlThe tool loads its own block and ignores the others.
The tools look for a config in this order:
--config path(explicit, always wins)./ignite-format.yamlin the current directoryignite-format.yamlnext to the data~/.config/ignite-tools/ignite-format.yaml(global default)- Auto-detect and propose (interactive)
Full config reference: docs/format-config.md
- JSONL: one JSON object per line. Supports
.gzand.zstcompression. - CSV/TSV: tabular. Header row expected.
- Plain text: one record per line, or one file per record.
Reads from local disk, S3 (s3://bucket/path/), or Azure Blob (azure://account/container/).
The evaluator looks at three things:
- Your data. Language, average text length, how many records, what domain the vocabulary suggests.
- Your requirements (optional). What task you're doing (search, classification, clustering), whether you care more about speed or quality.
- Your hardware. CPU, Apple Silicon GPU, or NVIDIA GPU.
Based on those three inputs, it picks models from a registry of 42 open-source options, runs them, and tells you which one performed best on your actual data. Not on a generic benchmark. On yours.
These tools run fine on a MacBook or a cheap cloud VM. At some point your data gets big enough that embedding takes too long:
- Under 10K texts: seconds.
- 10K to 100K: minutes.
- Over 1M: you probably want IgniteMS, which does the same thing 100x faster on GPU hardware.
Apache 2.0