ignite-tools

Tools for working with text embeddings. Pick the right model for your data, find duplicates, classify, search by meaning. Runs on a laptop.

What this is

You have text data. Support tickets, product listings, documents, commit messages, whatever. You want to do something useful with it: categorize it, find the duplicates, make it searchable, group similar items together.

These tools do that. They use embedding models under the hood, but you don't need to know what that means to get value out of them.

Tools

Tool	What it does	Status
`ignite-eval`	Figures out which embedding model works best for your specific data	Working
`ignite-read`	Shows you what your data looks like before you process it	Working
`ignite-explore`	Finds duplicates, natural groupings, outliers in your text	Planned
`ignite-classify`	Sorts text into categories without training a model	Planned
`ignite-index`	Makes your content searchable by meaning, not just keywords	Planned

Install

git clone https://github.com/Artain-AI/ignite-tools.git
cd ignite-tools
pip install -e .            # core (local data only)
pip install -e ".[cloud]"   # + S3/Azure download support
pip install -e ".[all]"     # cloud + all optional formats

Python 3.10 or newer.

Quick start

Find the right model for your data

ignite-eval your-data/ --yes

What happens:

Reads your data. Figures out the language, text length, domain.
Picks 3-4 models that make sense for your situation.
Downloads them, runs each one on your data.
Measures speed and quality.
Tells you which one to use, and why.

Output looks like this:

-- Result ---------------------------------------------------

  Recommendation: BGE-small
  Best balance of quality (AUC 0.72) and speed (1200 texts/sec).
  384-dimensional embeddings, 450 MB memory.
  Confidence: +++ (clear winner)

  Why not the others:
    - MiniLM-L12: lower quality (AUC 0.65)
    - E5-small: similar quality but slower

  Benchmarked on:
    Apple M2 Max, 32 GB RAM, Apple Silicon GPU

Look at your data first

ignite-read your-data/ --yes

Shows file structure, text lengths, detected languages, topic distribution, label balance. Useful for sanity-checking before you run anything expensive.

Configuration

One config file controls everything: ignite-format.yaml. It tells the tools where your data is, how to read it, and what each tool should do.

Auto-detection

If you don't have a config, the tool creates one for you:

ignite-eval your-data/ --save-config ./ignite-format.yaml

This sniffs your data (format, fields, languages, structure) and writes a proposed config. Review it, edit if needed, then use it for all runs.

Basic structure

# Where the data is
storage:
  type: local
  path: ./your-data/
  recursive: true

# How to parse it
format:
  type: jsonl

# Where the text lives in each record
text:
  fields: [body, title]

# Optional: which field has category labels
labels:
  field: category

Tool-specific settings

Each tool can have its own settings in the same file. Use the tool's name as the key:

# Shared data reading (used by all tools)
storage: ...
text: ...
labels: ...

# ignite-read settings
ignite-read:
  sections: [corpus_stats, per_source, top_words]
  top_words:
    count: 30

# ignite-eval settings
ignite-eval:
  task: classify
  priority: quality
  constraints:
    max_size_mb: 500

You can also put tool settings in a separate file:

# In ignite-format.yaml:
ignite-read: ./read-settings.yaml
ignite-eval: ./eval-settings.yaml

The tool loads its own block and ignores the others.

Config discovery

The tools look for a config in this order:

--config path (explicit, always wins)
./ignite-format.yaml in the current directory
ignite-format.yaml next to the data
~/.config/ignite-tools/ignite-format.yaml (global default)
Auto-detect and propose (interactive)

Full config reference: docs/format-config.md

Data formats

JSONL: one JSON object per line. Supports .gz and .zst compression.
CSV/TSV: tabular. Header row expected.
Plain text: one record per line, or one file per record.

Reads from local disk, S3 (s3://bucket/path/), or Azure Blob (azure://account/container/).

How model selection works

The evaluator looks at three things:

Your data. Language, average text length, how many records, what domain the vocabulary suggests.
Your requirements (optional). What task you're doing (search, classification, clustering), whether you care more about speed or quality.
Your hardware. CPU, Apple Silicon GPU, or NVIDIA GPU.

Based on those three inputs, it picks models from a registry of 42 open-source options, runs them, and tells you which one performed best on your actual data. Not on a generic benchmark. On yours.

When you outgrow your laptop

These tools run fine on a MacBook or a cheap cloud VM. At some point your data gets big enough that embedding takes too long:

Under 10K texts: seconds.
10K to 100K: minutes.
Over 1M: you probably want IgniteMS, which does the same thing 100x faster on GPU hardware.

License

Apache 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.github/workflows		.github/workflows
docs		docs
examples		examples
src/ignite_tools		src/ignite_tools
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ignite-tools

What this is

Tools

Install

Quick start

Find the right model for your data

Look at your data first

Configuration

Auto-detection

Basic structure

Tool-specific settings

Config discovery

Data formats

How model selection works

When you outgrow your laptop

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ignite-tools

What this is

Tools

Install

Quick start

Find the right model for your data

Look at your data first

Configuration

Auto-detection

Basic structure

Tool-specific settings

Config discovery

Data formats

How model selection works

When you outgrow your laptop

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages