manigen is a fast and reliable CLI tool designed to generate file manifests (lists of file paths) for Machine Learning and Deep Learning datasets.
Whether you are preparing a small local dataset or parsing a massive image corpus like ImageNet, manigen handles recursive scanning, multithreading, path formatting, and Train/Val/Test splitting out of the box.
- ⏱️ Efficient & Multithreaded: Uses a thread pool to parallelize I/O operations, significantly speeding up the scanning of large and deeply nested directory trees compared to sequential scripts.
- ✂️ Portable Manifests: Generate relative paths by stripping absolute prefixes (
--strip-prefix), making it easy to move datasets between local machines and cloud servers. - 🔀 ML-Ready Splits: Built-in shuffling and automatic Train/Validation/Test dataset splitting directly into separate files (
--split). - 🛡️ Robust Architecture: Built with modern Python, featuring thread-safe list operations, strict input validation, and clear error handling.
While working on Super-Resolution Deep Learning projects, I found myself repeatedly copying the same massive datasets across multiple project directories. To save disk space, I decided to store all datasets in a single central location (e.g., ~/.local/share/datasets) and feed the models using simple text files containing absolute paths to the images.
Initially, I wrote a bash script for this task. However, generating a manifest for the ImageNet dataset took about 30 minutes. By rewriting the tool in Python and leveraging multithreading, manigen can now generate a manifest for ImageNet (1,281,167 images) in 12 seconds.
You can install manigen directly from PyPI using pip:
pip install manigenOr, if you use uv (recommended for CLI tools):
uv tool install manigenGenerate a manifest of all images in a dataset directory:
manigen -i ./datasets/ImageNet/train -o manifest.txtSpeed up scanning for datasets with heavily nested directories (like ImageNet) by utilizing multiple threads and recursive search:
manigen -i ./datasets/ImageNet/train -o train_paths.txt -t 8 -rIf your absolute path is /Users/ml_engineer/projects/data/images/cat.jpg, but you want the manifest to only contain data/images/cat.jpg:
manigen -i /Users/ml_engineer/projects/data -o dataset.txt --strip-prefix /Users/ml_engineer/projects/Automatically shuffle the dataset and split it into training (70%), validation (20%), and testing (10%) sets:
manigen -i ./dataset -o manifest.txt --shuffle --split 0.7 0.2 0.1Override the default extensions to scan for audio, text, or any other formats:
manigen -i ./audio_dataset -o audio_manifest.txt -e wav mp3 flac| Argument | Short | Description | Default |
|---|---|---|---|
--input-dir |
-i |
(Required) One or more dataset directories to scan. | - |
--output-file |
-o |
(Required) Output file path (e.g., manifest.txt). |
- |
--threads |
-t |
Number of threads for parallel scanning. | 1 |
--recursive |
-r |
Scan subdirectories recursively. | False |
--extensions |
-e |
Allowed file extensions. | png jpeg jpg webp bmp |
--strip-prefix |
Prefix to strip from absolute paths for relative outputs. | None |
|
--shuffle |
Shuffle paths randomly before saving. | False |
|
--split |
Dataset split ratios, must sum to 1.0 (e.g., 0.8 0.2). |
None |
git clone https://github.com/ash1ra/manigen
cd manigenuv sync
# On Windows
.venv\Scripts\activate
# on Unix or MacOS
source .venv/bin/activateuv run ruff format .
uv run ruff check .uv run pytest tests/ -vIf you'd like to contribute, please fork the repository and open a pull request to the main branch.