A lightweight, zero-dependency KMH metagenomics batch pipeline.
KMH = KneadData → MetaPhlAn → HUMAnN
EasyBioBakery is a single-script alternative to bioBakery Workflows, designed for rapid deployment on small-to-medium HPC. It retains a fully standardised KMH analysis while eliminating unnecessary complexity — no Snakemake, no AnADAMA2, no extra Python packages beyond the standard library.
| Feature | Description |
|---|---|
| Zero extra dependencies | Pure Python standard library — nothing to install beyond the biobakery conda environment. |
| Concurrent download and analysis | Data acquisition and processing run in parallel, maximising utilisation of I/O wait time. |
| Resume on restart | A checkpoint file is written after each step; completed steps are skipped automatically on re-runs. |
| Per-sample failure isolation | A failed sample is logged and skipped without interrupting the rest of the batch. |
| Per-sample log files | Each sample writes to its own log, so concurrent output never interleaves. |
| Automatic cleanup | Intermediate files are removed after each sample to contain peak disk usage. |
| Dry-run mode | --dry-run prints every command that would be executed without writing a single byte. |
| Debug mode | --no-cleanup retains all intermediate files for troubleshooting. |
| Config file support | TOML (Python 3.11+) or INI (Python 3.8+) project-level config files. |
| Global persistent config | --set-global / --show-config persist database paths to ~/.config/easybiobakery/global.ini — no need to edit ~/.bashrc. |
| Force re-run | --force clears all checkpoints and re-runs every step from scratch. |
| Safe interruption | SIGINT (Ctrl+C) and SIGTERM (kill / pkill) both terminate every child process group, preventing orphaned grandchild processes on shared servers. |
| Environment variable awareness | KNEADDATA_DB and related variables are read automatically. |
| EasyBioBakery | bioBakery Workflows | |
|---|---|---|
| Installation | Minimal (one Python script, no extra deps) | Complex |
| Scheduler | Python ThreadPoolExecutor |
Snakemake / AnADAMA2 (DAG) |
| HPC cluster support | Not yet (planned) | SLURM / SGE / LSF |
| Resume on restart | Checkpoint files | Output file timestamps |
| Per-sample failure isolation | Yes | Yes |
Safe interruption (kill / pkill) |
Yes — process groups | Yes |
| Customisability | High — code is the documentation | Low — black-box |
| Best suited for | Small-to-medium projects, rapid deployment | Large-scale production, HPC clusters |
You can directly download the standalone script:
wget https://raw.githubusercontent.com/FrankYannn/EasyBioBakery/main/easybiobakery.pyAlternatively, you can clone the entire repository:
git clone [https://github.com/FrankYannn/EasyBioBakery.git](https://github.com/FrankYannn/EasyBioBakery.git)
cd EasyBioBakeryconda create -n biobakery -c biobakery python=3.9
conda activate biobakery
conda install -c biobakery kneaddata metaphlan humann
# Trimmomatic is normally installed alongside KneadData.
# If it is missing, install it separately:
conda install -c bioconda trimmomaticConsult the official bioBakery documentation for the correct database versions.
# KneadData: human reference genome
kneaddata_database --download human_genome bowtie2 $DIR
# MetaPhlAn
metaphlan --install --index mpa_vJun23_CHOCOPhlAnSGB_202403 --bowtie2db <database_dir>
# HUMAnN
humann_databases --download chocophlan full $INSTALL_LOCATION
humann_databases --download uniref uniref90_diamond $INSTALL_LOCATIONRun this once per machine so you never have to type database paths again:
python easybiobakery.py --set-global kneaddata_db /ref/kneaddata_hg39
python easybiobakery.py --set-global metaphlan_db /ref/metaphlan4
python easybiobakery.py --set-global chocophlan_db /ref/humann/chocophlan
python easybiobakery.py --set-global uniref_db /ref/humann/uniref
# Verify
python easybiobakery.py --show-configCreate samples.txt with three tab- or space-separated columns. Lines beginning with # are treated as comments. Local paths and remote URLs may be mixed freely within the same file.
# Sample R1 R2
ERR1234567 ftp://ftp.sra.ebi.ac.uk/.../ERR1234567_1.fastq.gz ftp://.../_2.fastq.gz
SampleLocal /home/user/raw/SampleLocal_R1.fastq.gz /home/user/raw/SampleLocal_R2.fastq.gz
Supported compression formats: .fastq.gz, .fq.gz, .fastq.bz2, .fq.bz2, .fastq, .fq.
ftp:// URLs are automatically rewritten to https:// to improve connectivity on networks where FTP port 21 is blocked.
python easybiobakery.py \
--sample-list samples.txt \
--output-dir /data/my_project/results \
--parallel 4 \
--threads 16 \
--kneaddata-memory 20gIf database paths were saved with --set-global, they do not need to be repeated here.
Project-level config files are convenient for parameters that are fixed within a project.
config.toml (requires Python 3.11+):
[pipeline]
output_dir = "/data/results"
kneaddata_db = "/ref/kneaddata_hg39"
metaphlan_db = "/ref/metaphlan4"
chocophlan_db = "/ref/humann/chocophlan"
uniref_db = "/ref/humann/uniref"
parallel = 4
threads = 16
kneaddata_memory = "20g"config.ini (Python 3.8+ compatible):
[pipeline]
output_dir = /data/results
kneaddata_db = /ref/kneaddata_hg39
metaphlan_db = /ref/metaphlan4
chocophlan_db = /ref/humann/chocophlan
uniref_db = /ref/humann/uniref
parallel = 4
threads = 16
kneaddata_memory = 20gpython easybiobakery.py --config config.toml --sample-list samples.txt
python easybiobakery.py --config config.ini --sample-list samples.txt --parallel 2Settings are resolved in the following order (highest to lowest):
CLI argument > --config file > global config (~/.config/easybiobakery/global.ini)
> environment variable > built-in default
Environment variables recognised: KNEADDATA_DB, METAPHLAN_DB_DIR, HUMANN_NUCLEOTIDE_DB, HUMANN_PROTEIN_DB.
results/
├── 00_Temp_Raw/ # Raw FASTQ (removed after KneadData)
├── 01_Kneaddata/ # KneadData output (alignment intermediates removed)
├── 02_MetaPhlAn/ # MetaPhlAn profiles (*.tsv + *.bowtie2.bz2)
├── 03_HUMAnN/ # HUMAnN output (temp dirs removed; per-sample TSVs retained)
├── 04_Final_Tables/ # Final merged and normalised tables
│ ├── merged_genefamilies.tsv
│ ├── merged_genefamilies_relab.tsv # Relative abundance (for MelonnPan)
│ ├── merged_pathabundance.tsv
│ ├── merged_pathabundance_cpm.tsv # CPM normalised (for MaAsLin2/3)
│ └── merged_metaphlan_taxa.tsv
└── logs/
├── pipeline_20240101_120000.log
├── ERR1234567.log
└── ERR1234568.log
| Argument | Required | Default | Description |
|---|---|---|---|
--sample-list |
Yes | — | Sample manifest file. |
--output-dir |
Yes | — | Root output directory. |
--kneaddata-db |
Yes* | — | KneadData host reference database. (*Can be set via --set-global or KNEADDATA_DB.) |
--metaphlan-db |
Yes* | — | MetaPhlAn database directory. (*Can be set via --set-global or METAPHLAN_DB_DIR.) |
--chocophlan-db |
Yes* | — | HUMAnN ChocoPhlAn nucleotide database. (*Can be set via --set-global or HUMANN_NUCLEOTIDE_DB.) |
--uniref-db |
Yes* | — | HUMAnN UniRef protein database. (*Can be set via --set-global or HUMANN_PROTEIN_DB.) |
--config |
No | — | Project-level config file (.toml / .ini / .cfg). |
--set-global KEY VALUE |
No | — | Persist a value to the global config and exit. |
--show-config |
No | — | Print the current global config and exit. |
--parallel |
No | 4 |
Number of samples processed concurrently. |
--threads |
No | 12 |
CPU threads per sample. |
--kneaddata-memory |
No | 20g |
Maximum memory per KneadData process. |
--metaphlan-index |
No | mpa_vJun23_CHOCOPhlAnSGB_202403 |
MetaPhlAn database index name. |
--humann-search-mode |
No | uniref90 |
HUMAnN protein search mode (uniref90 / uniref50). |
--keep-kneaddata |
No | False | Retain KneadData output directory. |
--no-cleanup |
No | False | Retain all intermediate files (debug mode). |
--force |
No | False | Clear all checkpoints and re-run from scratch. |
--dry-run |
No | False | Print commands without executing them. |
Total CPU cores = --parallel × --threads
KneadData memory = --kneaddata-memory × --parallel ≤ 50% of physical RAM
Peak disk usage ≈ n_samples × 50 GB (drops to ~5–10 GB/sample after cleanup)
Memory warning:
--kneaddata-memoryconstrains KneadData only. HUMAnN can consume 20–40 GB per process when loading the UniRef database. On memory-constrained servers, reduce--parallelfirst.
Q: Which KMH versions are supported?
A: EasyBioBakery was developed and tested with KneadData v0.12.0, MetaPhlAn v4.0.6, and HUMAnN v3.9. We anticipate that MetaPhlAn 4.x and HUMAnN 3.x should function correctly. We recommend using a Conda environment with Python 3.8+; however, please avoid excessively high Python versions (3.12+) to prevent potential dependency conflicts with the core bioBakery tools. If you verify a different version combination (working or not), please open an issue.
Q: KneadData cannot find its output files.
A: Output filename formats vary slightly across KneadData versions; EasyBioBakery handles the two most common patterns. If it still fails, inspect logs/<sample>.log for the actual filename and open an issue.
Q: KneadData fails with Error: Invalid or corrupt jarfile .../trimmomatic.
A: This is a known conda packaging conflict (fixed in v1.0.1). The share/trimmomatic-<ver>/ directory contains both a trimmomatic wrapper script and trimmomatic.jar; KneadData's glob picks the wrapper first alphabetically and tries to run it with java -jar. EasyBioBakery v1.0.1 auto-detects the real JAR and passes a JAR-only directory to KneadData at startup. If you see this error, ensure you are running v1.0.1 or later. If it persists, pass --trimmomatic-path /path/to/dir-containing-trimmomatic.jar to override the auto-detection.
Q: A sample fails with Error: Could not load sequence. Empty file or bad format. in the KneadData log.
A: This is a TRF crash caused by an extremely low-quality sample — Trimmomatic discarded nearly all reads, leaving a near-empty orphaned single-end file that TRF cannot parse. Since v1.0.1, EasyBioBakery automatically detects this by inspecting the KneadData log, wipes the failed output directory (to avoid accidentally reusing un-decontaminated Trimmomatic intermediates), and re-runs KneadData with --bypass-trf, which skips TRF while preserving the full Bowtie2 host-decontamination step. If the retry also fails, the sample is then truly marked as failed.
Q: Wrong MetaPhlAn index version.
A: Use --metaphlan-index to specify the index installed locally (ls /path/to/metaphlan_db/). Different MetaPhlAn versions may only recognise certain index names — consult the bioBakery documentation if in doubt.
Q: How do I check my command before running?
A: Use --dry-run. All commands are printed but nothing is executed.
Q: Only some samples failed. How do I retry just those?
A: Create a new manifest containing only the failed samples and re-run with the same arguments. Completed samples will be skipped automatically.
Q: Will kill / pkill leave orphan processes?
A: No. Since v1.0.0, every subprocess is launched in its own process group. Both SIGINT (Ctrl + C) and SIGTERM (kill / pkill) send the termination signal to the entire process group, reaching every grandchild process spawned by KneadData, MetaPhlAn, and HUMAnN.
Q: I only have curl, not wget.
A: EasyBioBakery falls back to curl automatically. To install wget: conda install wget.
Q: Can I mix local paths and remote URLs in the same manifest?
A: Yes. Local paths are symlinked (zero-copy); remote URLs are downloaded. They can appear in the same samples.txt without any special configuration.
Q: Which config source takes priority when multiple are set?
A: CLI > --config file > global config > environment variable > built-in default. Avoid specifying the same parameter through multiple sources to keep your configuration unambiguous.
-
🐛 KneadData selects the Trimmomatic wrapper script instead of the JAR When Trimmomatic is installed via conda (bioconda),
share/trimmomatic-<ver>/contains both a shell wrapper (trimmomatic, no extension) and the real archive (trimmomatic.jar). KneadData globstrimmomatic*and, because the extension-less name sorts first alphabetically, tries to run the wrapper withjava -jar, crashing withError: Invalid or corrupt jarfile. EasyBioBakery now locates the real JAR at startup and passes KneadData a small JAR-only symlink directory (~/.cache/easybiobakery/trimmomatic_jar_<hash>/), so the glob always matches exactly one file. -
🐛 TRF crash on near-empty KneadData output causes false sample failure On extremely low-quality samples, Trimmomatic discards nearly all reads. The resulting near-empty orphaned single-end file causes TRF (called internally by KneadData) to crash with
Error: Could not load sequence. Empty file or bad format., making KneadData exit non-zero before the Bowtie2 host-decontamination step completes. Any files left on disk at that point are un-decontaminated Trimmomatic intermediates; reusing them would silently contaminate all downstream MetaPhlAn and HUMAnN results with human reads. EasyBioBakery now inspects the KneadData log after a non-zero exit: if a TRF crash is detected, the failed output directory is wiped and KneadData is re-run with--bypass-trf(skipping TRF while keeping the full Bowtie2 host-removal pipeline intact). Only if the retry also fails is the sample marked as failed. -
🐛 Robust Download Validation Added a mandatory local integrity check (
gzip -t) for all downloaded.fastq.gzfiles. Previously, network interruptions (especially from distant FTP servers like ENA) could result in silently truncated files that would permanently crash the pipeline during the KneadData unzipping step. Now, corrupted files are automatically detected, rejected, and deleted before the download checkpoint is created, ensuring downstream tools only process healthy data.
- 🎉 Initial release.
If EasyBioBakery contributes to published research, please cite the original bioBakery tool papers:
KneadData / bioBakery
McIver LJ, Abu-Ali G, Franzosa EA, et al. bioBakery: a meta'omic analysis environment. Bioinformatics. 2018;34(7):1235–1237.
MetaPhlAn 4
Blanco-Míguez A, Beghini F, Cumbo F, et al. Extending and improving metagenomic taxonomic profiling with uncharacterized species using MetaPhlAn 4. Nature Biotechnology. 2023;41:1633–1644.
HUMAnN 3 / bioBakery 3
Beghini F, McIver LJ, Blanco-Míguez A, et al. Integrating taxonomic, functional, and strain-level profiling of diverse microbial communities with bioBakery 3. eLife. 2021;10:e65088.
MIT License — free to use, modify, and distribute.
EasyBioBakery is built on top of the bioBakery toolchain. We gratefully acknowledge the Huttenhower Lab for their foundational work.