EasyBioBakery

A lightweight, zero-dependency KMH metagenomics batch pipeline.

KMH = KneadData → MetaPhlAn → HUMAnN

EasyBioBakery is a single-script alternative to bioBakery Workflows, designed for rapid deployment on small-to-medium HPC. It retains a fully standardised KMH analysis while eliminating unnecessary complexity — no Snakemake, no AnADAMA2, no extra Python packages beyond the standard library.

✨Features

Feature	Description
Zero extra dependencies	Pure Python standard library — nothing to install beyond the biobakery conda environment.
Concurrent download and analysis	Data acquisition and processing run in parallel, maximising utilisation of I/O wait time.
Resume on restart	A checkpoint file is written after each step; completed steps are skipped automatically on re-runs.
Per-sample failure isolation	A failed sample is logged and skipped without interrupting the rest of the batch.
Per-sample log files	Each sample writes to its own log, so concurrent output never interleaves.
Automatic cleanup	Intermediate files are removed after each sample to contain peak disk usage.
Dry-run mode	`--dry-run` prints every command that would be executed without writing a single byte.
Debug mode	`--no-cleanup` retains all intermediate files for troubleshooting.
Config file support	TOML (Python 3.11+) or INI (Python 3.8+) project-level config files.
Global persistent config	`--set-global` / `--show-config` persist database paths to `~/.config/easybiobakery/global.ini` — no need to edit `~/.bashrc`.
Force re-run	`--force` clears all checkpoints and re-runs every step from scratch.
Safe interruption	SIGINT (Ctrl+C) and SIGTERM (`kill` / `pkill`) both terminate every child process group, preventing orphaned grandchild processes on shared servers.
Environment variable awareness	`KNEADDATA_DB` and related variables are read automatically.

Comparison with bioBakery Workflows

	EasyBioBakery	bioBakery Workflows
Installation	Minimal (one Python script, no extra deps)	Complex
Scheduler	Python `ThreadPoolExecutor`	Snakemake / AnADAMA2 (DAG)
HPC cluster support	Not yet (planned)	SLURM / SGE / LSF
Resume on restart	Checkpoint files	Output file timestamps
Per-sample failure isolation	Yes	Yes
Safe interruption (`kill` / `pkill`)	Yes — process groups	Yes
Customisability	High — code is the documentation	Low — black-box
Best suited for	Small-to-medium projects, rapid deployment	Large-scale production, HPC clusters

Quick Start

0. Obtain EasyBioBakery

You can directly download the standalone script:

wget https://raw.githubusercontent.com/FrankYannn/EasyBioBakery/main/easybiobakery.py

Alternatively, you can clone the entire repository:

git clone [https://github.com/FrankYannn/EasyBioBakery.git](https://github.com/FrankYannn/EasyBioBakery.git)
cd EasyBioBakery

1. Install the KMH toolchain

conda create -n biobakery -c biobakery python=3.9
conda activate biobakery
conda install -c biobakery kneaddata metaphlan humann
# Trimmomatic is normally installed alongside KneadData.
# If it is missing, install it separately:
conda install -c bioconda trimmomatic

2. Download databases

Consult the official bioBakery documentation for the correct database versions.

# KneadData: human reference genome
kneaddata_database --download human_genome bowtie2 $DIR

# MetaPhlAn
metaphlan --install --index mpa_vJun23_CHOCOPhlAnSGB_202403 --bowtie2db <database_dir>

# HUMAnN
humann_databases --download chocophlan full $INSTALL_LOCATION
humann_databases --download uniref uniref90_diamond $INSTALL_LOCATION

3. (Optional) Save database paths to the global config

Run this once per machine so you never have to type database paths again:

python easybiobakery.py --set-global kneaddata_db  /ref/kneaddata_hg39
python easybiobakery.py --set-global metaphlan_db  /ref/metaphlan4
python easybiobakery.py --set-global chocophlan_db /ref/humann/chocophlan
python easybiobakery.py --set-global uniref_db     /ref/humann/uniref

# Verify
python easybiobakery.py --show-config

4. Prepare a sample list

Create samples.txt with three tab- or space-separated columns. Lines beginning with # are treated as comments. Local paths and remote URLs may be mixed freely within the same file.

# Sample          R1                                                   R2
ERR1234567   ftp://ftp.sra.ebi.ac.uk/.../ERR1234567_1.fastq.gz   ftp://.../_2.fastq.gz
SampleLocal  /home/user/raw/SampleLocal_R1.fastq.gz              /home/user/raw/SampleLocal_R2.fastq.gz

Supported compression formats: .fastq.gz, .fq.gz, .fastq.bz2, .fq.bz2, .fastq, .fq.
ftp:// URLs are automatically rewritten to https:// to improve connectivity on networks where FTP port 21 is blocked.

5. Run the pipeline

python easybiobakery.py \
    --sample-list samples.txt \
    --output-dir /data/my_project/results \
    --parallel 4 \
    --threads 16 \
    --kneaddata-memory 20g

If database paths were saved with --set-global, they do not need to be repeated here.

Config Files

Project-level config files are convenient for parameters that are fixed within a project.

config.toml (requires Python 3.11+):

[pipeline]
output_dir       = "/data/results"
kneaddata_db     = "/ref/kneaddata_hg39"
metaphlan_db     = "/ref/metaphlan4"
chocophlan_db    = "/ref/humann/chocophlan"
uniref_db        = "/ref/humann/uniref"
parallel         = 4
threads          = 16
kneaddata_memory = "20g"

config.ini (Python 3.8+ compatible):

[pipeline]
output_dir       = /data/results
kneaddata_db     = /ref/kneaddata_hg39
metaphlan_db     = /ref/metaphlan4
chocophlan_db    = /ref/humann/chocophlan
uniref_db        = /ref/humann/uniref
parallel         = 4
threads          = 16
kneaddata_memory = 20g

python easybiobakery.py --config config.toml --sample-list samples.txt
python easybiobakery.py --config config.ini  --sample-list samples.txt --parallel 2

Configuration Priority

Settings are resolved in the following order (highest to lowest):

CLI argument  >  --config file  >  global config (~/.config/easybiobakery/global.ini)
>  environment variable  >  built-in default

Environment variables recognised: KNEADDATA_DB, METAPHLAN_DB_DIR, HUMANN_NUCLEOTIDE_DB, HUMANN_PROTEIN_DB.

Output Directory Layout

results/
├── 00_Temp_Raw/           # Raw FASTQ (removed after KneadData)
├── 01_Kneaddata/          # KneadData output (alignment intermediates removed)
├── 02_MetaPhlAn/          # MetaPhlAn profiles (*.tsv + *.bowtie2.bz2)
├── 03_HUMAnN/             # HUMAnN output (temp dirs removed; per-sample TSVs retained)
├── 04_Final_Tables/       # Final merged and normalised tables
│   ├── merged_genefamilies.tsv
│   ├── merged_genefamilies_relab.tsv        # Relative abundance (for MelonnPan)
│   ├── merged_pathabundance.tsv
│   ├── merged_pathabundance_cpm.tsv         # CPM normalised (for MaAsLin2/3)
│   └── merged_metaphlan_taxa.tsv
└── logs/
    ├── pipeline_20240101_120000.log
    ├── ERR1234567.log
    └── ERR1234568.log

Parameter Reference

Argument	Required	Default	Description
`--sample-list`	Yes	—	Sample manifest file.
`--output-dir`	Yes	—	Root output directory.
`--kneaddata-db`	Yes*	—	KneadData host reference database. (*Can be set via `--set-global` or `KNEADDATA_DB`.)
`--metaphlan-db`	Yes*	—	MetaPhlAn database directory. (*Can be set via `--set-global` or `METAPHLAN_DB_DIR`.)
`--chocophlan-db`	Yes*	—	HUMAnN ChocoPhlAn nucleotide database. (*Can be set via `--set-global` or `HUMANN_NUCLEOTIDE_DB`.)
`--uniref-db`	Yes*	—	HUMAnN UniRef protein database. (*Can be set via `--set-global` or `HUMANN_PROTEIN_DB`.)
`--config`	No	—	Project-level config file (.toml / .ini / .cfg).
`--set-global KEY VALUE`	No	—	Persist a value to the global config and exit.
`--show-config`	No	—	Print the current global config and exit.
`--parallel`	No	`4`	Number of samples processed concurrently.
`--threads`	No	`12`	CPU threads per sample.
`--kneaddata-memory`	No	`20g`	Maximum memory per KneadData process.
`--metaphlan-index`	No	`mpa_vJun23_CHOCOPhlAnSGB_202403`	MetaPhlAn database index name.
`--humann-search-mode`	No	`uniref90`	HUMAnN protein search mode (`uniref90` / `uniref50`).
`--keep-kneaddata`	No	False	Retain KneadData output directory.
`--no-cleanup`	No	False	Retain all intermediate files (debug mode).
`--force`	No	False	Clear all checkpoints and re-run from scratch.
`--dry-run`	No	False	Print commands without executing them.

Resource Planning

Total CPU cores    = --parallel × --threads
KneadData memory   = --kneaddata-memory × --parallel  ≤ 50% of physical RAM
Peak disk usage    ≈ n_samples × 50 GB  (drops to ~5–10 GB/sample after cleanup)

Memory warning: --kneaddata-memory constrains KneadData only. HUMAnN can consume 20–40 GB per process when loading the UniRef database. On memory-constrained servers, reduce --parallel first.

FAQ

Q: Which KMH versions are supported?
A: EasyBioBakery was developed and tested with KneadData v0.12.0, MetaPhlAn v4.0.6, and HUMAnN v3.9. We anticipate that MetaPhlAn 4.x and HUMAnN 3.x should function correctly. We recommend using a Conda environment with Python 3.8+; however, please avoid excessively high Python versions (3.12+) to prevent potential dependency conflicts with the core bioBakery tools. If you verify a different version combination (working or not), please open an issue.

Q: KneadData cannot find its output files.
A: Output filename formats vary slightly across KneadData versions; EasyBioBakery handles the two most common patterns. If it still fails, inspect logs/<sample>.log for the actual filename and open an issue.

Q: KneadData fails with Error: Invalid or corrupt jarfile .../trimmomatic.
A: This is a known conda packaging conflict (fixed in v1.0.1). The share/trimmomatic-<ver>/ directory contains both a trimmomatic wrapper script and trimmomatic.jar; KneadData's glob picks the wrapper first alphabetically and tries to run it with java -jar. EasyBioBakery v1.0.1 auto-detects the real JAR and passes a JAR-only directory to KneadData at startup. If you see this error, ensure you are running v1.0.1 or later. If it persists, pass --trimmomatic-path /path/to/dir-containing-trimmomatic.jar to override the auto-detection.

Q: A sample fails with Error: Could not load sequence. Empty file or bad format. in the KneadData log.
A: This is a TRF crash caused by an extremely low-quality sample — Trimmomatic discarded nearly all reads, leaving a near-empty orphaned single-end file that TRF cannot parse. Since v1.0.1, EasyBioBakery automatically detects this by inspecting the KneadData log, wipes the failed output directory (to avoid accidentally reusing un-decontaminated Trimmomatic intermediates), and re-runs KneadData with --bypass-trf, which skips TRF while preserving the full Bowtie2 host-decontamination step. If the retry also fails, the sample is then truly marked as failed.

Q: Wrong MetaPhlAn index version.
A: Use --metaphlan-index to specify the index installed locally (ls /path/to/metaphlan_db/). Different MetaPhlAn versions may only recognise certain index names — consult the bioBakery documentation if in doubt.

Q: How do I check my command before running?
A: Use --dry-run. All commands are printed but nothing is executed.

Q: Only some samples failed. How do I retry just those?
A: Create a new manifest containing only the failed samples and re-run with the same arguments. Completed samples will be skipped automatically.

Q: Will kill / pkill leave orphan processes?
A: No. Since v1.0.0, every subprocess is launched in its own process group. Both SIGINT (Ctrl + C) and SIGTERM (kill / pkill) send the termination signal to the entire process group, reaching every grandchild process spawned by KneadData, MetaPhlAn, and HUMAnN.

Q: I only have curl, not wget.
A: EasyBioBakery falls back to curl automatically. To install wget: conda install wget.

Q: Can I mix local paths and remote URLs in the same manifest?
A: Yes. Local paths are symlinked (zero-copy); remote URLs are downloaded. They can appear in the same samples.txt without any special configuration.

Q: Which config source takes priority when multiple are set?
A: CLI > --config file > global config > environment variable > built-in default. Avoid specifying the same parameter through multiple sources to keep your configuration unambiguous.

Changelog

v1.0.1

🐛 KneadData selects the Trimmomatic wrapper script instead of the JAR When Trimmomatic is installed via conda (bioconda), share/trimmomatic-<ver>/ contains both a shell wrapper (trimmomatic, no extension) and the real archive (trimmomatic.jar). KneadData globs trimmomatic* and, because the extension-less name sorts first alphabetically, tries to run the wrapper with java -jar, crashing with Error: Invalid or corrupt jarfile. EasyBioBakery now locates the real JAR at startup and passes KneadData a small JAR-only symlink directory (~/.cache/easybiobakery/trimmomatic_jar_<hash>/), so the glob always matches exactly one file.
🐛 TRF crash on near-empty KneadData output causes false sample failure On extremely low-quality samples, Trimmomatic discards nearly all reads. The resulting near-empty orphaned single-end file causes TRF (called internally by KneadData) to crash with Error: Could not load sequence. Empty file or bad format., making KneadData exit non-zero before the Bowtie2 host-decontamination step completes. Any files left on disk at that point are un-decontaminated Trimmomatic intermediates; reusing them would silently contaminate all downstream MetaPhlAn and HUMAnN results with human reads. EasyBioBakery now inspects the KneadData log after a non-zero exit: if a TRF crash is detected, the failed output directory is wiped and KneadData is re-run with --bypass-trf (skipping TRF while keeping the full Bowtie2 host-removal pipeline intact). Only if the retry also fails is the sample marked as failed.
🐛 Robust Download Validation Added a mandatory local integrity check (gzip -t) for all downloaded .fastq.gz files. Previously, network interruptions (especially from distant FTP servers like ENA) could result in silently truncated files that would permanently crash the pipeline during the KneadData unzipping step. Now, corrupted files are automatically detected, rejected, and deleted before the download checkpoint is created, ensuring downstream tools only process healthy data.

v1.0.0

🎉 Initial release.

Citation

If EasyBioBakery contributes to published research, please cite the original bioBakery tool papers:

KneadData / bioBakery

McIver LJ, Abu-Ali G, Franzosa EA, et al. bioBakery: a meta'omic analysis environment. Bioinformatics. 2018;34(7):1235–1237.

MetaPhlAn 4

Blanco-Míguez A, Beghini F, Cumbo F, et al. Extending and improving metagenomic taxonomic profiling with uncharacterized species using MetaPhlAn 4. Nature Biotechnology. 2023;41:1633–1644.

HUMAnN 3 / bioBakery 3

Beghini F, McIver LJ, Blanco-Míguez A, et al. Integrating taxonomic, functional, and strain-level profiling of diverse microbial communities with bioBakery 3. eLife. 2021;10:e65088.

License

MIT License — free to use, modify, and distribute.

Acknowledgements

EasyBioBakery is built on top of the bioBakery toolchain. We gratefully acknowledge the Huttenhower Lab for their foundational work.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README_zh.md		README_zh.md
easybiobakery.py		easybiobakery.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EasyBioBakery

✨Features

Comparison with bioBakery Workflows

Quick Start

0. Obtain EasyBioBakery

1. Install the KMH toolchain

2. Download databases

3. (Optional) Save database paths to the global config

4. Prepare a sample list

5. Run the pipeline

Config Files

Configuration Priority

Output Directory Layout

Parameter Reference

Resource Planning

FAQ

Changelog

v1.0.1

v1.0.0

Citation

License

Acknowledgements

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

EasyBioBakery

✨Features

Comparison with bioBakery Workflows

Quick Start

0. Obtain EasyBioBakery

1. Install the KMH toolchain

2. Download databases

3. (Optional) Save database paths to the global config

4. Prepare a sample list

5. Run the pipeline

Config Files

Configuration Priority

Output Directory Layout

Parameter Reference

Resource Planning

FAQ

Changelog

v1.0.1

v1.0.0

Citation

License

Acknowledgements

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages