This guide shows you how to compare a VCF file generated by your pipeline
(e.g. DRAGEN) against a trusted truth set using hap.py, all inside a
Docker container.
You don't need to install anything besides Docker and Python, and you don't need to be a bioinformatician to follow this.
Make sure you have:
-
Docker installed and running.
- Run
docker --versionto check. - Pull the hap.py image:
docker pull pkrusche/hap.py
- Run
-
happy-cli installed.
- From the project directory:
pip install -e . - Verify:
happy --help
- From the project directory:
-
A known truth VCF (included in
data/):- Example:
data/PlatinumGenomesIllumina/vcf/NA12877.vcf.gz
- Example:
-
The VCF you want to evaluate, produced by your pipeline:
- Example:
/path/to/dragen_output/P23_001471.hard-filtered.vcf.gz
- Example:
-
A reference genome in FASTA format:
- Example:
/path/to/hg38_reference/WholeGenomeFasta/genome.fa - Make sure
.faiand.dictindex files are present in the same directory.
- Example:
-
A high-confidence BED file (included in
data/):- Example:
data/ConfidentRegions/ConfidentRegions.bed
- Example:
-
For exome only: a target regions BED file from your capture kit vendor:
- Example:
data/ConfidentRegions/DRAGEN_Illumina_exome/hg38_Twist_Bioscience_for_Illumina_Exome_2_5_Mito.bed
- Example:
-
Open a terminal
-
Navigate to the project directory:
cd /path/to/happy-cli -
Run the command:
happy \ data/PlatinumGenomesIllumina/vcf/NA12877.vcf.gz \ /path/to/dragen_output/P23_001471.hard-filtered.vcf.gz \ -r /path/to/hg38_reference/WholeGenomeFasta/genome.fa \ -f data/ConfidentRegions/ConfidentRegions.bed \ -o /path/to/output/NA12877_vs_DRAGEN
For exome data, add
-Twith your capture kit target regions BED, and use--engine vcfevalwith--pass-onlyfor more accurate results:happy \ data/PlatinumGenomesIllumina/vcf/NA12877.vcf.gz \ /path/to/dragen_exome_output/P23_001471.hard-filtered.vcf.gz \ -r /path/to/hg38_reference/WholeGenomeFasta/genome.fa \ -f data/ConfidentRegions/ConfidentRegions.bed \ -T /path/to/exome_capture_targets.bed \ -o /path/to/output/NA12877_vs_DRAGEN_exome \ --engine vcfeval \ --pass-only
What
-fand-Tdo:-f(confident regions) — defines where the truth set is reliable. Variants outside are classified as unknown, not false positives.-T(target regions) — restricts analysis to your exome capture footprint. Variants outside are removed entirely.- hap.py intersects them internally — no need to pre-intersect with bedtools.
All paths are normal paths on your machine. The tool handles Docker volume mounting automatically.
-
Preview first (optional): Add
--dry-runto see the Docker command without running it:happy \ data/PlatinumGenomesIllumina/vcf/NA12877.vcf.gz \ /path/to/dragen_output/P23_001471.hard-filtered.vcf.gz \ -r /path/to/hg38_reference/WholeGenomeFasta/genome.fa \ -f data/ConfidentRegions/ConfidentRegions.bed \ -o /path/to/output/NA12877_vs_DRAGEN \ --dry-run
-
Run in background (optional): Add
-bgto run the process in the background. Output is logged tohappy_YYYYMMDD_HHMMSS.log:happy \ data/PlatinumGenomesIllumina/vcf/NA12877.vcf.gz \ /path/to/dragen_output/P23_001471.hard-filtered.vcf.gz \ -r /path/to/hg38_reference/WholeGenomeFasta/genome.fa \ -f data/ConfidentRegions/ConfidentRegions.bed \ -o /path/to/output/NA12877_vs_DRAGEN \ -bg
You can then check progress with
tail -f happy_YYYYMMDD_HHMMSS.log.
After it runs, the following output files will be created:
NA12877_vs_DRAGEN.summary.csv— summary table with precision, recall, F1NA12877_vs_DRAGEN.vcf.gz— annotated comparison VCFNA12877_vs_DRAGEN.json— detailed metricsNA12877_vs_DRAGEN.log— run log
These files show how many variants matched, how many were missed, and how many false positives were found.
- If you see "Docker is not installed or not in PATH", install Docker first.
- If you see "File not found", double-check that the path exists.
- If you see "no space left on device", free up disk space.
- If you see "Please specify a valid reference path using -r", make sure
your FASTA file has
.faiand.dictindex files in the same directory.
Still stuck? Ask a bioinformatics colleague or open an issue in the project repository.