HOW TO: Benchmark Variant Calls Using hap.py

This guide shows you how to compare a VCF file generated by your pipeline (e.g. DRAGEN) against a trusted truth set using hap.py, all inside a Docker container.

You don't need to install anything besides Docker and Python, and you don't need to be a bioinformatician to follow this.

What You Need Before You Begin

Make sure you have:

Docker installed and running.
- Run docker --version to check.
- Pull the hap.py image: docker pull pkrusche/hap.py
happy-cli installed.
- From the project directory: pip install -e .
- Verify: happy --help
A known truth VCF (included in data/):
- Example: data/PlatinumGenomesIllumina/vcf/NA12877.vcf.gz
The VCF you want to evaluate, produced by your pipeline:
- Example: /path/to/dragen_output/P23_001471.hard-filtered.vcf.gz
A reference genome in FASTA format:
- Example: /path/to/hg38_reference/WholeGenomeFasta/genome.fa
- Make sure .fai and .dict index files are present in the same directory.
A high-confidence BED file (included in data/):
- Example: data/ConfidentRegions/ConfidentRegions.bed
For exome only: a target regions BED file from your capture kit vendor:
- Example: data/ConfidentRegions/DRAGEN_Illumina_exome/hg38_Twist_Bioscience_for_Illumina_Exome_2_5_Mito.bed

How to Run

Open a terminal
Navigate to the project directory:
```
cd /path/to/happy-cli
```

Run the command:

Whole Genome (WGS)

happy \
  data/PlatinumGenomesIllumina/vcf/NA12877.vcf.gz \
  /path/to/dragen_output/P23_001471.hard-filtered.vcf.gz \
  -r /path/to/hg38_reference/WholeGenomeFasta/genome.fa \
  -f data/ConfidentRegions/ConfidentRegions.bed \
  -o /path/to/output/NA12877_vs_DRAGEN

Exome (WES)

For exome data, add -T with your capture kit target regions BED, and use --engine vcfeval with --pass-only for more accurate results:

happy \
  data/PlatinumGenomesIllumina/vcf/NA12877.vcf.gz \
  /path/to/dragen_exome_output/P23_001471.hard-filtered.vcf.gz \
  -r /path/to/hg38_reference/WholeGenomeFasta/genome.fa \
  -f data/ConfidentRegions/ConfidentRegions.bed \
  -T /path/to/exome_capture_targets.bed \
  -o /path/to/output/NA12877_vs_DRAGEN_exome \
  --engine vcfeval \
  --pass-only

What -f and -T do:

-f (confident regions) — defines where the truth set is reliable. Variants outside are classified as unknown, not false positives.
-T (target regions) — restricts analysis to your exome capture footprint. Variants outside are removed entirely.
hap.py intersects them internally — no need to pre-intersect with bedtools.

All paths are normal paths on your machine. The tool handles Docker volume mounting automatically.

Preview first (optional): Add --dry-run to see the Docker command without running it:

happy \
  data/PlatinumGenomesIllumina/vcf/NA12877.vcf.gz \
  /path/to/dragen_output/P23_001471.hard-filtered.vcf.gz \
  -r /path/to/hg38_reference/WholeGenomeFasta/genome.fa \
  -f data/ConfidentRegions/ConfidentRegions.bed \
  -o /path/to/output/NA12877_vs_DRAGEN \
  --dry-run

Run in background (optional): Add -bg to run the process in the background. Output is logged to happy_YYYYMMDD_HHMMSS.log:

happy \
  data/PlatinumGenomesIllumina/vcf/NA12877.vcf.gz \
  /path/to/dragen_output/P23_001471.hard-filtered.vcf.gz \
  -r /path/to/hg38_reference/WholeGenomeFasta/genome.fa \
  -f data/ConfidentRegions/ConfidentRegions.bed \
  -o /path/to/output/NA12877_vs_DRAGEN \
  -bg

You can then check progress with tail -f happy_YYYYMMDD_HHMMSS.log.

What You'll Get

After it runs, the following output files will be created:

NA12877_vs_DRAGEN.summary.csv — summary table with precision, recall, F1
NA12877_vs_DRAGEN.vcf.gz — annotated comparison VCF
NA12877_vs_DRAGEN.json — detailed metrics
NA12877_vs_DRAGEN.log — run log

These files show how many variants matched, how many were missed, and how many false positives were found.

Need Help?

If you see "Docker is not installed or not in PATH", install Docker first.
If you see "File not found", double-check that the path exists.
If you see "no space left on device", free up disk space.
If you see "Please specify a valid reference path using -r", make sure your FASTA file has .fai and .dict index files in the same directory.

Still stuck? Ask a bioinformatics colleague or open an issue in the project repository.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HOW TO: Benchmark Variant Calls Using hap.py

What You Need Before You Begin

How to Run

Whole Genome (WGS)

Exome (WES)

What You'll Get

Need Help?

FilesExpand file tree

Guide_to_run_benchmarking.md

Latest commit

History

Guide_to_run_benchmarking.md

File metadata and controls

HOW TO: Benchmark Variant Calls Using hap.py

What You Need Before You Begin

How to Run

Whole Genome (WGS)

Exome (WES)

What You'll Get

Need Help?