Skip to content

Latest commit

 

History

History
executable file
·
141 lines (106 loc) · 4.66 KB

File metadata and controls

executable file
·
141 lines (106 loc) · 4.66 KB

HOW TO: Benchmark Variant Calls Using hap.py

This guide shows you how to compare a VCF file generated by your pipeline (e.g. DRAGEN) against a trusted truth set using hap.py, all inside a Docker container.

You don't need to install anything besides Docker and Python, and you don't need to be a bioinformatician to follow this.


What You Need Before You Begin

Make sure you have:

  1. Docker installed and running.

    • Run docker --version to check.
    • Pull the hap.py image: docker pull pkrusche/hap.py
  2. happy-cli installed.

    • From the project directory: pip install -e .
    • Verify: happy --help
  3. A known truth VCF (included in data/):

    • Example: data/PlatinumGenomesIllumina/vcf/NA12877.vcf.gz
  4. The VCF you want to evaluate, produced by your pipeline:

    • Example: /path/to/dragen_output/P23_001471.hard-filtered.vcf.gz
  5. A reference genome in FASTA format:

    • Example: /path/to/hg38_reference/WholeGenomeFasta/genome.fa
    • Make sure .fai and .dict index files are present in the same directory.
  6. A high-confidence BED file (included in data/):

    • Example: data/ConfidentRegions/ConfidentRegions.bed
  7. For exome only: a target regions BED file from your capture kit vendor:

    • Example: data/ConfidentRegions/DRAGEN_Illumina_exome/hg38_Twist_Bioscience_for_Illumina_Exome_2_5_Mito.bed

How to Run

  1. Open a terminal

  2. Navigate to the project directory:

    cd /path/to/happy-cli
  3. Run the command:

    Whole Genome (WGS)

    happy \
      data/PlatinumGenomesIllumina/vcf/NA12877.vcf.gz \
      /path/to/dragen_output/P23_001471.hard-filtered.vcf.gz \
      -r /path/to/hg38_reference/WholeGenomeFasta/genome.fa \
      -f data/ConfidentRegions/ConfidentRegions.bed \
      -o /path/to/output/NA12877_vs_DRAGEN

    Exome (WES)

    For exome data, add -T with your capture kit target regions BED, and use --engine vcfeval with --pass-only for more accurate results:

    happy \
      data/PlatinumGenomesIllumina/vcf/NA12877.vcf.gz \
      /path/to/dragen_exome_output/P23_001471.hard-filtered.vcf.gz \
      -r /path/to/hg38_reference/WholeGenomeFasta/genome.fa \
      -f data/ConfidentRegions/ConfidentRegions.bed \
      -T /path/to/exome_capture_targets.bed \
      -o /path/to/output/NA12877_vs_DRAGEN_exome \
      --engine vcfeval \
      --pass-only

    What -f and -T do:

    • -f (confident regions) — defines where the truth set is reliable. Variants outside are classified as unknown, not false positives.
    • -T (target regions) — restricts analysis to your exome capture footprint. Variants outside are removed entirely.
    • hap.py intersects them internally — no need to pre-intersect with bedtools.

    All paths are normal paths on your machine. The tool handles Docker volume mounting automatically.

  4. Preview first (optional): Add --dry-run to see the Docker command without running it:

    happy \
      data/PlatinumGenomesIllumina/vcf/NA12877.vcf.gz \
      /path/to/dragen_output/P23_001471.hard-filtered.vcf.gz \
      -r /path/to/hg38_reference/WholeGenomeFasta/genome.fa \
      -f data/ConfidentRegions/ConfidentRegions.bed \
      -o /path/to/output/NA12877_vs_DRAGEN \
      --dry-run
  5. Run in background (optional): Add -bg to run the process in the background. Output is logged to happy_YYYYMMDD_HHMMSS.log:

    happy \
      data/PlatinumGenomesIllumina/vcf/NA12877.vcf.gz \
      /path/to/dragen_output/P23_001471.hard-filtered.vcf.gz \
      -r /path/to/hg38_reference/WholeGenomeFasta/genome.fa \
      -f data/ConfidentRegions/ConfidentRegions.bed \
      -o /path/to/output/NA12877_vs_DRAGEN \
      -bg

    You can then check progress with tail -f happy_YYYYMMDD_HHMMSS.log.


What You'll Get

After it runs, the following output files will be created:

  • NA12877_vs_DRAGEN.summary.csv — summary table with precision, recall, F1
  • NA12877_vs_DRAGEN.vcf.gz — annotated comparison VCF
  • NA12877_vs_DRAGEN.json — detailed metrics
  • NA12877_vs_DRAGEN.log — run log

These files show how many variants matched, how many were missed, and how many false positives were found.


Need Help?

  • If you see "Docker is not installed or not in PATH", install Docker first.
  • If you see "File not found", double-check that the path exists.
  • If you see "no space left on device", free up disk space.
  • If you see "Please specify a valid reference path using -r", make sure your FASTA file has .fai and .dict index files in the same directory.

Still stuck? Ask a bioinformatics colleague or open an issue in the project repository.