Skip to content

LuoGroup2023/HapFold

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HapFold: Efficient and Accurate T2T-level Haplotype Reconstruction

BioConda Downloads BioConda Version License Release Views Stars

Description

HapFold is a hybrid scaffolding framework for chromosome-scale, near-T2T haplotype reconstruction of large diploid genomes, including human, animal, and plant genomes.

HapFold integrates assembly graph topology with long-range Hi-C/Pore-C signals to achieve robust haplotype-aware scaffolding. Its performance mainly depends on the quality and completeness of the input unitig graph, and it is particularly effective for graphs containing well-resolved bubble-chain structures.

By combining graph-based haplotype information with sequence-based scaffolding, HapFold reduces chromosomal misassignments and improves assembly continuity, enabling accurate and scalable reconstruction of near-T2T diploid haplotypes.

Installation and Dependencies

Prerequisites

  • g++ (supporting C++9.4 or later)
  • zlib

1. Install via Bioconda (Recommended)

HapFold is officially available on Bioconda. This is the fastest and easiest way to install the tool:

# Create and activate a new environment
conda create -n hapfold
conda activate hapfold

# Install HapFold
conda install -c bioconda hapfold

# Alternatively, you can use mamba for faster dependency resolution:
# mamba install -c bioconda hapfold

Typical installation time: HapFold can usually be deployed within 1 minute via Bioconda on a standard Linux workstation or server with conda/mamba available.

2. Install from Source Code

If you prefer to compile from source:

git clone https://github.com/LuoGroup2023/HapFold.git
cd HapFold

# Compile the source code
make -j8

🚀 Quick Start & Workflow

HapFold utilizes a two-step workflow: Mapping and Resolving.

🎯 Primary Application

HapFold is primarily designed for large diploid genome scaffolding, including human, animal, and plant genomes. It is especially suitable for assemblies generated by hifiasm, where the input graph contains informative bubble-chain structures that can be used for robust haplotype-aware scaffolding.

HapFold requires three primary GFA files from the initial assembly:

  1. Unphased unitig graph (*.p_utg.gfa)
  2. Haplotype 1 contig graph (*.hap1.p_ctg.gfa)
  3. Haplotype 2 contig graph (*.hap2.p_ctg.gfa)

The final scaffolding quality is strongly influenced by the quality of the input unitig graph. In general, HapFold performs better and more robustly when the unitig graph is more complete, better connected, and contains clear bubble-chain structures.

General Usage

Usage: HapFold <command> <arguments> <inputs>

Commands:
  scaffolding            use Hi-C/Pore-C data to resolve haplotypes
  mapping                map Hi-C/Pore-C data to sequences in the graph
  version                print version number

Step 1: Hi-C Mapping (mapping)

Before mapping, you need to extract the node sequences from your hifiasm unitig graph into a FASTA file:

awk '/^S/{print ">"$2;print $3}' hifiasm_p_utg.gfa > hifiasm_p_utg.fa

Then, map the raw Hi-C reads to these node sequences:

HapFold mapping -t 32 -1 hic.R1.fastq.gz -2 hic.R2.fastq.gz -o mapping.txt hifiasm_p_utg.fa

Key Options for mapping:

  • -1 FILE, -2 FILE : (Required) Paths to Hi-C forward (R1) and reverse (R2) reads.
  • -t INT: Number of worker threads [32]
  • -o FILE: Output file to save the mapping relationships (e.g., map.out)
  • -k INT: k-mer size [31]

Step 2: Haplotype Resolution (scaffolding)

Once the mapping is complete, use the mapping results alongside the GFA files to resolve haplotypes and build chromosome-scale scaffolds.

Usage: HapFold scaffolding [options] <mapping.txt> <assembly.gfa> <output_dir> -1 *.hap1.p_ctg.gfa -2 *.hap2.p_ctg.gfa

(Positional arguments: <mapping_result> <unitig.gfa> <output_directory>)

Key Options for scaffolding:

Option Description
-t INT Number of threads [8].
-n INT Expected number of chromosomes (e.g., 46 for human, 78 for chicken) [0].
-1 FILE (Required) Path to haplotype 1 GFA file (*.hap1.p_ctg.gfa).
-2 FILE (Required) Path to haplotype 2 GFA file (*.hap2.p_ctg.gfa).
-u FILE Path to utg_to_ctg relationship file. Highly recommended for accurate graph traversing.
-i BOOL Enable identity check on contigs (true/false) [false].
-f FILE Precomputed identity file path; if omitted but -i true, the check will run automatically.
-e STR Restriction enzymes separated by comma (e.g., GATC,GANTC) [ ].
-c FILE Path to contig_hap_nodes.txt (required for specific Hi-C phasing modes).
-d, --debug Enable debug mode to run internal test code functions.
--hic_scaffold_threshold_ratio FLOAT Threshold ratio for sequence-based Hi-C scaffolding extensions [0.60].

Demo with rice test data

A small rice test dataset is provided in the package to test the scaffolding step of HapFold. The demo data include three GFA files generated from the initial assembly and a precomputed mapping.txt file.

cd rice-test-data

HapFold scaffolding \
  -t 32 \
  -n 24 \
  mapping.txt \
  rice.p_utg.gfa \
  rice_hapfold_out \
  -1 rice.hap1.p_ctg.gfa \
  -2 rice.hap2.p_ctg.gfa

The expected output files include:

rice_hapfold_out/hap_contig.fa
rice_hapfold_out/scaffold.fa

To obtain the final complete assembly, merge the unresolved single-chain sequences and phased scaffolds:

cd rice_hapfold_out
cat hap_contig.fa scaffold.fa > all_scaffold.fa

Expected demo run time: the rice test demo is expected to finish within several minutes on a standard Linux workstation or server using 32 CPU threads.

Expected Outputs & Key Notes

After successfully running the scaffolding command, HapFold generates several crucial sequence files in your specified <output_dir>.

⚠️ CRITICAL NOTICE on Haplotype Completeness

To obtain the complete genome assembly, you MUST combine both the unresolved single chains and the phased scaffolds.

HapFold splits the output based on graph topologies (Single Chains vs. Bubble Chains):

  1. hap_contig.fa (Single Chains): Contains sequences derived from unresolved single chains in the graph. This typically includes sex chromosomes (X/Y or Z/W) and ultra-conserved homozygous blocks where haplotypes cannot be topologically or proximally differentiated.
  2. scaffold.fa (or phasing_hap1.fa / phasing_hap2.fa) (Bubble Chains): Contains chromosome-scale scaffolds resolved from bubble chains, representing the successfully phased autosomal diploid regions.

Assembly Completeness Formula

Run the following commands to merge both components into the final complete assembly:

cd <output_dir>
cat hap_contig.fa scaffold.fa > all_scaffold.fa

Citation

If you use HapFold in your research, please cite:

@article{liu2026efficient,
  title={Efficient and accurate near telomere-to-telomere haplotype reconstruction of diploid genomes},
  author={Liu, Yuansheng and Li, Yichen and Xu, Jialu and Tan, Zhongzheng and Zhang, Wenhai and Wang, Long and Xu, Luohao and Zeng, Xiangxiang and Schoenhuth, Alexander and Luo, Xiao},
  journal={bioRxiv},
  pages={2026--05},
  year={2026},
  publisher={Cold Spring Harbor Laboratory}
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages