HapFold is a hybrid scaffolding framework for chromosome-scale, near-T2T haplotype reconstruction of large diploid genomes, including human, animal, and plant genomes.
HapFold integrates assembly graph topology with long-range Hi-C/Pore-C signals to achieve robust haplotype-aware scaffolding. Its performance mainly depends on the quality and completeness of the input unitig graph, and it is particularly effective for graphs containing well-resolved bubble-chain structures.
By combining graph-based haplotype information with sequence-based scaffolding, HapFold reduces chromosomal misassignments and improves assembly continuity, enabling accurate and scalable reconstruction of near-T2T diploid haplotypes.
g++(supporting C++9.4 or later)zlib
HapFold is officially available on Bioconda. This is the fastest and easiest way to install the tool:
# Create and activate a new environment
conda create -n hapfold
conda activate hapfold
# Install HapFold
conda install -c bioconda hapfold
# Alternatively, you can use mamba for faster dependency resolution:
# mamba install -c bioconda hapfoldTypical installation time: HapFold can usually be deployed within 1 minute via Bioconda on a standard Linux workstation or server with conda/mamba available.
If you prefer to compile from source:
git clone https://github.com/LuoGroup2023/HapFold.git
cd HapFold
# Compile the source code
make -j8HapFold utilizes a two-step workflow: Mapping and Resolving.
HapFold is primarily designed for large diploid genome scaffolding, including human, animal, and plant genomes. It is especially suitable for assemblies generated by hifiasm, where the input graph contains informative bubble-chain structures that can be used for robust haplotype-aware scaffolding.
HapFold requires three primary GFA files from the initial assembly:
- Unphased unitig graph (
*.p_utg.gfa) - Haplotype 1 contig graph (
*.hap1.p_ctg.gfa) - Haplotype 2 contig graph (
*.hap2.p_ctg.gfa)
The final scaffolding quality is strongly influenced by the quality of the input unitig graph. In general, HapFold performs better and more robustly when the unitig graph is more complete, better connected, and contains clear bubble-chain structures.
Usage: HapFold <command> <arguments> <inputs>
Commands:
scaffolding use Hi-C/Pore-C data to resolve haplotypes
mapping map Hi-C/Pore-C data to sequences in the graph
version print version number
Before mapping, you need to extract the node sequences from your hifiasm unitig graph into a FASTA file:
awk '/^S/{print ">"$2;print $3}' hifiasm_p_utg.gfa > hifiasm_p_utg.faThen, map the raw Hi-C reads to these node sequences:
HapFold mapping -t 32 -1 hic.R1.fastq.gz -2 hic.R2.fastq.gz -o mapping.txt hifiasm_p_utg.faKey Options for mapping:
-1 FILE, -2 FILE: (Required) Paths to Hi-C forward (R1) and reverse (R2) reads.-t INT: Number of worker threads [32]-o FILE: Output file to save the mapping relationships (e.g.,map.out)-k INT: k-mer size [31]
Once the mapping is complete, use the mapping results alongside the GFA files to resolve haplotypes and build chromosome-scale scaffolds.
Usage: HapFold scaffolding [options] <mapping.txt> <assembly.gfa> <output_dir> -1 *.hap1.p_ctg.gfa -2 *.hap2.p_ctg.gfa(Positional arguments: <mapping_result> <unitig.gfa> <output_directory>)
Key Options for scaffolding:
| Option | Description |
|---|---|
-t INT |
Number of threads [8]. |
-n INT |
Expected number of chromosomes (e.g., 46 for human, 78 for chicken) [0]. |
-1 FILE |
(Required) Path to haplotype 1 GFA file (*.hap1.p_ctg.gfa). |
-2 FILE |
(Required) Path to haplotype 2 GFA file (*.hap2.p_ctg.gfa). |
-u FILE |
Path to utg_to_ctg relationship file. Highly recommended for accurate graph traversing. |
-i BOOL |
Enable identity check on contigs (true/false) [false]. |
-f FILE |
Precomputed identity file path; if omitted but -i true, the check will run automatically. |
-e STR |
Restriction enzymes separated by comma (e.g., GATC,GANTC) [ ]. |
-c FILE |
Path to contig_hap_nodes.txt (required for specific Hi-C phasing modes). |
-d, --debug |
Enable debug mode to run internal test code functions. |
--hic_scaffold_threshold_ratio FLOAT |
Threshold ratio for sequence-based Hi-C scaffolding extensions [0.60]. |
A small rice test dataset is provided in the package to test the scaffolding step of HapFold. The demo data include three GFA files generated from the initial assembly and a precomputed mapping.txt file.
cd rice-test-data
HapFold scaffolding \
-t 32 \
-n 24 \
mapping.txt \
rice.p_utg.gfa \
rice_hapfold_out \
-1 rice.hap1.p_ctg.gfa \
-2 rice.hap2.p_ctg.gfaThe expected output files include:
rice_hapfold_out/hap_contig.fa
rice_hapfold_out/scaffold.fa
To obtain the final complete assembly, merge the unresolved single-chain sequences and phased scaffolds:
cd rice_hapfold_out
cat hap_contig.fa scaffold.fa > all_scaffold.faExpected demo run time: the rice test demo is expected to finish within several minutes on a standard Linux workstation or server using 32 CPU threads.
After successfully running the scaffolding command, HapFold generates several crucial sequence files in your specified <output_dir>.
To obtain the complete genome assembly, you MUST combine both the unresolved single chains and the phased scaffolds.
HapFold splits the output based on graph topologies (Single Chains vs. Bubble Chains):
hap_contig.fa(Single Chains): Contains sequences derived from unresolved single chains in the graph. This typically includes sex chromosomes (X/Y or Z/W) and ultra-conserved homozygous blocks where haplotypes cannot be topologically or proximally differentiated.scaffold.fa(orphasing_hap1.fa/phasing_hap2.fa) (Bubble Chains): Contains chromosome-scale scaffolds resolved from bubble chains, representing the successfully phased autosomal diploid regions.
Run the following commands to merge both components into the final complete assembly:
cd <output_dir>
cat hap_contig.fa scaffold.fa > all_scaffold.faIf you use HapFold in your research, please cite:
@article{liu2026efficient,
title={Efficient and accurate near telomere-to-telomere haplotype reconstruction of diploid genomes},
author={Liu, Yuansheng and Li, Yichen and Xu, Jialu and Tan, Zhongzheng and Zhang, Wenhai and Wang, Long and Xu, Luohao and Zeng, Xiangxiang and Schoenhuth, Alexander and Luo, Xiao},
journal={bioRxiv},
pages={2026--05},
year={2026},
publisher={Cold Spring Harbor Laboratory}
}