HapFold: Efficient and Accurate T2T-level Haplotype Reconstruction

Description

HapFold is a hybrid scaffolding framework for chromosome-scale, near-T2T haplotype reconstruction of large diploid genomes, including human, animal, and plant genomes.

HapFold integrates assembly graph topology with long-range Hi-C/Pore-C signals to achieve robust haplotype-aware scaffolding. Its performance mainly depends on the quality and completeness of the input unitig graph, and it is particularly effective for graphs containing well-resolved bubble-chain structures.

By combining graph-based haplotype information with sequence-based scaffolding, HapFold reduces chromosomal misassignments and improves assembly continuity, enabling accurate and scalable reconstruction of near-T2T diploid haplotypes.

Installation and Dependencies

Prerequisites

g++ (supporting C++9.4 or later)
zlib

1. Install via Bioconda (Recommended)

HapFold is officially available on Bioconda. This is the fastest and easiest way to install the tool:

# Create and activate a new environment
conda create -n hapfold
conda activate hapfold

# Install HapFold
conda install -c bioconda hapfold

# Alternatively, you can use mamba for faster dependency resolution:
# mamba install -c bioconda hapfold

Typical installation time: HapFold can usually be deployed within 1 minute via Bioconda on a standard Linux workstation or server with conda/mamba available.

2. Install from Source Code

If you prefer to compile from source:

git clone https://github.com/LuoGroup2023/HapFold.git
cd HapFold

# Compile the source code
make -j8

🚀 Quick Start & Workflow

HapFold utilizes a two-step workflow: Mapping and Resolving.

🎯 Primary Application

HapFold is primarily designed for large diploid genome scaffolding, including human, animal, and plant genomes. It is especially suitable for assemblies generated by hifiasm, where the input graph contains informative bubble-chain structures that can be used for robust haplotype-aware scaffolding.

HapFold requires three primary GFA files from the initial assembly:

Unphased unitig graph (*.p_utg.gfa)
Haplotype 1 contig graph (*.hap1.p_ctg.gfa)
Haplotype 2 contig graph (*.hap2.p_ctg.gfa)

The final scaffolding quality is strongly influenced by the quality of the input unitig graph. In general, HapFold performs better and more robustly when the unitig graph is more complete, better connected, and contains clear bubble-chain structures.

General Usage

Usage: HapFold <command> <arguments> <inputs>

Commands:
  scaffolding            use Hi-C/Pore-C data to resolve haplotypes
  mapping                map Hi-C/Pore-C data to sequences in the graph
  version                print version number

Step 1: Hi-C Mapping (`mapping`)

Before mapping, you need to extract the node sequences from your hifiasm unitig graph into a FASTA file:

awk '/^S/{print ">"$2;print $3}' hifiasm_p_utg.gfa > hifiasm_p_utg.fa

Then, map the raw Hi-C reads to these node sequences:

HapFold mapping -t 32 -1 hic.R1.fastq.gz -2 hic.R2.fastq.gz -o mapping.txt hifiasm_p_utg.fa

Key Options for mapping:

-1 FILE, -2 FILE : (Required) Paths to Hi-C forward (R1) and reverse (R2) reads.
-t INT: Number of worker threads [32]
-o FILE: Output file to save the mapping relationships (e.g., map.out)
-k INT: k-mer size [31]

Step 2: Haplotype Resolution (`scaffolding`)

Once the mapping is complete, use the mapping results alongside the GFA files to resolve haplotypes and build chromosome-scale scaffolds.

Usage: HapFold scaffolding [options] <mapping.txt> <assembly.gfa> <output_dir> -1 *.hap1.p_ctg.gfa -2 *.hap2.p_ctg.gfa

(Positional arguments: <mapping_result> <unitig.gfa> <output_directory>)

Key Options for scaffolding:

Option	Description
`-t INT`	Number of threads [8].
`-n INT`	Expected number of chromosomes (e.g., `46` for human, `78` for chicken) [0].
`-1 FILE`	(Required) Path to haplotype 1 GFA file (`*.hap1.p_ctg.gfa`).
`-2 FILE`	(Required) Path to haplotype 2 GFA file (`*.hap2.p_ctg.gfa`).
`-u FILE`	Path to `utg_to_ctg` relationship file. Highly recommended for accurate graph traversing.
`-i BOOL`	Enable identity check on contigs (`true`/`false`) [false].
`-f FILE`	Precomputed identity file path; if omitted but `-i true`, the check will run automatically.
`-e STR`	Restriction enzymes separated by comma (e.g., `GATC,GANTC`) [ ].
`-c FILE`	Path to `contig_hap_nodes.txt` (required for specific Hi-C phasing modes).
`-d`, `--debug`	Enable debug mode to run internal test code functions.
`--hic_scaffold_threshold_ratio FLOAT`	Threshold ratio for sequence-based Hi-C scaffolding extensions [0.60].

Demo with rice test data

A small rice test dataset is provided in the package to test the scaffolding step of HapFold. The demo data include three GFA files generated from the initial assembly and a precomputed mapping.txt file.

cd rice-test-data

HapFold scaffolding \
  -t 32 \
  -n 24 \
  mapping.txt \
  rice.p_utg.gfa \
  rice_hapfold_out \
  -1 rice.hap1.p_ctg.gfa \
  -2 rice.hap2.p_ctg.gfa

The expected output files include:

rice_hapfold_out/hap_contig.fa
rice_hapfold_out/scaffold.fa

To obtain the final complete assembly, merge the unresolved single-chain sequences and phased scaffolds:

cd rice_hapfold_out
cat hap_contig.fa scaffold.fa > all_scaffold.fa

Expected demo run time: the rice test demo is expected to finish within several minutes on a standard Linux workstation or server using 32 CPU threads.

Expected Outputs & Key Notes

After successfully running the scaffolding command, HapFold generates several crucial sequence files in your specified <output_dir>.

⚠️ CRITICAL NOTICE on Haplotype Completeness

To obtain the complete genome assembly, you MUST combine both the unresolved single chains and the phased scaffolds.

HapFold splits the output based on graph topologies (Single Chains vs. Bubble Chains):

hap_contig.fa (Single Chains): Contains sequences derived from unresolved single chains in the graph. This typically includes sex chromosomes (X/Y or Z/W) and ultra-conserved homozygous blocks where haplotypes cannot be topologically or proximally differentiated.
scaffold.fa (or phasing_hap1.fa / phasing_hap2.fa) (Bubble Chains): Contains chromosome-scale scaffolds resolved from bubble chains, representing the successfully phased autosomal diploid regions.

Assembly Completeness Formula

Run the following commands to merge both components into the final complete assembly:

cd <output_dir>
cat hap_contig.fa scaffold.fa > all_scaffold.fa

Citation

If you use HapFold in your research, please cite:

@article{liu2026efficient,
  title={Efficient and accurate near telomere-to-telomere haplotype reconstruction of diploid genomes},
  author={Liu, Yuansheng and Li, Yichen and Xu, Jialu and Tan, Zhongzheng and Zhang, Wenhai and Wang, Long and Xu, Luohao and Zeng, Xiangxiang and Schoenhuth, Alexander and Luo, Xiao},
  journal={bioRxiv},
  pages={2026--05},
  year={2026},
  publisher={Cold Spring Harbor Laboratory}
}

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
.vscode		.vscode
lib		lib
scripts		scripts
src		src
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
main.cpp		main.cpp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HapFold: Efficient and Accurate T2T-level Haplotype Reconstruction

Description

Installation and Dependencies

Prerequisites

1. Install via Bioconda (Recommended)

2. Install from Source Code

🚀 Quick Start & Workflow

🎯 Primary Application

General Usage

Step 1: Hi-C Mapping (`mapping`)

Step 2: Haplotype Resolution (`scaffolding`)

Demo with rice test data

Expected Outputs & Key Notes

⚠️ CRITICAL NOTICE on Haplotype Completeness

Assembly Completeness Formula

Citation

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

HapFold: Efficient and Accurate T2T-level Haplotype Reconstruction

Description

Installation and Dependencies

Prerequisites

1. Install via Bioconda (Recommended)

2. Install from Source Code

🚀 Quick Start & Workflow

🎯 Primary Application

General Usage

Step 1: Hi-C Mapping (mapping)

Step 2: Haplotype Resolution (scaffolding)

Demo with rice test data

Expected Outputs & Key Notes

⚠️ CRITICAL NOTICE on Haplotype Completeness

Assembly Completeness Formula

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Step 1: Hi-C Mapping (`mapping`)

Step 2: Haplotype Resolution (`scaffolding`)

Packages