Skip to content

HantaBERT/.github

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 

Repository files navigation

title HantaBERT
emoji 🧬
colorFrom green
colorTo yellow
sdk static
pinned true

HantaBERT

Multi-Task Orthohantavirus classification by fine-tuning DNABERT-2.
One forward pass → species, host, and geographic origin, plus a 768-d phylogenetic embedding.

🌐 Web App  ·  ⚡ API Docs  ·  🤗 Model  ·  📊 Dataset  ·  💻 GitHub


What is HantaBERT?

Hantaviruses (genus Orthohantavirus) are segmented negative-sense ssRNA viruses that cause hemorrhagic fever with renal syndrome (HFRS) in Eurasia and cardiopulmonary syndrome (HCPS) in the Americas, with mortality reaching ~40% in HCPS cases. Rapidly identifying the species, reservoir host, and geographic origin of a sequence is essential for surveillance, but BLAST and classical phylogeny are slow and do not integrate across attributes.

HantaBERT fine-tunes DNABERT-2 (117M) into a multi-task model that emits probabilities for all three tasks in a single forward pass. A shared 768-d bottleneck feeds three independent classification heads, trained with a weighted combined loss, balanced classes, AMP fp16, gradient accumulation, and a differential learning rate between encoder and heads.

📈 Headline results

Held-out test set (883 sequences), neural classification heads:

Task Classes Test accuracy
🧬 Species / lineage 23 96.7%
🐀 Host (Rodent / Human / Others) 3 91.4%
🌍 Geographic origin 7 80.5%
Training curves Species confusion matrix
Training progression: val accuracy climbs to 96.7% / 91.4% / 80.5% over 10 epochs. Species confusion matrix: clean diagonal across 23 lineages.

Emergent phylogenetic structure

A UMAP projection of all 8,822 bottleneck embeddings reveals clean per-lineage clusters (and substructure per genome segment S, M, L) with no explicit supervision of the segment. The S/M/L separation tracks differences in selective pressure (conserved N protein on S, antigenic positive selection on Gn/Gc in M, active RdRp motifs on L).

UMAP of all species UMAP Seoul virus UMAP Puumala virus
All 8,822 sequences by lineage Seoul virus (1,391) Puumala virus (2,709)

🗂️ Project components

The HantaBERT stack spans data collection, modeling, a public API, and a web interface, each in its own repository.

📊 Data pipeline & dataset

Automated extraction, cleaning, multi-task labeling, and geocoding of Orthohantavirus genomic records from NCBI GenBank (Biopython + Nominatim). Produces the ready-to-train dataset of S/M/L RNA segments with standardized host, species, and geography labels.

🧠 Model: training & fine-tuning

The multi-task fine-tuning code: MultiTaskHantaBERT (DNABERT-2 encoder → shared bottleneck → 3 heads), weighted loss 1.0·L_species + 0.5·L_host + 0.3·L_geo, full train / evaluate / visualize scripts, and an SVM-on-embeddings baseline.

⚡ Inference API

FastAPI + Uvicorn service, packaged with Docker. Accepts raw DNA/RNA or FASTA, auto-converts U→T, and returns top-N probabilistic predictions per task.

🌐 Web interface

Pure static HTML/CSS/JS frontend with an interactive world map (D3 + TopoJSON). Paste a sequence or upload a FASTA file and explore ranked predictions across all three tasks.

HantaBERT web interface

📄 Paper

HantaBERT: Multi-Task Hantavirus Classification with DNABERT-2 Fine-Tuning, an IEEE-style conference paper (English & Indonesian), written for the IF3211 Domain-Specific Computation (Bioinformatics) course at STEI ITB.


🚀 Quick links

Web / Hub Source
Model 🤗 HantaBERT/HantaBERT github.com/HantaBERT/HantaBERT
Dataset 🤗 Orthohantavirus-Genome-Atlas github.com/HantaBERT/data-pipeline
API hantabert-api.faizath.com/docs github.com/HantaBERT/HantaBERT-API
Web hantabert.faizath.com github.com/HantaBERT/HantaBERT-Web
Paper n/a github.com/HantaBERT/paper

👥 Authors

Muhammad Faiz Atharrahman · Muhammad Rafi Dhiyaulhaq · Lydia Gracia, School of Electrical Engineering and Informatics (STEI), Institut Teknologi Bandung.

Developed as the final project for the IF3211 Domain-Specific Computation (Bioinformatics) course at STEI ITB.

Released under the Apache-2.0 license, consistent with the DNABERT-2 backbone. If you use HantaBERT, please also cite DNABERT-2 (Zhou et al., 2023).

About

HantaBERT organization profile and shared community health files.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors