GitHub - HantaBERT/.github: HantaBERT organization profile and shared community health files.

title	HantaBERT
emoji	🧬
colorFrom	green
colorTo	yellow
sdk	static
pinned	true

Multi-Task Orthohantavirus classification by fine-tuning DNABERT-2.
One forward pass → species, host, and geographic origin, plus a 768-d phylogenetic embedding.

🌐 Web App · ⚡ API Docs · 🤗 Model · 📊 Dataset · 💻 GitHub

What is HantaBERT?

Hantaviruses (genus Orthohantavirus) are segmented negative-sense ssRNA viruses that cause hemorrhagic fever with renal syndrome (HFRS) in Eurasia and cardiopulmonary syndrome (HCPS) in the Americas, with mortality reaching ~40% in HCPS cases. Rapidly identifying the species, reservoir host, and geographic origin of a sequence is essential for surveillance, but BLAST and classical phylogeny are slow and do not integrate across attributes.

HantaBERT fine-tunes DNABERT-2 (117M) into a multi-task model that emits probabilities for all three tasks in a single forward pass. A shared 768-d bottleneck feeds three independent classification heads, trained with a weighted combined loss, balanced classes, AMP fp16, gradient accumulation, and a differential learning rate between encoder and heads.

📈 Headline results

Held-out test set (883 sequences), neural classification heads:

Task	Classes	Test accuracy
🧬 Species / lineage	23	96.7%
🐀 Host (Rodent / Human / Others)	3	91.4%
🌍 Geographic origin	7	80.5%


_{Training progression: val accuracy climbs to 96.7% / 91.4% / 80.5% over 10 epochs.}	_{Species confusion matrix: clean diagonal across 23 lineages.}

Emergent phylogenetic structure

A UMAP projection of all 8,822 bottleneck embeddings reveals clean per-lineage clusters (and substructure per genome segment S, M, L) with no explicit supervision of the segment. The S/M/L separation tracks differences in selective pressure (conserved N protein on S, antigenic positive selection on Gn/Gc in M, active RdRp motifs on L).


_{All 8,822 sequences by lineage}	_{Seoul virus (1,391)}	_{Puumala virus (2,709)}

🗂️ Project components

The HantaBERT stack spans data collection, modeling, a public API, and a web interface, each in its own repository.

📊 Data pipeline & dataset

Automated extraction, cleaning, multi-task labeling, and geocoding of Orthohantavirus genomic records from NCBI GenBank (Biopython + Nominatim). Produces the ready-to-train dataset of S/M/L RNA segments with standardized host, species, and geography labels.

🤗 Dataset: HantaBERT/Orthohantavirus-Genome-Atlas: raw (9,950), interim (9,846), default processed (9,846)
💻 Code: github.com/HantaBERT/data-pipeline

🧠 Model: training & fine-tuning

The multi-task fine-tuning code: MultiTaskHantaBERT (DNABERT-2 encoder → shared bottleneck → 3 heads), weighted loss 1.0·L_species + 0.5·L_host + 0.3·L_geo, full train / evaluate / visualize scripts, and an SVM-on-embeddings baseline.

🤗 Model: HantaBERT/HantaBERT
💻 Code: github.com/HantaBERT/HantaBERT

⚡ Inference API

FastAPI + Uvicorn service, packaged with Docker. Accepts raw DNA/RNA or FASTA, auto-converts U→T, and returns top-N probabilistic predictions per task.

🚀 Live: hantabert-api.faizath.com/docs
💻 Code: github.com/HantaBERT/HantaBERT-API

🌐 Web interface

Pure static HTML/CSS/JS frontend with an interactive world map (D3 + TopoJSON). Paste a sequence or upload a FASTA file and explore ranked predictions across all three tasks.

🌍 Live: hantabert.faizath.com
💻 Code: github.com/HantaBERT/HantaBERT-Web

📄 Paper

HantaBERT: Multi-Task Hantavirus Classification with DNABERT-2 Fine-Tuning, an IEEE-style conference paper (English & Indonesian), written for the IF3211 Domain-Specific Computation (Bioinformatics) course at STEI ITB.

💻 Source & PDFs: github.com/HantaBERT/paper

🚀 Quick links

	Web / Hub	Source
Model	🤗 HantaBERT/HantaBERT	github.com/HantaBERT/HantaBERT
Dataset	🤗 Orthohantavirus-Genome-Atlas	github.com/HantaBERT/data-pipeline
API	hantabert-api.faizath.com/docs	github.com/HantaBERT/HantaBERT-API
Web	hantabert.faizath.com	github.com/HantaBERT/HantaBERT-Web
Paper	n/a	github.com/HantaBERT/paper

👥 Authors

Muhammad Faiz Atharrahman · Muhammad Rafi Dhiyaulhaq · Lydia Gracia, School of Electrical Engineering and Informatics (STEI), Institut Teknologi Bandung.

Developed as the final project for the IF3211 Domain-Specific Computation (Bioinformatics) course at STEI ITB.

Released under the Apache-2.0 license, consistent with the DNABERT-2 backbone. If you use HantaBERT, please also cite DNABERT-2 (Zhou et al., 2023).

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
assets		assets
profile		profile
.gitattributes		.gitattributes
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

What is HantaBERT?