| title | HantaBERT |
|---|---|
| emoji | 🧬 |
| colorFrom | green |
| colorTo | yellow |
| sdk | static |
| pinned | true |
Multi-Task Orthohantavirus classification by fine-tuning DNABERT-2.
One forward pass → species, host, and geographic origin, plus a 768-d phylogenetic embedding.
🌐 Web App · ⚡ API Docs · 🤗 Model · 📊 Dataset · 💻 GitHub
Hantaviruses (genus Orthohantavirus) are segmented negative-sense ssRNA viruses that cause hemorrhagic fever with renal syndrome (HFRS) in Eurasia and cardiopulmonary syndrome (HCPS) in the Americas, with mortality reaching ~40% in HCPS cases. Rapidly identifying the species, reservoir host, and geographic origin of a sequence is essential for surveillance, but BLAST and classical phylogeny are slow and do not integrate across attributes.
HantaBERT fine-tunes DNABERT-2 (117M) into a multi-task model that emits probabilities for all three tasks in a single forward pass. A shared 768-d bottleneck feeds three independent classification heads, trained with a weighted combined loss, balanced classes, AMP fp16, gradient accumulation, and a differential learning rate between encoder and heads.
Held-out test set (883 sequences), neural classification heads:
| Task | Classes | Test accuracy |
|---|---|---|
| 🧬 Species / lineage | 23 | 96.7% |
| 🐀 Host (Rodent / Human / Others) | 3 | 91.4% |
| 🌍 Geographic origin | 7 | 80.5% |
![]() |
![]() |
| Training progression: val accuracy climbs to 96.7% / 91.4% / 80.5% over 10 epochs. | Species confusion matrix: clean diagonal across 23 lineages. |
A UMAP projection of all 8,822 bottleneck embeddings reveals clean per-lineage clusters (and substructure per genome segment S, M, L) with no explicit supervision of the segment. The S/M/L separation tracks differences in selective pressure (conserved N protein on S, antigenic positive selection on Gn/Gc in M, active RdRp motifs on L).
![]() |
![]() |
![]() |
| All 8,822 sequences by lineage | Seoul virus (1,391) | Puumala virus (2,709) |
The HantaBERT stack spans data collection, modeling, a public API, and a web interface, each in its own repository.
Automated extraction, cleaning, multi-task labeling, and geocoding of Orthohantavirus genomic records from NCBI GenBank (Biopython + Nominatim). Produces the ready-to-train dataset of S/M/L RNA segments with standardized host, species, and geography labels.
- 🤗 Dataset: HantaBERT/Orthohantavirus-Genome-Atlas:
raw(9,950),interim(9,846),defaultprocessed (9,846) - 💻 Code: github.com/HantaBERT/data-pipeline
The multi-task fine-tuning code: MultiTaskHantaBERT (DNABERT-2 encoder → shared bottleneck → 3 heads), weighted loss 1.0·L_species + 0.5·L_host + 0.3·L_geo, full train / evaluate / visualize scripts, and an SVM-on-embeddings baseline.
- 🤗 Model: HantaBERT/HantaBERT
- 💻 Code: github.com/HantaBERT/HantaBERT
FastAPI + Uvicorn service, packaged with Docker. Accepts raw DNA/RNA or FASTA, auto-converts U→T, and returns top-N probabilistic predictions per task.
- 🚀 Live: hantabert-api.faizath.com/docs
- 💻 Code: github.com/HantaBERT/HantaBERT-API
Pure static HTML/CSS/JS frontend with an interactive world map (D3 + TopoJSON). Paste a sequence or upload a FASTA file and explore ranked predictions across all three tasks.
- 🌍 Live: hantabert.faizath.com
- 💻 Code: github.com/HantaBERT/HantaBERT-Web
HantaBERT: Multi-Task Hantavirus Classification with DNABERT-2 Fine-Tuning, an IEEE-style conference paper (English & Indonesian), written for the IF3211 Domain-Specific Computation (Bioinformatics) course at STEI ITB.
- 💻 Source & PDFs: github.com/HantaBERT/paper
Muhammad Faiz Atharrahman · Muhammad Rafi Dhiyaulhaq · Lydia Gracia, School of Electrical Engineering and Informatics (STEI), Institut Teknologi Bandung.
Developed as the final project for the IF3211 Domain-Specific Computation (Bioinformatics) course at STEI ITB.
Released under the Apache-2.0 license, consistent with the DNABERT-2 backbone. If you use HantaBERT, please also cite DNABERT-2 (Zhou et al., 2023).






