Skip to content
@HantaBERT

HantaBERT

Multi-task Orthohantavirus classification by fine-tuning DNABERT-2 — species, host, and geographic origin in one forward pass.

HantaBERT

Multi-Task Orthohantavirus classification by fine-tuning DNABERT-2.
One forward pass → species, host, and geographic origin, plus a 768-d phylogenetic embedding.

🌐 Web App  ·  ⚡ API Docs  ·  🤗 Model  ·  📊 Dataset  ·  💻 GitHub


What is HantaBERT?

Hantaviruses (genus Orthohantavirus) are segmented negative-sense ssRNA viruses that cause hemorrhagic fever with renal syndrome (HFRS) in Eurasia and cardiopulmonary syndrome (HCPS) in the Americas, with mortality reaching ~40% in HCPS cases. Rapidly identifying the species, reservoir host, and geographic origin of a sequence is essential for surveillance, but BLAST and classical phylogeny are slow and do not integrate across attributes.

HantaBERT fine-tunes DNABERT-2 (117M) into a multi-task model that emits probabilities for all three tasks in a single forward pass. A shared 768-d bottleneck feeds three independent classification heads, trained with a weighted combined loss, balanced classes, AMP fp16, gradient accumulation, and a differential learning rate between encoder and heads.

📈 Headline results

Held-out test set (883 sequences), neural classification heads:

Task Classes Test accuracy
🧬 Species / lineage 23 96.7%
🐀 Host (Rodent / Human / Others) 3 91.4%
🌍 Geographic origin 7 80.5%
Training curves Species confusion matrix
Training progression: val accuracy climbs to 96.7% / 91.4% / 80.5% over 10 epochs. Species confusion matrix: clean diagonal across 23 lineages.

Emergent phylogenetic structure

A UMAP projection of all 8,822 bottleneck embeddings reveals clean per-lineage clusters (and substructure per genome segment S, M, L) with no explicit supervision of the segment. The S/M/L separation tracks differences in selective pressure (conserved N protein on S, antigenic positive selection on Gn/Gc in M, active RdRp motifs on L).

UMAP of all species UMAP Seoul virus UMAP Puumala virus
All 8,822 sequences by lineage Seoul virus (1,391) Puumala virus (2,709)

🗂️ Project components

The HantaBERT stack spans data collection, modeling, a public API, and a web interface, each in its own repository.

📊 Data pipeline & dataset

Automated extraction, cleaning, multi-task labeling, and geocoding of Orthohantavirus genomic records from NCBI GenBank (Biopython + Nominatim). Produces the ready-to-train dataset of S/M/L RNA segments with standardized host, species, and geography labels.

🧠 Model: training & fine-tuning

The multi-task fine-tuning code: MultiTaskHantaBERT (DNABERT-2 encoder → shared bottleneck → 3 heads), weighted loss 1.0·L_species + 0.5·L_host + 0.3·L_geo, full train / evaluate / visualize scripts, and an SVM-on-embeddings baseline.

⚡ Inference API

FastAPI + Uvicorn service, packaged with Docker. Accepts raw DNA/RNA or FASTA, auto-converts U→T, and returns top-N probabilistic predictions per task.

🌐 Web interface

Pure static HTML/CSS/JS frontend with an interactive world map (D3 + TopoJSON). Paste a sequence or upload a FASTA file and explore ranked predictions across all three tasks.

HantaBERT web interface

📄 Paper

HantaBERT: Multi-Task Hantavirus Classification with DNABERT-2 Fine-Tuning, an IEEE-style conference paper (English & Indonesian), written for the IF3211 Domain-Specific Computation (Bioinformatics) course at STEI ITB.


🚀 Quick links

Web / Hub Source
Model 🤗 HantaBERT/HantaBERT github.com/HantaBERT/HantaBERT
Dataset 🤗 Orthohantavirus-Genome-Atlas github.com/HantaBERT/data-pipeline
API hantabert-api.faizath.com/docs github.com/HantaBERT/HantaBERT-API
Web hantabert.faizath.com github.com/HantaBERT/HantaBERT-Web
Paper n/a github.com/HantaBERT/paper

👥 Authors

Muhammad Faiz Atharrahman · Muhammad Rafi Dhiyaulhaq · Lydia Gracia, School of Electrical Engineering and Informatics (STEI), Institut Teknologi Bandung.

Developed as the final project for the IF3211 Domain-Specific Computation (Bioinformatics) course at STEI ITB.

Released under the Apache-2.0 license, consistent with the DNABERT-2 backbone. If you use HantaBERT, please also cite DNABERT-2 (Zhou et al., 2023).

Popular repositories Loading

  1. data-pipeline data-pipeline Public

    Data collection and preprocessing pipeline for the HantaBERT Orthohantavirus genomic dataset from NCBI GenBank.

    Jupyter Notebook

  2. HantaBERT HantaBERT Public

    Multi-task Orthohantavirus classification by fine-tuning DNABERT-2 — species, host, and geographic origin in a single forward pass.

    Python

  3. paper paper Public

    LaTeX source and PDFs of the HantaBERT paper (English & Indonesian) — multi-task hantavirus classification with DNABERT-2.

    TeX

  4. HantaBERT-API HantaBERT-API Public

    FastAPI inference service for HantaBERT — classifies hantavirus nucleotide sequences by species, host, and geographic origin.

    Python

  5. HantaBERT-Web HantaBERT-Web Public

    Static web frontend for HantaBERT — paste a sequence or upload FASTA to classify hantavirus by species, host, and geographic origin.

    CSS

  6. .github .github Public

    HantaBERT organization profile and shared community health files.

Repositories

Showing 6 of 6 repositories

People

This organization has no public members. You must be a member to see who’s a part of this organization.

Top languages

Loading…

Most used topics

Loading…