This repository contains the processed Nullsettes benchmark and reference inference scripts used to evaluate whether genomic language models can distinguish functional synthetic expression cassettes from matched virtual loss-of-function mutants. The benchmark focuses on zero-shot mutation effect prediction in sequences that are functional but evolutionarily implausible.
Each benchmark file contains 1,500 sequences. For every dataset, nonmutant.txt stores the original active expression cassettes and each mutant_translocation*.txt file stores the matched Nullsette mutants generated by translocating one key control element. Within a dataset, the FASTA headers and sequence order are preserved across files, so row i in a mutant file is paired with row i in nonmutant.txt for statistical testing.
The current release contains five processed datasets:
| Dataset folder | System | Upstream source | Nonmutant cassettes | Nullsette mutant types | Notes |
|---|---|---|---|---|---|
deboer/Abf1TATA |
Yeast | deBoer MPRA library | 1500 | 11 | Abf1TATA is designed by embedding conserved transcription factor binding sites such as Abf1 and a canonical TATA box |
deboer/pTpA |
Yeast | deBoer MPRA library | 1500 | 11 | pTpA consists of synthetic promoters constructed with a poly-T–poly-A architecture |
kosuri |
E. coli | Kosuri promoter-RBS library | 1500 | 19 | Rationally designed promoter-RBS pairs upstream of sfGFP |
lagator |
E. coli | Lagator promoter library | 1500 | 19 | Random promoter sequences upstream of sfGFP |
zahm |
Mammalian | Zahm TRE/minimal-promoter library | 1500 | 11 | Human/mouse TREs combined with minimal promoters |
For the benchmark logic itself:
- Eukaryotic cassettes have 11 valid single-component Nullsette translocations.
- Prokaryotic cassettes have 19 valid single-component Nullsette translocations.
- The translocation definitions are summarized in database/README.md.
GLM-Nullsette-Benchmark/
|-- README.md
|-- environment.yml
|-- data/
| |-- processed_data.zip
| `-- example.txt
|-- database/
| |-- README.md
| |-- __init__.py
| |-- ecoli_kosuri.py
| |-- ecoli_lagator.py
| |-- yeast_deboer.py
| |-- mammlian_zahm.py
| `-- pkls/
|-- model/
| |-- alphagenome_infer.py
| |-- evo1.py
| |-- genslm.py
| |-- hyenadna.py
| |-- metagene1.py
| |-- NT.py
| |-- dnabert2.py
| |-- generator.py
| `-- utils/
| |-- ll_calculation.py
| `-- paired_mutation_test.py
`-- resource/
`-- schematic.png
Useful starting points:
database/stores the curated cassette architectures and promoter dictionaries used to build the benchmark.model/hyenadna.py,model/genslm.py, andmodel/alphagenome_infer.pyare the most configurable scoring scripts in the current repository.model/NT.py,model/dnabert2.py, andmodel/generator.pyare lightweight legacy examples that currently use hard-coded paths for a single dataset slice.
The shortest end-to-end reproduction path is:
- Create the environment.
- Extract the processed benchmark inputs.
- Run one inference script on the extracted
processed_data/directory. - Compare
nonmutantand mutant score files with the paired permutation test described below.
Example:
conda env create -f environment.yml
conda activate glm_eval
unzip data/processed_data.zip -d .
python model/hyenadna.py \
--input-root "$PWD/processed_data" \
--output-root "$PWD/inference_out_csv/HyenaDNA" \
--model-name LongSafari/hyenadna-large-1m-seqlen-hf \
--batch-size 1 \
--device cudagit clone https://github.com/cellethology/GLM-Nullsette-Benchmark.git
cd GLM-Nullsette-Benchmark
conda env create -f environment.yml
conda activate glm_evalThe processed benchmark sequences used for scoring are distributed as a zip archive to keep the repository lightweight.
unzip data/processed_data.zip -d .After extraction, the expected input layout is:
processed_data/
|-- deboer/
| |-- Abf1TATA/
| | |-- nonmutant.txt
| | |-- mutant_translocation1.txt
| | |-- ...
| | `-- mutant_translocation11.txt
| `-- pTpA/
| |-- nonmutant.txt
| |-- mutant_translocation1.txt
| |-- ...
| `-- mutant_translocation11.txt
|-- kosuri/
| |-- nonmutant.txt
| |-- mutant_translocation1.txt
| |-- ...
| `-- mutant_translocation19.txt
|-- lagator/
| |-- nonmutant.txt
| |-- mutant_translocation1.txt
| |-- ...
| `-- mutant_translocation19.txt
`-- zahm/
|-- nonmutant.txt
|-- mutant_translocation1.txt
|-- ...
`-- mutant_translocation11.txt
Important format details:
- Although the files end in
.txt, they are standard FASTA files. - Each file contains exactly 1,500 sequences in the current release.
- Parameterized scoring scripts should be pointed to the extracted
processed_data/directory via--input-root. - Sequence pairing across
nonmutantand mutant files is preserved by FASTA header and file order.
The curated cassette definitions can be imported directly from the database package:
from database import deboer_database, zahm_database, kosuri_database, lagator_database
print(deboer_database["deBoer_cassette"].keys())
print(kosuri_database["kosuri_cassette"].keys())
print(zahm_database["zahm_cassette"].keys())
print(lagator_database["lagator_cassette"].keys())These dictionaries expose the fixed cassette components such as promoter, RBS, start codon, CDS, stop codon, terminator, and etc.
General notes for all scoring scripts:
- For CLM- and MLM-style models, the output columns are usually
seqsandscores. alphagenome_infer.pyadditionally writesscore_region_startandscore_region_endwhen--score-subsequenceis used.- Scores are model-specific and should be compared within a model family, not across different model families.
model/hyenadna.py is a fully parameterized script for zero-shot scoring with Hugging Face HyenaDNA checkpoints.
python model/hyenadna.py \
--input-root "$PWD/processed_data" \
--output-root "$PWD/inference_out_csv/HyenaDNA" \
--model-name LongSafari/hyenadna-large-1m-seqlen-hf \
--batch-size 1 \
--device cudamodel/genslm.py expects a local GenSLM source tree and a downloaded checkpoint.
Requirements specific to this script:
- The GenSLM source tree is expected at
./genslm/genslm. - A checkpoint must be provided either with
--weights-pathor via--model-cache-dir.
Example:
python model/genslm.py \
--input-root "$PWD/processed_data" \
--output-root "$PWD/inference_out_csv/GenSLM" \
--model-id genslm_2.5B_patric \
--weights-path /absolute/path/to/patric_2.5b_epoch00_val_los_0.29_bias_removed.pt \
--batch-size 1 \
--device cudamodel/alphagenome_infer.py scores sequences with AlphaGenome by averaging predicted RNA-seq activity over a selected region of the input sequence.
Requirements specific to this script:
- The local AlphaGenome research repository is expected at
./alphagenome_research. - If
--checkpoint-pathis not supplied, the script downloads the requested checkpoint from Hugging Face. - If authentication is needed for the checkpoint download, set
HF_TOKENin the environment before running the script. --score-subsequenceis used to locate the CDS region for scoring.
Basic example:
python model/alphagenome_infer.py \
--input-root "$PWD/processed_data" \
--output-root "$PWD/inference_out_csv/AlphaGenome" \
--organism homo_sapiens \
--use-all-rna-tracks \
--device gpu
--score-subsequence ATCG...Paper-aligned note:
- In the manuscript, AlphaGenome scores are computed by averaging RNA-seq predictions over the CDS positions of each cassette.
- To reproduce that setting exactly, pass the dataset-specific CDS sequence with
--score-subsequence. - The CDS sequences are stored in the database definitions under
database/. If--score-subsequenceis omitted, the script averages over the full inserted sequence instead.
model/evo1.py scores each FASTA file in an input directory and writes one tab-separated output file per input file.
python model/evo1.py \
--input_dir "$PWD/processed_data/kosuri" \
--output_dir "$PWD/inference_out_csv/Evo1/kosuri" \
--model_name evo-1-131k-base \
--batch_size 1 \
--device cudaThis script assumes the upstream Evo package is already installed and importable.
model/metagene1.py runs causal language model scoring on every FASTA file in a dataset directory.
python model/metagene1.py \
--input_dir "$PWD/processed_data/deboer/Abf1TATA" \
--output_dir "$PWD/inference_out_csv/METAGENE-1"The following scripts were kept as minimal single-dataset examples:
They require manual editing of dir_paths and out_root_path if you want to sweep additional datasets.
The repository uses different scoring rules for different model classes:
- Causal language models use mean next-token log-likelihood, implemented in
model/utils/ll_calculation.pyascompute_ll_clm. - Masked language models use mean token log-probability from the unmasked MLM logits, implemented as
compute_llr_mlm. - AlphaGenome uses mean predicted RNA-seq activity over the selected scoring region.
Because these scores come from different objectives and inference procedures, their absolute values should not be interpreted on a common numerical scale across model families. The benchmark should therefore be evaluated within each model family by comparing mutant scores against their matched nonmutant scores.
The one-sided paired permutation test used in the manuscript is implemented in model/utils/paired_mutation_test.py. For a given model and dataset:
- Load
nonmutant.csv. - Load one matched
mutant_translocation*.csv. - Run the test with
alternative="less"because mutants are expected to receive lower scores than nonmutants. - Repeat for all mutation types in the dataset.
- Apply multiple-testing correction across the tested mutation types for that model and dataset.
- Compute success rate as the fraction of mutation types with corrected
p < 0.05.
Minimal example:
from pathlib import Path
import pandas as pd
from model.utils.paired_mutation_test import paired_permutation_test
root = Path("inference_out_csv/HyenaDNA/hyenadna-large-1m-seqlen-hf/kosuri")
wt = pd.read_csv(root / "nonmutant.csv", sep="\t")["scores"].to_numpy()
for mutant_file in sorted(root.glob("mutant_translocation*.csv")):
mut = pd.read_csv(mutant_file, sep="\t")["scores"].to_numpy()
p_value = paired_permutation_test(
wt,
mut,
num_permutations=10000,
alternative="less",
seed=0,
)
print(mutant_file.name, p_value)This repository distributes processed benchmark-ready inputs. The original upstream resources used to curate the benchmark are:
- Kosuri expression cassette backbone: Addgene 47441
- Lagator promoter dataset: Thermoters repository
- deBoer promoter library: GEO GSE104878
- Zahm promoter library: GEO GSE271608
The manuscript describes the dataset-specific filtering used to select the 1,500 nonmutant cassettes included in this release.
[1] Kosuri, Sriram, et al. "Composability of regulatory sequences controlling transcription and translation in Escherichia coli." Proceedings of the National Academy of Sciences 110.34 (2013): 14024-14029.
[2] Vaishnav, Eeshit Dhaval, et al. "The evolution, evolvability and engineering of gene regulatory DNA." Nature 603.7901 (2022): 455-463.
[3] Zahm, Adam M., et al. "A massively parallel reporter assay library to screen short synthetic promoters in mammalian cells." Nature Communications 15.1 (2024): 10353.
[4] de Boer, Carl G., et al. "Deciphering eukaryotic gene-regulatory logic with 100 million random promoters." Nature Biotechnology 38.1 (2020): 56-65.
We acknowledge the valuable contributions to genomic language modeling made by the authors of the following repositories: Evo1, Evo2, Nucleotide Transformer, DNABERT-2, GENERator, METAGENE-1, Caduceus, GPN, GENA-LM, gLM2, PDLLM, GENERanno, GPN-Promoter, GenSLM, HyenaDNA, AlphaGenome.
