gLM2 Framework

gLM2 (Genomic Language Model 2) is a mixed-modality deep learning framework for extracting embeddings from biological sequences (amino acids and nucleotides). This framework allows you to extract high-dimensional embeddings from FASTA sequences using pre-trained gLM2 models from HuggingFace.

Overview

The gLM2 framework provides tools to:

Extract embeddings from FASTA sequences using pre-trained gLM2 models (gLM2_650M, gLM2_150M, etc.)
Process sequences in batches for efficient inference
Reduce embedding dimensionality using GPU-accelerated PCA (cuML)
Support both padded and variable-length embeddings
Handle mixed-modality sequences (amino acids and nucleotides)

Prerequisites

Python 3.10 or higher
uv package manager (install from https://github.com/astral-sh/uv)
CUDA-capable GPU (optional, but recommended for faster inference and PCA)
CUDA 12.x (for cuML GPU acceleration)

Environment Setup

1. Install uv (if not already installed)

# Using pip
pip install uv

2. Create Virtual Environment and Install Dependencies

Run the environment setup command:

uv sync
source .venv/bin/activate

This will:

Create a virtual environment (.venv/)
Install all required dependencies including:
- PyTorch and Transformers (for model inference)
- cuML and CuPy (for GPU-accelerated PCA)
- BioPython (for FASTA file handling)
- NumPy and other utilities

Note: The cuml-cu12 package is installed from the RAPIDS PyPI index, which requires CUDA 12.x. If you don't have CUDA or want CPU-only PCA, you may need to modify the dependencies.

3. Install Package in Editable Mode (Recommended)

For development and easier imports:

pip install -e .

Or with uv:

uv pip install -e .

Usage

Extract Embeddings from FASTA Sequences

The main module for extracting embeddings is retrieve_embeddings.

Basic Usage

python -m retrieve_embeddings.retrieve_embeddings \
    test_files/test.fasta \
    embeddings.npz

Full Command with All Options

python -m retrieve_embeddings.retrieve_embeddings \
    <path-to-input.fasta> \
    <path-to-output.npz> \
    --model-name tattabio/gLM2_650M \
    --batch-size 1 \
    --device cuda \
    --no-average

Command-Line Arguments

input_file (positional, required): Path to input FASTA file containing biological sequences
output_file (positional, required): Path to output .npz file where embeddings will be saved
--model-name (optional): HuggingFace model identifier (default: tattabio/gLM2_650M)
- Available models: tattabio/gLM2_650M, tattabio/gLM2_150M
--batch-size (optional): Batch size for processing sequences (default: 1)
--device (optional): Device to run inference on - cuda, cpu, or auto (default: auto)
--no-average (optional): Keep per-token embeddings (default is mean pooled)
--no-validate (optional): Skip sequence validation (not recommended)

Output Format

The script outputs a compressed NumPy archive (.npz) file containing:

ids: Array of sequence IDs from the FASTA file
embeddings:
- If mean pooled (default): Array with shape (num_sequences, hidden_dim)
- If per-token (--no-average): Object array of variable-length embeddings

Example Commands

# Extract embeddings with default settings (mean pooled, GPU if available)
python -m retrieve_embeddings.retrieve_embeddings \
    test_files/test.fasta \
    embeddings.npz

# Extract per-token embeddings
python -m retrieve_embeddings.retrieve_embeddings \
    test_files/test.fasta \
    embeddings.npz \
    --no-average

# Use CPU and smaller batch size
python -m retrieve_embeddings.retrieve_embeddings \
    test_files/test.fasta \
    embeddings.npz \
    --device cpu \
    --batch-size 4

# Use a smaller model
python -m retrieve_embeddings.retrieve_embeddings \
    test_files/test.fasta \
    embeddings.npz \
    --model-name tattabio/gLM2_150M

Python API Usage

You can also use the functions programmatically:

from pathlib import Path
from retrieve_embeddings import process_fasta_and_save_embeddings

# Extract and save embeddings
process_fasta_and_save_embeddings(
    fasta_path=Path("test_files/test.fasta"),
    output_path=Path("embeddings.npz"),
    model_name="tattabio/gLM2_650M",
    batch_size=1,
    average_embeddings=True,
    device="cuda"
)

Reduce Embedding Dimensionality with PCA

After extracting embeddings, you can reduce their dimensionality using GPU-accelerated PCA:

Basic Usage

python -m pca.pca \
    --input-file embeddings.npz \
    --output-file embeddings_pca512.npz \
    --n-components 512

Full Command with All Options

python -m pca.pca \
    --input-file <path-to-input.npz> \
    --output-file <path-to-output.npz> \
    --n-components 512 \
    --random-state 42

Command-Line Arguments

--input-file (required): Path to input .npz file containing embeddings
--output-file (required): Path to output .npz file for reduced embeddings
--n-components (optional): Number of PCA components (default: 512)
--random-state (optional): Random seed for reproducibility (default: 42)

Supported Input Formats

The PCA module supports:

Padded embeddings: 3D array (batch_size, seq_len, dimension) - automatically flattened
Variable-length embeddings: Object array of variable-length embeddings - each flattened individually

Example

# Reduce to 256 dimensions
python -m pca.pca \
    --input-file embeddings.npz \
    --output-file embeddings_pca256.npz \
    --n-components 256

# Reduce with custom random seed
python -m pca.pca \
    --input-file embeddings.npz \
    --output-file embeddings_pca512.npz \
    --n-components 512 \
    --random-state 123

Python API for PCA

from pathlib import Path
from pca import reduce_embeddings_pca

# Reduce embedding dimensionality
reduce_embeddings_pca(
    input_file=Path("embeddings.npz"),
    output_file=Path("embeddings_pca512.npz"),
    n_components=512,
    random_state=42
)

Loading Embeddings

You can load the saved embeddings in Python:

import numpy as np
from pca import load_embeddings

# Load embeddings
embeddings, ids = load_embeddings("embeddings.npz")

print(f"Loaded {len(ids)} sequences")
print(f"Embeddings shape: {embeddings.shape}")

# Or load directly with numpy
data = np.load("embeddings.npz", allow_pickle=True)
sequence_ids = data["ids"]
embeddings = data["embeddings"]

Sequence Format

gLM2 supports mixed-modality sequences containing:

Amino acids (uppercase): A-Z (standard 20 plus ambiguous B, J, O, U, X, Z)
Nucleotides (lowercase): a, t, g, c, n (for ambiguous)
Special strand tokens: <+> and <-> to indicate positive/negative strand

Example Sequences

>seq1
<+>MALTKVEKRNRIKRRVRGKISGTQASPRLSVYKSNK<+>aattttttaaggaa<->MLGIDNIERVKPGGLEKGGRAFGFSAIVVVGNED

>seq2
<+>aatttaaaaaggaa

Project Structure

gLM2/
├── model/                    # gLM2 model architecture
│   ├── __init__.py
│   ├── configuration_glm2.py  # Model configuration
│   └── modeling_glm2.py      # Model implementation
├── retrieve_embeddings/       # Embedding extraction module
│   ├── __init__.py
│   ├── retrieve_embeddings.py # Main embedding extraction script
│   └── util.py                # Utility functions (FASTA reading, validation, saving)
├── pca/                       # PCA dimensionality reduction
│   ├── __init__.py
│   └── pca.py                 # PCA reduction script
├── test_files/                # Test data
│   └── test.fasta            # Example FASTA file
├── tests/                     # Unit tests
│   └── test_saving_loading.py
├── docs/                      # Documentation
│   └── images/
├── pyproject.toml             # Project dependencies and configuration
├── uv.lock                    # Locked dependency versions
└── README.md                  # This file

Testing

Run the test suite to verify your installation:

# Run all tests
pytest tests/

# Run specific test file
pytest tests/test_saving_loading.py -v

# Run with coverage
pytest tests/ --cov=retrieve_embeddings --cov=pca

Troubleshooting

Model Download Issues

The gLM2 models are automatically downloaded from HuggingFace on first use. If you encounter download issues:

Check your internet connection
Verify HuggingFace access (models are public, but you may need to accept terms)
Try downloading manually from: https://huggingface.co/tattabio/gLM2_650M

CUDA/cuML Issues

If you encounter issues with cuML (GPU-accelerated PCA):

CUDA not available: The framework will automatically fall back to CPU for model inference, but PCA requires cuML which needs CUDA 12.x
cuML installation fails: This is expected - cuML must be installed via conda:
```
conda install -c rapidsai -c conda-forge -c nvidia cuml
```
CPU-only PCA: You can modify the code to use scikit-learn's PCA instead of cuML for CPU-only operation

Import Errors

If you encounter ModuleNotFoundError:

Make sure the package is installed in editable mode:
```
pip install -e .
```
Verify the virtual environment is activated
Check that pyproject.toml includes all necessary packages

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
docs/images		docs/images
model		model
pca		pca
retrieve_embeddings		retrieve_embeddings
scripts		scripts
test_files		test_files
tests		tests
.gitignore		.gitignore
.python-version		.python-version
README.archive		README.archive
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
tutorial.ipynb		tutorial.ipynb
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

gLM2 Framework

Overview

Prerequisites

Environment Setup

1. Install uv (if not already installed)

2. Create Virtual Environment and Install Dependencies

3. Install Package in Editable Mode (Recommended)

Usage

Extract Embeddings from FASTA Sequences

Basic Usage

Full Command with All Options

Command-Line Arguments

Output Format

Example Commands

Python API Usage

Reduce Embedding Dimensionality with PCA

Basic Usage

Full Command with All Options

Command-Line Arguments

Supported Input Formats

Example

Python API for PCA

Loading Embeddings

Sequence Format

Example Sequences

Project Structure

Testing

Troubleshooting

Model Download Issues

CUDA/cuML Issues

Import Errors

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages