Skip to content

ai4nucleome/GreS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Semantic-Aware Spatial Representation Learning for Spatial Domain Identification

Python PyTorch License

Overview

GreS is a spatial domain identification framework that incorporates gene-level semantic priors into spatial representation learning. GreS models spatial domain organization from three complementary perspectives: physical spatial proximity, transcriptomic similarity, and semantic similarity derived from gene function. To capture these relationships, GreS constructs a spatial graph, a feature graph, and a semantic graph, which are encoded by three parallel GCN branches and fused into a unified spot representation.

GreS Framework

Key features:

  • 🧠 Gene-Level Semantic Priors: Builds spot-level semantic descriptors by encoding gene descriptions and aggregating gene semantics according to each spot's expression profile.
  • 🕸️ Three Complementary Graphs: Constructs a spatial graph (physical proximity), a feature graph (transcriptomic similarity), and a semantic graph (functional similarity).
  • 🔀 Three Parallel GCN Branches: Encodes the three views with separate GCN encoders and integrates them via a branch-weighted fusion mechanism.
  • 📉 ZINB Autoencoding: Reconstructs input gene expression with a zero-inflated negative binomial (ZINB) decoder to handle sparsity and noise.

Table of Contents

A runnable end-to-end walkthrough is also available in tutorial.ipynb.

Installation

Quick Setup (Using Requirements File)

We provide a requirements file for quick environment setup:

pip install -r environment/requirements_sc.txt

Download Resources (Required)

GreS requires pretrained gene embeddings and their vocabulary. Please download them from our Hugging Face repository and place them under embedding/text_embedd_large/:

GreS/
├── embedding/
│   └── text_embedd_large/
│       ├── pretrained_gene_embeddings.pt
│       └── vocab.json
└── ...

Data Preparation

Each dataset is identified by a dataset_id. Place the raw data under data/raw_h5ad/<dataset_id>/.

For 10x Visium data, the directory should contain:

  • filtered_feature_bc_matrix.h5: raw gene expression counts.
  • spatial/: spatial metadata (tissue positions, scale factors, images).
  • metadata.tsv: per-spot annotations, including the ground-truth label column.

Example (DLPFC sample 151672):

data/raw_h5ad/151672/
├── filtered_feature_bc_matrix.h5
├── spatial/
└── metadata.tsv

An .h5ad file containing adata.X (raw counts), adata.obsm['spatial'], and a label column is also supported.

Usage

The full pipeline has three steps. All commands are run from the project root, and we use DLPFC sample 151672 as an example.

Step 1: Data Preprocessing

Generate a graph-augmented data.h5ad from the raw data with preprocess/generate_data.py.

python preprocess/generate_data.py --dataset_id 151672 --label_column layer_guess_reordered
Argument Description Default
--dataset_id Dataset ID; locates the input and names the output (required)
--label_column Label column in metadata.tsv (DLPFC uses layer_guess_reordered) ground_truth

Step 2: Generate Semantic Embeddings

Build a per-spot semantic embedding from data.h5ad and the pretrained gene embeddings with preprocess/generate_raw_gene_concat_spot_embedding.py. Results are written to data/npys_grn_raw_concat/.

python preprocess/generate_raw_gene_concat_spot_embedding.py \
    --dataset_id 151672 \
    --embedding embedding/text_embedd_large/pretrained_gene_embeddings.pt \
    --vocab embedding/text_embedd_large/vocab.json

This produces embeddings_<dataset_id>.npy (the spot semantic embeddings used during training), together with an _attribution.npz and a _stats.json file.

Step 3: Training & Clustering

Train the model and cluster the spots with tools/train.py. Hyper-parameters are read from config/<config_name>.ini (default DLPFC).

python tools/train.py --dataset_id 151672

Output

Results are saved in data/results/<dataset_id>/:

  • <dataset_id>.h5ad: AnnData with clustering results; cluster labels are stored in obs['idx'].
  • <dataset_id>_clusters.png: spatial cluster plot.

Repository Structure

GreS/
├── config/                 # Configuration files 
├── data/
│   ├── raw_h5ad/           # Input data, one folder per dataset_id
│   ├── generated/          # Preprocessing output (data.h5ad, graphs)
│   ├── npys_grn_raw_concat/# Generated spot semantic embeddings
│   └── results/            # Training results (h5ad + cluster plots)
├── embedding/
│   └── text_embedd_large/  # Pretrained gene embeddings and vocabulary
├── preprocess/
│   ├── generate_data.py                          # Step 1: build data.h5ad + graphs
│   ├── generate_raw_gene_concat_spot_embedding.py# Step 2: build spot semantic embeddings
│   └── config.py
├── tools/
│   ├── model.py            # GreS model architecture
│   ├── train.py            # Step 3: training & clustering
│   └── utils.py
├── fig/                    # Figure assets
├── tutorial.ipynb          # End-to-end tutorial
└── README.md

About

GreS: Semantic-Guided Spatial Representation Learning for Spatial Domain Identification

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors