Semantic-Aware Spatial Representation Learning for Spatial Domain Identification

Overview

GreS is a spatial domain identification framework that incorporates gene-level semantic priors into spatial representation learning. GreS models spatial domain organization from three complementary perspectives: physical spatial proximity, transcriptomic similarity, and semantic similarity derived from gene function. To capture these relationships, GreS constructs a spatial graph, a feature graph, and a semantic graph, which are encoded by three parallel GCN branches and fused into a unified spot representation.

Key features:

🧠 Gene-Level Semantic Priors: Builds spot-level semantic descriptors by encoding gene descriptions and aggregating gene semantics according to each spot's expression profile.
🕸️ Three Complementary Graphs: Constructs a spatial graph (physical proximity), a feature graph (transcriptomic similarity), and a semantic graph (functional similarity).
🔀 Three Parallel GCN Branches: Encodes the three views with separate GCN encoders and integrates them via a branch-weighted fusion mechanism.
📉 ZINB Autoencoding: Reconstructs input gene expression with a zero-inflated negative binomial (ZINB) decoder to handle sparsity and noise.

Installation

Quick Setup (Using Requirements File)

We provide a requirements file for quick environment setup:

pip install -r environment/requirements_sc.txt

Download Resources (Required)

GreS requires pretrained gene embeddings and their vocabulary. Please download them from our Hugging Face repository and place them under embedding/text_embedd_large/:

GreS/
├── embedding/
│   └── text_embedd_large/
│       ├── pretrained_gene_embeddings.pt
│       └── vocab.json
└── ...

Data Preparation

Each dataset is identified by a dataset_id. Place the raw data under data/raw_h5ad/<dataset_id>/.

For 10x Visium data, the directory should contain:

filtered_feature_bc_matrix.h5: raw gene expression counts.
spatial/: spatial metadata (tissue positions, scale factors, images).
metadata.tsv: per-spot annotations, including the ground-truth label column.

Example (DLPFC sample 151672):

data/raw_h5ad/151672/
├── filtered_feature_bc_matrix.h5
├── spatial/
└── metadata.tsv

An .h5ad file containing adata.X (raw counts), adata.obsm['spatial'], and a label column is also supported.

Usage

The full pipeline has three steps. All commands are run from the project root, and we use DLPFC sample 151672 as an example.

Step 1: Data Preprocessing

Generate a graph-augmented data.h5ad from the raw data with preprocess/generate_data.py.

python preprocess/generate_data.py --dataset_id 151672 --label_column layer_guess_reordered

Argument	Description	Default
`--dataset_id`	Dataset ID; locates the input and names the output	(required)
`--label_column`	Label column in `metadata.tsv` (DLPFC uses `layer_guess_reordered`)	`ground_truth`

Step 2: Generate Semantic Embeddings

Build a per-spot semantic embedding from data.h5ad and the pretrained gene embeddings with preprocess/generate_raw_gene_concat_spot_embedding.py. Results are written to data/npys_grn_raw_concat/.

python preprocess/generate_raw_gene_concat_spot_embedding.py \
    --dataset_id 151672 \
    --embedding embedding/text_embedd_large/pretrained_gene_embeddings.pt \
    --vocab embedding/text_embedd_large/vocab.json

This produces embeddings_<dataset_id>.npy (the spot semantic embeddings used during training), together with an _attribution.npz and a _stats.json file.

Step 3: Training & Clustering

Train the model and cluster the spots with tools/train.py. Hyper-parameters are read from config/<config_name>.ini (default DLPFC).

python tools/train.py --dataset_id 151672

Output

Results are saved in data/results/<dataset_id>/:

<dataset_id>.h5ad: AnnData with clustering results; cluster labels are stored in obs['idx'].
<dataset_id>_clusters.png: spatial cluster plot.

Repository Structure

GreS/
├── config/                 # Configuration files 
├── data/
│   ├── raw_h5ad/           # Input data, one folder per dataset_id
│   ├── generated/          # Preprocessing output (data.h5ad, graphs)
│   ├── npys_grn_raw_concat/# Generated spot semantic embeddings
│   └── results/            # Training results (h5ad + cluster plots)
├── embedding/
│   └── text_embedd_large/  # Pretrained gene embeddings and vocabulary
├── preprocess/
│   ├── generate_data.py                          # Step 1: build data.h5ad + graphs
│   ├── generate_raw_gene_concat_spot_embedding.py# Step 2: build spot semantic embeddings
│   └── config.py
├── tools/
│   ├── model.py            # GreS model architecture
│   ├── train.py            # Step 3: training & clustering
│   └── utils.py
├── fig/                    # Figure assets
├── tutorial.ipynb          # End-to-end tutorial
└── README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Semantic-Aware Spatial Representation Learning for Spatial Domain Identification

Overview

Table of Contents

Installation

Quick Setup (Using Requirements File)

Download Resources (Required)

Data Preparation

Usage

Step 1: Data Preprocessing

Step 2: Generate Semantic Embeddings

Step 3: Training & Clustering

Output

Repository Structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
config		config
environment		environment
fig		fig
preprocess		preprocess
tools		tools
.gitignore		.gitignore
README.md		README.md
tutorial.ipynb		tutorial.ipynb

Folders and files

Latest commit

History

Repository files navigation

Semantic-Aware Spatial Representation Learning for Spatial Domain Identification

Overview

Table of Contents

Installation

Quick Setup (Using Requirements File)

Download Resources (Required)

Data Preparation

Usage

Step 1: Data Preprocessing

Step 2: Generate Semantic Embeddings

Step 3: Training & Clustering

Output

Repository Structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages