Skip to content

KevinBian107/MOSAIC

Repository files navigation

logo

This project explores a simple idea: instead of asking generative models to recover motif-level structure implicitly, we encode motifs directly into the representation used for graph generation, independent of the sequence transformer's processing. We are interesting in constructing a representation in the form of flat tokens.

Core Approach

For generating graphs using tokens with hierarchical insights, we need 3 things:

  1. Create the input H-graph: Build a hierarchical representation of the graph using coarsening strategies (HAC, Spectral Clustering, Motif Community).

  2. Tokenize the input H-graph: Convert the hierarchy to a token sequence using H-SENT (Vanilla HiGen) or HDT (DFS-based). Note that we need to preserve enough information (leaf edge connections) for the inverse problem to flatten the H-graph.

  3. Flatten the generated H-graph: Reconstruct the flat graph from tokens via bipartite edge union for H-SENT, or union of back edges for HDT.

HDT HDT

Quick Start

Installation

# Create training environment
conda env create -f environment.yaml
conda activate mosaic

# Create evaluation/test environment (required for full metrics)
conda env create -f environment_eval.yaml
conda activate mosaic-eval

Training

The configs in configs/ are the default hyperparameters used for our experiments. Training uses Hydra for configuration — experiment-specific overrides (dataset size, LR, steps) are in configs/experiment/, and tokenizer defaults in configs/tokenizer/. View sample training runs on WandB.

# Train HDTC on MOSES (default)
python scripts/train.py

# Train with different tokenizers
python scripts/train.py tokenizer=sent     # Flat SENT (baseline)
python scripts/train.py tokenizer=hsent    # Hierarchical SENT
python scripts/train.py tokenizer=hdt      # Hierarchical DFS
python scripts/train.py tokenizer=hdtc     # Compositional (default)

# Train on COCONUT dataset
python scripts/train.py experiment=coconut

# Evaluate trained model
# (Use the eval environment: conda activate mosaic-eval)
python scripts/test.py model.checkpoint_path=outputs/train/.../best.ckpt
python scripts/realistic_gen.py model.checkpoint_path=outputs/train/.../best.ckpt

# Create comparison table
python scripts/comparison/compare_results.py

Trained Checkpoints

Download our trained checkpoints from Google Drive.

Batch Benchmark Scripts

The bash_scripts/ directory automates the full benchmark pipeline. See bash_scripts/README.md for details.

# Train all tokenizer variants
./bash_scripts/train/train_benchmarks.sh
./bash_scripts/train/train_benchmarks.sh --coconut

# Evaluate all trained models
./bash_scripts/eval/eval_benchmarks.sh
./bash_scripts/eval/eval_benchmarks.sh --coconut

# Faster eval flow: GPU sequential + CPU-parallel motif phase
./bash_scripts/eval/eval_benchmarks_2phase.sh

Project Structure

MOSAIC/
├── src/
│   ├── data/              # Data loading, generation, and motif detection
│   ├── tokenizers/        # Graph tokenization (SENT, H-SENT, HDT, HDTC)
│   │   ├── coarsening/    # Coarsening strategies (spectral, motif-aware)
│   │   └── motif/         # Motif detection and patterns
│   ├── models/            # Transformer models
│   ├── evaluation/        # Standard and motif metrics
│   └── realistic_gen/     # Generation quality analysis
├── configs/               # Hydra configuration
├── scripts/               # Training, evaluation, and visualization scripts
│   ├── preprocess/        # Data preprocessing and caching
│   ├── comparison/        # Result comparison and benchmarking
│   └── visualization/     # Visualization and demo scripts
├── bash_scripts/          # Batch benchmark automation scripts
├── tests/                 # Test suite
└── docs/                  # Documentation

Documentation

See the docs/ directory for:

Setup guides (docs/setups/):

Design docs (docs/designs/):

Acknowledgement

This codebase was developed based on insights from:

About

Motif-preserving Graph Tokenization for Biological Structure Generation using Sequence Transformer

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages