Beyond Flat Walks: Compositional Abstraction for Autoregressive Molecular Generation

This project explores a simple idea: instead of asking generative models to recover motif-level structure implicitly, we encode motifs directly into the representation used for graph generation, independent of the sequence transformer's processing. We are interesting in constructing a representation in the form of flat tokens.

Core Approach

For generating graphs using tokens with hierarchical insights, we need 3 things:

Create the input H-graph: Build a hierarchical representation of the graph using coarsening strategies (HAC, Spectral Clustering, Motif Community).
Tokenize the input H-graph: Convert the hierarchy to a token sequence using H-SENT (Vanilla HiGen) or HDT (DFS-based). Note that we need to preserve enough information (leaf edge connections) for the inverse problem to flatten the H-graph.
Flatten the generated H-graph: Reconstruct the flat graph from tokens via bipartite edge union for H-SENT, or union of back edges for HDT.

Quick Start

Installation

# Create training environment
conda env create -f environment.yaml
conda activate mosaic

# Create evaluation/test environment (required for full metrics)
conda env create -f environment_eval.yaml
conda activate mosaic-eval

Training

The configs in configs/ are the default hyperparameters used for our experiments. Training uses Hydra for configuration — experiment-specific overrides (dataset size, LR, steps) are in configs/experiment/, and tokenizer defaults in configs/tokenizer/. View sample training runs on WandB.

# Train HDTC on MOSES (default)
python scripts/train.py

# Train with different tokenizers
python scripts/train.py tokenizer=sent     # Flat SENT (baseline)
python scripts/train.py tokenizer=hsent    # Hierarchical SENT
python scripts/train.py tokenizer=hdt      # Hierarchical DFS
python scripts/train.py tokenizer=hdtc     # Compositional (default)

# Train on COCONUT dataset
python scripts/train.py experiment=coconut

# Evaluate trained model
# (Use the eval environment: conda activate mosaic-eval)
python scripts/test.py model.checkpoint_path=outputs/train/.../best.ckpt
python scripts/realistic_gen.py model.checkpoint_path=outputs/train/.../best.ckpt

# Create comparison table
python scripts/comparison/compare_results.py

Trained Checkpoints

Download our trained checkpoints from Google Drive.

Batch Benchmark Scripts

The bash_scripts/ directory automates the full benchmark pipeline. See bash_scripts/README.md for details.

# Train all tokenizer variants
./bash_scripts/train/train_benchmarks.sh
./bash_scripts/train/train_benchmarks.sh --coconut

# Evaluate all trained models
./bash_scripts/eval/eval_benchmarks.sh
./bash_scripts/eval/eval_benchmarks.sh --coconut

# Faster eval flow: GPU sequential + CPU-parallel motif phase
./bash_scripts/eval/eval_benchmarks_2phase.sh

Project Structure

MOSAIC/
├── src/
│   ├── data/              # Data loading, generation, and motif detection
│   ├── tokenizers/        # Graph tokenization (SENT, H-SENT, HDT, HDTC)
│   │   ├── coarsening/    # Coarsening strategies (spectral, motif-aware)
│   │   └── motif/         # Motif detection and patterns
│   ├── models/            # Transformer models
│   ├── evaluation/        # Standard and motif metrics
│   └── realistic_gen/     # Generation quality analysis
├── configs/               # Hydra configuration
├── scripts/               # Training, evaluation, and visualization scripts
│   ├── preprocess/        # Data preprocessing and caching
│   ├── comparison/        # Result comparison and benchmarking
│   └── visualization/     # Visualization and demo scripts
├── bash_scripts/          # Batch benchmark automation scripts
├── tests/                 # Test suite
└── docs/                  # Documentation

Documentation

See the docs/ directory for:

Setup guides (docs/setups/):

Design docs (docs/designs/):

Acknowledgement

This codebase was developed based on insights from：

The official AutoGraph repository.
The official HiGen repository.

Name		Name	Last commit message	Last commit date
Latest commit History 390 Commits
.github/workflows		.github/workflows
bash_scripts		bash_scripts
configs		configs
data/ci_moses		data/ci_moses
docs		docs
nrp		nrp
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yaml		environment.yaml
environment_eval.yaml		environment_eval.yaml
environment_server.yaml		environment_server.yaml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Beyond Flat Walks: Compositional Abstraction for Autoregressive Molecular Generation

Core Approach

Quick Start

Installation

Training

Trained Checkpoints

Batch Benchmark Scripts

Project Structure

Documentation

Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Beyond Flat Walks: Compositional Abstraction for Autoregressive Molecular Generation

Core Approach

Quick Start

Installation

Training

Trained Checkpoints

Batch Benchmark Scripts

Project Structure

Documentation

Acknowledgement

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages