This project explores a simple idea: instead of asking generative models to recover motif-level structure implicitly, we encode motifs directly into the representation used for graph generation, independent of the sequence transformer's processing. We are interesting in constructing a representation in the form of flat tokens.
For generating graphs using tokens with hierarchical insights, we need 3 things:
-
Create the input H-graph: Build a hierarchical representation of the graph using coarsening strategies (HAC, Spectral Clustering, Motif Community).
-
Tokenize the input H-graph: Convert the hierarchy to a token sequence using H-SENT (Vanilla HiGen) or HDT (DFS-based). Note that we need to preserve enough information (leaf edge connections) for the inverse problem to flatten the H-graph.
-
Flatten the generated H-graph: Reconstruct the flat graph from tokens via bipartite edge union for H-SENT, or union of back edges for HDT.
# Create training environment
conda env create -f environment.yaml
conda activate mosaic
# Create evaluation/test environment (required for full metrics)
conda env create -f environment_eval.yaml
conda activate mosaic-evalThe configs in configs/ are the default hyperparameters used for our experiments. Training uses Hydra for configuration — experiment-specific overrides (dataset size, LR, steps) are in configs/experiment/, and tokenizer defaults in configs/tokenizer/. View sample training runs on WandB.
# Train HDTC on MOSES (default)
python scripts/train.py
# Train with different tokenizers
python scripts/train.py tokenizer=sent # Flat SENT (baseline)
python scripts/train.py tokenizer=hsent # Hierarchical SENT
python scripts/train.py tokenizer=hdt # Hierarchical DFS
python scripts/train.py tokenizer=hdtc # Compositional (default)
# Train on COCONUT dataset
python scripts/train.py experiment=coconut
# Evaluate trained model
# (Use the eval environment: conda activate mosaic-eval)
python scripts/test.py model.checkpoint_path=outputs/train/.../best.ckpt
python scripts/realistic_gen.py model.checkpoint_path=outputs/train/.../best.ckpt
# Create comparison table
python scripts/comparison/compare_results.pyDownload our trained checkpoints from Google Drive.
The bash_scripts/ directory automates the full benchmark pipeline. See bash_scripts/README.md for details.
# Train all tokenizer variants
./bash_scripts/train/train_benchmarks.sh
./bash_scripts/train/train_benchmarks.sh --coconut
# Evaluate all trained models
./bash_scripts/eval/eval_benchmarks.sh
./bash_scripts/eval/eval_benchmarks.sh --coconut
# Faster eval flow: GPU sequential + CPU-parallel motif phase
./bash_scripts/eval/eval_benchmarks_2phase.shMOSAIC/
├── src/
│ ├── data/ # Data loading, generation, and motif detection
│ ├── tokenizers/ # Graph tokenization (SENT, H-SENT, HDT, HDTC)
│ │ ├── coarsening/ # Coarsening strategies (spectral, motif-aware)
│ │ └── motif/ # Motif detection and patterns
│ ├── models/ # Transformer models
│ ├── evaluation/ # Standard and motif metrics
│ └── realistic_gen/ # Generation quality analysis
├── configs/ # Hydra configuration
├── scripts/ # Training, evaluation, and visualization scripts
│ ├── preprocess/ # Data preprocessing and caching
│ ├── comparison/ # Result comparison and benchmarking
│ └── visualization/ # Visualization and demo scripts
├── bash_scripts/ # Batch benchmark automation scripts
├── tests/ # Test suite
└── docs/ # Documentation
See the docs/ directory for:
Setup guides (docs/setups/):
Design docs (docs/designs/):
This codebase was developed based on insights from:


