Skip to content

reimandlab/CAMM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CAMM: Chromatin Accessibility to Metastatic Mutagenesis

Code for the manuscript "Chromatin accessibility of primary cancers informs regional mutagenesis in metastases through multi-scale deep learning".

A hierarchical, multi-scale, multi-task neural network that jointly predicts SNV and indel density at 1 Mb, 100 kb, and 10 kb resolution from chromatin accessibility (CA) and replication timing (RT) profiles, trained on metastatic whole-genome data (HMF) and externally validated on primary tumors (PCAWG).

Data

  • Mutations: PCAWG primary tumors (validation). Input WGS data and metadata annotations for metastatic cancer samples from the Hartwig Medical Foundation (HMF) are controlled-access datasets. Access to these data can be requested from the HMF and are subject to scientific review and completion of the required data access or material transfer agreements. Intermediate files derived from HMF controlled-access datasets are not publicly shared because of data-use restrictions.
  • Epigenomes: 796 TCGA ATAC-seq CA profiles + 96 ENCODE Repli-seq RT profiles.
  • Windows: non-overlapping 10 kb / 100 kb / 1 Mb after mappability and blacklist filtering.

Repository layout

Code/
  step1/   # Hyperparameter search + main hierarchical MS-MT model
  step2/   # Cross-validation, ablations, baselines, PCAWG validation
  step3/   # Feature importance (permutation + SHAP)
  step4/   # Mutation-enriched windows and cancer-gene annotation
Data/
  CA_RT/   # CA + RT feature matrices
    atac_with_repliseq_10kb/   # Chromosome-split 10 kb CA + RT matrix
  PCAWG/   # PCAWG validation mutation-density tables
Model/     # Trained cancer-specific model checkpoints
Figure_script/   # Python/R scripts for Figures 1–4

Data/ — bundled data

  • CA_RT/: TCGA ATAC-seq CA profiles with ENCODE Repli-seq RT features at 1 Mb, 100 kb, and 10 kb resolution.
  • CA_RT/atac_with_repliseq_10kb/: the 10 kb CA / RT matrix is split by chromosome due to file size. Rebuild the combined matrix with:
    python Data/CA_RT/atac_with_repliseq_10kb/combine_chr_tsv.py Data/CA_RT/atac_with_repliseq_10kb -o Data/CA_RT/atac_with_repliseq.10kb.tsv.gz
  • PCAWG/: PCAWG SNV and indel validation tables at 1 Mb, 100 kb, and 10 kb resolution.

Model/ — trained models

  • Cancer-specific trained PyTorch checkpoints for breast, colorectal, esophagus, lung, prostate, and skin cancers.

Code/step1 — model training

  • run_model_hier_multi.py: hierarchical multi-scale (1 Mb → 100 kb → 10 kb), multi-task (SNV + indel) MLP with adaptive feature gating, coarse-to-fine context flow, and uncertainty-weighted loss.
  • optuna_hier_multi_tcga_rt.py: Optuna hyperparameter search (learning rate, batch size, hidden dim, dropout).

Code/step2 — evaluation and baselines

Code/step3 — interpretation

  • feature_importance.py: permutation importance (1,000 permutations, empirical p-values) and SHAP attributions (Captum ShapleyValueSampling) for 10 kb SNV predictions.

Code/step4 — mutation-enriched windows

  • underestimated_windows.py: build combined tables of underestimated 10 kb windows (z > 4 baseline and optional Tukey-thresholded inputs), add window-end coordinates, intersect with hg19_genes_gff.bed, flag OncoKB / CGC cancer genes, and emit per-prefix step1–step6 tables plus downstream-compatible aliases (*_cancer_genes_only.tsv, *_cancer_types_by_gene.tsv).

Figure_script/

Plotting scripts grouped by figure:

Pipeline

  1. Tunestep1/optuna_hier_multi_tcga_rt.py per cancer type.
  2. Trainstep1/run_model_hier_multi.py with the best config.
  3. Evaluatestep2/ for CV, ablations, tree/linear baselines, residual extraction, and PCAWG transfer.
  4. Interpretstep3/feature_importance.py for permutation + SHAP.
  5. Annotatestep4/underestimated_windows.py to map mutation-enriched windows to genes and OncoKB / CGC cancer genes.

Requirements

Python 3.9+ with torch, numpy, pandas, scikit-learn, optuna, captum, shap; R with tidyverse / ggplot2 for the R figure scripts.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors