Code for the manuscript "Chromatin accessibility of primary cancers informs regional mutagenesis in metastases through multi-scale deep learning".
A hierarchical, multi-scale, multi-task neural network that jointly predicts SNV and indel density at 1 Mb, 100 kb, and 10 kb resolution from chromatin accessibility (CA) and replication timing (RT) profiles, trained on metastatic whole-genome data (HMF) and externally validated on primary tumors (PCAWG).
- Mutations: PCAWG primary tumors (validation). Input WGS data and metadata annotations for metastatic cancer samples from the Hartwig Medical Foundation (HMF) are controlled-access datasets. Access to these data can be requested from the HMF and are subject to scientific review and completion of the required data access or material transfer agreements. Intermediate files derived from HMF controlled-access datasets are not publicly shared because of data-use restrictions.
- Epigenomes: 796 TCGA ATAC-seq CA profiles + 96 ENCODE Repli-seq RT profiles.
- Windows: non-overlapping 10 kb / 100 kb / 1 Mb after mappability and blacklist filtering.
Code/
step1/ # Hyperparameter search + main hierarchical MS-MT model
step2/ # Cross-validation, ablations, baselines, PCAWG validation
step3/ # Feature importance (permutation + SHAP)
step4/ # Mutation-enriched windows and cancer-gene annotation
Data/
CA_RT/ # CA + RT feature matrices
atac_with_repliseq_10kb/ # Chromosome-split 10 kb CA + RT matrix
PCAWG/ # PCAWG validation mutation-density tables
Model/ # Trained cancer-specific model checkpoints
Figure_script/ # Python/R scripts for Figures 1–4
CA_RT/: TCGA ATAC-seq CA profiles with ENCODE Repli-seq RT features at 1 Mb, 100 kb, and 10 kb resolution.CA_RT/atac_with_repliseq_10kb/: the 10 kb CA / RT matrix is split by chromosome due to file size. Rebuild the combined matrix with:python Data/CA_RT/atac_with_repliseq_10kb/combine_chr_tsv.py Data/CA_RT/atac_with_repliseq_10kb -o Data/CA_RT/atac_with_repliseq.10kb.tsv.gz
PCAWG/: PCAWG SNV and indel validation tables at 1 Mb, 100 kb, and 10 kb resolution.
- Cancer-specific trained PyTorch checkpoints for breast, colorectal, esophagus, lung, prostate, and skin cancers.
- run_model_hier_multi.py: hierarchical multi-scale (1 Mb → 100 kb → 10 kb), multi-task (SNV + indel) MLP with adaptive feature gating, coarse-to-fine context flow, and uncertainty-weighted loss.
- optuna_hier_multi_tcga_rt.py: Optuna hyperparameter search (learning rate, batch size, hidden dim, dropout).
- run_model_hier_multi_cv.py: repeated K-fold CV for the full MS-MT model.
- run_model_hier_single_task_cv.py, run_model_multitask_single_scale_cv.py: single-task and single-scale ablations.
- cv_compare_mtl_ms_vs_ms_st.py, cv_compare_mtl_ms_vs_ss.py: paired comparisons (MS-MT vs MS-ST and vs MT-SS).
- run_baseline_tree_models_randomsplit_independent.py: Random Forest / Elastic Net baselines, independent per (task, scale).
- eval_best_model_snv10_all.py, eval_best_model_indel10_all.py: genome-wide 10 kb predictions and residual outlier (z > 3 / 4) tables.
- validate_pcawg_kfold_linear_calib.py: external validation on PCAWG with linear / per-chromosome calibration.
- feature_importance.py: permutation importance (1,000 permutations, empirical p-values) and SHAP attributions (Captum
ShapleyValueSampling) for 10 kb SNV predictions.
- underestimated_windows.py: build combined tables of underestimated 10 kb windows (z > 4 baseline and optional Tukey-thresholded inputs), add window-end coordinates, intersect with
hg19_genes_gff.bed, flag OncoKB / CGC cancer genes, and emit per-prefix step1–step6 tables plus downstream-compatible aliases (*_cancer_genes_only.tsv,*_cancer_types_by_gene.tsv).
Plotting scripts grouped by figure:
- Fig. 1 — CA / RT vs mutation density heatmaps (plot_fig1_matched_atacseq_heatmap.py, plot_fig1_matched_repliseq_heatmaps.py).
- Fig. 2 — MS-MT vs ablations / baselines and PCAWG transfer (plot_fig2_mtms_dots_bar.R, plot_fig2_legend_only.R).
- Fig. 3 — SHAP-based feature importance bars and pies (plot_fig3_bar_pie.py, plot_fig3_all_shap.R).
- Fig. 4 — Residual analysis and mutation-enriched windows (plot_fig4_residual_violin_outliers.py, plot_fig4_bar_stacked_per_cancer_z.py, plot_fig4_gene_windows.py, plot_fig4_zscores.R).
- Tune —
step1/optuna_hier_multi_tcga_rt.pyper cancer type. - Train —
step1/run_model_hier_multi.pywith the best config. - Evaluate —
step2/for CV, ablations, tree/linear baselines, residual extraction, and PCAWG transfer. - Interpret —
step3/feature_importance.pyfor permutation + SHAP. - Annotate —
step4/underestimated_windows.pyto map mutation-enriched windows to genes and OncoKB / CGC cancer genes.
Python 3.9+ with torch, numpy, pandas, scikit-learn, optuna, captum, shap; R with tidyverse / ggplot2 for the R figure scripts.