From 217a248ed34ae4ff0c61af7d278878b2732703fd Mon Sep 17 00:00:00 2001
From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com>
Date: Fri, 1 May 2026 14:11:11 +0000
Subject: [PATCH 1/2] Initial plan


From 364276bf78e82395e7de85baa9a3e8e60c0de823 Mon Sep 17 00:00:00 2001
From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com>
Date: Fri, 1 May 2026 14:21:17 +0000
Subject: [PATCH 2/2] Expand pipeline documentation details

Agent-Logs-Url: https://github.com/bbglab/deepCSA/sessions/c4fac4e8-2063-4a96-9fcb-5ce5b5f93ea3

Co-authored-by: FerriolCalvet <38539786+FerriolCalvet@users.noreply.github.com>
---
 CITATIONS.md            | 50 +++++++++++++++++++---
 docs/file_formatting.md | 68 ++++++++++++++++++++++++++++++
 docs/output.md          | 29 +++++++++++++
 docs/tools.md           | 69 +++++++++++++++++++++++++++++++
 docs/usage.md           | 92 +++++++++++++++++++++++++++++++++++++----
 5 files changed, 294 insertions(+), 14 deletions(-)

diff --git a/CITATIONS.md b/CITATIONS.md
index 4aaab46f..f62f13cc 100644
--- a/CITATIONS.md
+++ b/CITATIONS.md
@@ -10,6 +10,10 @@
 
 ## Sources of data and tools
 
+- **Ensembl VEP**
+
+  > McLaren W, Gil L, Hunt SE, et al. The Ensembl Variant Effect Predictor. Genome Biol. 2016;17:122. doi: 10.1186/s13059-016-0974-4.
+
 - Nanoseq masks
 
   > Abascal, F., Harvey, L.M.R., Mitchell, E. et al. Somatic mutation landscapes at single-molecule resolution. Nature 593, 405–410 (2021). https://doi.org/10.1038/s41586-021-03477-4
@@ -34,15 +38,51 @@
 
   > Stefano Pellegrini, Olivia Dove-Estrella, Ferran Muiños, Nuria Lopez-Bigas, Abel Gonzalez-Perez, Oncodrive3D: fast and accurate detection of structural clusters of somatic mutations under positive selection, Nucleic Acids Research, Volume 53, Issue 15, 28 August 2025, gkaf776, https://doi.org/10.1093/nar/gkaf776
 
+- **dNdScv (tool)**
+
+  > Martincorena I, Raine KM, Gerstung M, et al. Universal Patterns of Selection in Cancer and Somatic Tissues. Cell. 2017. https://doi.org/10.1016/j.cell.2017.09.042
+
+- **Omega (dN/dS)**
+
+  > Repository: https://github.com/bbglab/omega (see repository for citation details)
+
+- **OncodriveFML**
+
+  > Repository: https://github.com/bbglab/oncodrivefml (see repository for citation details)
+
+- **OncodriveCLUSTL**
+
+  > Repository: https://github.com/bbglab/oncodriveclustl (see repository for citation details)
+
 - [MultiQC](https://pubmed.ncbi.nlm.nih.gov/27312411/)
 
   > Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016 Oct 1;32(19):3047-8. doi: 10.1093/bioinformatics/btw354. Epub 2016 Jun 16. PubMed PMID: 27312411; PubMed Central PMCID: PMC5039924.
 
-- Python
-- SigProfilerAssignment, MatrixGenerator
-- HDP
-- OncodriveFML
-- OncodriveCLUSTL
+- **SigProfilerAssignment / SigProfilerMatrixGenerator**
+
+  > Repository: https://github.com/AlexandrovLab/SigProfilerAssignment  
+  > Repository: https://github.com/AlexandrovLab/SigProfilerMatrixGenerator
+
+- **HDP / mSigHdp**
+
+  > Repository: https://github.com/Nik-Zainal-Group/msigHdp (see repository for citation details)
+
+- **bgreference / bgdata**
+
+  > Repository: https://github.com/bbglab/bgreference  
+  > Repository: https://github.com/bbglab/bgdata
+
+- **SAMtools**
+
+  > Li H, Handsaker B, Wysoker A, et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009;25(16):2078-9. doi: 10.1093/bioinformatics/btp352.
+
+- **BEDTools**
+
+  > Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010;26(6):841-842. doi: 10.1093/bioinformatics/btq033.
+
+- **HTSlib / Tabix**
+
+  > Li H. Tabix: fast retrieval of sequence features from generic TAB-delimited files. Bioinformatics. 2011;27(5):718-719. doi: 10.1093/bioinformatics/btq671.
 
 ## Software packaging/containerisation tools
 
diff --git a/docs/file_formatting.md b/docs/file_formatting.md
index 98458877..e730ae09 100644
--- a/docs/file_formatting.md
+++ b/docs/file_formatting.md
@@ -216,16 +216,84 @@ params {
 
 ### cosmic_ref_signatures
 
+Path to the COSMIC SBS signature reference file used by SigProfilerAssignment. Use the SBS 96 context file for your genome build (e.g., `COSMIC_v3.4_SBS_GRCh38.txt`). The file is a tab-delimited matrix where the first column encodes mutation context and each additional column corresponds to a signature.
+
 ### wgs_trinuc_counts
 
+Tab-delimited file with two columns:
+
+```text
+CONTEXT	COUNT
+ACA	118979126
+ACC	67570313
+...
+```
+
+The file represents the **total number of occurrences of each trinucleotide** in the reference genome. The pipeline provides a default example in `assets/trinucleotide_counts/`.
+
 ### cadd_scores
 
+Path to the CADD "All possible SNVs" file (BGZIP-compressed TSV). This file is used for OncodriveFML scoring.
+
+Recommended download: [CADD downloads](https://cadd.gs.washington.edu/download) → "All possible SNVs of GRCh38/hg38".
+
 ### cadd_scores_ind
 
+Tabix index (`.tbi`) for the `cadd_scores` file. If you need to generate it:
+
+```bash
+bgzip -c whole_genome_SNVs.tsv > whole_genome_SNVs.tsv.gz
+tabix -s 1 -b 2 -e 2 whole_genome_SNVs.tsv.gz
+```
+
 ### dnds_ref_transcripts
 
+Reference transcript annotation for dNdScv. For human, this is typically `RefCDS_human_latest_intogen.rda` from the dNdScv reference bundle (IntOGen mirror).
+
 ### dnds_covariates
 
+dNdScv covariates file, usually `covariates_hg19_hg38_epigenome_pcawg.rda`. This provides covariate regression terms for mutation rate modeling.
+
 ### datasets3d
 
+Directory containing precomputed Oncodrive3D datasets (structure and mutation mapping information). Build using the [Oncodrive3D dataset builder](https://github.com/bbglab/oncodrive3d?tab=readme-ov-file#building-datasets).
+
 ### annotations3d
+
+Directory containing Oncodrive3D annotation datasets (protein annotations, stability data, etc.). Use the same build process as `datasets3d` to ensure compatibility.
+
+### gff3_file
+
+Optional local GFF3 file used by the DNA2PROTEINMAPPING step. If not provided, the pipeline downloads the GFF3 from Ensembl. If provided, it must match the Ensembl release, species, and genome build you are using (compressed `.gff3.gz` files are supported).
+
+## Examples
+
+### Blacklist mutations
+
+```
+chr1:11107296_C>CA
+chr1:11107450_C>A
+chr1:11108379_T>A
+```
+
+### Gene grouping
+
+```
+chr15q  chr15q  IDH2    SIN3A
+chr17p  chr17p  MAP2K4  NCOR1   TP53    USP6
+```
+
+### Custom annotation
+
+See `assets/example_inputs/custom_regions.example.tsv` for a full example of a custom region annotation file.
+
+### Omega hotspots / subgenic regions
+
+Provide a BED file with 3 or 4 columns (`CHROM`, `START`, `END`, optional `NAME`):
+
+```
+chr7    55191765    55191840    EGFR_L858R_region
+chr12   25245300    25245380    KRAS_G12_region
+```
+
+You can expand these regions with `hotspot_expansion` and optionally generate complements with `subgenic_regions_complement`.
diff --git a/docs/output.md b/docs/output.md
index 7847ef71..0e8cb239 100644
--- a/docs/output.md
+++ b/docs/output.md
@@ -108,6 +108,29 @@ work/
 .nextflow.log
 ```
 
+### Output directory cheat sheet
+
+| Output directory/file | Description |
+| --- | --- |
+| `sumannotation/` | Aggregated mutation annotations (one row per mutation) after VEP annotation and preprocessing. |
+| `germline_somatic/` | Mutations labeled as germline vs somatic before strict cohort filtering. |
+| `clean_somatic/` | Filtered somatic mutations used in downstream analyses. |
+| `clean_germline_somatic/` | Filtered mutations retaining germline/somatic labels. |
+| `annotatedepths/` | Depth tables per genomic position (used for depth-aware metrics). |
+| `depthssummary/` | Cohort depth summaries (TSV + PDF plots). |
+| `computeprofile/` | Mutational profiles, proportions, and `*.profile_stability.tsv` metrics. |
+| `mutrate/` | Mutation density tables (per sample/group, depth-normalized). |
+| `omega/` | dN/dS selection results using per-sample profiles. |
+| `omegagloballoc/` | dN/dS selection results using global cohort profiles. |
+| `absolutemutabilities/` | Expected mutability per site for selection analyses. |
+| `sitecomparison/` | Observed vs expected mutability comparisons per site/residue. |
+| `oncodrivefmlsnvs/` | OncodriveFML results. |
+| `oncodrive3d/` | Oncodrive3D clustering results and plots. |
+| `signatures_hdp/`, `sigprofilerassignment/`, `sigprobs/` | Mutational signature extraction/assignment outputs. |
+| `plotmaf/`, `plotneedles/`, `plotselection/`, `plotsomaticmaf/` | Standard plotting outputs for mutation and selection summaries. |
+| `qc/metrics_vs_depth/` | QC plots/tables comparing depth vs mutation density and omega. |
+| `pipeline_info/` | Pipeline metadata and software versions. |
+
 ## Input and configuration
 
 See Usage docs for extensive explanation on required inputs and format. Including documentation on parameters to run on for 4 different suggested running modes.
@@ -177,6 +200,8 @@ Optional:
 - clean_somatic
 - clean_germline_somatic
 
+**PMEAN/PSTD fields:** if the input VCF contains read-position statistics (PMN/PST from deepUMIcaller), deepCSA stores them as `PMEAN` and `PSTD` in the mutation tables. When not available, these columns are set to `-1`.
+
 ## Basic analysis
 
 ### Key role
@@ -193,6 +218,8 @@ Optional:
 - computeprofile
 - mutrate
 
+`computeprofile` also emits `*.profile_stability.tsv` files, which quantify how sensitive each mutational profile is to the addition of a single mutation per channel (see [Tools](tools.md#mutational-profile-stability)).
+
 ## Intermediate outputs
 
 ### Key role
@@ -287,6 +314,8 @@ Optional:
 
 - Optionally think on adding more plots.
 
+Plotting scope can be controlled with `plot_only_allsamples`: when `true`, only cohort-level plots are generated; when `false`, plots are also produced for each defined subgroup.
+
 ### Outputs
 
 - plotmaf
diff --git a/docs/tools.md b/docs/tools.md
index b74a4805..bbed3d6b 100644
--- a/docs/tools.md
+++ b/docs/tools.md
@@ -8,6 +8,51 @@
 
 Here, you can find an explanation of the different computations, tools or metrics implemented in deepCSA.
 
+## Interpreting outputs (sanity checks and key metrics)
+
+### Sanity checks / QC
+
+Use these outputs to assess overall data quality before interpreting biological signals:
+
+- **Depth summaries** (`depthssummary/`): verify consistent coverage across samples and genes.
+- **Mutation density vs depth** (`qc/metrics_vs_depth/`): check that mutation density does not collapse in low-depth samples.
+- **Omega QC** (`qc/metrics_vs_depth/` + `qc/annotated_omegas`): highlights genes/samples with unstable omega estimates.
+- **Mutational profile stability** (`computeprofile/*.profile_stability.tsv`): higher deviations indicate unstable mutational profiles (see below).
+
+### Omega vs omegagloballoc
+
+- **`omega/`** uses **per-sample mutational profiles** and per-sample synonymous rates to estimate selection.
+- **`omegagloballoc/`** uses a **global cohort mutational profile** and global synonymous rates (shared across samples), which stabilizes estimates in low-burden samples and facilitates cohort-level comparisons.
+
+Use `omega` for sample-specific selection signals and `omegagloballoc` for conservative cohort-level estimates.
+
+### Site selection values
+
+Outputs in `sitecomparison/` and `sitecomparisongloballoc/` compare observed vs expected mutations per site or residue:
+
+- `OBSERVED_MUTS`: number of observed mutations.
+- `EXPECTED_MUTS`: expected mutations from mutability models.
+- `OBS/EXP`: selection enrichment ratio.
+- `p_value`: Poisson p-value for observing at least `OBSERVED_MUTS` given `EXPECTED_MUTS`.
+
+The resolution is controlled by `site_comparison_grouping` (`site`, `aminoacid`, or `aminoacid_change`).
+
+### Mutational signatures
+
+- **`sigprofilerassignment/`**: assignments of known COSMIC signatures; includes activity tables and plots.
+- **`signatures_hdp/`**: extracted signatures using a hierarchical Dirichlet process.
+- **`sigprobs/` / `muts2sigs/`**: per-mutation signature probabilities (useful for downstream stratification).
+
+Interpret signature results alongside mutation counts and profile stability to avoid over-interpreting low-burden samples.
+
+### Mutational profile stability
+
+The file `*.profile_stability.tsv` is generated by adding a single mutation to each of the 96 SBS channels and measuring the L1 deviation from the original profile. Reported statistics include:
+
+- `mean_deviation`, `min_deviation`, `max_deviation`, `std_deviation`
+
+Lower deviations indicate a more stable (less noisy) profile.
+
 ## Publications with detailed explanation
 
 We are in the process of completing the documentation, but in the meantime you can check the recently published [paper and its supplementary material for more details](https://www.nature.com/articles/s41586-025-09521-x).
@@ -123,3 +168,27 @@ We provide two different strategies for signature analysis.
 - Using a Hierarchical Dirichlet Process algorithm developed by Nicola Robets and compacted by the McGranahan lab into a wrapped version.
 
 Additionally one could run SigProfilerExtractor on the data but this needs to be done externally.
+
+## Containers and reproducibility
+
+deepCSA defines container images directly in module files and `conf/modules.config`. For bbglab-maintained images (`bbglab/*`), Dockerfile recipes are tracked in the lab repository: https://github.com/bbglab/containers-recipes. External images (e.g., `ferriolcalvet/*`, `rblancomi/*`, `biocontainers/*`) should be mirrored locally if strict reproducibility is required.
+
+Key images used by the pipeline:
+
+| Component | Image |
+| --- | --- |
+| Core utilities | `docker.io/bbglab/deepcsa-core:0.1.0` |
+| Panel BED tools | `docker.io/bbglab/deepcsa_bed:latest` |
+| Omega | `docker.io/bbglab/omega:0.2.1` |
+| Oncodrive3D | `docker.io/bbglab/oncodrive3d:1.0.5` |
+| Oncodrive3D (ChimeraX plots) | `docker.io/spellegrini87/oncodrive3d_chimerax:latest` |
+| OncodriveFML | `docker.io/ferriolcalvet/oncodrivefml:latest` |
+| OncodriveCLUSTL | `docker.io/ferriolcalvet/oncodriveclustl:latest` |
+| SigProfilerAssignment | `docker.io/ferriolcalvet/sigprofiler_assignment:1.1.3` |
+| SigProfilerMatrixGenerator | `docker.io/ferriolcalvet/sigprofilermatrixgenerator:1.3.5` |
+| mSigHdp (HDP) | `docker.io/ferriolcalvet/msighdp:latest` |
+| bbgregressions | `docker.io/rblancomi/bbgregressions:dev` |
+| Ensembl VEP | `biocontainers/ensembl-vep:111.0--pl5321h2a3209d_0` (version depends on `vep_cache_version`) |
+| SAMtools | `biocontainers/samtools:1.18--h50ea8bc_1` |
+
+To override any image, set `process.container` or the relevant module label in your `nextflow.config`.
diff --git a/docs/usage.md b/docs/usage.md
index a2c69a1d..9692b8d4 100644
--- a/docs/usage.md
+++ b/docs/usage.md
@@ -56,6 +56,18 @@ sample2,sample2.high.filtered.vcf,sample2.sorted.bam
 
 An [example samplesheet](../assets/example_inputs/input_example.csv) has been provided with the pipeline.
 
+### Input files from deepUMIcaller
+
+If your mutations were called with [deepUMIcaller](https://github.com/bbglab/deepUMIcaller), use the **final** per-sample outputs produced by that pipeline:
+
+- **VCF**: the filtered duplex VCF for each sample (commonly named `*.duplex.filtered.vcf` or `*.high.filtered.vcf`), typically found under the `mutations_vcf/` output directory. The VCF should be uncompressed.
+- **BAM**: the **duplex-consensus** BAM used for calling those variants (commonly named `*.sorted.bam` in `sortbamduplexcons/`). This BAM must be aligned to the same reference genome as the VCF.
+
+Batch/sample guidance:
+
+- **One row per library/run**: if you have multiple sequencing libraries for the same biological sample and want to **aggregate** them, keep the same `sample` name across rows and list each matching VCF/BAM pair.
+- **Separate batches**: if you want to compare batches or runs, keep distinct sample names and (optionally) add a `BATCH` or `RUN_ID` column in the feature groups table to stratify the analysis.
+
 ## Available genomes
 
 deepCSA pipeline heavily relies on bgreference and bgdata tools so the use of this pipeline is limited to those genomes available in these packages. In particular, the default containers that are being used already have the hg38 and mm39 genomes cached, if you want to use any other genome, open an issue and we will address it as soon as we can.
@@ -183,19 +195,48 @@ params {
 }
 ```
 
+### Feature toggles (turning steps on/off)
+
+All pipeline parameters are defined in `nextflow_schema.json` and exposed via `--help`. Most analysis steps can be enabled/disabled with boolean flags such as `mutationdensity`, `omega`, `oncodrive3d`, `signatures`, and `regressions`. If a step is turned off, its output directories are not produced. For a full list of parameters run:
+
+```bash
+nextflow run bbglab/deepCSA --help
+```
+
 ## Definition of structural parameters
 
-- Container pulling (either prior to running the pipeline or directly as the pipeline runs)
-- Generation of Oncodrive3D datasets (see: [Oncodrive3D repo datasets building process](https://github.com/bbglab/oncodrive3d?tab=readme-ov-file#building-datasets))
+Before running, ensure container images can be pulled (or are already available) and that all reference datasets are accessible from your execution environment.
 
-- Download of additional specific datasets
-  - Ensembl VEP (see: [Ensembl VEP docs](https://www.ensembl.org/info/docs/tools/vep/script/vep_cache.html#cache)). Modify accordingly your `nextflow.config` vep parameters, `vep_cache`, `vep_cache_version`, etc.
-  <!-- TODO we should revise if we can provide more specific information on how to download the cache -->
-  - CADD scores (see: [CADD downloads page](https://cadd.gs.washington.edu/download) "All possible SNVs of GRCh38/hg38" file)
-  - COSMIC signatures (i.e. [COSMIC signatures downloads page](https://cancer.sanger.ac.uk/signatures/downloads/) (select context size = 96 and your desired species of interest))
+### Data sources and reference downloads
 
-- Provide custom domain definition file.
-  <!-- * dNdScv datasets (see: ) -->
+- **Ensembl VEP cache**  
+  Download from the [Ensembl VEP cache](https://www.ensembl.org/info/docs/tools/vep/script/vep_cache.html#cache) page and set `vep_cache`, `vep_cache_version`, `vep_species`, and `vep_genome` accordingly.
+
+- **CADD scores**  
+  Use the ["All possible SNVs" GRCh38/hg38 file](https://cadd.gs.washington.edu/download). Provide both the compressed TSV (`cadd_scores`) and its tabix index (`cadd_scores_ind`).
+
+- **COSMIC signatures**  
+  Download the SBS signatures (context size = 96) for your genome build from the [COSMIC signatures downloads page](https://cancer.sanger.ac.uk/signatures/downloads/). Set `cosmic_ref_signatures`.
+
+- **dNdScv reference data**  
+  Download `RefCDS_human_latest_intogen.rda` and `covariates_hg19_hg38_epigenome_pcawg.rda` from the [dNdScv](https://github.com/im3sanger/dndscv) reference data (also mirrored by [IntOGen](https://intogen.org/download)). Set `dnds_ref_transcripts` and `dnds_covariates`.
+
+- **Oncodrive3D datasets**  
+  Build datasets and annotations following the [Oncodrive3D dataset instructions](https://github.com/bbglab/oncodrive3d?tab=readme-ov-file#building-datasets). Provide the resulting `datasets3d` and `annotations3d` directories.
+
+- **Trinucleotide counts**  
+  Provide a `wgs_trinuc_counts` file with the total count of each trinucleotide in your reference genome (see `assets/trinucleotide_counts/` for the expected format).
+
+- **DNA2PROTEINMAPPING GFF3 (optional)**  
+  By default, the pipeline fetches a GFF3 file from Ensembl FTP at runtime. For local/offline use, download the matching file from  
+  `https://ftp.ensembl.org/pub/release-<release>/gff3/<species>/`  
+  (e.g., `Homo_sapiens.GRCh38.111.gff3.gz`) and provide it via `gff3_file`.
+
+- **Domain definitions**  
+  Supply a Pfam/InterPro domain file (see [file formatting](file_formatting.md#domain-definition-file)).
+
+- **NanoSeq masks (optional)**  
+  See [Nanoseq genomic masks](#nanoseq-genomic-masks) below.
 
 ### Mandatory parameter configuration
 
@@ -275,6 +316,15 @@ These files identify sites overlapping common SNPs and noisy or variable genomic
 - Nanoseq SNP: Common SNP positions that should be excluded from analysis
 - Nanoseq Noise: Regions with high noise or variability
 
+Enable them with:
+
+```console
+params {
+    nanoseq_snp   = "SNP_GRCh38.wgns.bed.gz"
+    nanoseq_noise = "NOISE_GRCh38.wgns.bed.gz"
+}
+```
+
 Both files are available for GRCh38 at the [shared folder](https://drive.google.com/drive/folders/1wqkgpRTuf4EUhqCGSLA4fIg9qEEw3ZcL) from Iñigo Martincorena's group, at the Wellcome Sanger Institute.
 
 ## Additional customizable parameters
@@ -322,6 +372,30 @@ Notes and requirements:
   - If your input.csv file contains `sample`, vcf and bam columns, the columns of the depths table have to be the same as the name of the BAM files of each sample in the input.csv file.
 - Make sure that you remove the column CONTEXT from the table in case you are starting with the all_samples individual depths table that is outputted by deepCSA. Check out the assets/useful_scripts/downsample_depths.ipynb file for an example on how to prepare the input for this parameter.
 
+### Plotting controls
+
+If you are running with sample groups (see [Feature groups](file_formatting.md#feature-groups)), you can control whether plotting steps generate cohort-only outputs or include all group-level plots:
+
+```console
+params {
+    plot_only_allsamples = true  // only cohort-level plots
+}
+```
+
+Set `plot_only_allsamples = false` to generate per-group plots alongside the cohort summaries.
+
+### Positive selection with non-protein-affecting profiles
+
+deepCSA can compute positive selection metrics using a **non-protein-affecting** mutational profile (synonymous + intronic + intergenic mutations). This is useful to compare selection metrics against a background that excludes protein-altering events.
+
+```console
+params {
+    positive_selection_non_protein_affecting = true
+}
+```
+
+When enabled, outputs will be labeled with the `.non_prot_aff` suffix in the corresponding selection directories (e.g., `omega/`, `omegagloballoc/`).
+
 ## Custom mutation calls -- option 1 (building input VCFs and providing them via normal input)
 
 If you want to run deepCSA with your own mutation calls, this is also possible. Reasons behind this would be: