diff --git a/README.md b/README.md index a254570a..72e4ca48 100644 --- a/README.md +++ b/README.md @@ -22,6 +22,8 @@ sample2,sample2.filtered.vcf,sample2.sorted.bam Each row represents a single sample with a single-sample VCF containing the mutations called in that sample and the BAM file that was used for getting those variant calls. The mutations will be obtained from the VCF and the BAM file will be used for computing the sequencing depth at each position and using this for the downstream analysis. +Two alternative input modes are also supported: a samplesheet with only `sample,vcf` columns combined with a precomputed depths table, or a single cohort-level MAF file passed via `--input_maf` together with a precomputed depths table. See [Input scenarios](docs/input_scenarios.md) for details. + **Make sure that you do not use any '.' in your sample names, and also use text-like names for the samples, try to avoid having only numbers.** This second case should be handled properly but using string-like names will ensure consistency. **There are specific datasets that need to be prepared before running deepCSA. You can find a list of those, and instructions for downloading them in [the documentation section of the repo](docs/usage.md#mandatory-parameter-configuration).** @@ -80,6 +82,8 @@ We are working to provide the biggest possible detail on the [usage](docs/usage. > > *these authors contributed equally and the order was decided randomly +& + > **DeepClone, an end-to-end protocol to study somatic mutagenesis and selection at high resolution** > > Ferriol Calvet, Morena Pinheiro-Santin, Erika Lopez, Raquel Blanco Martinez-Illescas, Núria Samper, Miguel L. Grau, Ferran Muiños, Rocío Chamorro González, Maria Andrianova, Federica Brando, Stefano Pellegrini, Marta Huertas, Elisabet Figuerola-Bou, Coohleen Coombes, Brendan F. Kohrn, Jeanne Fredrickson, Rosa Ana Risques, Nuria Lopez-Bigas, Abel Gonzalez-Perez diff --git a/docs/README.md b/docs/README.md index 8e0c0b90..19103546 100644 --- a/docs/README.md +++ b/docs/README.md @@ -4,9 +4,15 @@ The bbglab/deepCSA documentation is split into the following pages: - [Usage](usage.md) - An overview of how the pipeline works and how to run it. +- [Input scenarios](input_scenarios.md) + - The three supported input modes (VCF + BAM, VCF + precomputed depths, cohort MAF + precomputed depths) and when to use each. - [File formatting](file_formatting.md) - An overview of the specific formats required for each of the custom mandatory or optional files. - [Output](output.md) - An overview of the different results produced by the pipeline and how to interpret them. - [Tools](tools.md) - An overview of the explanation of the tools used in deepCSA and the rationale behind some of the decisions or computations. +- [Test data](test_data.md) + - Where the test data lives, what it contains, and how it is consumed by the nf-test suite. +- [Issue resolution](issue_resolution.md) + - Known issues encountered during development and how they were resolved. diff --git a/docs/input_scenarios.md b/docs/input_scenarios.md new file mode 100644 index 00000000..0ddeda63 --- /dev/null +++ b/docs/input_scenarios.md @@ -0,0 +1,96 @@ +# bbglab/deepCSA: Input scenarios + +deepCSA supports three input scenarios depending on what you already have available (BAMs, mutations as VCFs, or a cohort-level MAF together with a precomputed depths table). All scenarios still require the standard samplesheet CSV passed via `--input`. + +Sample naming rules apply to every scenario: avoid `.` in sample names and prefer text-like names instead of purely numeric ones. See [File formatting](file_formatting.md) for details on each file. + +## Scenario summary + +| Scenario | `--input` columns | Depth source | Extra flags | +|---|---|---|---| +| 1. VCF + BAM (default) | `sample,vcf,bam` | Computed from BAMs | — | +| 2. VCF + precomputed depths | `sample,vcf` | `--custom_depths_table` | `--use_custom_depths true` | +| 3. Cohort MAF + precomputed depths | `sample,vcf` (metadata only) | `--custom_depths_table` | `--input_maf ` + `--use_custom_depths true` | + +The pipeline validates these combinations at start-up and stops with an explicit error if `--input_maf` is set without `--use_custom_depths true` (see [workflows/deepcsa.nf](../workflows/deepcsa.nf)). + +## Scenario 1 — VCF + BAM (default) + +Use this scenario when you have per-sample variant calls and the BAM files that were used to produce them. + +```csv +sample,vcf,bam +sample1,sample1.filtered.vcf,sample1.sorted.bam +sample2,sample2.filtered.vcf,sample2.sorted.bam +``` + +The pipeline derives per-position sequencing depth directly from the BAMs (subworkflow `depthanalysis`). No extra flag is needed. + +## Scenario 2 — VCF + precomputed depths + +Use this scenario when you already have a depths table (for example produced by a previous deepCSA run, or by an external tool) and you want to skip BAM-based pileup. + +```csv +sample,vcf +sample1,sample1.filtered.vcf +sample2,sample2.filtered.vcf +``` + +```console +params { + use_custom_depths = true + custom_depths_table = '/path/to/precomputed_depths_table.tsv' +} +``` + +Notes: + +- The depths-table column names must match the sample names declared in the `sample` column of the input CSV. +- `custom_depths_table` may be TSV or CSV but must follow the per-position depth layout that deepCSA expects. +- If the file is missing or unreadable the pipeline fails immediately. + +See [Usage — Using a precomputed depths table](usage.md#using-a-precomputed-depths-table) for additional notes on how columns are matched and on preparing the file from a previous deepCSA run. + +## Scenario 3 — Cohort MAF + precomputed depths + +Use this scenario when all mutations for the cohort are already consolidated in a single MAF/TSV file and you also have the matching precomputed depths table. + +```console +params { + input = "samplesheet.csv" + input_maf = "cohort_mutations.maf" + use_custom_depths = true + custom_depths_table = "precomputed_depths.tsv" +} +``` + +```bash +nextflow run bbglab/deepCSA \ + --input samplesheet.csv \ + --outdir results/ \ + --input_maf cohort_mutations.maf \ + --use_custom_depths true \ + --custom_depths_table precomputed_depths.tsv \ + -profile +``` + +What happens under the hood: + +1. The MAF file is split into one VCF per unique `SAMPLE_ID` by `INPUTMAF2VCF` (script [assets/useful_scripts/deepcsa_maf2samplevcfs.py](../assets/useful_scripts/deepcsa_maf2samplevcfs.py)). +2. The per-sample VCFs are published under `/processing_files/input_vcfs/`. +3. The rest of the pipeline runs as in Scenario 2. + +The standard `--input` samplesheet is still required, because it provides the sample metadata used by other pipeline steps. The `SAMPLE_ID` values in the MAF must match the `sample` column of the samplesheet. + +For the expected MAF columns (deepCSA-generated MAF vs external MAF) see [Usage — MAF file format](usage.md#maf-file-format). + +## Related parameters + +| Parameter | Purpose | +|---|---| +| `input` | Samplesheet CSV with `sample,vcf[,bam]` columns. Always required. | +| `input_maf` | Cohort-level MAF file (Scenario 3). Requires `use_custom_depths = true`. | +| `use_custom_depths` | Skip BAM-based depth computation. Required for Scenarios 2 and 3. | +| `custom_depths_table` | Path to the precomputed per-position depths table. Required when `use_custom_depths = true`. | + +Custom-mutation workflows (e.g. forcing your own filter list) layer on top of these scenarios. See [Usage — Custom mutation calls](usage.md#custom-mutation-calls----option-1-building-input-vcfs-and-providing-them-via-normal-input) for the advanced options. diff --git a/docs/output.md b/docs/output.md index 7847ef71..ceef4600 100644 --- a/docs/output.md +++ b/docs/output.md @@ -19,239 +19,245 @@ The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes d - [Additional clonal structure metrics](#additional-clonal-structure-metrics) - [Mutational signatures](#mutational-signatures) - [Plotting functionalities](#plotting-functionalities) +- [QC outputs](#qc-outputs) - [Additional outputs](#additional-outputs) ## Directory Structure -The directory structure listed below will be created in the results directory after the pipeline has finished. -The structure captures the maximum diversity of created outputs, but when only certain run options are turned on, not all directories will be generated. -All paths are relative to the top-level results directory. +The directory tree below shows the maximum diversity of outputs the pipeline can publish. When only some run options are turned on, only the corresponding subdirectories will be generated. All paths are relative to the top-level results directory. ```{console} {outdir} -├──absolutemutabilities -├──absolutemutabilitiesgloballoc -├──annotatedepths -├──clean_germline_somatic -├──clean_somatic -├──computematrix -├──computeprofile -├──createpanels -│ ├── consensus -│ │ └── .consensus.bed -│ │ └── .consensus.tsv -│ ├── captured -│ │ └── .captured.bed -│ │ └── .captured.tsv -│ └── sample -│ └── ..bed -│ └── ..tsv -├──customannotation -├──customprocessing -├──customprocessingrich -├──depthssummary -├──dna2proteinmapping -├──domainannotation -├──expandregions -├──filterexons -├──germline_somatic -├──groupgenes -├──indels -├──matrixconcatwgs -├──multiqc -├──mutability -├──mutatedcellsfromvafam -├──mutatedgenomesfromvafam -├──mutrate -├──muts2sigs -├──omega -│ ├── preprocessing -│ │ └── syn_muts. -│ │ └── mutabilities. -│ └── output_mle..tsv -├──omegagloballoc -│ ├── preprocessing -│ │ └── syn_muts. -│ │ └── mutabilities. -│ └── output_mle..tsv -├──oncodrive3d -│ ├── run -│ └── -│ └── plot -│ └── -├──oncodrivefmlsnvs -├──pipeline_info -├──plotmaf -├──plotneedles -│ └── -│ └── -├──plotselection -├──plotsomaticmaf -├──postprocessveppanel -├──signatures_hdp -│ └── output. -│ └── -├──sigprobs -├──sigprofilerassignment -│ └── output. -│ └── -├──sitecomparison -├──sitecomparisongloballoc -├──sitecomparisongloballocmulti -├──sitecomparisonmulti -├──sitesfrompositions -├──sumannotation -├──synmutrate -├──synmutreadsdensity -└──table2group -work/ -.nextflow.log +├── depths +│ ├── individual # per-sample depth tables +│ ├── plots_per_group # depth plots split by sample groupings +│ └── summary # exons / exons_cons / all_cons depth summaries +├── group_definition +│ ├── genes +│ └── samples +├── mutations +│ ├── germline_somatic # all calls labelled germline + somatic +│ ├── clean_somatic # somatic calls after filtering +│ └── clean_germline_somatic # cleaned germline + somatic +├── mutational_profile # trinucleotide profiles (all / exons / introns / non-prot / synonymous) +├── mutdensity +│ └── individual_vals # flat mutation density per sample/group +├── mutdensity_adjusted +│ └── individual_vals # trinucleotide-adjusted mutation density +├── regions +│ ├── allsites # captured positions ready for VEP +│ ├── annotations # panel annotation tables and plots +│ ├── capturedpanels # per-region captured panels +│ ├── consensuspanels # consensus panels (cohort-level) +│ ├── samplepanels # per-sample panels per region type +│ │ ├── createsamplepanelsall +│ │ ├── createsamplepanelsexons +│ │ ├── createsamplepanelsintrons +│ │ ├── createsamplepanelsnonprotaffect +│ │ ├── createsamplepanelsprotaffect +│ │ └── createsamplepanelssynonymous +│ ├── expandedregions # subgenic / domain / exon expansions +│ ├── panelannotation +│ └── dndscv # biomart filtered by panel BED (dynamic RefCDS input) +├── selection +│ ├── omega +│ │ ├── preprocessing # syn_muts., mutabilities. +│ │ └── estimator # all_omegas.tsv, output_mle..tsv +│ ├── omegagloballoc +│ │ ├── preprocessing +│ │ └── estimator +│ ├── sitecomparison # background × count combinations +│ │ ├── bckg_single_count_single +│ │ ├── bckg_single_count_multi +│ │ ├── bckg_multi_count_single +│ │ ├── bckg_multi_count_multi +│ │ ├── bckg_glocsingle_count_single +│ │ ├── bckg_glocsingle_count_multi +│ │ ├── bckg_glocmulti_count_single +│ │ └── bckg_glocmulti_count_multi +│ ├── oncodrivefml +│ ├── oncodrive3d +│ │ └── run # per-sample Oncodrive3D results +│ ├── dndscv # dNdScv (R) outputs +│ │ ├── cv # *.cv.tsv +│ │ ├── persample # *.globaldnds.tsv +│ │ └── local # *.loc.tsv +│ └── dndsproxy # dN/dS proxy from adjusted vs synonymous densities +├── signatures +│ ├── sigprofilerassignment +│ ├── sigprofilerassignment_indels +│ ├── sigprofilermatrixgenerator +│ ├── signatures_hdp +│ └── hdp_decomposition_spa +├── plots +│ ├── mutations_summary # plot_maf / plot_somatic_maf +│ ├── needle_plots # per-sample, per-gene needles +│ ├── selection_summary +│ ├── selection +│ │ ├── omega +│ │ ├── omegagloballoc +│ │ └── oncodrive3d +│ │ └── chimerax +│ ├── gene_subgenic_selection +│ ├── saturation_proportions +│ └── interindividual_variability +├── qc +│ ├── trinucleotide_proportions +│ ├── mutational_profiles_comparison +│ ├── mutdensityqc +│ ├── metrics_vs_depth # depth-vs-metric scatter PDFs + status TSVs +│ ├── mutationspecific +│ ├── omega_flagged +│ ├── evaluate_omega_globalloc +│ └── contamination +├── processing_files +│ ├── input_vcfs # per-sample VCFs (when --input_maf is used) +│ ├── all_possible_sites +│ ├── sumannotation +│ ├── synmutdensity +│ ├── synmutreadsdensity +│ ├── mutations_matrix +│ │ └── per_sample # SBS matrices for signature analysis +│ ├── relativemutability +│ ├── flagged_positions +│ └── multiqc +├── regressions +├── pipeline_info +└── multiqc ``` ## Input and configuration -See Usage docs for extensive explanation on required inputs and format. Including documentation on parameters to run on for 4 different suggested running modes. +See [Usage](usage.md) and [Input scenarios](input_scenarios.md) for an explanation of the required inputs, the three supported input modes, and the parameter presets for the four suggested run profiles. ## Depth analysis ### Key role -- Computation of depth per sample for each specific position - Most analysis may be influenced by sequencing depth, it is essential to correct for these values. +- Computation of depth per sample for each specific position. + Most analyses are influenced by sequencing depth, so it is essential to correct for these values. -- Definition of regions to analyze - Only genomic areas that have been properly covered across samples will be used for the analysis. +- Definition of regions to analyse. + Only genomic areas that have been properly covered across samples will be used. -**Note 1:** There is a depth difference between the depth reported in the files in the annotated depths directory and the values of depth reported in each of the mutations. This difference is because we do not count Ns when computing th depth of specific mutations. This means that the values of VAF are computed with N-discounted depth, while other metrics are not. +**Note 1:** There is a depth difference between the depth reported in the files under `depths/individual/` and the values reported per mutation. This difference is because Ns are not counted when computing the depth at the specific mutation position. Therefore VAF values are computed with N-discounted depth while other metrics are not. -### Detailed explanation of depthssummary depths versions +### Detailed explanation of `depths/summary/` versions -In this directory you will find different versions of TSVs and PDFs summarizing the depths of the samples/genes sequenced. +In this directory you will find different versions of TSVs and PDFs summarising the depths of the samples/genes sequenced. -Each of the versions provides slightly different information, as you can see in the image below: +Each version provides slightly different information, as shown below: ![depths summary slide](images/deepCSA_depths_summary.png) -- exons contains the average depth in all the exonic regions sequenced in the genome no matter which minimum consensus coverage was reached. -- exons_cons contains the average depth in the exonic regions sequenced in the genome to a minimum consensus depth threshold. (only exons in the well covered regions) -- all_cons contains the average depth of all sequenced regions of the genome that are well covered across the samples in the cohort, without any distinction of exons/introns/others. - -We will work on a better representation of the different metrics of depth so that is it more understandable, but for now we include this schematic and brief explanations. - -Reach out if you have more questions! +- `exons` — average depth in all the exonic regions sequenced, regardless of consensus coverage. +- `exons_cons` — average depth in exonic regions reaching the minimum consensus depth threshold (i.e. exons within the well-covered regions). +- `all_cons` — average depth of all well-covered sequenced regions across the cohort, with no exonic/intronic distinction. ### Outputs -- sitesfrompositions -- postprocessveppanel -- createpanels -- annotatedepths -- depthssummary +- `depths/` (individual, summary, plots_per_group) +- `regions/allsites/`, `regions/annotations/`, `regions/panelannotation/` +- `regions/capturedpanels/`, `regions/consensuspanels/`, `regions/samplepanels/` -Optional: +Optional (subgenic / domain expansion): -- dna2proteinmapping -- domainannotation -- customprocessing -- customprocessingrich +- `regions/expandedregions/` +- `regions/annotations/` (domain and DNA-to-protein mapping outputs) ## Mutation preprocessing ### Key role -- VCF annotation: Annotate mutations with Ensembl VEP. -- VCF to MAF conversion: Convert VCFs to MAF, define VAF, and merge with annotation. -- Custom region annotation: Allow user to define different consequence types for specific regions. -- Hotspot annotation: Add known hotspots to mutation annotation. +- VCF annotation with Ensembl VEP. +- VCF → MAF conversion, VAF computation, merge with annotation. +- Custom region annotation: user-defined consequence types for specific regions. +- Hotspot annotation: add known hotspots to the mutation table. - Filtering: - - Filter mutations at the sample level (e.g., VAF distortion). - - Filter at the cohort level (e.g., other_sample_SNP, repetitive_variant, not_covered, not_in_exons). -- Blacklist mutations if activated (see assets for example). -- Downsample mutations if activated. + - Sample-level filters (e.g. VAF distortion via `vaf_distortion_threshold`). + - Cohort-level filters (e.g. `other_sample_SNP`, `repetitive_variant`, `not_covered`, `not_in_exons`). +- Optional blacklist of mutations (see assets for example). +- Optional downsampling of mutations. ### Outputs -- sumannotation -- customannotation -- germline_somatic -- clean_somatic -- clean_germline_somatic +- `mutations/germline_somatic/` +- `mutations/clean_somatic/` +- `mutations/clean_germline_somatic/` +- `processing_files/sumannotation/`, `processing_files/flagged_positions/` ## Basic analysis ### Key role -- Mutation density computation - Correct the number of mutations observed by the number of sequenced nucleotides. - -- Mutational profile computation - Capture the mutation probability of each trinucleotide. Represent it in three different normalization conditions. +- Mutation density computation — corrects the number of observed mutations by the number of sequenced nucleotides. +- Mutational profile computation — captures the mutation probability of each trinucleotide, in three different normalisation conditions. ### Outputs -- computematrix -- computeprofile -- mutrate +- `mutdensity/individual_vals/` +- `mutdensity_adjusted/individual_vals/` (trinucleotide-adjusted; see [Tools — Adjusted mutation density](tools.md#adjusted-mutation-density)) +- `mutational_profile/` +- `processing_files/mutations_matrix/` (per-sample SBS matrix) ## Intermediate outputs ### Key role -- Matrix concatenation - Combine WGS-renomralized matrices for mutational signature analysis. - -- Mutability calculation - Compute relative mutabilities using depths and mutational profile. - -- Choose synonymous mutation rates for downstream analysis. +- Matrix concatenation — combine WGS-renormalised matrices for mutational signature analysis. +- Mutability calculation — compute relative mutabilities using depths and the mutational profile. +- Selection of the synonymous mutation rate used downstream. ### Outputs -- matrixconcatwgs -- mutability -- synmutrate -- synmutreadsdensity +- `processing_files/mutations_matrix/` (cohort-level concatenated matrix) +- `processing_files/relativemutability/` +- `processing_files/synmutdensity/` +- `processing_files/synmutreadsdensity/` ## Positive selection ### Key role -- Compute multiple positive selection metrics - This is done at the cohort-level, but also for each sample or group of samples. - -- OncodriveFML: Detects functional impact bias in observed mutations. - -- Oncodrive3D: Identifies 3D protein regions with mutation clustering, using relative mutabilities and raw VEP annotation. - -- Omega: dN/dS-based, quantifies selection pressure in defined regions (genes, exons, domains, hotspots, etc.). - -- Indels: Analysis of indel selection. +- Compute several positive selection metrics at the cohort level and per sample/group: + - **OncodriveFML** — functional-impact bias. + - **Oncodrive3D** — 3D protein clustering, optionally on raw VEP annotation. + - **Omega** — dN/dS-based selection in defined regions (genes, exons, domains, hotspots, ...). + - **dNdScv** — R implementation, run with a per-run RefCDS built dynamically from the panel BED + a biomart export (`dnds_biomart_ref`) + the genome FASTA. See [Tools — dNdScv](tools.md#dndscv). + - **dN/dS proxy** — quick ratio of adjusted vs synonymous mutation densities, output as `*.gene_mutdensities_n_dnds.tsv`. + - **Indels** — indel selection analysis. ### Outputs -- omega -- omegagloballoc -- oncodrive3d -- oncodrivefmlsnvs -- indels +- `selection/omega/{preprocessing,estimator}/` +- `selection/omegagloballoc/{preprocessing,estimator}/` +- `selection/oncodrive3d/run/` +- `selection/oncodrivefml/` +- `selection/dndscv/{cv,persample,local}/` +- `selection/dndsproxy/` ## Site selection metrics ### Key role - Compute absolute mutabilities for each position. - -- Compare the observed number of mutations per site to the expected number of mutations and estimate a site selection value. +- Compare the observed number of mutations per site to the expected number and estimate a site-selection value. ### Outputs -- absolutemutabilities -- absolutemutabilitiesgloballoc +- The recommended ones to use are: + + - For reporting selection at a cohort-level: -- sitecomparison -- sitecomparisongloballoc -- sitecomparisongloballocmulti -- sitecomparisonmulti + `selection/sitecomparison/bckg_single_count_single` + + - For estimating selection accounting for the expansions or multiple occurrences of specific mutations: + + `selection/sitecomparison/bckg_single_count_multi` + + `selection/sitecomparison/bckg_multi_count_multi` + +- But all possible combinations are available: `selection/sitecomparison/` (8 background × count combinations: `bckg_{single,multi,glocsingle,glocmulti}_count_{single,multi}/`) ## Additional clonal structure metrics @@ -261,51 +267,69 @@ Optional: ### Outputs -- mutatedcellsfromvafam -- mutatedgenomesfromvafam +- Subdirectories of the mutated-cells analyses are published under `selection/` and `mutations/` according to the configured grouping; the corresponding processes are `mutated_cells_from_vaf` and `mutated_genomes_from_vaf` (controlled by `params.mutated_cells_vaf`). ## Mutational signatures ### Key role -- Signature assignment: Use SigProfilerAssignment with optional custom signatures. -- HDP: Hierarchical Dirichlet Process for signature extraction. -- (Pending) Signature extraction: SigProfilerExtractor support. +- Signature assignment with SigProfilerAssignment (optional custom signatures). +- HDP — Hierarchical Dirichlet Process signature extraction. +- SigProfilerExtractor is supported but must be run externally. ### Outputs -- signatures_hdp -- sigprofilerassignment -- sigprobs -- muts2sigs +- `signatures/sigprofilerassignment/` +- `signatures/sigprofilerassignment_indels/` +- `signatures/sigprofilermatrixgenerator/` +- `signatures/signatures_hdp/` +- `signatures/hdp_decomposition_spa/` ## Plotting functionalities ### Key role -- Plotting basic statistics of numbers and distribution of mutations in genes. +- Plot basic statistics on numbers and distribution of mutations in genes. +- Plot selection results (omega, OncodriveFML, Oncodrive3D, gene/subgenic saturation, interindividual variability). + +### Outputs + +- `plots/mutations_summary/` +- `plots/needle_plots/` +- `plots/selection_summary/` +- `plots/selection/{omega,omegagloballoc,oncodrive3d}/` +- `plots/gene_subgenic_selection/` +- `plots/saturation_proportions/` +- `plots/interindividual_variability/` + +## QC outputs + +### Key role -- Optionally think on adding more plots. +A `qc/` umbrella collects all the quality-control views; `qc/metrics_vs_depth/` always runs and produces depth-vs-metric scatter plots for raw and adjusted mutation densities and omega-globalloc. ### Outputs -- plotmaf -- plotneedles -- plotselection -- plotsomaticmaf -- qc/metrics_vs_depth (depth-vs-mutdensity/omega QC scatterplots and TSV summaries) +- `qc/trinucleotide_proportions/` +- `qc/mutational_profiles_comparison/` +- `qc/mutdensityqc/` +- `qc/metrics_vs_depth/` +- `qc/mutationspecific/` +- `qc/omega_flagged/` +- `qc/evaluate_omega_globalloc/` +- `qc/contamination/` ## Additional outputs ### Key role -- Definition of groups, expanded regions and other metrics related with the full pipeline execution. +- Definition of sample/gene groups, expanded regions, regression configs, and pipeline-level reports. ### Outputs -- table2group -- groupgenes -- expandregions -- filterexons -- multiqc -- pipeline_info +- `group_definition/{samples,genes}/` +- `regions/expandedregions/` +- `regressions/` +- `multiqc/` +- `pipeline_info/` +- `processing_files/input_vcfs/` (when `--input_maf` is used) diff --git a/docs/test_data.md b/docs/test_data.md new file mode 100644 index 00000000..f07b81f8 --- /dev/null +++ b/docs/test_data.md @@ -0,0 +1,77 @@ +# bbglab/deepCSA: Test data + +deepCSA ships a minimal nf-test suite ([tests/deepcsa.nf.test](../tests/deepcsa.nf.test)) that exercises the main input scenarios and validation paths. This document describes where the test data lives, what it contains, and how it is consumed by the tests. + +## Where the test data lives + +The reference test datasets are hosted in the [bbglab/DeepClone_protocol](https://github.com/bbglab/DeepClone_protocol) repository, under `test_datasets/deepCSA/testdata/`: + +``` +test_datasets/deepCSA/testdata/ +├── maf/ +│ └── all_samples.somatic.mutations.maf # cohort-level MAF (3 samples) +├── depth/ +│ └── all_samples_indv.depths.tsv.gz # precomputed per-position depths table +└── input_vcfs/ + ├── P19_0002_BDO_01.vcf + ├── P19_0002_BTR_01.vcf + └── P19_0003_BDO_01.vcf +``` + +The three test samples (`P19_0002_BDO_01`, `P19_0002_BTR_01`, `P19_0003_BDO_01`) come from a bladder duplex-sequencing experiment and are large enough to exercise the panel, mutational-profile, depth, and omega code paths while keeping runtimes short. + +Locally committed inputs under [tests/test_data/](../tests/test_data/) only contain the small CSV samplesheets and one toy MAF used by the validation-failure tests: + +| File | Purpose | +|---|---| +| `input.csv` | Samplesheet with `sample,vcf,bam` columns referring to internal IRB paths (not used by the public CI tests). | +| `input_maf.csv` | Samplesheet with `sample,vcf` columns pointing to remote VCFs from `bbglab/DeepClone_protocol`. Used by the MAF-input test. | +| `input_no_bam.csv` | Same as `input_maf.csv`, used by the VCF-without-BAM tests. | +| `test_mutations.maf` | Tiny MAF used only by the parameter-validation failure tests (3, 4, 5). | + +## Remote-fetching convention + +Following the convention used by nf-core pipelines (e.g. [nf-core/fastquorum](https://github.com/nf-core/fastquorum)), the MAF and depths files are fetched **at runtime** directly from `bbglab/DeepClone_protocol` rather than pre-downloaded: + +```groovy +input_maf = 'https://raw.githubusercontent.com/bbglab/DeepClone_protocol/main/test_datasets/deepCSA/testdata/maf/all_samples.somatic.mutations.maf' +use_custom_depths = true +custom_depths_table = 'https://raw.githubusercontent.com/bbglab/DeepClone_protocol/main/test_datasets/deepCSA/testdata/depth/all_samples_indv.depths.tsv.gz' +``` + +> ⚠️ The `bbglab/DeepClone_protocol` repository must remain **publicly accessible** for Nextflow to fetch these files at runtime. If access is restricted the tests fail with a "No such file or directory" error. + +Because nf-schema 2.x validates `file-path` parameters for local existence, [tests/nextflow.config](../tests/nextflow.config) excludes the remote-URL parameters from that check: + +```groovy +validation { + ignoreParams = ['input_maf', 'custom_depths_table'] +} +``` + +## How tests map to input scenarios + +The five nf-test cases cover all three [input scenarios](input_scenarios.md) plus three validation-failure paths: + +| Test | Scenario covered | Inputs | +|---|---|---| +| TEST 1 — basic MAF processing | Scenario 3 (cohort MAF + depths) | `input_maf.csv` + remote MAF + remote depths | +| TEST 1b — VCF + depths | Scenario 2 (VCF + precomputed depths) | `input_no_bam.csv` + remote depths | +| TEST 2 — omega run | Scenario 3 with `omega = true` | same as TEST 1 | +| TEST 3 — `--input_maf` without `--use_custom_depths` | Validation failure | `input_maf.csv` + local toy MAF | +| TEST 4 — VCF samplesheet without BAMs and `use_custom_depths = false` | Validation failure | `input_no_bam.csv` | +| TEST 5 — `use_custom_depths = true` without a depths table | Validation failure | `input_no_bam.csv` | + +Snapshots (MD5 of selected outputs) are stored in [tests/deepcsa.nf.test.snap](../tests/deepcsa.nf.test.snap). + +## Running and updating the tests + +For the full execution-environment notes (SLURM, Singularity, `DEEPCSA_TEST_WORKDIR`, configuring for a non-IRB site) see [tests/README.md](../tests/README.md). The short version: + +```bash +nf-test test tests/deepcsa.nf.test # run the whole suite +nf-test test tests/deepcsa.nf.test --tag omega # run a single test +nf-test test tests/deepcsa.nf.test --update-snapshot # regenerate snapshots +``` + +The tests must be run on a SLURM cluster with Singularity; local execution is not supported because resource limits won't be met. Snapshots must be regenerated whenever default pipeline parameters change. diff --git a/docs/tools.md b/docs/tools.md index b74a4805..647b9890 100644 --- a/docs/tools.md +++ b/docs/tools.md @@ -114,6 +114,34 @@ For more explanations on omega go to the [corresponding repo](https://github.com The site comparison step takes advantage of the computation of mutabilities in [omega](https://github.com/bbglab/omega), and then compares these mutabilities either by residue, residue change or nucleotide change. +## dNdScv + +deepCSA wraps the [dNdScv](https://github.com/im3sanger/dndscv) R package and runs it with a **dynamically built** `RefCDS` reference instead of relying on a pre-baked `.rda` transcripts file. + +The `dnds` subworkflow performs three steps for every run: + +1. `ADAPT_PANEL_REFCDS` (`dNdScv_panel_prep.py`) — filter the biomart export referenced by `params.dnds_biomart_ref` to the transcripts overlapping the panel BED. +2. `BUILD_REFCDS` — call `dndscv::buildref` using `params.fasta` to produce a fresh `RefCDS_custom.rda`. +3. `DNDSRUN` (`dNdS_run.R`) — run dNdScv on the cohort, producing `*.cv.tsv`, `*.globaldnds.tsv` and `*.loc.tsv` under `selection/dndscv/{cv,persample,local}/`. + +Instructions for regenerating the biomart TSV are in [assets/build_datasets/dndscv/instructions.txt](../assets/build_datasets/dndscv/instructions.txt). The previously required `dnds_ref_transcripts` parameter has been removed. + +## dN/dS proxy + +When mutation density and the all-regions profile are computed, deepCSA also generates a quick **dN/dS proxy** per gene by taking the ratio of non-synonymous vs synonymous adjusted mutation densities. The implementation is in `mut_density_adjusted_dnds.py` and the results are published to `selection/dndsproxy/` as `*.gene_mutdensities_n_dnds.tsv`. + +This metric is intended as a fast sanity check and is independent of the R-based dNdScv run and of omega, both of which provide dN/dS estimates with significance testing. It is gated by `run_mutdensity` (which itself is enabled by either `mutationdensity` or `omega`) combined with `profileall`. + +## Depth-vs-metric QC + +The `qc/metrics_vs_depth/` directory is produced by `PLOT_METRICS_VS_DEPTH_QC` (in the `plotting_qc` subworkflow) and is always generated. It joins per-gene/sample average depth (from the `PLOTDEPTHSEXONSCONS` step) with: + +- raw mutation densities (`mutdensity/`) +- adjusted mutation densities (`mutdensity_adjusted/`) +- omega-globalloc estimates (`selection/omegagloballoc/`) + +Each combination yields a scatter PDF and a status TSV under `*.metrics_depth_qc/`, used to flag samples/genes whose metric values may be confounded by sequencing depth. + ## Mutational signatures We provide two different strategies for signature analysis. @@ -122,4 +150,8 @@ We provide two different strategies for signature analysis. - Using a Hierarchical Dirichlet Process algorithm developed by Nicola Robets and compacted by the McGranahan lab into a wrapped version. + - The outputs of the signature extraction process are then further processed downstream using SigProfilerAssignment to decompose the de novo signature and reassign mutational processes to samples. + +- Additionally we also output mutation count matrices that are ready to be run through [MSA](https://gitlab.com/s.senkin/MSA) which is another method for mutational signature attribution. + Additionally one could run SigProfilerExtractor on the data but this needs to be done externally. diff --git a/docs/usage.md b/docs/usage.md index 2b0110b8..9c9ec525 100644 --- a/docs/usage.md +++ b/docs/usage.md @@ -6,6 +6,7 @@ - [Introduction](#introduction) - [How to run the pipeline](#how-to-run-the-pipeline) +- [Input scenarios](#input-scenarios) - [Samplesheet input](#samplesheet-input) - [Available genomes](#available-genomes) - [Proposed run modes](#proposed-run-modes) @@ -32,6 +33,10 @@ nextflow run bbglab/deepCSA --outdir -profile --input For more information on how to run Nextflow pipelines check a more detailed explanation [below](#running-the-pipeline) in this same document or check the [Nextflow](https://www.nextflow.io/docs/latest/index.html) or [nf-core](https://nf-co.re) community documentations. +## Input scenarios + +deepCSA accepts three different input combinations: per-sample VCF + BAM (default), per-sample VCF + a precomputed depths table, or a cohort-level MAF + a precomputed depths table. The sections below describe each piece in detail; for a concise summary of the three modes and when to use each, see [Input scenarios](input_scenarios.md). + ## Samplesheet input You will need to create a samplesheet with information about the samples you would like to analyse before running the pipeline. Use this parameter to specify its location. It has to be a comma-separated file with 3 columns, and a header row as shown in the examples below. @@ -214,9 +219,16 @@ params { cadd_scores_ind = "CADD/v1.7/hg38/whole_genome_SNVs.tsv.gz.tbi" // dnds - dnds_biomart_ref = "RefCDS_human_latest_intogen.rda" + // dnds_biomart_ref is a biomart TSV; deepCSA dynamically builds a per-run RefCDS_custom.rda + // by intersecting it with the panel BED (replaces the previously required static + // RefCDS_*.rda transcripts file). See assets/build_datasets/dndscv/instructions.txt + // for how to regenerate the biomart export. + dnds_biomart_ref = "biomart_export.tsv" dnds_covariates = "covariates_hg19_hg38_epigenome_pcawg.rda" + // GFF3 annotation for the genome assembly, consumed when building exon/domain panels + gff3_file = "Homo_sapiens.GRCh38.111.gff3.gz" + // oncodrive3d + fancy plots datasets3d = "oncodrive3d/datasets" annotations3d = "oncodrive3d/annotations" @@ -301,6 +313,12 @@ This value is used for filtering the mutations by depth. Meaning that if a mutat This value is the less stringent depth threshold and is used in the first step of computing the positions that may be part of the so called "panels". This value indicates the minimum average depth at a given position for this position to be kept for the posterior depth analysis and definition on panels. The main use of this value should be to reduce the size of the files that are being processed afterwards. This can be set to 20 or more very safely. +### VAF-distortion filter + +- vaf_distortion_threshold = 3 + +Mutations whose ratio `VAF_AM / VAF` (all-molecules VAF over duplex VAF) exceeds this threshold are flagged as VAF-distorted during mutation filtering. Lower values are more conservative. + ### Using a precomputed depths table If you already have a precomputed table with per-position depths for your cohort (for example produced by a previous run or an external tool), you can instruct the pipeline to use that table instead of re-computing depths from the BAM files. This can save time and compute resources when depth computation has been performed once and re-used.