phipsonlab · eos-jin · Mar 27, 2026 · Mar 27, 2026 · Mar 27, 2026 · Mar 27, 2026
diff --git a/README.md b/README.md
@@ -16,6 +16,206 @@ It is heavily optimised for usage in high-performance computing (HPC) platforms.
 
 For instructions on how to use *NextClone*, please visit the [user guide](https://phipsonlab.github.io/NextClone/).
 
+## Modes
+
+### Whitelist mode (default)
+
+Provide a list of known barcode sequences. Flexiplex maps all reads against the whitelist.
+
+```bash
+nextflow run main.nf --clone_barcodes_reference /path/to/barcodes.txt
+```
+
+### Discovery mode
+
+NextClone supports **discovery mode**, which identifies barcodes directly from the data without a pre-defined whitelist. This is useful when:
+
+- The exact barcode sequences are unknown
+- You are working with a new or custom clonal barcoding system
+- You want to validate or supplement a known barcode list
+
+Discovery mode uses a two-pass approach powered by [Flexiplex](https://github.com/DavidsonGroup/flexiplex):
+
+1. **Pass 1 (Discovery):** Run Flexiplex without a barcode list (`-k` flag) using strict flanking sequence matching (`-f 0`) to identify candidate barcodes.
+2. **Pass 2 (Mapping):** Run Flexiplex with the discovered barcode list using standard edit distance parameters.
+
+```bash
+nextflow run main.nf --discovery_mode true
+```
+
+#### Barcode filtering in discovery mode
+
+By default (`filter_discovered_barcodes = false`), **all barcodes discovered in Pass 1 are passed to Pass 2**, including singletons. This is recommended for lineage tracing experiments where rare clones are biologically meaningful.
+
+Setting `filter_discovered_barcodes = true` applies `flexiplex-filter` knee-plot inflection filtering, which removes low-count barcodes. Use this only for noisy datasets — **it will discard singleton and low-count clones**:
+
+```bash
+nextflow run main.nf --discovery_mode true --filter_discovered_barcodes true
+```
+
+## Parameters
+
+| Parameter | Default | Description |
+|-----------|---------|-------------|
+| `mode` | `"scRNAseq"` | Workflow mode: `"scRNAseq"` or `"DNAseq"` |
+| `clone_barcodes_reference` | — | Path to known barcode whitelist (required when `discovery_mode = false`) |
+| `discovery_mode` | `false` | Enable two-pass barcode discovery mode |
+| `filter_discovered_barcodes` | `false` | Apply knee-plot filtering to discovered barcodes (see above) |
+| `barcode_edit_distance` | `2` | Maximum edit distance for barcode matching |
+| `adapter_edit_distance` | `6` | Maximum edit distance for flanking adapter matching |
+| `adapter_5prime` | — | 5′ flanking adapter sequence |
+| `adapter_3prime` | — | 3′ flanking adapter sequence |
+| `barcode_length` | `20` | Expected barcode length (bp) |
+| `n_chunks` | `2` | Number of read chunks for parallel processing |
+| `publish_dir` | `output/` | Output directory |
+| `report_title` | — | Custom title for the HTML report (defaults to date-stamped title) |
+
+## Output Files
+
+NextClone generates the following files in your `publish_dir`:
+
+| File | Description |
+|------|-------------|
+| `all_barcodes.txt` | **All discovered barcodes** with counts (no filtering). Header: `#barcode\tcount` |
+| `filtered_barcodes.txt` | Barcodes after filtering. Same as `all_barcodes.txt` if `filter_discovered_barcodes=false` |
+| `clone_barcodes.csv` | Final clone assignments to cells (for downstream analysis) |
+| `nextclone_qc_report.html` | Interactive QC dashboard |
+| `run_log.txt` | Run parameters and command line (for reproducibility) |
+
+**Note:** `all_barcodes.txt` contains ALL barcodes discovered in Pass 1, including singletons. This is useful for debugging and QC.
+
+## HTML Reports
+
+### Standard report (auto-generated)
+
+NextClone automatically generates an interactive HTML dashboard at the end of every run, saved to your `publish_dir` as `nextclone_qc_report.html`.
+
+**New in v2 (2026-04-09):**
+- **Clone overlap table** — shared clones across samples at different thresholds (≥5, 10, 15, 20, 50, 100 cells)
+- **Heterogeneity metrics** — Gini coefficient and Shannon index for each sample
+- **Clone size density plot** — KDE-style curve showing clone size distribution
+- **Reversed top 20 clones** — largest clones now at top (easier to read)
+
+**All charts included:**
+- Sample overview table (reads, cells, clones, Gini, Shannon)
+- Clone overlap across samples (new!)
+- Heterogeneity metrics summary (new!)
+- Ranked clone abundance (log scale, top 3 annotated)
+- Clone size density curve (new!)
+- Top 20 clones (horizontal bar, reversed, with % labels)
+- Edit distance QC (FlankEditDist & BarcodeEditDist)
+- Cross-sample clonality comparison
+
+To set a custom title:
+```bash
+nextflow run main.nf --report_title "My Experiment — ZR751 2026"
+```
+
+### Manual report generation (CLI)
+
+You can also generate reports manually from any `clone_barcodes.csv` file:
+
+```bash
+# Basic usage
+cd /path/to/nextclone/output
+python3 /path/to/NextClone/reports/generate_report.py clone_barcodes.csv
+
+# Custom output and title
+python3 reports/generate_report.py clone_barcodes.csv \
+  --output my_report.html \
+  --title "ZR751 Clonal Analysis — 2026-04-09"
+```
+
+**Command-line options:**
+```bash
+python3 generate_report.py <input_csv> [OPTIONS]
+
+Positional:
+  input_csv              Path to clone_barcodes.csv from NextClone output
+
+Options:
+  --output FILE          Output HTML file (default: report.html)
+  --title TEXT           Report title (default: "NextClone Report")
+  --help                 Show help message
+```
+
+For full documentation, see [`reports/README.md`](reports/README.md).
+
+## Output Management
+
+### Recommended Usage
+
+**Always use timestamped output directories** to prevent overwriting previous runs:
+
+```bash
+# DNA-seq mode
+nextflow run main.nf \\
+    --mode DNAseq \\
+    --dnaseq_fastq_files /path/to/fastq \\
+    --discovery_mode true \\
+    --filter_discovered_barcodes false \\
+    --publish_dir "results_DNAseq_$(date +%Y-%m-%d_%H-%M-%S)"
+
+# scRNA-seq mode
+nextflow run main.nf \\
+    --mode scRNAseq \\
+    --scrnaseq_bam_files /path/to/bams \\
+    --discovery_mode true \\
+    --filter_discovered_barcodes false \\
+    --publish_dir "results_scRNAseq_$(date +%Y-%m-%d_%H-%M-%S)"
+```
+
+**Example output:**
+```
+results_DNAseq_2026-04-10_11-45-22/
+├── all_barcodes.txt          # All discovered barcodes
+├── filtered_barcodes.txt     # Filtered barcodes (same as above if filter=false)
+├── clone_barcodes.csv        # Final clone assignments
+├── nextclone_qc_report.html  # Interactive QC dashboard
+└── run_log.txt               # Run parameters + software versions
+```
+
+### When to Clear Work Directory
+
+**Clear `work/` directory only when:**
+- Updating NextClone code (to avoid cached old results)
+- Conda environments are corrupted
+- Debugging unexpected behavior
+
+```bash
+# Clear work directory
+rm -rf work/
+
+# Clear conda cache (if needed)
+rm -rf /path/to/nextflow_local/conda_cache/
+```
+
+**For routine runs:** Keep `work/` to save compute time (Nextflow caches task results).
+
+### Comparison report (manual)
+
+To compare two runs side by side (e.g. reference mode vs discovery mode), use the comparison script after both runs are complete:
+
+```bash
+python3 reports/generate_comparison_report.py \
+    /path/to/run_a/clone_barcodes.csv \
+    /path/to/run_b/clone_barcodes.csv \
+    --label-a "Reference" \
+    --label-b "Discovery" \
+    --output comparison_report.html \
+    --title "Reference vs Discovery — My Experiment"
+```
+
+The comparison report shows:
+- Δ reads, cells, and clones between the two runs
+- Per-sample ranked abundance overlay (both modes, log-scale)
+- Clone size distribution side by side
+- Top clone overlap (concordance between modes)
+- Clonality metrics comparison (top1%, top3%, top10%)
+- Cell recovery validation across samples
+
+> **No pip installs required.** Both report scripts use Python stdlib only, with Chart.js loaded via CDN.
+
 <!-- ## Citation -->
 
 <!-- If you use NextClone in your study, please kindly cite our preprint on bioRxiv. -->
diff --git a/conda_env/extract_dnaseq_env.yaml b/conda_env/extract_dnaseq_env.yaml
@@ -2,7 +2,6 @@ name: extract_dnaseq_env
 channels:
   - conda-forge
   - bioconda
-  - defaults
 dependencies:
   - python=3.8
   - Biopython

diff --git a/conda_env/extract_sc_env.yaml b/conda_env/extract_sc_env.yaml
@@ -2,11 +2,12 @@ name: extract_sc_env
 channels:
   - conda-forge
   - bioconda
-  - defaults
 dependencies:
   - python=3.8
   - pysam
   - pandas
   - numpy
   - Biopython
-
+  - pip
+  - pip:
+    - flexiplex-filter