diff --git a/README.md b/README.md
index 0ecf73b..87471e4 100644
--- a/README.md
+++ b/README.md
@@ -16,6 +16,206 @@ It is heavily optimised for usage in high-performance computing (HPC) platforms.
 
 For instructions on how to use *NextClone*, please visit the [user guide](https://phipsonlab.github.io/NextClone/).
 
+## Modes
+
+### Whitelist mode (default)
+
+Provide a list of known barcode sequences. Flexiplex maps all reads against the whitelist.
+
+```bash
+nextflow run main.nf --clone_barcodes_reference /path/to/barcodes.txt
+```
+
+### Discovery mode
+
+NextClone supports **discovery mode**, which identifies barcodes directly from the data without a pre-defined whitelist. This is useful when:
+
+- The exact barcode sequences are unknown
+- You are working with a new or custom clonal barcoding system
+- You want to validate or supplement a known barcode list
+
+Discovery mode uses a two-pass approach powered by [Flexiplex](https://github.com/DavidsonGroup/flexiplex):
+
+1. **Pass 1 (Discovery):** Run Flexiplex without a barcode list (`-k` flag) using strict flanking sequence matching (`-f 0`) to identify candidate barcodes.
+2. **Pass 2 (Mapping):** Run Flexiplex with the discovered barcode list using standard edit distance parameters.
+
+```bash
+nextflow run main.nf --discovery_mode true
+```
+
+#### Barcode filtering in discovery mode
+
+By default (`filter_discovered_barcodes = false`), **all barcodes discovered in Pass 1 are passed to Pass 2**, including singletons. This is recommended for lineage tracing experiments where rare clones are biologically meaningful.
+
+Setting `filter_discovered_barcodes = true` applies `flexiplex-filter` knee-plot inflection filtering, which removes low-count barcodes. Use this only for noisy datasets — **it will discard singleton and low-count clones**:
+
+```bash
+nextflow run main.nf --discovery_mode true --filter_discovered_barcodes true
+```
+
+## Parameters
+
+| Parameter | Default | Description |
+|-----------|---------|-------------|
+| `mode` | `"scRNAseq"` | Workflow mode: `"scRNAseq"` or `"DNAseq"` |
+| `clone_barcodes_reference` | — | Path to known barcode whitelist (required when `discovery_mode = false`) |
+| `discovery_mode` | `false` | Enable two-pass barcode discovery mode |
+| `filter_discovered_barcodes` | `false` | Apply knee-plot filtering to discovered barcodes (see above) |
+| `barcode_edit_distance` | `2` | Maximum edit distance for barcode matching |
+| `adapter_edit_distance` | `6` | Maximum edit distance for flanking adapter matching |
+| `adapter_5prime` | — | 5′ flanking adapter sequence |
+| `adapter_3prime` | — | 3′ flanking adapter sequence |
+| `barcode_length` | `20` | Expected barcode length (bp) |
+| `n_chunks` | `2` | Number of read chunks for parallel processing |
+| `publish_dir` | `output/` | Output directory |
+| `report_title` | — | Custom title for the HTML report (defaults to date-stamped title) |
+
+## Output Files
+
+NextClone generates the following files in your `publish_dir`:
+
+| File | Description |
+|------|-------------|
+| `all_barcodes.txt` | **All discovered barcodes** with counts (no filtering). Header: `#barcode\tcount` |
+| `filtered_barcodes.txt` | Barcodes after filtering. Same as `all_barcodes.txt` if `filter_discovered_barcodes=false` |
+| `clone_barcodes.csv` | Final clone assignments to cells (for downstream analysis) |
+| `nextclone_qc_report.html` | Interactive QC dashboard |
+| `run_log.txt` | Run parameters and command line (for reproducibility) |
+
+**Note:** `all_barcodes.txt` contains ALL barcodes discovered in Pass 1, including singletons. This is useful for debugging and QC.
+
+## HTML Reports
+
+### Standard report (auto-generated)
+
+NextClone automatically generates an interactive HTML dashboard at the end of every run, saved to your `publish_dir` as `nextclone_qc_report.html`.
+
+**New in v2 (2026-04-09):**
+- **Clone overlap table** — shared clones across samples at different thresholds (≥5, 10, 15, 20, 50, 100 cells)
+- **Heterogeneity metrics** — Gini coefficient and Shannon index for each sample
+- **Clone size density plot** — KDE-style curve showing clone size distribution
+- **Reversed top 20 clones** — largest clones now at top (easier to read)
+
+**All charts included:**
+- Sample overview table (reads, cells, clones, Gini, Shannon)
+- Clone overlap across samples (new!)
+- Heterogeneity metrics summary (new!)
+- Ranked clone abundance (log scale, top 3 annotated)
+- Clone size density curve (new!)
+- Top 20 clones (horizontal bar, reversed, with % labels)
+- Edit distance QC (FlankEditDist & BarcodeEditDist)
+- Cross-sample clonality comparison
+
+To set a custom title:
+```bash
+nextflow run main.nf --report_title "My Experiment — ZR751 2026"
+```
+
+### Manual report generation (CLI)
+
+You can also generate reports manually from any `clone_barcodes.csv` file:
+
+```bash
+# Basic usage
+cd /path/to/nextclone/output
+python3 /path/to/NextClone/reports/generate_report.py clone_barcodes.csv
+
+# Custom output and title
+python3 reports/generate_report.py clone_barcodes.csv \
+  --output my_report.html \
+  --title "ZR751 Clonal Analysis — 2026-04-09"
+```
+
+**Command-line options:**
+```bash
+python3 generate_report.py <input_csv> [OPTIONS]
+
+Positional:
+  input_csv              Path to clone_barcodes.csv from NextClone output
+
+Options:
+  --output FILE          Output HTML file (default: report.html)
+  --title TEXT           Report title (default: "NextClone Report")
+  --help                 Show help message
+```
+
+For full documentation, see [`reports/README.md`](reports/README.md).
+
+## Output Management
+
+### Recommended Usage
+
+**Always use timestamped output directories** to prevent overwriting previous runs:
+
+```bash
+# DNA-seq mode
+nextflow run main.nf \\
+    --mode DNAseq \\
+    --dnaseq_fastq_files /path/to/fastq \\
+    --discovery_mode true \\
+    --filter_discovered_barcodes false \\
+    --publish_dir "results_DNAseq_$(date +%Y-%m-%d_%H-%M-%S)"
+
+# scRNA-seq mode
+nextflow run main.nf \\
+    --mode scRNAseq \\
+    --scrnaseq_bam_files /path/to/bams \\
+    --discovery_mode true \\
+    --filter_discovered_barcodes false \\
+    --publish_dir "results_scRNAseq_$(date +%Y-%m-%d_%H-%M-%S)"
+```
+
+**Example output:**
+```
+results_DNAseq_2026-04-10_11-45-22/
+├── all_barcodes.txt          # All discovered barcodes
+├── filtered_barcodes.txt     # Filtered barcodes (same as above if filter=false)
+├── clone_barcodes.csv        # Final clone assignments
+├── nextclone_qc_report.html  # Interactive QC dashboard
+└── run_log.txt               # Run parameters + software versions
+```
+
+### When to Clear Work Directory
+
+**Clear `work/` directory only when:**
+- Updating NextClone code (to avoid cached old results)
+- Conda environments are corrupted
+- Debugging unexpected behavior
+
+```bash
+# Clear work directory
+rm -rf work/
+
+# Clear conda cache (if needed)
+rm -rf /path/to/nextflow_local/conda_cache/
+```
+
+**For routine runs:** Keep `work/` to save compute time (Nextflow caches task results).
+
+### Comparison report (manual)
+
+To compare two runs side by side (e.g. reference mode vs discovery mode), use the comparison script after both runs are complete:
+
+```bash
+python3 reports/generate_comparison_report.py \
+    /path/to/run_a/clone_barcodes.csv \
+    /path/to/run_b/clone_barcodes.csv \
+    --label-a "Reference" \
+    --label-b "Discovery" \
+    --output comparison_report.html \
+    --title "Reference vs Discovery — My Experiment"
+```
+
+The comparison report shows:
+- Δ reads, cells, and clones between the two runs
+- Per-sample ranked abundance overlay (both modes, log-scale)
+- Clone size distribution side by side
+- Top clone overlap (concordance between modes)
+- Clonality metrics comparison (top1%, top3%, top10%)
+- Cell recovery validation across samples
+
+> **No pip installs required.** Both report scripts use Python stdlib only, with Chart.js loaded via CDN.
+
 <!-- ## Citation -->
 
 <!-- If you use NextClone in your study, please kindly cite our preprint on bioRxiv. -->
diff --git a/conda_env/extract_dnaseq_env.yaml b/conda_env/extract_dnaseq_env.yaml
index 5a00c0e..3a25afa 100644
--- a/conda_env/extract_dnaseq_env.yaml
+++ b/conda_env/extract_dnaseq_env.yaml
@@ -2,7 +2,6 @@ name: extract_dnaseq_env
 channels:
   - conda-forge
   - bioconda
-  - defaults
 dependencies:
   - python=3.8
   - Biopython
diff --git a/conda_env/extract_sc_env.yaml b/conda_env/extract_sc_env.yaml
index 95dd7bd..071cf1e 100644
--- a/conda_env/extract_sc_env.yaml
+++ b/conda_env/extract_sc_env.yaml
@@ -2,11 +2,12 @@ name: extract_sc_env
 channels:
   - conda-forge
   - bioconda
-  - defaults
 dependencies:
   - python=3.8
   - pysam
   - pandas
   - numpy
   - Biopython
-  
\ No newline at end of file
+  - pip
+  - pip:
+    - flexiplex-filter
diff --git a/main.nf b/main.nf
index 0ef467d..f58c43a 100644
--- a/main.nf
+++ b/main.nf
@@ -1,47 +1,173 @@
 #!/bin/bash nextflow
 
+// =============================================================================
+// NextClone - Clonal barcode extraction pipeline
+// Supports both DNAseq and scRNAseq modes
+// 
+// Two barcode identification approaches:
+// 1. Whitelist mode (default): Use known barcode reference (clone_barcodes_reference)
+// 2. Discovery mode: Two-pass approach to discover barcodes from data
+//    - Pass 1: Run Flexiplex without -k to discover barcodes (-f 0 for strict match)
+//    - Filter: Use flexiplex-filter (knee-plot method)
+//    - Pass 2: Run Flexiplex with the discovered/filtered barcode list
+// =============================================================================
+
 params.barcode_length_chr = '?' * params.barcode_length
 
+// Import DNAseq processes
 include { 
     dnaseq_trim_reads;
     dnaseq_filter_reads;
     dnaseq_count_reads;
     dnaseq_split_reads_to_chunks;
     dnaseq_map_barcodes;
+    dnaseq_discover_barcodes;
+    dnaseq_filter_discovered_barcodes;
+    dnaseq_map_with_discovered_barcodes;
     dnaseq_collapse_barcodes
 } from "./modules/extract_dnaseq_barcodes"
 
+// Import scRNAseq processes
 include { 
     sc_get_unmapped_reads;
     sc_remove_low_qual_reads;
     sc_retain_reads_with_CB_tag;
     sc_split_unmapped_reads;
     sc_map_unmapped_reads;
-    sc_merge_barcodes 
+    sc_discover_barcodes;
+    sc_merge_discovered_barcodes;
+    sc_map_with_discovered_barcodes;
+    sc_merge_barcodes;
+    generate_report;
+    generate_run_log
 } from "./modules/extract_sc_clone_barcodes"
 
 workflow {
 
+    // =============================================================================
+    // Parameter validation
+    // =============================================================================
+    
+    // Validate: discovery_mode = false requires clone_barcodes_reference
+    if (!params.discovery_mode && !params.clone_barcodes_reference) {
+        error """
+        ERROR: Parameter 'clone_barcodes_reference' is required when 'discovery_mode = false'.
+        
+        Either:
+        1. Provide a barcode whitelist: --clone_barcodes_reference /path/to/barcodes.txt
+        2. Enable discovery mode: --discovery_mode true
+        
+        See documentation for details: https://phipsonlab.github.io/NextClone/
+        """
+    }
+    
+    // Validate: discovery_mode = true should not use clone_barcodes_reference (warn if provided)
+    if (params.discovery_mode && params.clone_barcodes_reference) {
+        log.warn """
+        WARNING: 'clone_barcodes_reference' is ignored when 'discovery_mode = true'.
+        Barcodes will be discovered from the data instead.
+        """
+    }
+
+
+
     if (params.mode == 'DNAseq') {
-        ch_barcode_chunks = Channel.fromPath("${params.dnaseq_fastq_files}/*.fastq.gz") | 
-            dnaseq_trim_reads |
-            dnaseq_filter_reads |
-            dnaseq_count_reads |
-            dnaseq_split_reads_to_chunks
         
-        ch_barcode_mappings = dnaseq_map_barcodes(ch_barcode_chunks.flatten())
-        dnaseq_collapse_barcodes(ch_barcode_mappings.collect())
+        if (params.discovery_mode) {
+            // =========================================
+            // Discovery mode workflow for DNAseq
+            // =========================================
+            
+            // Preprocessing: trim and filter reads
+            ch_filtered_reads = Channel.fromPath("${params.dnaseq_fastq_files}/*.fastq.gz") | 
+                dnaseq_trim_reads |
+                dnaseq_filter_reads
+            
+            // Pass 1: Discover barcodes from filtered reads
+            ch_discovered = dnaseq_discover_barcodes(ch_filtered_reads)
+            
+            // Combine all discovered barcode counts and filter using knee-plot method
+            ch_filtered_barcodes = dnaseq_filter_discovered_barcodes(
+                ch_discovered.collectFile(name: 'combined_barcodes_counts.txt')
+            )
+            
+            // Pass 2: Re-read files, preprocess, split, and map with discovered barcodes
+            ch_barcode_chunks = Channel.fromPath("${params.dnaseq_fastq_files}/*.fastq.gz") |
+                dnaseq_count_reads |
+                dnaseq_split_reads_to_chunks
+            
+            ch_barcode_mappings = dnaseq_map_with_discovered_barcodes(
+                ch_barcode_chunks.flatten(),
+                ch_filtered_barcodes.first()
+            )
+            
+            dnaseq_collapse_barcodes(ch_barcode_mappings.collect())
+            
+        } else {
+            // =========================================
+            // Whitelist mode workflow (original behavior)
+            // =========================================
+            
+            ch_barcode_chunks = Channel.fromPath("${params.dnaseq_fastq_files}/*.fastq.gz") | 
+                dnaseq_trim_reads |
+                dnaseq_filter_reads |
+                dnaseq_count_reads |
+                dnaseq_split_reads_to_chunks
+            
+            ch_barcode_mappings = dnaseq_map_barcodes(ch_barcode_chunks.flatten())
+            dnaseq_collapse_barcodes(ch_barcode_mappings.collect())
+        }
 
     } 
     
     if (params.mode == 'scRNAseq') {
+        
+        // Initial preprocessing: get unmapped reads with cell barcodes
         ch_unmapped_fastas = Channel.fromPath("${params.scrnaseq_bam_files}/*.bam") | 
             sc_get_unmapped_reads |
             sc_remove_low_qual_reads |
             sc_retain_reads_with_CB_tag |
             sc_split_unmapped_reads
         
-        ch_mapped_fastas = sc_map_unmapped_reads(ch_unmapped_fastas[0].flatten())
-        sc_merge_barcodes(ch_mapped_fastas.collect())
+        if (params.discovery_mode) {
+            // =========================================
+            // Discovery mode workflow for scRNAseq
+            // =========================================
+            
+            // Pass 1: Discover barcodes from each chunk
+            ch_discovered = sc_discover_barcodes(ch_unmapped_fastas[0].flatten())
+            
+            // Combine and optionally filter discovered barcodes
+            // sc_merge_discovered_barcodes handles both cases via params.filter_discovered_barcodes:
+            // - false (default): --no-inflection keeps ALL discovered barcodes
+            // - true: knee-plot filtering removes low-count barcodes
+            // sc_merge_discovered_barcodes outputs TWO channels:
+            //   all_barcodes = all discovered barcodes (for QC/debugging)
+            //   filtered_barcodes = barcodes to use for Pass 2 mapping
+            ch_merged = sc_merge_discovered_barcodes(
+                ch_discovered.collect()
+            )
+            
+            // Pass 2: Map reads using FILTERED discovered barcode list
+            // Use named emit to be explicit about which file goes to mapping
+            ch_mapped_fastas = sc_map_with_discovered_barcodes(
+                ch_unmapped_fastas[0].flatten(),
+                ch_merged.filtered_barcodes.first()
+            )
+            
+            ch_clone_barcodes = sc_merge_barcodes(ch_mapped_fastas.collect())
+            generate_report(ch_clone_barcodes)
+            generate_run_log(ch_clone_barcodes)
+            
+        } else {
+            // =========================================
+            // Whitelist mode workflow (original behavior)
+            // =========================================
+            
+            ch_mapped_fastas = sc_map_unmapped_reads(ch_unmapped_fastas[0].flatten())
+            ch_clone_barcodes = sc_merge_barcodes(ch_mapped_fastas.collect())
+            generate_report(ch_clone_barcodes)
+            generate_run_log(ch_clone_barcodes)
+        }
     }
-}
\ No newline at end of file
+}
diff --git a/modules/extract_dnaseq_barcodes.nf b/modules/extract_dnaseq_barcodes.nf
index f36b628..a9155fd 100644
--- a/modules/extract_dnaseq_barcodes.nf
+++ b/modules/extract_dnaseq_barcodes.nf
@@ -1,5 +1,12 @@
 #!/usr/bin/env nextflow
 
+// =============================================================================
+// DNA-seq clone barcode extraction module
+// Supports two modes:
+// 1. Whitelist mode (default): Use known barcode reference
+// 2. Discovery mode: Two-pass approach to discover barcodes from data
+// =============================================================================
+
 process dnaseq_trim_reads {
     label 'medium'
     conda "${projectDir}/conda_env/extract_dnaseq_env.yaml"
@@ -96,7 +103,73 @@ process dnaseq_split_reads_to_chunks {
     """
 }
 
+// =============================================================================
+// Discovery mode processes for DNA-seq
+// =============================================================================
+
+process dnaseq_discover_barcodes {
+    // Pass 1: Run flexiplex WITHOUT -k to discover barcodes from data
+    // Uses -f 0 for strict flanking sequence match
+    label 'small'
+
+    input:
+        path fastq_file
+
+    output:
+        path "${sample_name}_barcodes_counts.txt"
+
+    script:
+    sample_name = fastq_file.getSimpleName()
+    fastq_w_adapter = sample_name + "_wDummyAdaptor.fq"
+    """
+    zcat $fastq_file | sed 's/^/START/g' | sed 's/START@/@/g' > ${fastq_w_adapter}
+    
+    # Run flexiplex in discovery mode (no -k flag)
+    flexiplex \
+        -x "START" \
+        -b ${params.barcode_length_chr} \
+        -u "" \
+        -x "" \
+        -f 0 \
+        -n $sample_name \
+        -p ${task.cpus} \
+        ${fastq_w_adapter}
+    
+    """
+}
+
+process dnaseq_filter_discovered_barcodes {
+    // Optionally filter discovered barcodes using flexiplex-filter knee-plot method
+    // When params.filter_discovered_barcodes = false (default), all discovered
+    // barcodes are kept using --no-inflection.
+    // When params.filter_discovered_barcodes = true, knee-plot filtering is applied.
+    label 'small'
+    
+    input:
+        path barcode_counts
+
+    output:
+        path "filtered_barcodes.txt"
+
+    """
+    #!/usr/bin/bash
+    
+    # Run flexiplex-filter:
+    # - filter_discovered_barcodes = false: --no-inflection keeps ALL discovered barcodes
+    # - filter_discovered_barcodes = true:  knee-plot filtering removes low-count barcodes
+    flexiplex-filter \
+        ${params.filter_discovered_barcodes ? '' : '--no-inflection'} \
+        --outfile filtered_barcodes.txt \
+        ${barcode_counts}
+    """
+}
+
+// =============================================================================
+// Mapping processes (whitelist and discovery mode)
+// =============================================================================
+
 process dnaseq_map_barcodes {
+    // Map barcodes using known reference (whitelist mode)
     // Ran flexiplex per fasta chunk
     // Then combine the counting of read (flexiplex discovery)
     // and the mapped barcode
@@ -133,6 +206,42 @@ process dnaseq_map_barcodes {
     """
 }
 
+process dnaseq_map_with_discovered_barcodes {
+    // Pass 2: Map barcodes using discovered/filtered barcode list
+    label "${params.mapping_process_profile}"
+    conda "${projectDir}/conda_env/extract_dnaseq_env.yaml"
+
+    input:
+        path unmapped_fasta
+        path discovered_barcodes
+
+    output:
+        path "${out_file}"
+
+    script:
+    sample_name = unmapped_fasta.getSimpleName()
+    mapped_chunk = sample_name + "_reads_barcodes.txt"
+    out_file = sample_name + "_mapped.csv"
+    """
+
+    flexiplex \
+        -x "START" \
+        -b ${params.barcode_length_chr} \
+        -u "" \
+        -x "" \
+        -f 0 \
+        -n ${sample_name} \
+        -k ${discovered_barcodes} \
+        -e ${params.barcode_edit_distance} \
+        -p ${task.cpus} \
+        ${unmapped_fasta}
+
+    dnaseq_combine_read_cnt_map.py --unmapped_chunk ${unmapped_fasta} \
+                                --mapped_chunk ${mapped_chunk} \
+                                --out_file ${out_file}
+    """
+}
+
 process dnaseq_collapse_barcodes {
     label 'small'
     conda "${projectDir}/conda_env/extract_dnaseq_env.yaml"
@@ -147,4 +256,4 @@ process dnaseq_collapse_barcodes {
     """
     dnaseq_count_barcodes.py . ${mapped_reads}
     """
-}
\ No newline at end of file
+}
diff --git a/modules/extract_sc_clone_barcodes.nf b/modules/extract_sc_clone_barcodes.nf
index ca74cec..87ff99d 100644
--- a/modules/extract_sc_clone_barcodes.nf
+++ b/modules/extract_sc_clone_barcodes.nf
@@ -1,5 +1,12 @@
 #!/usr/bin/env nextflow
 
+// =============================================================================
+// Single-cell RNA-seq clone barcode extraction module
+// Supports two modes:
+// 1. Whitelist mode (default): Use known barcode reference
+// 2. Discovery mode: Two-pass approach to discover barcodes from data
+// =============================================================================
+
 process sc_get_unmapped_reads {
     // Using sambamba
     module 'sambamba'
@@ -83,7 +90,118 @@ process sc_split_unmapped_reads {
     """
 }
 
+// =============================================================================
+// Discovery mode processes (Pass 1 and filtering)
+// =============================================================================
+
+process sc_discover_barcodes {
+    // Pass 1: Run flexiplex WITHOUT -k to discover barcodes from data
+    // Uses -f 0 for strict flanking sequence match to reduce errors
+    label "${params.mapping_process_profile}"
+    
+    input:
+        path unmapped_fasta
+
+    output:
+        path "${unmapped_fasta.baseName}_barcodes_counts.txt"
+
+    script:
+    """
+    #!/usr/bin/bash
+    
+    # Run flexiplex in discovery mode (no -k flag)
+    # -f 0: strict flanking sequence match (reduces barcode errors)
+    flexiplex \
+        -x "${params.adapter_5prime}" \
+        -b ${params.barcode_length_chr} \
+        -u "" \
+        -x "${params.adapter_3prime}" \
+        -f 0 \
+        -n ${unmapped_fasta.baseName} \
+        -p ${task.cpus} \
+        ${unmapped_fasta}
+    """
+}
+
+process sc_merge_discovered_barcodes {
+    // Merge barcode counts from all chunks and optionally filter using knee-plot
+    // When params.filter_discovered_barcodes = false (default), all discovered
+    // barcodes are kept (no filtering). This is recommended for lineage tracing 
+    // where singleton clones are biologically meaningful and should not be discarded.
+    // When params.filter_discovered_barcodes = true, the knee-plot inflection point
+    // method is used to remove low-count/noisy barcodes.
+    label 'small'
+    
+    input:
+        path barcode_counts_files
+
+    output:
+        path "all_barcodes.txt", emit: all_barcodes
+        path "filtered_barcodes.txt", emit: filtered_barcodes
+
+    """
+    #!/usr/bin/bash
+    set -e  # Exit immediately on any error
+    
+    echo "[SC_MERGE] ========================================" >&2
+    echo "[SC_MERGE] Starting sc_merge_discovered_barcodes" >&2
+    echo "[SC_MERGE] filter_discovered_barcodes=${params.filter_discovered_barcodes}" >&2
+    
+    # Count input files
+    n_chunks=0
+    for f in ${barcode_counts_files}; do
+        n_chunks=\$((n_chunks + 1))
+        echo "[SC_MERGE]   Chunk \$n_chunks: \$f (\$(wc -l < "\$f") lines)" >&2
+    done
+    echo "[SC_MERGE] Total chunks: \$n_chunks" >&2
+    
+    # Combine all barcode counts
+    echo "[SC_MERGE] Combining barcode counts..." >&2
+    cat ${barcode_counts_files} | awk '{counts[\$1] += \$2} END {for (bc in counts) print bc "\t" counts[bc]}' | sort -k2 -nr > combined_barcodes_counts.txt
+    
+    n_combined=\$(wc -l < combined_barcodes_counts.txt)
+    echo "[SC_MERGE] combined_barcodes_counts.txt: \$n_combined barcodes" >&2
+    head -5 combined_barcodes_counts.txt >&2
+    
+    if [ ! -s combined_barcodes_counts.txt ]; then
+        echo "[SC_MERGE ERROR] combined_barcodes_counts.txt is EMPTY!" >&2
+        exit 1
+    fi
+    
+    # Create all_barcodes.txt
+    echo "[SC_MERGE] Creating all_barcodes.txt..." >&2
+    echo -e "#barcode\tcount" > all_barcodes.txt
+    echo "# barcode: lineage tracing barcode sequence" >> all_barcodes.txt
+    echo "# count: number of reads supporting this barcode" >> all_barcodes.txt
+    cat combined_barcodes_counts.txt >> all_barcodes.txt
+    echo "[SC_MERGE] all_barcodes.txt: \$(wc -l < all_barcodes.txt) lines" >&2
+    
+    # Create filtered_barcodes.txt
+    if [ "${params.filter_discovered_barcodes}" = "true" ]; then
+        echo "[SC_MERGE] Running flexiplex-filter..." >&2
+        flexiplex-filter --outfile filtered_barcodes.txt.tmp combined_barcodes_counts.txt
+        # flexiplex-filter output is raw barcodes, use directly (no comment headers)
+        mv filtered_barcodes.txt.tmp filtered_barcodes.txt
+        echo "[SC_MERGE] filtered_barcodes.txt: \$(wc -l < filtered_barcodes.txt) lines" >&2
+    else
+        echo "[SC_MERGE] filter_discovered_barcodes=false - creating filtered_barcodes.txt (no headers)" >&2
+        # filtered_barcodes.txt must NOT have comment headers - flexiplex reads it as -k reference
+        cat combined_barcodes_counts.txt > filtered_barcodes.txt
+        echo "[SC_MERGE] filtered_barcodes.txt: \$(wc -l < filtered_barcodes.txt) lines" >&2
+    fi
+    
+    echo "[SC_MERGE] COMPLETED" >&2
+    ls -lh all_barcodes.txt filtered_barcodes.txt >&2
+    """
+
+}
+
+// =============================================================================
+// Mapping processes (Pass 2 for discovery mode, or single pass for whitelist mode)
+// =============================================================================
+
 process sc_map_unmapped_reads {
+    // Map reads to known barcode reference (whitelist mode)
     label "${params.mapping_process_profile}"
 
     input:
@@ -110,6 +228,35 @@ process sc_map_unmapped_reads {
     """
 }
 
+process sc_map_with_discovered_barcodes {
+    // Pass 2: Map reads using discovered/filtered barcode list
+    label "${params.mapping_process_profile}"
+
+    input:
+        path unmapped_fasta
+        path discovered_barcodes
+
+    output:
+        path "${unmapped_fasta.baseName}_reads_barcodes.txt"
+
+    """
+    #!/usr/bin/bash
+
+    flexiplex \
+            -x "${params.adapter_5prime}" \
+            -b ${params.barcode_length_chr} \
+            -u "" \
+            -x "${params.adapter_3prime}" \
+            -f ${params.adapter_edit_distance} \
+            -e ${params.barcode_edit_distance} \
+            -n ${unmapped_fasta.baseName} \
+            -k ${discovered_barcodes} \
+            -p ${task.cpus} \
+            ${unmapped_fasta}
+    
+    """
+}
+
 process sc_merge_barcodes {
     label 'small_mem'
     conda "${projectDir}/conda_env/extract_sc_env.yaml"
@@ -126,4 +273,99 @@ process sc_merge_barcodes {
     """
     sc_merge_clone_barcodes.py ${mapped_reads} ${outfile}
     """
-}
\ No newline at end of file
+}
+
+process generate_run_log {
+    // Generate run log with parameters and command line
+    // Saved to publish_dir for reproducibility
+    label 'small'
+    
+    publishDir params.publish_dir, mode: params.publish_dir_mode
+    
+    input:
+        path clone_barcodes
+    
+    output:
+        path "run_log.txt"
+    
+    script:
+        timestamp = new Date().format('yyyy-MM-dd HH:mm:ss')
+    """
+    # Get software versions
+    NF_VERSION=\$(nextflow -version 2>&1 | head -1 || echo "unknown")
+    FLEXIPLEX_VERSION=\$(flexiplex --version 2>&1 | head -1 || echo "unknown")
+    PYTHON_VERSION=\$(python3 --version 2>&1 || echo "unknown")
+    
+    # Get git info if available
+    GIT_COMMIT=\$(git rev-parse HEAD 2>/dev/null || echo "Not a git repo")
+    GIT_BRANCH=\$(git rev-parse --abbrev-ref HEAD 2>/dev/null || echo "unknown")
+    
+    cat > run_log.txt << EOF
+# NextClone Run Log
+# Generated: ${timestamp}
+
+## Software Versions
+Nextflow: \${NF_VERSION}
+Flexiplex: \${FLEXIPLEX_VERSION}
+Python: \${PYTHON_VERSION}
+
+## Code Version
+Git commit: \${GIT_COMMIT}
+Git branch: \${GIT_BRANCH}
+
+## Command
+nextflow run ${projectDir}/main.nf \\
+    --mode ${params.mode} \\
+    --discovery_mode ${params.discovery_mode} \\
+    --filter_discovered_barcodes ${params.filter_discovered_barcodes} \\
+    --barcode_edit_distance ${params.barcode_edit_distance} \\
+    --adapter_edit_distance ${params.adapter_edit_distance} \\
+    --n_chunks ${params.n_chunks} \\
+    --publish_dir ${params.publish_dir}
+
+## Parameters
+mode = ${params.mode}
+discovery_mode = ${params.discovery_mode}
+filter_discovered_barcodes = ${params.filter_discovered_barcodes}
+barcode_edit_distance = ${params.barcode_edit_distance}
+adapter_edit_distance = ${params.adapter_edit_distance}
+barcode_length = ${params.barcode_length}
+n_chunks = ${params.n_chunks}
+publish_dir = ${params.publish_dir}
+
+## Output Files
+- all_barcodes.txt: All discovered barcodes (no filtering)
+- filtered_barcodes.txt: Barcodes after filtering (same as all_barcodes.txt if filter_discovered_barcodes=false)
+- clone_barcodes.csv: Final clone assignments to cells
+- nextclone_qc_report.html: Interactive QC dashboard
+
+## Notes
+- all_barcodes.txt contains ALL barcodes discovered in Pass 1, including singletons
+- filtered_barcodes.txt applies knee-plot filtering only if filter_discovered_barcodes=true
+- For lineage tracing, we recommend filter_discovered_barcodes=false to retain rare clones
+EOF
+    """
+}
+
+process generate_report {
+    // Generate interactive HTML dashboard from clone_barcodes.csv
+    // Uses reports/generate_report.py (pure Python stdlib, no pip installs)
+    label 'small'
+
+    publishDir params.publish_dir, mode: params.publish_dir_mode
+
+    input:
+        path clone_barcodes
+
+    output:
+        path "nextclone_qc_report.html"
+
+    script:
+        title = params.report_title ?: "NextClone QC Report — ${new Date().format('yyyy-MM-dd')}"
+    """
+    python3 ${projectDir}/reports/generate_report.py \
+        ${clone_barcodes} \
+        --output nextclone_qc_report.html \
+        --title "${title}"
+    """
+}
diff --git a/nextflow.config b/nextflow.config
index 6b601ac..6492279 100644
--- a/nextflow.config
+++ b/nextflow.config
@@ -16,6 +16,25 @@ params {
     // mapping may need long time, so use either long_mapping or regular_mapping
     mapping_process_profile = "regular_mapping"
     
+    // Discovery mode: when true, barcodes are discovered from data using a two-pass approach
+    // Pass 1: Run Flexiplex without -k to discover barcodes (uses -f 0 for strict match)
+    // Filter: Use flexiplex-filter to get high-quality barcodes (knee-plot method)
+    // Pass 2: Run Flexiplex with the discovered/filtered barcode list
+    // When false (default), requires clone_barcodes_reference to be provided
+    discovery_mode = false
+
+    // filter_discovered_barcodes: applies knee-plot inflection filtering to discovered barcodes
+    // Set to false to keep ALL discovered barcodes (recommended when singletons matter,
+    // e.g. lineage tracing where rare clones are biologically meaningful)
+    // Set to true to apply knee-plot filtering (removes low-count barcodes)
+    filter_discovered_barcodes = false
+
+    // Title for the auto-generated HTML report (optional)
+    // Defaults to "NextClone Run — YYYY-MM-DD" if not set
+    report_title = ""
+    
+
+    
 
     // for DNA-seq data
     dnaseq_fastq_files = "${projectDir}/data/dnaseq_fastq_files"
@@ -33,7 +52,7 @@ params {
 
 conda {
     enabled = true
-    useMamba = false
+    useMamba = true  // Faster and more reliable than conda
     useMicromamba = false
     createOptions = '--yes'
 }
diff --git a/reports/README.md b/reports/README.md
new file mode 100644
index 0000000..1bb9c6b
--- /dev/null
+++ b/reports/README.md
@@ -0,0 +1,91 @@
+# NextClone Report Generator
+
+Self-contained Python scripts to generate interactive HTML dashboards from NextClone output. No external dependencies — pure Python stdlib + Chart.js via CDN.
+
+## Single-run report (v2)
+
+Generates a per-sample HTML dashboard from a single `clone_barcodes.csv`.
+
+### Quick Start
+
+```bash
+# Basic usage (outputs report.html)
+python3 generate_report.py clone_barcodes.csv
+
+# Custom output filename and title
+python3 generate_report.py clone_barcodes.csv \
+  --output my_report.html \
+  --title "ZR751 Clonal Analysis — 2026-04-09"
+```
+
+### From NextClone Output
+
+After running NextClone, generate the report from your results directory:
+
+```bash
+# If NextClone output is in results_discoverymode_260331/
+cd /path/to/nextclone/results_discoverymode_260331
+python3 /path/to/NextClone/reports/generate_report.py clone_barcodes.csv \
+  --output nextclone_qc_report.html \
+  --title "Discovery Mode — ZR751"
+```
+
+### Command-Line Options
+
+```bash
+python3 generate_report.py <input_csv> [OPTIONS]
+
+Positional:
+  input_csv              Path to clone_barcodes.csv from NextClone output
+
+Options:
+  --output FILE          Output HTML file (default: report.html)
+  --title TEXT           Report title (default: "NextClone Report")
+  --help                 Show help message and exit
+
+Examples:
+  # Default output (report.html)
+  python3 generate_report.py clone_barcodes.csv
+  
+  # Custom output and title
+  python3 generate_report.py clone_barcodes.csv \
+    --output qc_report.html \
+    --title "Sample ABC — Discovery Mode"
+```
+
+**New in v2 (2026-04-09):**
+- **Clone overlap table** — shows how many clones are shared across samples at different cell thresholds (≥5, 10, 15, 20, 50, 100 cells)
+- **Heterogeneity metrics** — Gini coefficient and Shannon index for each sample
+- **Clone size density plot** — KDE-style curve showing clone size distribution
+- **Reversed top 20 clones** — largest clones now at top of chart (easier to read)
+
+**All charts:**
+- Sample overview table (reads, cells, clones, Gini, Shannon)
+- Clone overlap across samples (new!)
+- Heterogeneity metrics summary (new!)
+- Ranked clone abundance (log scale, top 3 annotated)
+- Clone size density curve (new!)
+- Top 20 clones (horizontal bar, reversed, with % labels)
+- Edit distance QC (FlankEditDist + BarcodeEditDist)
+- Cross-sample clonality comparison
+
+## Comparison report
+
+Compares two runs side by side (e.g. reference mode vs discovery mode).
+
+```bash
+python3 generate_comparison_report.py run_a.csv run_b.csv \
+  --label-a "Reference" \
+  --label-b "Discovery" \
+  --output comparison.html \
+  --title "Reference vs Discovery — ZR751"
+```
+
+**Charts included:**
+- Summary header with Δ metrics (reads, cells, clones)
+- Per-sample delta table (click row to drill in)
+- Ranked abundance overlay (both modes on one log-scale plot)
+- Clone size distribution side by side
+- Top clone overlap (are the same clones found in both modes?)
+- Clonality metrics comparison (top1%, top3%, top10%)
+- Cross-sample clone count and cell recovery comparison
diff --git a/reports/generate_comparison_report.py b/reports/generate_comparison_report.py
new file mode 100644
index 0000000..a142939
--- /dev/null
+++ b/reports/generate_comparison_report.py
@@ -0,0 +1,881 @@
+#!/usr/bin/env python3
+"""
+NextClone Comparison Report Generator
+Reads two clone_barcodes.csv files and generates a self-contained HTML comparison dashboard.
+
+Usage:
+    python3 generate_comparison_report.py <csv_a> <csv_b> \
+        --label-a "Reference" --label-b "Discovery (No Filter)" \
+        --output report_comparison.html \
+        --title "NextClone: Reference vs Discovery Mode — ZR751"
+"""
+
+import argparse
+import csv
+import json
+import os
+import sys
+from collections import defaultdict
+from datetime import datetime
+
+
+# ---------------------------------------------------------------------------
+# Data loading & stats computation
+# ---------------------------------------------------------------------------
+
+def load_data(csv_path):
+    """Parse the CSV and return a dict of per-sample data structures."""
+    samples = defaultdict(lambda: {
+        "reads": 0,
+        "cells": set(),
+        "clone_cells": defaultdict(set),  # clone_barcode -> set of cell barcodes
+    })
+
+    with open(csv_path, newline="", encoding="utf-8") as f:
+        reader = csv.DictReader(f)
+        for row in reader:
+            sample = row["SourceBAMFile"]
+            cell = row["CellBarcode"]
+            clone = row["CloneBarcode"]
+
+            s = samples[sample]
+            s["reads"] += 1
+            s["cells"].add(cell)
+            s["clone_cells"][clone].add(cell)
+
+    return dict(samples)
+
+
+def compute_sample_stats(sample_data):
+    """Compute derived stats for a single sample dict."""
+    clone_cells = sample_data["clone_cells"]
+    total_cells = len(sample_data["cells"])
+
+    # Clone sizes: number of unique cells per clone
+    clone_sizes = {clone: len(cells) for clone, cells in clone_cells.items()}
+    sorted_clones = sorted(clone_sizes.items(), key=lambda x: -x[1])
+
+    n_clones = len(sorted_clones)
+    n_cells = total_cells
+
+    # Top1, Top3, Top10 %
+    def pct_top_n(n):
+        if n_cells == 0:
+            return 0.0
+        top_cells = sum(sz for _, sz in sorted_clones[:n])
+        return round(100.0 * top_cells / n_cells, 2)
+
+    top1_pct = pct_top_n(1)
+    top3_pct = pct_top_n(3)
+    top10_pct = pct_top_n(10)
+
+    # Ranked sizes for top 100 (log abundance plot)
+    ranked_sizes = [sz for _, sz in sorted_clones[:100]]
+
+    # Size buckets
+    buckets = {"singleton": 0, "small": 0, "medium": 0, "large": 0, "dominant": 0}
+    for _, sz in sorted_clones:
+        if sz == 1:
+            buckets["singleton"] += 1
+        elif sz <= 5:
+            buckets["small"] += 1
+        elif sz <= 20:
+            buckets["medium"] += 1
+        elif sz <= 100:
+            buckets["large"] += 1
+        else:
+            buckets["dominant"] += 1
+
+    # Top 20 clones
+    top20 = []
+    for clone, sz in sorted_clones[:20]:
+        pct = round(100.0 * sz / n_cells, 2) if n_cells > 0 else 0.0
+        top20.append({
+            "barcode": clone[:12] + "…" if len(clone) > 12 else clone,
+            "barcode_full": clone,
+            "n_cells": sz,
+            "pct": pct,
+        })
+
+    return {
+        "reads": sample_data["reads"],
+        "cells": n_cells,
+        "clones": n_clones,
+        "top1_pct": top1_pct,
+        "top3_pct": top3_pct,
+        "top10_pct": top10_pct,
+        "ranked_sizes": ranked_sizes,
+        "buckets": buckets,
+        "top20": top20,
+        "clone_sizes": clone_sizes,  # full dict for cross-run lookup
+    }
+
+
+def build_comparison_data(data_a, data_b, label_a, label_b):
+    """Build the full comparison dataset for the HTML template."""
+    samples_a = {name: compute_sample_stats(sd) for name, sd in data_a.items()}
+    samples_b = {name: compute_sample_stats(sd) for name, sd in data_b.items()}
+
+    all_samples = sorted(set(list(samples_a.keys()) + list(samples_b.keys())))
+
+    # Per-sample comparison rows
+    sample_rows = []
+    for sample in all_samples:
+        sa = samples_a.get(sample)
+        sb = samples_b.get(sample)
+
+        def delta_pct(a, b):
+            if a is None or b is None or a == 0:
+                return None
+            return round(100.0 * (b - a) / a, 1)
+
+        row = {
+            "sample": sample,
+            "reads_a": sa["reads"] if sa else 0,
+            "reads_b": sb["reads"] if sb else 0,
+            "delta_reads": delta_pct(sa["reads"] if sa else None, sb["reads"] if sb else None),
+            "cells_a": sa["cells"] if sa else 0,
+            "cells_b": sb["cells"] if sb else 0,
+            "delta_cells": delta_pct(sa["cells"] if sa else None, sb["cells"] if sb else None),
+            "clones_a": sa["clones"] if sa else 0,
+            "clones_b": sb["clones"] if sb else 0,
+            "delta_clones": delta_pct(sa["clones"] if sa else None, sb["clones"] if sb else None),
+        }
+        sample_rows.append(row)
+
+    # Per-sample detail data for charts
+    sample_detail = {}
+    for sample in all_samples:
+        sa = samples_a.get(sample, {})
+        sb = samples_b.get(sample, {})
+
+        # Top clones overlap: top 10 from A, look up in B
+        top10_clones_a = sa.get("top20", [])[:10]
+        clone_sizes_b = sb.get("clone_sizes", {})
+
+        overlap = []
+        for cl in top10_clones_a:
+            full_bc = cl["barcode_full"]
+            cells_b = clone_sizes_b.get(full_bc, 0)
+            overlap.append({
+                "label": cl["barcode"],
+                "cells_a": cl["n_cells"],
+                "cells_b": cells_b,
+            })
+
+        sample_detail[sample] = {
+            "ranked_a": sa.get("ranked_sizes", []),
+            "ranked_b": sb.get("ranked_sizes", []),
+            "clones_a": sa.get("clones", 0),
+            "clones_b": sb.get("clones", 0),
+            "buckets_a": sa.get("buckets", {}),
+            "buckets_b": sb.get("buckets", {}),
+            "overlap": overlap,
+            "top1_a": sa.get("top1_pct", 0),
+            "top3_a": sa.get("top3_pct", 0),
+            "top10_a": sa.get("top10_pct", 0),
+            "top1_b": sb.get("top1_pct", 0),
+            "top3_b": sb.get("top3_pct", 0),
+            "top10_b": sb.get("top10_pct", 0),
+        }
+
+    # Global summary totals
+    total_reads_a = sum(s["reads"] for s in samples_a.values())
+    total_reads_b = sum(s["reads"] for s in samples_b.values())
+    total_cells_a = sum(s["cells"] for s in samples_a.values())
+    total_cells_b = sum(s["cells"] for s in samples_b.values())
+    total_clones_a = sum(s["clones"] for s in samples_a.values())
+    total_clones_b = sum(s["clones"] for s in samples_b.values())
+
+    def fmt_delta(a, b):
+        if a == 0:
+            return "N/A"
+        d = 100.0 * (b - a) / a
+        sign = "+" if d > 0 else ""
+        return f"{sign}{d:.1f}%"
+
+    summary = {
+        "total_reads_a": total_reads_a,
+        "total_reads_b": total_reads_b,
+        "delta_reads": fmt_delta(total_reads_a, total_reads_b),
+        "total_cells_a": total_cells_a,
+        "total_cells_b": total_cells_b,
+        "delta_cells": fmt_delta(total_cells_a, total_cells_b),
+        "total_clones_a": total_clones_a,
+        "total_clones_b": total_clones_b,
+        "delta_clones": fmt_delta(total_clones_a, total_clones_b),
+        "samples_a": len(samples_a),
+        "samples_b": len(samples_b),
+    }
+
+    # Section 3 cross-sample chart data (sorted by clones_a desc)
+    cross_sorted = sorted(sample_rows, key=lambda r: -r["clones_a"])
+
+    return {
+        "summary": summary,
+        "sample_rows": sample_rows,
+        "sample_detail": sample_detail,
+        "all_samples": all_samples,
+        "cross_sorted": cross_sorted,
+        "label_a": label_a,
+        "label_b": label_b,
+    }
+
+
+# ---------------------------------------------------------------------------
+# HTML generation
+# ---------------------------------------------------------------------------
+
+HTML_TEMPLATE = r"""<!DOCTYPE html>
+<html lang="en">
+<head>
+<meta charset="UTF-8"/>
+<meta name="viewport" content="width=device-width, initial-scale=1.0"/>
+<title>{title}</title>
+<link rel="preconnect" href="https://fonts.googleapis.com"/>
+<link href="https://fonts.googleapis.com/css2?family=Inter:wght@300;400;500;600;700&display=swap" rel="stylesheet"/>
+<script src="https://cdn.jsdelivr.net/npm/chart.js@4.4.0/dist/chart.umd.min.js"></script>
+<style>
+* {{ box-sizing: border-box; margin: 0; padding: 0; }}
+body {{
+  font-family: 'Inter', sans-serif;
+  background: #F1F5F9;
+  color: #1E293B;
+  font-size: 14px;
+  line-height: 1.5;
+}}
+a {{ color: inherit; text-decoration: none; }}
+
+/* ── Layout ── */
+.container {{ max-width: 1280px; margin: 0 auto; padding: 24px 20px; }}
+.card {{
+  background: #fff;
+  border-radius: 12px;
+  box-shadow: 0 1px 4px rgba(0,0,0,.06), 0 4px 16px rgba(0,0,0,.04);
+  padding: 24px;
+  margin-bottom: 24px;
+}}
+
+/* ── Header ── */
+.header {{
+  background: linear-gradient(135deg, #1E3A5F 0%, #1E40AF 100%);
+  color: #fff;
+  border-radius: 12px;
+  padding: 28px 28px 20px;
+  margin-bottom: 24px;
+}}
+.header h1 {{ font-size: 22px; font-weight: 700; margin-bottom: 6px; }}
+.header .meta {{ font-size: 12px; opacity: .75; margin-bottom: 14px; }}
+.badges {{ display: flex; gap: 10px; margin-bottom: 18px; flex-wrap: wrap; }}
+.badge {{
+  display: inline-flex; align-items: center; gap: 6px;
+  padding: 5px 12px; border-radius: 99px; font-size: 12px; font-weight: 600;
+}}
+.badge-a {{ background: #2563EB; color: #fff; }}
+.badge-b {{ background: #16A34A; color: #fff; }}
+
+/* ── Summary bar ── */
+.summary-bar {{
+  display: grid; grid-template-columns: repeat(4, 1fr); gap: 12px;
+}}
+@media (max-width: 700px) {{ .summary-bar {{ grid-template-columns: repeat(2, 1fr); }} }}
+.summary-metric {{
+  background: rgba(255,255,255,.12);
+  border-radius: 10px;
+  padding: 12px 14px;
+  border: 1px solid rgba(255,255,255,.18);
+}}
+.summary-metric .label {{ font-size: 11px; opacity: .8; text-transform: uppercase; letter-spacing: .04em; margin-bottom: 4px; }}
+.summary-metric .vals {{ font-size: 13px; font-weight: 600; margin-bottom: 2px; }}
+.summary-metric .delta {{ font-size: 12px; font-weight: 500; }}
+.delta-red {{ color: #FCA5A5; }}
+.delta-green {{ color: #86EFAC; }}
+.delta-gray {{ color: rgba(255,255,255,.6); }}
+
+/* ── Section title ── */
+.section-title {{
+  font-size: 16px; font-weight: 700; color: #1E293B;
+  margin-bottom: 16px; padding-bottom: 10px;
+  border-bottom: 2px solid #E2E8F0;
+  display: flex; align-items: center; gap: 8px;
+}}
+.section-num {{
+  background: #2563EB; color: #fff;
+  width: 24px; height: 24px; border-radius: 50%;
+  display: inline-flex; align-items: center; justify-content: center;
+  font-size: 12px; font-weight: 700; flex-shrink: 0;
+}}
+
+/* ── Table ── */
+table {{ width: 100%; border-collapse: collapse; font-size: 13px; }}
+thead th {{
+  background: #F8FAFC; color: #64748B; font-weight: 600;
+  text-transform: uppercase; font-size: 11px; letter-spacing: .04em;
+  padding: 10px 12px; text-align: left; border-bottom: 2px solid #E2E8F0;
+}}
+tbody tr {{ cursor: pointer; transition: background .15s; }}
+tbody tr:hover {{ background: #EFF6FF; }}
+tbody tr.selected {{ background: #DBEAFE; }}
+tbody td {{ padding: 10px 12px; border-bottom: 1px solid #F1F5F9; }}
+.num {{ text-align: right; font-variant-numeric: tabular-nums; }}
+
+/* ── Delta pills ── */
+.pill {{
+  display: inline-block; padding: 2px 8px; border-radius: 99px;
+  font-size: 11px; font-weight: 600; white-space: nowrap;
+}}
+.pill-red {{ background: #FEE2E2; color: #DC2626; }}
+.pill-green {{ background: #DCFCE7; color: #16A34A; }}
+.pill-gray {{ background: #F1F5F9; color: #64748B; }}
+.pill-large-red {{ background: #FEE2E2; color: #9B1C1C; font-size: 12px; font-weight: 700; }}
+
+/* ── Per-sample detail ── */
+#sample-detail {{ display: none; }}
+#sample-detail.visible {{ display: block; }}
+.sample-heading {{
+  font-size: 18px; font-weight: 700; color: #1E293B; margin-bottom: 20px;
+  display: flex; align-items: center; gap: 10px;
+}}
+.sample-heading-tag {{
+  background: #EFF6FF; color: #2563EB;
+  border-radius: 8px; padding: 4px 12px;
+  font-size: 14px; font-weight: 600;
+}}
+.chart-grid {{
+  display: grid; grid-template-columns: 1fr 1fr; gap: 20px;
+}}
+@media (max-width: 900px) {{ .chart-grid {{ grid-template-columns: 1fr; }} }}
+.chart-card {{
+  background: #FAFAFA; border-radius: 10px;
+  border: 1px solid #E2E8F0; padding: 16px;
+}}
+.chart-card h4 {{ font-size: 13px; font-weight: 600; color: #475569; margin-bottom: 12px; }}
+.chart-wrap {{ position: relative; height: 260px; }}
+
+/* ── Section 3 ── */
+.cross-grid {{ display: grid; grid-template-columns: 1fr 1fr; gap: 20px; }}
+@media (max-width: 900px) {{ .cross-grid {{ grid-template-columns: 1fr; }} }}
+.cross-note {{
+  font-size: 12px; color: #64748B; background: #F8FAFC;
+  border-left: 3px solid #2563EB; padding: 8px 12px;
+  border-radius: 0 6px 6px 0; margin-top: 10px; line-height: 1.6;
+}}
+
+/* ── Legend chips ── */
+.legend {{ display: flex; gap: 14px; margin-bottom: 10px; flex-wrap: wrap; }}
+.legend-item {{ display: flex; align-items: center; gap: 5px; font-size: 12px; color: #475569; }}
+.legend-dot {{ width: 12px; height: 12px; border-radius: 3px; flex-shrink: 0; }}
+</style>
+</head>
+<body>
+<div class="container">
+
+<!-- ── HEADER ── -->
+<div class="header">
+  <h1>{title}</h1>
+  <div class="meta">
+    Run A: <strong>{file_a}</strong> &nbsp;·&nbsp;
+    Run B: <strong>{file_b}</strong> &nbsp;·&nbsp;
+    Generated: <strong>{date}</strong>
+  </div>
+  <div class="badges">
+    <span class="badge badge-a">● {label_a}</span>
+    <span class="badge badge-b">● {label_b}</span>
+  </div>
+  <div class="summary-bar" id="summary-bar"></div>
+</div>
+
+<!-- ── SECTION 1: Overview Table ── -->
+<div class="card">
+  <div class="section-title"><span class="section-num">1</span> Sample Overview Comparison</div>
+  <div class="legend">
+    <div class="legend-item"><div class="legend-dot" style="background:#2563EB"></div> {label_a}</div>
+    <div class="legend-item"><div class="legend-dot" style="background:#16A34A"></div> {label_b}</div>
+  </div>
+  <table id="overview-table">
+    <thead>
+      <tr>
+        <th>Sample</th>
+        <th class="num">Reads A</th><th class="num">Reads B</th><th class="num">Δ Reads</th>
+        <th class="num">Cells A</th><th class="num">Cells B</th><th class="num">Δ Cells</th>
+        <th class="num">Clones A</th><th class="num">Clones B</th><th class="num">Δ Clones</th>
+      </tr>
+    </thead>
+    <tbody id="overview-tbody"></tbody>
+  </table>
+  <p style="font-size:12px;color:#94A3B8;margin-top:10px;">Click a row or use the dropdown below to view per-sample detail.</p>
+</div>
+
+<!-- ── SECTION 2: Per-sample detail ── -->
+<div class="card" id="sample-detail">
+  <div class="section-title"><span class="section-num">2</span> Per-Sample Detail</div>
+  <div style="margin-bottom:16px;">
+    <label for="sample-selector" style="font-size:13px;font-weight:600;color:#475569;margin-right:10px;">Sample:</label>
+    <select id="sample-selector" onchange="onSelectorChange(this.value)" style="font-family:inherit;font-size:13px;padding:6px 12px;border:1px solid #CBD5E1;border-radius:6px;background:#fff;color:#1E293B;cursor:pointer;">
+      <option value="">— select a sample —</option>
+    </select>
+  </div>
+  <span id="selected-sample-name" style="display:none;"></span>
+  <div class="legend" style="margin-bottom:16px;">
+    <div class="legend-item"><div class="legend-dot" style="background:#2563EB"></div> {label_a}</div>
+    <div class="legend-item"><div class="legend-dot" style="background:#16A34A"></div> {label_b}</div>
+  </div>
+  <div class="chart-grid">
+    <div class="chart-card">
+      <h4>A) Ranked Clone Abundance (log scale)</h4>
+      <div class="chart-wrap"><canvas id="chart-abundance"></canvas></div>
+    </div>
+    <div class="chart-card">
+      <h4>B) Clone Size Distribution</h4>
+      <div class="chart-wrap"><canvas id="chart-size-dist"></canvas></div>
+    </div>
+    <div class="chart-card">
+      <h4>C) Top 10 Clones Overlap</h4>
+      <div class="chart-wrap"><canvas id="chart-overlap"></canvas></div>
+    </div>
+    <div class="chart-card">
+      <h4>D) Clonality Metrics</h4>
+      <div class="chart-wrap"><canvas id="chart-clonality"></canvas></div>
+    </div>
+  </div>
+</div>
+
+<!-- ── SECTION 3: Cross-sample summary ── -->
+<div class="card">
+  <div class="section-title"><span class="section-num">3</span> Cross-Sample Summary</div>
+  <div class="legend" style="margin-bottom:16px;">
+    <div class="legend-item"><div class="legend-dot" style="background:#2563EB"></div> {label_a}</div>
+    <div class="legend-item"><div class="legend-dot" style="background:#16A34A"></div> {label_b}</div>
+  </div>
+  <div class="cross-grid">
+    <div>
+      <div style="font-size:13px;font-weight:600;color:#475569;margin-bottom:10px;">E) Clone Count per Sample</div>
+      <div style="position:relative;height:340px;"><canvas id="chart-cross-clones"></canvas></div>
+      <div class="cross-note">
+        Reference mode uses a complete barcode library whitelist; Discovery mode identifies barcodes de novo from data above a detection threshold.
+      </div>
+    </div>
+    <div>
+      <div style="font-size:13px;font-weight:600;color:#475569;margin-bottom:10px;">F) Cell Recovery per Sample</div>
+      <div style="position:relative;height:340px;"><canvas id="chart-cross-cells"></canvas></div>
+      <div class="cross-note">
+        Cell counts are similar across modes (~90% recovery), validating that the core clonal architecture is preserved even in discovery mode.
+      </div>
+    </div>
+  </div>
+</div>
+
+</div><!-- /container -->
+
+<script>
+// ── Injected data ──
+const DATA = {data_json};
+const LABEL_A = DATA.label_a;
+const LABEL_B = DATA.label_b;
+const COLOR_A = '#2563EB';
+const COLOR_B = '#16A34A';
+const COLOR_A_ALPHA = 'rgba(37,99,235,0.15)';
+const COLOR_B_ALPHA = 'rgba(22,163,74,0.15)';
+
+// ── Utility ──
+function fmt(n) {{
+  if (n === null || n === undefined) return '—';
+  return n.toLocaleString();
+}}
+
+function deltaClass(d) {{
+  if (d === null) return 'gray';
+  if (d <= -5) return 'red';
+  if (d >= 5) return 'green';
+  return 'gray';
+}}
+
+function deltaPill(d, large) {{
+  if (d === null) return '<span class="pill pill-gray">—</span>';
+  const sign = d > 0 ? '+' : '';
+  const cls = d <= -5 ? (large ? 'pill-large-red' : 'pill-red') :
+              d >= 5  ? 'pill-green' : 'pill-gray';
+  return `<span class="pill ${{cls}}">${{sign}}${{d}}%</span>`;
+}}
+
+// ── Summary bar ──
+(function() {{
+  const s = DATA.summary;
+  function metricHTML(label, va, vb, delta) {{
+    const dc = delta.startsWith('+') ? 'delta-green' :
+               delta.startsWith('-') ? 'delta-red' : 'delta-gray';
+    return `<div class="summary-metric">
+      <div class="label">${{label}}</div>
+      <div class="vals">${{va}} vs ${{vb}}</div>
+      <div class="delta ${{dc}}">${{delta}}</div>
+    </div>`;
+  }}
+  const bar = document.getElementById('summary-bar');
+  bar.innerHTML =
+    metricHTML('Total Reads', fmt(s.total_reads_a), fmt(s.total_reads_b), s.delta_reads) +
+    metricHTML('Total Cells', fmt(s.total_cells_a), fmt(s.total_cells_b), s.delta_cells) +
+    metricHTML('Total Clones', fmt(s.total_clones_a), fmt(s.total_clones_b), s.delta_clones) +
+    metricHTML('Samples', s.samples_a, s.samples_b, s.samples_a === s.samples_b ? '=' : '≠');
+}})();
+
+// ── Populate sample selector dropdown ──
+(function() {{
+  const sel = document.getElementById('sample-selector');
+  DATA.sample_rows.forEach(row => {{
+    const opt = document.createElement('option');
+    opt.value = row.sample;
+    opt.textContent = row.sample;
+    sel.appendChild(opt);
+  }});
+}})();
+
+function onSelectorChange(sample) {{
+  if (!sample) return;
+  const tr = document.querySelector(`#overview-tbody tr[data-sample="${{sample}}"]`);
+  selectSample(sample, tr);
+}}
+
+// ── Overview table ──
+(function() {{
+  const tbody = document.getElementById('overview-tbody');
+  DATA.sample_rows.forEach((row, i) => {{
+    const tr = document.createElement('tr');
+    tr.dataset.sample = row.sample;
+    tr.innerHTML = `
+      <td><strong>${{row.sample}}</strong></td>
+      <td class="num">${{fmt(row.reads_a)}}</td>
+      <td class="num">${{fmt(row.reads_b)}}</td>
+      <td class="num">${{deltaPill(row.delta_reads, false)}}</td>
+      <td class="num">${{fmt(row.cells_a)}}</td>
+      <td class="num">${{fmt(row.cells_b)}}</td>
+      <td class="num">${{deltaPill(row.delta_cells, false)}}</td>
+      <td class="num">${{fmt(row.clones_a)}}</td>
+      <td class="num">${{fmt(row.clones_b)}}</td>
+      <td class="num">${{deltaPill(row.delta_clones, true)}}</td>
+    `;
+    tr.addEventListener('click', () => {{
+      document.getElementById('sample-selector').value = row.sample;
+      selectSample(row.sample, tr);
+    }});
+    tbody.appendChild(tr);
+  }});
+}})();
+
+// Auto-select first sample after page fully loads (ensures Chart.js is ready
+// and canvases have non-zero dimensions)
+window.addEventListener('load', function() {{
+  if (DATA.sample_rows.length > 0) {{
+    const first = DATA.sample_rows[0];
+    const sel = document.getElementById('sample-selector');
+    sel.value = first.sample;
+    const firstTr = document.querySelector('#overview-tbody tr');
+    // Select without scrolling on initial load
+    if (firstTr) firstTr.classList.add('selected');
+    document.getElementById('sample-selector').value = first.sample;
+    document.getElementById('selected-sample-name').textContent = first.sample;
+    
+    document.getElementById('sample-detail').classList.add('visible');
+    renderSampleCharts(first.sample);
+  }}
+}});
+
+// ── Chart instances ──
+let chartAbundance = null, chartSizeDist = null, chartOverlap = null, chartClonality = null;
+
+function destroyCharts() {{
+  [chartAbundance, chartSizeDist, chartOverlap, chartClonality].forEach(c => {{ if (c) c.destroy(); }});
+}}
+
+function selectSample(sample, tr, scroll=true) {{
+  // Highlight row
+  document.querySelectorAll('#overview-tbody tr').forEach(r => r.classList.remove('selected'));
+  if (tr) tr.classList.add('selected');
+
+  // Sync dropdown
+  document.getElementById('sample-selector').value = sample;
+
+  // Show sample name heading
+  document.getElementById('selected-sample-name').textContent = sample;
+  
+
+  const detail = document.getElementById('sample-detail');
+  detail.classList.add('visible');
+  if (scroll) detail.scrollIntoView({{ behavior: 'smooth', block: 'start' }});
+
+  destroyCharts();
+  renderSampleCharts(sample);
+}}
+
+function renderSampleCharts(sample) {{
+  const d = DATA.sample_detail[sample];
+  if (!d) return;
+
+  // ── A) Ranked Clone Abundance ──
+  const ranksA = d.ranked_a.map((v, i) => ({{ x: i+1, y: v }}));
+  const ranksB = d.ranked_b.map((v, i) => ({{ x: i+1, y: v }}));
+  chartAbundance = new Chart(document.getElementById('chart-abundance'), {{
+    type: 'line',
+    data: {{
+      datasets: [
+        {{
+          label: `${{LABEL_A}} (${{d.clones_a}} clones)`,
+          data: ranksA,
+          borderColor: COLOR_A,
+          backgroundColor: COLOR_A_ALPHA,
+          pointRadius: 0,
+          borderWidth: 2.5,
+          tension: 0.1,
+          fill: false,
+        }},
+        {{
+          label: `${{LABEL_B}} (${{d.clones_b}} clones)`,
+          data: ranksB,
+          borderColor: COLOR_B,
+          backgroundColor: COLOR_B_ALPHA,
+          pointRadius: 0,
+          borderWidth: 2.5,
+          tension: 0.1,
+          fill: false,
+        }},
+      ]
+    }},
+    options: {{
+      responsive: true, maintainAspectRatio: false,
+      scales: {{
+        x: {{ type: 'linear', title: {{ display: true, text: 'Clone Rank', font: {{ size: 11 }} }} }},
+        y: {{
+          type: 'logarithmic',
+          title: {{ display: true, text: 'Cells (log)', font: {{ size: 11 }} }},
+          ticks: {{
+            callback: v => Number.isInteger(Math.log10(v)) ? v : ''
+          }}
+        }}
+      }},
+      plugins: {{
+        legend: {{ position: 'top', labels: {{ font: {{ size: 11 }} }} }},
+        tooltip: {{ mode: 'index', intersect: false }},
+      }}
+    }}
+  }});
+
+  // ── B) Clone Size Distribution ──
+  const bucketLabels = ['Singleton\n(=1)', 'Small\n(2-5)', 'Medium\n(6-20)', 'Large\n(21-100)', 'Dominant\n(>100)'];
+  const bucketKeys = ['singleton', 'small', 'medium', 'large', 'dominant'];
+  chartSizeDist = new Chart(document.getElementById('chart-size-dist'), {{
+    type: 'bar',
+    data: {{
+      labels: ['Singleton (=1)', 'Small (2-5)', 'Medium (6-20)', 'Large (21-100)', 'Dominant (>100)'],
+      datasets: [
+        {{
+          label: LABEL_A,
+          data: bucketKeys.map(k => d.buckets_a[k] || 0),
+          backgroundColor: COLOR_A,
+          borderRadius: 4,
+        }},
+        {{
+          label: LABEL_B,
+          data: bucketKeys.map(k => d.buckets_b[k] || 0),
+          backgroundColor: COLOR_B,
+          borderRadius: 4,
+        }},
+      ]
+    }},
+    options: {{
+      responsive: true, maintainAspectRatio: false,
+      plugins: {{ legend: {{ position: 'top', labels: {{ font: {{ size: 11 }} }} }} }},
+      scales: {{
+        x: {{ ticks: {{ font: {{ size: 10 }}, maxRotation: 0 }} }},
+        y: {{ title: {{ display: true, text: '# Clones', font: {{ size: 11 }} }} }}
+      }}
+    }}
+  }});
+
+  // ── C) Top Clones Overlap ──
+  const overlapLabels = d.overlap.map(o => o.label);
+  chartOverlap = new Chart(document.getElementById('chart-overlap'), {{
+    type: 'bar',
+    data: {{
+      labels: overlapLabels,
+      datasets: [
+        {{
+          label: LABEL_A,
+          data: d.overlap.map(o => o.cells_a),
+          backgroundColor: COLOR_A,
+          borderRadius: 4,
+        }},
+        {{
+          label: LABEL_B,
+          data: d.overlap.map(o => o.cells_b),
+          backgroundColor: COLOR_B,
+          borderRadius: 4,
+        }},
+      ]
+    }},
+    options: {{
+      indexAxis: 'y',
+      responsive: true, maintainAspectRatio: false,
+      plugins: {{ legend: {{ position: 'top', labels: {{ font: {{ size: 11 }} }} }} }},
+      scales: {{
+        x: {{ title: {{ display: true, text: 'Cells', font: {{ size: 11 }} }} }},
+        y: {{ ticks: {{ font: {{ size: 10 }} }} }}
+      }}
+    }}
+  }});
+
+  // ── D) Clonality Metrics ──
+  chartClonality = new Chart(document.getElementById('chart-clonality'), {{
+    type: 'bar',
+    data: {{
+      labels: ['Top 1%', 'Top 3%', 'Top 10%'],
+      datasets: [
+        {{
+          label: LABEL_A,
+          data: [d.top1_a, d.top3_a, d.top10_a],
+          backgroundColor: COLOR_A,
+          borderRadius: 4,
+        }},
+        {{
+          label: LABEL_B,
+          data: [d.top1_b, d.top3_b, d.top10_b],
+          backgroundColor: COLOR_B,
+          borderRadius: 4,
+        }},
+      ]
+    }},
+    options: {{
+      responsive: true, maintainAspectRatio: false,
+      plugins: {{ legend: {{ position: 'top', labels: {{ font: {{ size: 11 }} }} }} }},
+      scales: {{
+        x: {{ ticks: {{ font: {{ size: 11 }} }} }},
+        y: {{
+          title: {{ display: true, text: '% Cells in Top Clones', font: {{ size: 11 }} }},
+          max: 100,
+        }}
+      }}
+    }}
+  }});
+}}
+
+// ── Section 3: Cross-sample charts ──
+(function() {{
+  const cs = DATA.cross_sorted;
+  const labels = cs.map(r => r.sample);
+
+  // E) Clone counts
+  new Chart(document.getElementById('chart-cross-clones'), {{
+    type: 'bar',
+    data: {{
+      labels: labels,
+      datasets: [
+        {{
+          label: LABEL_A,
+          data: cs.map(r => r.clones_a),
+          backgroundColor: COLOR_A,
+          borderRadius: 4,
+        }},
+        {{
+          label: LABEL_B,
+          data: cs.map(r => r.clones_b),
+          backgroundColor: COLOR_B,
+          borderRadius: 4,
+        }},
+      ]
+    }},
+    options: {{
+      indexAxis: 'y',
+      responsive: true, maintainAspectRatio: false,
+      plugins: {{ legend: {{ position: 'top', labels: {{ font: {{ size: 11 }} }} }} }},
+      scales: {{
+        x: {{ title: {{ display: true, text: '# Clones', font: {{ size: 11 }} }} }},
+        y: {{ ticks: {{ font: {{ size: 10 }} }} }}
+      }}
+    }}
+  }});
+
+  // F) Cell counts
+  new Chart(document.getElementById('chart-cross-cells'), {{
+    type: 'bar',
+    data: {{
+      labels: labels,
+      datasets: [
+        {{
+          label: LABEL_A,
+          data: cs.map(r => r.cells_a),
+          backgroundColor: COLOR_A,
+          borderRadius: 4,
+        }},
+        {{
+          label: LABEL_B,
+          data: cs.map(r => r.cells_b),
+          backgroundColor: COLOR_B,
+          borderRadius: 4,
+        }},
+      ]
+    }},
+    options: {{
+      indexAxis: 'y',
+      responsive: true, maintainAspectRatio: false,
+      plugins: {{ legend: {{ position: 'top', labels: {{ font: {{ size: 11 }} }} }} }},
+      scales: {{
+        x: {{ title: {{ display: true, text: '# Cells', font: {{ size: 11 }} }} }},
+        y: {{ ticks: {{ font: {{ size: 10 }} }} }}
+      }}
+    }}
+  }});
+}})();
+</script>
+</body>
+</html>
+"""
+
+
+def generate_html(comparison, title, file_a, file_b):
+    data_json = json.dumps(comparison, separators=(',', ':'))
+    return HTML_TEMPLATE.format(
+        title=title,
+        file_a=os.path.basename(file_a),
+        file_b=os.path.basename(file_b),
+        label_a=comparison["label_a"],
+        label_b=comparison["label_b"],
+        date=datetime.now().strftime("%Y-%m-%d %H:%M"),
+        data_json=data_json,
+    )
+
+
+# ---------------------------------------------------------------------------
+# Main
+# ---------------------------------------------------------------------------
+
+def main():
+    parser = argparse.ArgumentParser(description="NextClone Comparison Report Generator")
+    parser.add_argument("csv_a", help="Input CSV file A (e.g. reference mode)")
+    parser.add_argument("csv_b", help="Input CSV file B (e.g. discovery mode)")
+    parser.add_argument("--label-a", default="Run A", help="Label for run A")
+    parser.add_argument("--label-b", default="Run B", help="Label for run B")
+    parser.add_argument("--output", "-o", default="report_comparison.html", help="Output HTML file")
+    parser.add_argument("--title", default="NextClone: Run A vs Run B", help="Report title")
+    args = parser.parse_args()
+
+    print(f"Loading {args.csv_a} …")
+    data_a = load_data(args.csv_a)
+    print(f"  → {len(data_a)} samples, {sum(s['reads'] for s in data_a.values()):,} reads")
+
+    print(f"Loading {args.csv_b} …")
+    data_b = load_data(args.csv_b)
+    print(f"  → {len(data_b)} samples, {sum(s['reads'] for s in data_b.values()):,} reads")
+
+    print("Computing comparison stats …")
+    comparison = build_comparison_data(data_a, data_b, args.label_a, args.label_b)
+
+    print("Generating HTML …")
+    html = generate_html(comparison, args.title, args.csv_a, args.csv_b)
+
+    with open(args.output, "w", encoding="utf-8") as f:
+        f.write(html)
+
+    size_kb = os.path.getsize(args.output) / 1024
+    print(f"✓ Report written to: {args.output}  ({size_kb:.1f} KB)")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/reports/generate_report.py b/reports/generate_report.py
new file mode 100644
index 0000000..6f73b1a
--- /dev/null
+++ b/reports/generate_report.py
@@ -0,0 +1,1117 @@
+#!/usr/bin/env python3
+"""
+NextClone Report Generator v2
+Reads clone_barcodes.csv and generates a self-contained HTML dashboard.
+
+New features (v2):
+- Clone overlap table (shared clones across samples at different thresholds)
+- Heterogeneity metrics (Gini coefficient, Shannon index)
+- Clone size density plot (KDE curve)
+- Reversed top 20 clones (largest at top)
+
+Usage:
+    python3 generate_report.py <input_csv> [--output report.html] [--title "My Run"]
+"""
+
+import argparse
+import csv
+import json
+import math
+import os
+import sys
+from collections import defaultdict
+from datetime import datetime
+
+
+# ---------------------------------------------------------------------------
+# Data loading & stats computation
+# ---------------------------------------------------------------------------
+
+def load_data(csv_path):
+    """
+    Parse the CSV and return a dict of per-sample data structures.
+    Also extracts run information from header lines starting with #.
+    """
+    samples = defaultdict(lambda: {
+        "reads": 0,
+        "cells": set(),
+        "clone_cells": defaultdict(set),  # clone_barcode -> set of cell barcodes
+        "flank_edit": defaultdict(int),
+        "barcode_edit": defaultdict(int),
+    })
+    
+    run_info = {
+        "mode": None,
+        "command": None,
+        "parameters": {},
+    }
+
+    with open(csv_path, newline="", encoding="utf-8") as f:
+        # First pass: read header comments for run information
+        header_lines = []
+        for line in f:
+            if line.startswith('#'):
+                header_lines.append(line.strip())
+            else:
+                # Found first data line, reset file pointer
+                f.seek(0)
+                break
+        
+        # Parse header comments
+        for line in header_lines:
+            line = line.lstrip('#').strip()
+            if line.startswith('mode:'):
+                run_info["mode"] = line.split(':', 1)[1].strip()
+            elif line.startswith('command:'):
+                run_info["command"] = line.split(':', 1)[1].strip()
+            elif ':' in line:
+                key, val = line.split(':', 1)
+                run_info["parameters"][key.strip()] = val.strip()
+        
+        # Second pass: read CSV data
+        f.seek(0)
+        reader = csv.DictReader(f)
+        for row in reader:
+            sample = row["SourceBAMFile"]
+            cell = row["CellBarcode"]
+            clone = row["CloneBarcode"]
+            try:
+                fed = int(row["FlankEditDist"])
+            except (ValueError, KeyError):
+                fed = -1
+            try:
+                bed = int(row["BarcodeEditDist"])
+            except (ValueError, KeyError):
+                bed = -1
+
+            s = samples[sample]
+            s["reads"] += 1
+            s["cells"].add(cell)
+            s["clone_cells"][clone].add(cell)
+            if fed >= 0:
+                s["flank_edit"][min(fed, 5)] += 1
+            if bed >= 0:
+                s["barcode_edit"][min(bed, 5)] += 1
+
+    return samples, run_info
+
+
+def compute_gini(values):
+    """
+    Calculate Gini coefficient for clone size distribution.
+    0 = perfect equality (all clones same size)
+    1 = perfect inequality (one clone has all cells)
+    
+    Formula: G = sum(|xi - xj|) / (2 * n * sum(x))
+    """
+    if not values or sum(values) == 0:
+        return 0.0
+    
+    n = len(values)
+    if n == 1:
+        return 0.0
+    
+    # Sort values
+    sorted_vals = sorted(values)
+    
+    # Calculate Gini using the efficient formula
+    # G = (2 * sum(i * x_i) - (n + 1) * sum(x_i)) / (n * sum(x_i))
+    total = sum(sorted_vals)
+    weighted_sum = sum((i + 1) * val for i, val in enumerate(sorted_vals))
+    
+    gini = (2 * weighted_sum - (n + 1) * total) / (n * total)
+    return round(gini, 4)
+
+
+def compute_shannon(values):
+    """
+    Calculate Shannon diversity index for clone distribution.
+    Higher = more diverse (many clones with similar sizes)
+    Lower = less diverse (few dominant clones)
+    
+    Formula: H = -sum(pi * ln(pi))
+    """
+    if not values or sum(values) == 0:
+        return 0.0
+    
+    total = sum(values)
+    h = 0.0
+    
+    for val in values:
+        if val > 0:
+            pi = val / total
+            h -= pi * math.log(pi)
+    
+    return round(h, 4)
+
+
+def compute_clone_overlap(samples):
+    """
+    Compute clone overlap across samples at different cell thresholds.
+    Returns a dict with thresholds as keys and per-sample counts + "in_all" count.
+    """
+    thresholds = [5, 10, 15, 20, 50, 100]
+    sample_names = sorted(samples.keys())
+    
+    # For each sample, get clones with >= threshold cells
+    sample_clone_sets = {}
+    for sample, raw in samples.items():
+        clone_sizes = {clone: len(cells) for clone, cells in raw["clone_cells"].items()}
+        sample_clone_sets[sample] = clone_sizes
+    
+    # Compute overlap for each threshold
+    overlap_data = {}
+    for thresh in thresholds:
+        overlap_data[thresh] = {
+            "per_sample": {},
+            "in_all": 0
+        }
+        
+        # Get clones meeting threshold for each sample
+        clones_above_thresh = {}
+        for sample in sample_names:
+            clones_above = [
+                clone for clone, size in sample_clone_sets[sample].items()
+                if size >= thresh
+            ]
+            clones_above_thresh[sample] = set(clones_above)
+            overlap_data[thresh]["per_sample"][sample] = len(clones_above)
+        
+        # Clones present in ALL samples above threshold
+        if len(sample_names) > 0:
+            common_clones = set.intersection(*clones_above_thresh.values())
+            overlap_data[thresh]["in_all"] = len(common_clones)
+    
+    return overlap_data
+
+
+def compute_stats(samples):
+    """Turn raw per-sample data into serialisable stats dicts."""
+    result = {}
+    for sample, raw in sorted(samples.items()):
+        n_reads = raw["reads"]
+        n_cells = len(raw["cells"])
+
+        # Clone sizes (by unique cells per clone)
+        clone_sizes = {clone: len(cells) for clone, cells in raw["clone_cells"].items()}
+        n_clones = len(clone_sizes)
+        clone_size_values = list(clone_sizes.values())
+
+        # Ranked sizes (descending)
+        ranked = sorted(clone_size_values, reverse=True)
+
+        # Clone size distribution buckets
+        buckets = {"Singleton": 0, "Small (2-5)": 0, "Medium (6-20)": 0,
+                   "Large (21-100)": 0, "Dominant (>100)": 0}
+        for sz in ranked:
+            if sz == 1:
+                buckets["Singleton"] += 1
+            elif sz <= 5:
+                buckets["Small (2-5)"] += 1
+            elif sz <= 20:
+                buckets["Medium (6-20)"] += 1
+            elif sz <= 100:
+                buckets["Large (21-100)"] += 1
+            else:
+                buckets["Dominant (>100)"] += 1
+
+        # Clone size density (for KDE plot)
+        # Create binned density for log-transformed clone sizes
+        density_data = []
+        if clone_size_values:
+            # Use log scale for better visualization
+            log_sizes = [math.log10(sz) for sz in clone_size_values if sz > 0]
+            if log_sizes:
+                min_log = min(log_sizes)
+                max_log = max(log_sizes)
+                n_bins = min(30, len(log_sizes))
+                if n_bins > 1:
+                    bin_width = (max_log - min_log) / n_bins
+                    for i in range(n_bins):
+                        bin_start = min_log + i * bin_width
+                        bin_count = sum(1 for ls in log_sizes if bin_start <= ls < bin_start + bin_width)
+                        density_data.append({
+                            "x": round(10 ** bin_start, 2),
+                            "y": bin_count
+                        })
+
+        # Top 20 clones (already sorted descending - largest first)
+        top_clones_raw = sorted(clone_sizes.items(), key=lambda x: x[1], reverse=True)[:20]
+        top_clones = [
+            {
+                "barcode": bc[:20],
+                "n_cells": cnt,
+                "pct": round(cnt / n_cells * 100, 2) if n_cells else 0,
+            }
+            for bc, cnt in top_clones_raw
+        ]
+
+        # Clonality metrics
+        def top_n_pct(n):
+            if n_cells == 0:
+                return 0.0
+            top_cells = sum(ranked[:n])
+            return round(top_cells / n_cells * 100, 2)
+
+        # Edit distance distributions (keys 0-5)
+        def ed_dist(d):
+            return [d.get(i, 0) for i in range(6)]
+
+        # Heterogeneity metrics
+        gini = compute_gini(clone_size_values)
+        shannon = compute_shannon(clone_size_values)
+
+        result[sample] = {
+            "reads": n_reads,
+            "cells": n_cells,
+            "clones": n_clones,
+            "ranked_sizes": ranked,
+            "clone_size_buckets": buckets,
+            "clone_size_density": density_data,
+            "top_clones": top_clones,
+            "top1_pct": top_n_pct(1),
+            "top3_pct": top_n_pct(3),
+            "top10_pct": top_n_pct(10),
+            "flank_edit_dist": ed_dist(raw["flank_edit"]),
+            "barcode_edit_dist": ed_dist(raw["barcode_edit"]),
+            "gini": gini,
+            "shannon": shannon,
+        }
+
+    return result
+
+
+def compute_global_overlap(samples):
+    """Compute clone overlap data for all samples."""
+    return compute_clone_overlap(samples)
+
+
+def global_stats(stats):
+    total_reads = sum(s["reads"] for s in stats.values())
+    total_cells = sum(s["cells"] for s in stats.values())
+    total_samples = len(stats)
+    total_clones = sum(s["clones"] for s in stats.values())
+    
+    return {
+        "total_reads": total_reads,
+        "total_cells": total_cells,
+        "total_samples": total_samples,
+        "total_clones": total_clones,
+    }
+
+
+# ---------------------------------------------------------------------------
+# HTML template
+# ---------------------------------------------------------------------------
+
+HTML_TEMPLATE = r"""<!DOCTYPE html>
+<html lang="en">
+<head>
+<meta charset="UTF-8"/>
+<meta name="viewport" content="width=device-width, initial-scale=1.0"/>
+<title>{{TITLE}}</title>
+<link rel="preconnect" href="https://fonts.googleapis.com"/>
+<link href="https://fonts.googleapis.com/css2?family=Inter:wght@300;400;500;600;700&display=swap" rel="stylesheet"/>
+<script src="https://cdn.jsdelivr.net/npm/chart.js"></script>
+<style>
+  *, *::before, *::after { box-sizing: border-box; margin: 0; padding: 0; }
+  body { font-family: 'Inter', system-ui, sans-serif; background: #F8FAFC; color: #1E293B; font-size: 14px; line-height: 1.6; }
+  a { color: #2563EB; text-decoration: none; }
+  a:hover { text-decoration: underline; }
+
+  /* Layout */
+  .container { max-width: 1400px; margin: 0 auto; padding: 0 24px; }
+
+  /* Header */
+  .header { background: linear-gradient(135deg, #1E3A5F 0%, #2563EB 100%); color: white; padding: 32px 0 28px; }
+  .header h1 { font-size: 26px; font-weight: 700; letter-spacing: -0.3px; }
+  .header-meta { margin-top: 8px; opacity: 0.8; font-size: 13px; display: flex; gap: 20px; flex-wrap: wrap; }
+  .run-mode-badge { background: rgba(255,255,255,0.2); border-radius: 99px; padding: 2px 12px; font-size: 12px; font-weight: 500; }
+
+  /* Summary bar */
+  .summary-bar { background: white; border-bottom: 1px solid #E2E8F0; padding: 16px 0; }
+  .summary-stats { display: flex; gap: 0; }
+  .stat-item { flex: 1; text-align: center; padding: 8px 16px; border-right: 1px solid #E2E8F0; }
+  .stat-item:last-child { border-right: none; }
+  .stat-value { font-size: 28px; font-weight: 700; color: #2563EB; }
+  .stat-label { font-size: 11px; text-transform: uppercase; letter-spacing: 0.05em; color: #64748B; margin-top: 2px; }
+
+  /* Sections */
+  .section { padding: 28px 0; }
+  .section-title { font-size: 18px; font-weight: 600; color: #1E293B; margin-bottom: 16px; display: flex; align-items: center; gap: 8px; }
+  .section-title::before { content: ''; display: block; width: 4px; height: 20px; background: #2563EB; border-radius: 2px; }
+
+  /* Card */
+  .card { background: white; border-radius: 12px; box-shadow: 0 1px 3px rgba(0,0,0,0.08), 0 1px 2px rgba(0,0,0,0.04); padding: 20px; }
+
+  /* Table */
+  .table-wrapper { overflow-x: auto; border-radius: 12px; box-shadow: 0 1px 3px rgba(0,0,0,0.08); }
+  table { width: 100%; border-collapse: collapse; background: white; }
+  thead tr { background: #F8FAFC; }
+  th { padding: 12px 16px; text-align: left; font-size: 12px; font-weight: 600; text-transform: uppercase; letter-spacing: 0.05em; color: #64748B; border-bottom: 2px solid #E2E8F0; cursor: pointer; user-select: none; white-space: nowrap; }
+  th:hover { background: #EFF6FF; color: #2563EB; }
+  th .sort-icon { display: inline-block; margin-left: 4px; opacity: 0.4; }
+  th.sort-asc .sort-icon::after { content: ' ↑'; opacity: 1; }
+  th.sort-desc .sort-icon::after { content: ' ↓'; opacity: 1; }
+  tbody tr { border-bottom: 1px solid #F1F5F9; cursor: pointer; transition: background 0.1s; }
+  tbody tr:last-child { border-bottom: none; }
+  tbody tr:hover { background: #EFF6FF; }
+  tbody tr.selected { background: #DBEAFE; }
+  td { padding: 12px 16px; }
+  .num-cell { text-align: right; font-variant-numeric: tabular-nums; }
+
+  /* Clonality pill */
+  .pill { display: inline-block; padding: 2px 10px; border-radius: 99px; font-size: 12px; font-weight: 500; }
+  .pill-green { background: #DCFCE7; color: #16A34A; }
+  .pill-amber { background: #FEF3C7; color: #D97706; }
+  .pill-red { background: #FEE2E2; color: #DC2626; }
+
+  /* Heterogeneity badge */
+  .het-badge { display: inline-block; padding: 2px 8px; border-radius: 6px; font-size: 11px; font-weight: 600; margin-left: 6px; }
+  .het-low { background: #DCFCE7; color: #16A34A; }
+  .het-med { background: #FEF3C7; color: #D97706; }
+  .het-high { background: #FEE2E2; color: #DC2626; }
+
+  /* Sample detail */
+  .detail-header { display: flex; align-items: center; justify-content: space-between; margin-bottom: 16px; flex-wrap: wrap; gap: 12px; }
+  .detail-select { padding: 8px 12px; border: 1px solid #CBD5E1; border-radius: 8px; font-family: inherit; font-size: 14px; background: white; cursor: pointer; }
+  .detail-select:focus { outline: none; border-color: #2563EB; box-shadow: 0 0 0 3px rgba(37,99,235,0.1); }
+  .charts-grid { display: grid; grid-template-columns: 1fr 1fr; gap: 20px; }
+  @media (max-width: 900px) { .charts-grid { grid-template-columns: 1fr; } }
+  .chart-card { background: white; border-radius: 12px; box-shadow: 0 1px 3px rgba(0,0,0,0.08); padding: 20px; }
+  .chart-title { font-size: 14px; font-weight: 600; color: #374151; margin-bottom: 12px; }
+  .chart-container { position: relative; }
+
+  /* Cross-sample */
+  .comparison-grid { display: grid; grid-template-columns: 1fr 1fr; gap: 20px; }
+  @media (max-width: 900px) { .comparison-grid { grid-template-columns: 1fr; } }
+  
+  /* Overlap table */
+  .overlap-table-wrapper { overflow-x: auto; border-radius: 12px; box-shadow: 0 1px 3px rgba(0,0,0,0.08); margin-top: 16px; }
+  .overlap-table { width: 100%; border-collapse: collapse; }
+  .overlap-table th { background: #F1F5F9; font-size: 11px; padding: 10px 12px; text-align: right; border: 1px solid #E2E8F0; }
+  .overlap-table th:first-child { text-align: left; background: #F8FAFC; }
+  .overlap-table td { padding: 10px 12px; text-align: right; border: 1px solid #E2E8F0; font-variant-numeric: tabular-nums; }
+  .overlap-table tr:hover { background: #F8FAFC; }
+  .overlap-table th:first-child, .overlap-table td:first-child { text-align: left; background: #F8FAFC; font-weight: 600; color: #1E293B; }
+  .overlap-table .in-all-col { background: #DBEAFE; font-weight: 600; color: #1E40AF; }
+
+  /* Footer */
+  .footer { background: #1E293B; color: #94A3B8; text-align: center; padding: 20px; font-size: 12px; margin-top: 20px; }
+  .footer a { color: #60A5FA; }
+
+  /* Divider */
+  .divider { height: 1px; background: #E2E8F0; margin: 0; }
+
+  /* Placeholder */
+  .placeholder { text-align: center; color: #94A3B8; padding: 40px; font-size: 14px; }
+</style>
+</head>
+<body>
+
+<!-- Header -->
+<div class="header">
+  <div class="container">
+    <h1>{{TITLE}}</h1>
+    <div class="header-meta">
+      <span>📄 {{INPUT_FILE}}</span>
+      <span>📅 Generated {{TIMESTAMP}}</span>
+      <span class="run-mode-badge">{{RUN_MODE}}</span>
+    </div>
+  </div>
+</div>
+
+<!-- Summary bar -->
+<div class="summary-bar">
+  <div class="container">
+    <div class="summary-stats" id="summary-stats"></div>
+  </div>
+</div>
+
+<!-- Main content -->
+<div class="container">
+
+  <!-- Section 1: Sample Overview -->
+  <div class="section">
+    <div class="section-title">Sample Overview</div>
+    <div class="table-wrapper">
+      <table id="sample-table">
+        <thead>
+          <tr>
+            <th data-col="sample" data-type="str">Sample<span class="sort-icon"></span></th>
+            <th data-col="reads" data-type="num" class="num-cell">Reads<span class="sort-icon"></span></th>
+            <th data-col="cells" data-type="num" class="num-cell">Cells<span class="sort-icon"></span></th>
+            <th data-col="clones" data-type="num" class="num-cell">Clones<span class="sort-icon"></span></th>
+            <th data-col="top1_pct" data-type="num" class="num-cell">Top Clone %<span class="sort-icon"></span></th>
+            <th data-col="gini" data-type="num" class="num-cell">Gini<span class="sort-icon"></span></th>
+            <th data-col="shannon" data-type="num" class="num-cell">Shannon<span class="sort-icon"></span></th>
+          </tr>
+        </thead>
+        <tbody id="sample-tbody"></tbody>
+      </table>
+    </div>
+  </div>
+
+  <div class="divider"></div>
+
+  <!-- Section 2: Clone Overlap Across Samples -->
+  <div class="section">
+    <div class="section-title">Clone Overlap Across Samples</div>
+    <div class="card">
+      <p style="color:#64748B;font-size:13px;margin-bottom:12px;">
+        Number of clones detected in each sample (and in ALL samples) at different cell count thresholds.
+        Higher overlap indicates consistent clone detection across samples.
+      </p>
+      <div class="overlap-table-wrapper">
+        <table class="overlap-table" id="overlap-table">
+          <thead>
+            <tr id="overlap-header"></tr>
+          </thead>
+          <tbody id="overlap-tbody"></tbody>
+        </table>
+      </div>
+    </div>
+  </div>
+
+  <div class="divider"></div>
+
+  <!-- Section 3: Sample Detail -->
+  <div class="section">
+    <div class="detail-header">
+      <div class="section-title" style="margin-bottom:0">Sample Detail</div>
+      <select class="detail-select" id="sample-select">
+        <option value="">— Select a sample —</option>
+      </select>
+    </div>
+    <div id="detail-placeholder" class="placeholder card">Click a row in the table above or select a sample from the dropdown to view detailed charts.</div>
+    <div id="detail-charts" style="display:none;">
+      <div class="charts-grid">
+        <div class="chart-card">
+          <div class="chart-title">A) Ranked Clone Abundance</div>
+          <div class="chart-container" style="height:300px"><canvas id="chartAbundance"></canvas></div>
+        </div>
+        <div class="chart-card">
+          <div class="chart-title">B) Clone Size Density</div>
+          <div class="chart-container" style="height:300px"><canvas id="chartSizeDensity"></canvas></div>
+        </div>
+        <div class="chart-card">
+          <div class="chart-title">C) Top 20 Clones</div>
+          <div class="chart-container" style="height:300px"><canvas id="chartTop20"></canvas></div>
+        </div>
+        <div class="chart-card">
+          <div class="chart-title">D) Edit Distance Quality</div>
+          <div class="chart-container" style="height:300px"><canvas id="chartEditDist"></canvas></div>
+        </div>
+      </div>
+    </div>
+  </div>
+
+  <div class="divider"></div>
+
+  <!-- Section 5: Cross-Sample Comparison -->
+  <div class="section">
+    <div class="section-title">Cross-Sample Comparison</div>
+    <div class="comparison-grid">
+      <div class="chart-card">
+        <div class="chart-title">E) Cells per Sample</div>
+        <div class="chart-container" style="height:320px"><canvas id="chartCellsPerSample"></canvas></div>
+      </div>
+      <div class="chart-card">
+        <div class="chart-title">F) Clonality Comparison</div>
+        <div class="chart-container" style="height:320px"><canvas id="chartClonality"></canvas></div>
+      </div>
+    </div>
+  </div>
+
+</div>
+
+<!-- Footer -->
+<div class="footer">
+  Generated by <a href="https://github.com/phipsonlab/NextClone" target="_blank">NextClone</a> report generator &mdash; {{TIMESTAMP}}
+</div>
+
+<script>
+// ============================================================
+// Embedded data
+// ============================================================
+const DATA = {{DATA_JSON}};
+const GLOBAL = {{GLOBAL_JSON}};
+const OVERLAP = {{OVERLAP_JSON}};
+const SAMPLE_NAMES = Object.keys(DATA);
+
+// ============================================================
+// Utilities
+// ============================================================
+function fmt(n) {
+  if (n === undefined || n === null) return '—';
+  return Number(n).toLocaleString();
+}
+function pct(v) { return v.toFixed(1) + '%'; }
+function fmt2(v) { return v.toFixed(2); }
+
+// ============================================================
+// Summary bar
+// ============================================================
+function renderSummary() {
+  const el = document.getElementById('summary-stats');
+  const items = [
+    { label: 'Total Reads', value: fmt(GLOBAL.total_reads) },
+    { label: 'Total Cells', value: fmt(GLOBAL.total_cells) },
+    { label: 'Samples', value: fmt(GLOBAL.total_samples) },
+    { label: 'Total Clone Assignments', value: fmt(GLOBAL.total_clones) },
+  ];
+  el.innerHTML = items.map(i =>
+    `<div class="stat-item"><div class="stat-value">${i.value}</div><div class="stat-label">${i.label}</div></div>`
+  ).join('');
+}
+
+// ============================================================
+// Sample table
+// ============================================================
+let sortCol = null, sortDir = 1;
+
+function giniBadge(v) {
+  if (v < 0.3) return `<span class="het-badge het-low">Low</span>`;
+  if (v < 0.6) return `<span class="het-badge het-med">Med</span>`;
+  return `<span class="het-badge het-high">High</span>`;
+}
+
+function renderTable(names) {
+  const tbody = document.getElementById('sample-tbody');
+  tbody.innerHTML = names.map(name => {
+    const s = DATA[name];
+    return `<tr data-sample="${name}">
+      <td>${name}</td>
+      <td class="num-cell">${fmt(s.reads)}</td>
+      <td class="num-cell">${fmt(s.cells)}</td>
+      <td class="num-cell">${fmt(s.clones)}</td>
+      <td class="num-cell">${pct(s.top1_pct)}</td>
+      <td class="num-cell">${fmt2(s.gini)} ${giniBadge(s.gini)}</td>
+      <td class="num-cell">${fmt2(s.shannon)}</td>
+    </tr>`;
+  }).join('');
+
+  tbody.querySelectorAll('tr').forEach(row => {
+    row.addEventListener('click', () => selectSample(row.dataset.sample));
+  });
+}
+
+function sortTable(col, type) {
+  if (sortCol === col) sortDir *= -1;
+  else { sortCol = col; sortDir = 1; }
+
+  document.querySelectorAll('th').forEach(th => {
+    th.classList.remove('sort-asc', 'sort-desc');
+    if (th.dataset.col === col) th.classList.add(sortDir === 1 ? 'sort-asc' : 'sort-desc');
+  });
+
+  const sorted = [...SAMPLE_NAMES].sort((a, b) => {
+    let va = col === 'sample' ? a : DATA[a][col];
+    let vb = col === 'sample' ? b : DATA[b][col];
+    if (type === 'num') return (va - vb) * sortDir;
+    return va.localeCompare(vb) * sortDir;
+  });
+  renderTable(sorted);
+  if (currentSample) {
+    document.querySelectorAll('#sample-tbody tr').forEach(r => {
+      if (r.dataset.sample === currentSample) r.classList.add('selected');
+    });
+  }
+}
+
+document.querySelectorAll('th[data-col]').forEach(th => {
+  th.addEventListener('click', () => sortTable(th.dataset.col, th.dataset.type));
+});
+
+// ============================================================
+// Dropdown
+// ============================================================
+function populateDropdown() {
+  const sel = document.getElementById('sample-select');
+  SAMPLE_NAMES.forEach(name => {
+    const opt = document.createElement('option');
+    opt.value = name; opt.textContent = name;
+    sel.appendChild(opt);
+  });
+  sel.addEventListener('change', e => {
+    if (e.target.value) selectSample(e.target.value);
+  });
+}
+
+// ============================================================
+// Overlap table
+// ============================================================
+function renderOverlapTable() {
+  const header = document.getElementById('overlap-header');
+  const tbody = document.getElementById('overlap-tbody');
+  
+  // Header row
+  const thresholds = Object.keys(OVERLAP).map(Number);
+  header.innerHTML = '<th>Threshold</th>' + 
+    SAMPLE_NAMES.map(s => `<th>${s}</th>`).join('') +
+    '<th class="in-all-col">In ALL Samples</th>';
+  
+  // Data rows
+  tbody.innerHTML = thresholds.map(thresh => {
+    const row = OVERLAP[thresh];
+    return '<tr>' +
+      `<td>≥${thresh} cells</td>` +
+      SAMPLE_NAMES.map(s => `<td>${fmt(row.per_sample[s])}</td>`).join('') +
+      `<td class="in-all-col">${fmt(row.in_all)}</td>` +
+      '</tr>';
+  }).join('');
+}
+
+// ============================================================
+// Chart instances
+// ============================================================
+let charts = {};
+function destroyChart(id) {
+  if (charts[id]) { charts[id].destroy(); delete charts[id]; }
+}
+
+// ============================================================
+// Sample selection & detail charts
+// ============================================================
+let currentSample = null;
+
+function selectSample(name) {
+  currentSample = name;
+  document.querySelectorAll('#sample-tbody tr').forEach(r => {
+    r.classList.toggle('selected', r.dataset.sample === name);
+  });
+  document.getElementById('sample-select').value = name;
+  document.getElementById('detail-placeholder').style.display = 'none';
+  document.getElementById('detail-charts').style.display = 'block';
+
+  renderAbundance(name);
+  renderSizeDensity(name);
+  renderTop20(name);
+  renderEditDist(name);
+}
+
+// Chart A: Ranked Clone Abundance
+function renderAbundance(name) {
+  destroyChart('abundance');
+  const s = DATA[name];
+  const ranked = s.ranked_sizes;
+  const labels = ranked.map((_, i) => i + 1);
+
+  const ctx = document.getElementById('chartAbundance').getContext('2d');
+  charts['abundance'] = new Chart(ctx, {
+    type: 'line',
+    data: {
+      labels,
+      datasets: [{
+        label: 'Cells per Clone',
+        data: ranked,
+        borderColor: '#2563EB',
+        backgroundColor: 'rgba(37,99,235,0.05)',
+        borderWidth: 1.5,
+        pointRadius: 0,
+        fill: true,
+        tension: 0.1,
+      }]
+    },
+    options: {
+      responsive: true,
+      maintainAspectRatio: false,
+      scales: {
+        y: {
+          type: 'logarithmic',
+          title: { display: true, text: 'Cells (log scale)', font: { size: 11 } },
+          ticks: { callback: v => v },
+        },
+        x: {
+          title: { display: true, text: 'Clone Rank', font: { size: 11 } },
+          ticks: { maxTicksLimit: 10 },
+        }
+      },
+      plugins: {
+        legend: { display: false },
+        tooltip: {
+          callbacks: {
+            title: ctx => `Rank #${ctx[0].label}`,
+            label: ctx => `${fmt(ctx.raw)} cells`,
+            afterLabel: ctx => {
+              const i = ctx.dataIndex;
+              if (i < 3 && s.top_clones[i]) return `Barcode: ${s.top_clones[i].barcode}`;
+              return '';
+            }
+          }
+        },
+      }
+    },
+    plugins: [{
+      id: 'topAnnotations',
+      afterDatasetsDraw(chart) {
+        const { ctx, scales: { x, y } } = chart;
+        const ds = chart.data.datasets[0];
+        [0, 1, 2].forEach(i => {
+          if (!s.top_clones[i] || ranked[i] === undefined) return;
+          const xPx = x.getPixelForValue(i + 1);
+          const yPx = y.getPixelForValue(ranked[i]);
+          ctx.save();
+          ctx.fillStyle = '#DC2626';
+          ctx.font = '10px Inter, sans-serif';
+          ctx.textAlign = 'left';
+          ctx.fillText(s.top_clones[i].barcode, xPx + 4, yPx - 4);
+          ctx.beginPath();
+          ctx.arc(xPx, yPx, 3, 0, 2 * Math.PI);
+          ctx.fillStyle = '#DC2626';
+          ctx.fill();
+          ctx.restore();
+        });
+      }
+    }]
+  });
+}
+
+// Chart B: Clone Size Density (KDE-style)
+function renderSizeDensity(name) {
+  destroyChart('sizedensity');
+  const s = DATA[name];
+  const densityData = s.clone_size_density;
+  
+  if (!densityData || densityData.length === 0) {
+    const ctx = document.getElementById('chartSizeDensity').getContext('2d');
+    ctx.canvas.parentNode.innerHTML = '<div class="placeholder">No density data available</div>';
+    return;
+  }
+  
+  const labels = densityData.map(d => fmt(d.x));
+  const values = densityData.map(d => d.y);
+
+  const ctx = document.getElementById('chartSizeDensity').getContext('2d');
+  charts['sizedensity'] = new Chart(ctx, {
+    type: 'line',
+    data: {
+      labels,
+      datasets: [{
+        label: 'Clone Count',
+        data: values,
+        borderColor: '#16A34A',
+        backgroundColor: 'rgba(22,163,74,0.1)',
+        borderWidth: 2,
+        pointRadius: 0,
+        fill: true,
+        tension: 0.3,
+      }]
+    },
+    options: {
+      responsive: true,
+      maintainAspectRatio: false,
+      scales: {
+        y: {
+          title: { display: true, text: 'Number of Clones', font: { size: 11 } },
+          beginAtZero: true,
+        },
+        x: {
+          title: { display: true, text: 'Clone Size (cells, log scale)', font: { size: 11 } },
+          min: 0,
+          ticks: { maxTicksLimit: 10 }
+        }
+      },
+      plugins: {
+        legend: { display: false },
+        tooltip: {
+          callbacks: {
+            title: ctx => `Clone size: ~${ctx[0].label} cells`,
+            label: ctx => `${fmt(ctx.raw)} clones`,
+          }
+        }
+      }
+    }
+  });
+}
+
+// Chart C: Top 20 Clones (reversed - largest at top)
+function renderTop20(name) {
+  destroyChart('top20');
+  const s = DATA[name];
+  const top = s.top_clones;
+  
+  // Reverse so largest is at top (index 0)
+  const labels = top.map(c => c.barcode);
+  const values = top.map(c => c.n_cells);
+  const pcts = top.map(c => c.pct);
+  const colors = top.map((_, i) => {
+    if (i < 3) return '#DC2626';
+    if (i < 10) return '#D97706';
+    return '#2563EB';
+  });
+
+  const ctx = document.getElementById('chartTop20').getContext('2d');
+  charts['top20'] = new Chart(ctx, {
+    type: 'bar',
+    data: {
+      labels,
+      datasets: [{ data: values, backgroundColor: colors, borderRadius: 3 }]
+    },
+    options: {
+      responsive: true,
+      maintainAspectRatio: false,
+      indexAxis: 'y',
+      plugins: {
+        legend: { display: false },
+        tooltip: {
+          callbacks: {
+            label: ctx => `${fmt(ctx.raw)} cells (${pcts[ctx.dataIndex]}%)`
+          }
+        },
+      },
+      scales: {
+        x: { title: { display: true, text: 'Number of Cells', font: { size: 11 } } },
+        y: { 
+          ticks: { font: { size: 10 } },
+          reverse: false  // Largest at top
+        }
+      }
+    },
+    plugins: [{
+      id: 'barPctLabels',
+      afterDatasetsDraw(chart) {
+        const { ctx: c } = chart;
+        chart.data.datasets[0].data.forEach((val, i) => {
+          const meta = chart.getDatasetMeta(0);
+          const bar = meta.data[i];
+          const pctVal = pcts[i];
+          c.save();
+          c.font = '10px Inter, sans-serif';
+          c.fillStyle = '#374151';
+          c.textAlign = 'left';
+          c.textBaseline = 'middle';
+          c.fillText(`${pctVal}%`, bar.x + 4, bar.y);
+          c.restore();
+        });
+      }
+    }]
+  });
+}
+
+// Chart D: Edit Distance Quality
+function renderEditDist(name) {
+  destroyChart('editdist');
+  const s = DATA[name];
+  const labels = ['0', '1', '2', '3', '4', '5+'];
+
+  const ctx = document.getElementById('chartEditDist').getContext('2d');
+  charts['editdist'] = new Chart(ctx, {
+    type: 'bar',
+    data: {
+      labels,
+      datasets: [
+        {
+          label: 'FlankEditDist',
+          data: s.flank_edit_dist,
+          backgroundColor: 'rgba(37,99,235,0.7)',
+          borderRadius: 3,
+        },
+        {
+          label: 'BarcodeEditDist',
+          data: s.barcode_edit_dist,
+          backgroundColor: 'rgba(220,38,38,0.6)',
+          borderRadius: 3,
+        }
+      ]
+    },
+    options: {
+      responsive: true,
+      maintainAspectRatio: false,
+      plugins: {
+        legend: { position: 'top', labels: { font: { size: 11 } } },
+        tooltip: { callbacks: { label: ctx => `${fmt(ctx.raw)} reads` } }
+      },
+      scales: {
+        y: { title: { display: true, text: 'Number of Reads', font: { size: 11 } } },
+        x: { title: { display: true, text: 'Edit Distance', font: { size: 11 } } }
+      }
+    }
+  });
+}
+
+// ============================================================
+// Cross-sample charts
+// ============================================================
+function renderCrossCharts() {
+  // Sort samples alphabetically
+  const sorted = [...SAMPLE_NAMES].sort((a, b) => a.localeCompare(b));
+
+  // Chart E: Cells per sample
+  {
+    const ctx = document.getElementById('chartCellsPerSample').getContext('2d');
+    charts['cellsPerSample'] = new Chart(ctx, {
+      type: 'bar',
+      data: {
+        labels: sorted,
+        datasets: [{
+          label: 'Unique Cells',
+          data: sorted.map(n => DATA[n].cells),
+          backgroundColor: '#2563EB',
+          borderRadius: 4,
+        }]
+      },
+      options: {
+        responsive: true,
+        maintainAspectRatio: false,
+        indexAxis: 'y',
+        plugins: {
+          legend: { display: false },
+          tooltip: { callbacks: { label: ctx => `${fmt(ctx.raw)} cells` } }
+        },
+        scales: {
+          x: { title: { display: true, text: 'Unique Cells', font: { size: 11 } } },
+          y: { ticks: { font: { size: 11 } } }
+        }
+      }
+    });
+  }
+
+  // Chart F: Clonality comparison
+  {
+    const ctx = document.getElementById('chartClonality').getContext('2d');
+    charts['clonality'] = new Chart(ctx, {
+      type: 'bar',
+      data: {
+        labels: SAMPLE_NAMES,
+        datasets: [
+          {
+            label: 'Top 1%',
+            data: SAMPLE_NAMES.map(n => DATA[n].top1_pct),
+            backgroundColor: '#DC2626',
+            borderRadius: 3,
+          },
+          {
+            label: 'Top 3%',
+            data: SAMPLE_NAMES.map(n => DATA[n].top3_pct),
+            backgroundColor: '#D97706',
+            borderRadius: 3,
+          },
+          {
+            label: 'Top 10%',
+            data: SAMPLE_NAMES.map(n => DATA[n].top10_pct),
+            backgroundColor: '#16A34A',
+            borderRadius: 3,
+          }
+        ]
+      },
+      options: {
+        responsive: true,
+        maintainAspectRatio: false,
+        plugins: {
+          legend: { position: 'top', labels: { font: { size: 11 } } },
+          tooltip: { callbacks: { label: ctx => `${ctx.dataset.label}: ${ctx.raw.toFixed(1)}%` } }
+        },
+        scales: {
+          y: {
+            title: { display: true, text: '% of Cells', font: { size: 11 } },
+            max: 100,
+          },
+          x: { ticks: { font: { size: 10 }, maxRotation: 30 } }
+        }
+      }
+    });
+  }
+}
+
+// ============================================================
+// Init
+// ============================================================
+renderSummary();
+renderTable(SAMPLE_NAMES);
+populateDropdown();
+renderOverlapTable();
+renderCrossCharts();
+
+if (SAMPLE_NAMES.length > 0) selectSample(SAMPLE_NAMES[0]);
+</script>
+</body>
+</html>
+"""
+
+
+# ---------------------------------------------------------------------------
+# Report generation
+# ---------------------------------------------------------------------------
+
+def detect_run_mode(run_info):
+    """Determine run mode from run_info or return 'Unknown'."""
+    if run_info.get("mode"):
+        mode = run_info["mode"]
+        # Capitalize appropriately
+        if mode.lower() == "discovery":
+            return "Discovery Mode"
+        elif mode.lower() == "whitelist" or mode.lower() == "reference":
+            return "Whitelist Mode"
+        else:
+            return mode
+    return "Run Mode Unknown"
+
+
+def generate_report(csv_path, output_path, title):
+    print(f"[1/6] Loading data from {csv_path}...")
+    raw, run_info = load_data(csv_path)
+
+    print(f"[2/6] Computing stats for {len(raw)} samples...")
+    stats = compute_stats(raw)
+    glob = global_stats(stats)
+    overlap = compute_global_overlap(raw)
+
+    run_mode = detect_run_mode(run_info)
+    timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
+    input_filename = os.path.basename(csv_path)
+
+    print(f"[3/6] Building HTML report...")
+    data_json = json.dumps(stats, separators=(",", ":"))
+    global_json = json.dumps(glob, separators=(",", ":"))
+    overlap_json = json.dumps(overlap, separators=(",", ":"))
+
+    html = HTML_TEMPLATE
+    html = html.replace("{{TITLE}}", title)
+    html = html.replace("{{INPUT_FILE}}", input_filename)
+    html = html.replace("{{TIMESTAMP}}", timestamp)
+    html = html.replace("{{RUN_MODE}}", run_mode)
+    html = html.replace("{{DATA_JSON}}", data_json)
+    html = html.replace("{{GLOBAL_JSON}}", global_json)
+    html = html.replace("{{OVERLAP_JSON}}", overlap_json)
+
+    print(f"[4/6] Writing to {output_path}...")
+    with open(output_path, "w", encoding="utf-8") as f:
+        f.write(html)
+
+    print(f"[5/6] Complete!")
+    print(f"[6/6] Summary:")
+    size_kb = os.path.getsize(output_path) / 1024
+    print(f"\n✅ Report generated: {output_path} ({size_kb:.1f} KB)")
+    print(f"   Samples: {glob['total_samples']}")
+    print(f"   Reads:   {glob['total_reads']:,}")
+    print(f"   Cells:   {glob['total_cells']:,}")
+    print(f"   Clones:  {glob['total_clones']:,}")
+    if run_info.get("mode"):
+        print(f"   Mode:    {run_mode}")
+    if run_info.get("command"):
+        print(f"   Command: {run_info['command'][:80]}...")
+
+
+# ---------------------------------------------------------------------------
+# CLI
+# ---------------------------------------------------------------------------
+
+def main():
+    parser = argparse.ArgumentParser(
+        description="Generate a NextClone HTML report from clone_barcodes.csv (v2)"
+    )
+    parser.add_argument("input_csv", help="Path to clone_barcodes.csv")
+    parser.add_argument("--output", default="report.html", help="Output HTML file (default: report.html)")
+    parser.add_argument("--title", default="NextClone Report", help="Report title")
+    args = parser.parse_args()
+
+    if not os.path.isfile(args.input_csv):
+        print(f"Error: input file not found: {args.input_csv}", file=sys.stderr)
+        sys.exit(1)
+
+    generate_report(args.input_csv, args.output, args.title)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/tests/data/README.md b/tests/data/README.md
new file mode 100644
index 0000000..21e8863
--- /dev/null
+++ b/tests/data/README.md
@@ -0,0 +1,31 @@
+# NextClone Test Data
+
+## Files
+
+### whitelist_test.fastq.gz
+Synthetic FASTQ with known barcodes (matching data/known_barcodes_subset.txt).
+Use for testing whitelist mode (discovery_mode=false).
+
+### discovery_test.fastq.gz  
+Synthetic FASTQ with mixed barcodes - some known, some novel.
+Use for testing discovery mode (discovery_mode=true).
+
+### expected_discovered_barcodes.txt
+List of all barcodes present in discovery_test.fastq.gz.
+Use to validate discovery mode output.
+
+## Barcode Format
+
+- Barcode length: 20bp
+- 3' adapter: GTTTCAGAGCTATGCTGGAAACAGC
+- Read structure: [BARCODE][3' adapter]
+
+## Running Tests
+
+```bash
+# From repository root
+python tests/test_discovery_mode.py
+
+# Or run specific test
+python -c "from tests.test_discovery_mode import test_flexiplex_discovery; test_flexiplex_discovery()"
+```
diff --git a/tests/data/discovery_test.fastq.gz b/tests/data/discovery_test.fastq.gz
new file mode 100644
index 0000000..584f78b
Binary files /dev/null and b/tests/data/discovery_test.fastq.gz differ
diff --git a/tests/data/expected_discovered_barcodes.txt b/tests/data/expected_discovered_barcodes.txt
new file mode 100644
index 0000000..02bd7f2
--- /dev/null
+++ b/tests/data/expected_discovered_barcodes.txt
@@ -0,0 +1,7 @@
+CGGAGTAATACATTTTGCCT
+TCGGAGTTGGCTGTCGTTTC
+GTTGTCTCGGGGGGTGGAGA
+CCATGATAAGGGAGTTCCGG
+AAAAAAAAAAAAAAAAAAAA
+TTTTTTTTTTTTTTTTTTTT
+GGGGGGGGGGGGGGGGGGGG
diff --git a/tests/data/whitelist_test.fastq.gz b/tests/data/whitelist_test.fastq.gz
new file mode 100644
index 0000000..19f11b2
Binary files /dev/null and b/tests/data/whitelist_test.fastq.gz differ
diff --git a/tests/test_discovery_mode.py b/tests/test_discovery_mode.py
new file mode 100644
index 0000000..7261984
--- /dev/null
+++ b/tests/test_discovery_mode.py
@@ -0,0 +1,407 @@
+#!/usr/bin/env python3
+"""
+Test suite for NextClone discovery mode and parameter validation.
+
+This creates synthetic test data and validates:
+1. Parameter validation logic (error/warning messages)
+2. Discovery mode barcode extraction
+3. Backward compatibility with whitelist mode
+
+Run with: python tests/test_discovery_mode.py
+"""
+
+import os
+import sys
+import gzip
+import tempfile
+import subprocess
+import shutil
+from pathlib import Path
+
+# Test configuration matching nextflow.config defaults
+ADAPTER_5PRIME = "ATCTTGTGGAAAGGACGAAACACCG"
+ADAPTER_3PRIME = "GTTTCAGAGCTATGCTGGAAACAGC"
+BARCODE_LENGTH = 20
+
+# Known test barcodes (matching data/known_barcodes_subset.txt)
+TEST_BARCODES = [
+    "CGGAGTAATACATTTTGCCT",
+    "TCGGAGTTGGCTGTCGTTTC",
+    "GTTGTCTCGGGGGGTGGAGA",
+    "CCATGATAAGGGAGTTCCGG",
+    "AGGGGAGTCGCGTGGTAGGC",
+    "TGTCTAATGGGGGTGTCACT",
+]
+
+# Additional barcodes for discovery mode testing (not in whitelist)
+DISCOVERY_BARCODES = [
+    "AAAAAAAAAAAAAAAAAAAA",
+    "TTTTTTTTTTTTTTTTTTTT",
+    "GGGGGGGGGGGGGGGGGGGG",
+]
+
+
+def generate_quality_string(length):
+    """Generate a fake quality string of given length."""
+    return "I" * length
+
+
+def generate_fastq_read(read_id, barcode):
+    """
+    Generate a single FASTQ read with the barcode flanked by BOTH adapters.
+    
+    Format: [5' adapter][BARCODE][3' adapter]
+    
+    This matches the NextClone expected input structure where flexiplex
+    searches for: -x 5'adapter -b barcode -x 3'adapter
+    """
+    sequence = f"{ADAPTER_5PRIME}{barcode}{ADAPTER_3PRIME}"
+    quality = generate_quality_string(len(sequence))
+    
+    return f"@{read_id}\n{sequence}\n+\n{quality}\n"
+
+
+def create_synthetic_fastq(output_path, barcodes, reads_per_barcode=10):
+    """
+    Create a synthetic FASTQ file with known barcodes.
+    
+    Args:
+        output_path: Path to output .fastq.gz file
+        barcodes: List of barcode sequences to include
+        reads_per_barcode: Number of reads to generate per barcode
+    
+    Read structure: [5' adapter][BARCODE][3' adapter]
+    This matches NextClone's expected input format.
+    """
+    read_count = 0
+    
+    with gzip.open(output_path, 'wt') as f:
+        for barcode in barcodes:
+            for i in range(reads_per_barcode):
+                read_id = f"TEST_READ_{read_count}:1:1:1:{read_count} 1:N:0:AACTTGAC"
+                f.write(generate_fastq_read(read_id, barcode))
+                read_count += 1
+    
+    print(f"Created {output_path} with {read_count} reads ({len(barcodes)} barcodes × {reads_per_barcode} reads)")
+    return read_count
+
+
+def create_barcode_whitelist(output_path, barcodes):
+    """Create a barcode whitelist file."""
+    with open(output_path, 'w') as f:
+        for barcode in barcodes:
+            f.write(f"{barcode}\n")
+    print(f"Created whitelist {output_path} with {len(barcodes)} barcodes")
+
+
+def test_parameter_validation():
+    """
+    Test that parameter validation and workflow modes work correctly.
+    
+    Expected behavior:
+    - discovery_mode=true with whitelist → WARNING (proceeds anyway)
+    - All 4 workflow modes should validate (DNAseq/scRNAseq × discovery/whitelist)
+    
+    Note: The validation for missing whitelist is not tested here because
+    the nextflow.config has a default whitelist path. In production, users
+    would need to explicitly set clone_barcodes_reference or enable discovery_mode.
+    """
+    print("\n" + "="*60)
+    print("TEST: Parameter Validation & Workflow Modes")
+    print("="*60)
+    
+    # Check if nextflow is available
+    result = subprocess.run(["which", "nextflow"], capture_output=True, text=True)
+    if result.returncode != 0:
+        print("⚠️  SKIP: Nextflow not installed - cannot run validation tests")
+        print("   Install with: curl -s https://get.nextflow.io | bash")
+        return None
+    
+    project_dir = Path(__file__).parent.parent
+    all_passed = True
+    
+    # Test 1: discovery_mode=true should show warning about ignored whitelist
+    print("\n[Test 1] discovery_mode=true with whitelist (should warn)...")
+    result = subprocess.run(
+        ["nextflow", "run", str(project_dir / "main.nf"),
+         "--discovery_mode", "true",
+         "--mode", "DNAseq",
+         "-preview"],
+        capture_output=True,
+        text=True,
+        cwd=project_dir,
+        env={**os.environ, "JAVA_HOME": os.environ.get("JAVA_HOME", "")}
+    )
+    
+    if "WARNING" in result.stderr or "ignored" in result.stderr.lower():
+        print("   ✅ PASS: Warning raised as expected")
+    elif result.returncode == 0:
+        print("   ⚠️  Warning may not appear in -preview mode, but workflow validated")
+    else:
+        print(f"   ❌ FAIL: Nextflow failed: {result.stderr[:300]}")
+        all_passed = False
+    
+    # Test 2-5: Validate all 4 workflow combinations
+    modes = [
+        ("DNAseq", "false", "whitelist"),
+        ("DNAseq", "true", "discovery"),
+        ("scRNAseq", "false", "whitelist"),
+        ("scRNAseq", "true", "discovery"),
+    ]
+    
+    for i, (data_mode, discovery, desc) in enumerate(modes, start=2):
+        print(f"\n[Test {i}] {data_mode} + {desc} mode...")
+        result = subprocess.run(
+            ["nextflow", "run", str(project_dir / "main.nf"),
+             "--mode", data_mode,
+             "--discovery_mode", discovery,
+             "-preview"],
+            capture_output=True,
+            text=True,
+            cwd=project_dir,
+            env={**os.environ, "JAVA_HOME": os.environ.get("JAVA_HOME", "")}
+        )
+        
+        if result.returncode == 0:
+            print(f"   ✅ PASS: Workflow validated successfully")
+        else:
+            print(f"   ❌ FAIL: {result.stderr[:200]}")
+            all_passed = False
+    
+    return all_passed
+
+
+def test_synthetic_data_structure():
+    """
+    Test that synthetic data matches expected format.
+    """
+    print("\n" + "="*60)
+    print("TEST: Synthetic Data Structure")
+    print("="*60)
+    
+    with tempfile.TemporaryDirectory() as tmpdir:
+        # Create test FASTQ
+        fastq_path = Path(tmpdir) / "test_synthetic.fastq.gz"
+        create_synthetic_fastq(fastq_path, TEST_BARCODES[:3], reads_per_barcode=5)
+        
+        # Verify structure
+        with gzip.open(fastq_path, 'rt') as f:
+            lines = f.readlines()
+        
+        # Should have 4 lines per read × 3 barcodes × 5 reads = 60 lines
+        expected_lines = 4 * 3 * 5
+        if len(lines) == expected_lines:
+            print(f"   ✅ PASS: Correct number of lines ({expected_lines})")
+        else:
+            print(f"   ❌ FAIL: Expected {expected_lines} lines, got {len(lines)}")
+            return False
+        
+        # Check first read has correct structure (5' adapter + barcode + 3' adapter)
+        first_seq = lines[1].strip()
+        if ADAPTER_5PRIME in first_seq and ADAPTER_3PRIME in first_seq:
+            print(f"   ✅ PASS: Both 5' and 3' adapters found in sequence")
+        else:
+            print(f"   ❌ FAIL: Expected both adapters in sequence")
+            print(f"   Sequence: {first_seq}")
+            return False
+        
+        # Check barcode is correct length (between the adapters)
+        barcode_start = len(ADAPTER_5PRIME)
+        barcode_region = first_seq[barcode_start:barcode_start + BARCODE_LENGTH]
+        if barcode_region in TEST_BARCODES:
+            print(f"   ✅ PASS: Valid barcode found: {barcode_region}")
+        else:
+            print(f"   ❌ FAIL: Barcode not in expected list: {barcode_region}")
+            return False
+    
+    return True
+
+
+def test_flexiplex_discovery():
+    """
+    Test Flexiplex discovery mode on synthetic data (if flexiplex is installed).
+    """
+    print("\n" + "="*60)
+    print("TEST: Flexiplex Discovery Mode")
+    print("="*60)
+    
+    # Check if flexiplex is available
+    result = subprocess.run(["which", "flexiplex"], capture_output=True, text=True)
+    if result.returncode != 0:
+        print("⚠️  SKIP: Flexiplex not installed")
+        print("   Install from: https://github.com/DavidsonGroup/flexiplex")
+        return None
+    
+    with tempfile.TemporaryDirectory() as tmpdir:
+        tmpdir = Path(tmpdir)
+        
+        # Create test data with mixed barcodes (known + novel)
+        all_barcodes = TEST_BARCODES[:3] + DISCOVERY_BARCODES[:2]
+        fastq_path = tmpdir / "discovery_test.fastq.gz"
+        create_synthetic_fastq(fastq_path, all_barcodes, reads_per_barcode=20)
+        
+        # Decompress for flexiplex
+        fastq_unzipped = tmpdir / "discovery_test.fastq"
+        with gzip.open(fastq_path, 'rb') as f_in:
+            with open(fastq_unzipped, 'wb') as f_out:
+                f_out.write(f_in.read())
+        
+        # Run flexiplex in discovery mode (no -k flag, -f 0 for strict match)
+        # Must include BOTH 5' and 3' adapters in order: -x 5' -b barcode -x 3'
+        print("\n   Running Flexiplex discovery mode...")
+        result = subprocess.run(
+            ["flexiplex",
+             "-x", ADAPTER_5PRIME,
+             "-b", "?" * BARCODE_LENGTH,
+             "-u", "",
+             "-x", ADAPTER_3PRIME,
+             "-f", "0",  # Strict flanking match for discovery
+             "-n", "discovery_test",
+             str(fastq_unzipped)],
+            capture_output=True,
+            text=True,
+            cwd=tmpdir
+        )
+        
+        if result.returncode != 0:
+            print(f"   ❌ FAIL: Flexiplex failed with: {result.stderr}")
+            return False
+        
+        # Check discovery output
+        counts_file = tmpdir / "discovery_test_barcodes_counts.txt"
+        if not counts_file.exists():
+            # Try alternate naming
+            counts_file = tmpdir / "flexiplex_barcodes_counts.txt"
+        
+        if counts_file.exists():
+            with open(counts_file) as f:
+                discovered = f.readlines()
+            
+            discovered_barcodes = [line.split()[0] for line in discovered if line.strip()]
+            
+            print(f"   Discovered {len(discovered_barcodes)} unique barcodes")
+            
+            # Check that our test barcodes were found
+            found_count = sum(1 for b in all_barcodes if b in discovered_barcodes)
+            if found_count >= len(all_barcodes) * 0.8:  # Allow some tolerance
+                print(f"   ✅ PASS: Found {found_count}/{len(all_barcodes)} expected barcodes")
+                return True
+            else:
+                print(f"   ❌ FAIL: Only found {found_count}/{len(all_barcodes)} expected barcodes")
+                return False
+        else:
+            print(f"   ❌ FAIL: Barcode counts file not created")
+            print(f"   Files in tmpdir: {list(tmpdir.iterdir())}")
+            return False
+
+
+def create_test_data_for_repo():
+    """
+    Create test data files to include in the repository.
+    """
+    print("\n" + "="*60)
+    print("Creating test data for repository")
+    print("="*60)
+    
+    project_dir = Path(__file__).parent.parent
+    test_data_dir = project_dir / "tests" / "data"
+    test_data_dir.mkdir(parents=True, exist_ok=True)
+    
+    # 1. Whitelist mode test data (uses known barcodes)
+    whitelist_fastq = test_data_dir / "whitelist_test.fastq.gz"
+    create_synthetic_fastq(whitelist_fastq, TEST_BARCODES, reads_per_barcode=10)
+    
+    # 2. Discovery mode test data (includes novel barcodes)
+    discovery_fastq = test_data_dir / "discovery_test.fastq.gz"
+    mixed_barcodes = TEST_BARCODES[:4] + DISCOVERY_BARCODES
+    create_synthetic_fastq(discovery_fastq, mixed_barcodes, reads_per_barcode=15)
+    
+    # 3. Expected barcodes for discovery mode
+    expected_barcodes = test_data_dir / "expected_discovered_barcodes.txt"
+    create_barcode_whitelist(expected_barcodes, mixed_barcodes)
+    
+    # 4. Create a README for the test data
+    readme_content = """# NextClone Test Data
+
+## Files
+
+### whitelist_test.fastq.gz
+Synthetic FASTQ with known barcodes (matching data/known_barcodes_subset.txt).
+Use for testing whitelist mode (discovery_mode=false).
+
+### discovery_test.fastq.gz  
+Synthetic FASTQ with mixed barcodes - some known, some novel.
+Use for testing discovery mode (discovery_mode=true).
+
+### expected_discovered_barcodes.txt
+List of all barcodes present in discovery_test.fastq.gz.
+Use to validate discovery mode output.
+
+## Barcode Format
+
+- Barcode length: 20bp
+- 3' adapter: GTTTCAGAGCTATGCTGGAAACAGC
+- Read structure: [BARCODE][3' adapter]
+
+## Running Tests
+
+```bash
+# From repository root
+python tests/test_discovery_mode.py
+
+# Or run specific test
+python -c "from tests.test_discovery_mode import test_flexiplex_discovery; test_flexiplex_discovery()"
+```
+"""
+    
+    with open(test_data_dir / "README.md", 'w') as f:
+        f.write(readme_content)
+    
+    print(f"\nTest data created in {test_data_dir}")
+    print("Files:")
+    for f in test_data_dir.iterdir():
+        print(f"  - {f.name}")
+
+
+def main():
+    """Run all tests."""
+    print("="*60)
+    print("NextClone Discovery Mode Test Suite")
+    print("="*60)
+    
+    results = {}
+    
+    # Test 1: Synthetic data structure
+    results['synthetic_data'] = test_synthetic_data_structure()
+    
+    # Test 2: Parameter validation (requires nextflow)
+    results['param_validation'] = test_parameter_validation()
+    
+    # Test 3: Flexiplex discovery (requires flexiplex)
+    results['flexiplex_discovery'] = test_flexiplex_discovery()
+    
+    # Summary
+    print("\n" + "="*60)
+    print("TEST SUMMARY")
+    print("="*60)
+    
+    for test_name, result in results.items():
+        if result is True:
+            status = "✅ PASS"
+        elif result is False:
+            status = "❌ FAIL"
+        else:
+            status = "⚠️  SKIP"
+        print(f"  {test_name}: {status}")
+    
+    # Create test data for repo
+    print("\n")
+    create_test_data_for_repo()
+    
+    # Return exit code
+    failures = sum(1 for r in results.values() if r is False)
+    return failures
+
+
+if __name__ == "__main__":
+    sys.exit(main())