Skip to content

Add dna-playground — k-mer offset histogram DNA alignment playground#1

Draft
Copilot wants to merge 1 commit into
mainfrom
copilot/build-rust-dna-alignment-playground
Draft

Add dna-playground — k-mer offset histogram DNA alignment playground#1
Copilot wants to merge 1 commit into
mainfrom
copilot/build-rust-dna-alignment-playground

Conversation

Copilot AI commented May 12, 2026

Copy link
Copy Markdown

Introduces a new dna-playground workspace crate for experimenting with k-mer based DNA alignment against hg38 FASTA. Generates synthetic reads with configurable mutations, aligns them via offset histogram voting across multiple k values, then resolves candidates with Needleman-Wunsch local alignment and evaluates against ground truth.

New crate: dna-playground

Core modules

Module Purpose
fasta.rs needletail FASTA reader (plain + gz), batch and streaming
dna.rs Complement/RC, hard/soft masking; Orientation + SoftmaskMode enums
kmer.rs 2-bit packed k-mers; PackedKmer::U64 (k≤31) / PackedKmer::U128 (k≥32); overlapping KmerIter (stride 1)
index.rs Per-chromosome k-mer index; frequency filter drops high-occurrence k-mers (repeat noise)
generate.rs Synthetic reads: SNPs, indels, hard/soft masking, optional RC; serde ground-truth JSONL
align.rs Multi-k offset voting (offset = ref_pos − query_pos), clustering, weighted scoring (weight(k) = k)
local_align.rs Needleman-Wunsch over a reference window extracted around the histogram peak
cigar.rs CIGAR string from aligned strings (SAM M/I/D)
metrics.rs Identity, query/reference coverage, MAPQ-like confidence, ground-truth error
report.rs Aligned FASTA pairs, JSONL alignment report, debug offset-vote CSV, text visual alignment
main.rs CLI: generate, align, evaluate, index subcommands

CLI usage

# Generate synthetic reads with ground truth
dna-playground generate \
  --reference hg38/hg38.fa \
  --read-count 1000 --read-len 150 \
  --snp-rate 0.001 --insertion-rate 0.0005 --deletion-rate 0.0005 \
  --out reads.fa --truth reads.truth.jsonl

# Align reads (index built in memory)
dna-playground align \
  --reference hg38/hg38.fa \
  --reads reads.fa \
  --k 17 --k 21 --k 25 \
  --out-aligned aligned.fa \
  --out-report aligned.jsonl

# Evaluate predicted vs ground truth
dna-playground evaluate \
  --truth reads.truth.jsonl \
  --report aligned.jsonl

Workspace additions

  • rand = "0.10.1", serde = "1.0.228", serde_json = "1.0.149" added to [workspace.dependencies]
  • "dna-playground" added to workspace members

Verification

  • cargo build --workspace
  • cargo clippy -p dna-playground ✅ (zero warnings)
  • cargo test -p dna-playground ✅ (10/10 tests pass)

Copilot AI requested a review from sunsided May 12, 2026 16:33
@sunsided sunsided changed the title feat: add dna-playground — k-mer offset histogram DNA alignment playground Add dna-playground — k-mer offset histogram DNA alignment playground May 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants