In-silico off-target triage with seed-mismatch scoring
- Overview
- Project Layout
- Installation
- Quick Start
- Step-by-Step Tutorial
- Understanding the Scripts
- Output Format
- Troubleshooting
- FAQ
This repository provides tools to extend CRISPR-Sheriff with local sequence homology scoring. It helps identify and prioritize potential off-target sites by:
- Seed-aware alignment: Counts mismatches in the critical seed region (positions 1-8)
- PAM detection: Identifies canonical NGG/NAG PAMs near candidate sites
- Strand-specific scoring: Tests both DNA strands for best alignment
- Integration-ready: Outputs merge with Sheriff's results and other off-target predictors
sheriff call -i dedup.bam --min_cells 1 --no_blacklist -o loose.tsvThe workflow consists of three main scripts:
sheriff2homology.py- Converts Sheriff's relaxed TSV output to candidate formatedit_sites2loose.py- Alternative converter for Sheriff's BED output (legacy)pairwise_homology.py- Performs guide-to-genome alignment and scoring
homology-extension/
├── environment.yml # Conda environment specification
├── README.md # This file
├── scripts/
│ ├── sheriff2homology.py # Sheriff TSV → candidates.tsv (PRIMARY)
│ ├── edit_sites2loose.py # Sheriff BED → loose.tsv (legacy)
│ └── pairwise_homology.py # Main homology scoring script
├── data/ # Example data files
│ ├── guides.example.fa
│ ├── sheriff_edit_sites.example.bed
│ └── test_genome.fa
├── tests/
│ └── smoke_test.sh # Basic functionality test
└── .gitignore
- Conda or Mamba package manager
- Access to a reference genome (e.g., hg38.fa)
- Sheriff output files (edit_sites.bed)
# Clone the repository
git clone https://github.com/YOUR-LAB/homology-extension.git
cd homology-extension
# Create and activate the conda environment
conda env create -f environment.yml
conda activate sheriff-homology
# Run the smoke test to verify installation
bash tests/smoke_test.sh# 1. Run Sheriff in RELAXED mode (critical!)
sheriff call -i dedup.bam --min_cells 1 --no_blacklist -o loose.tsv
# 2. Convert Sheriff output to candidates format
python scripts/sheriff2homology.py \
--sheriff loose.tsv \
--out candidates.tsv
# 3. Score homology for your guides
python scripts/pairwise_homology.py \
--guides guides.fa \
--candidates candidates.tsv \
--fasta /path/to/hg38.fa \
--out homology_scores.tsvBefore using these tools, you need Sheriff's output files. CRITICAL: Run Sheriff in relaxed mode to capture all potential sites for homology scoring.
| Mode | Command | Use Case |
|---|---|---|
| Strict (default) | sheriff call -i dedup.bam -o strict.tsv |
Publication figures; drops low-cell-count events |
| Relaxed ✅ | sheriff call -i dedup.bam --min_cells 1 --no_blacklist -o loose.tsv |
Homology scoring; keeps ALL sites for re-ranking |
# For homology scoring, ALWAYS use relaxed mode:
sheriff call -i dedup.bam --min_cells 1 --no_blacklist -o loose.tsvWhy relaxed mode?
--min_cells 1: Keeps single-cell events that might be real off-targets--no_blacklist: Retains repeat-masked regions for comprehensive scoring- We apply our own filtering AFTER attaching homology scores
Sheriff produces:
loose.tsv- All candidate sites (relaxed mode output)edit_site_info.txt- Detailed metadata per site- Various per-cell edit count files
The pairwise_homology.py script expects a specific 5-column format. Use the appropriate converter:
python scripts/sheriff2homology.py \
--sheriff loose.tsv \
--out candidates.tsvpython scripts/edit_sites2loose.py \
--bed results/edit_sites.bed \
--out candidates.tsvWhat this does:
- Centers each BED interval to a single base pair (the predicted cut site)
- Removes genome build prefixes (e.g.,
hg38_chr1→chr1) - Adds a strand column (set to
.for unknown) - Reorders columns to:
chr start end strand site_id
Example transformation:
# Input (Sheriff BED):
hg38_chr1 1000 1100 site_001
# Output (loose.tsv):
chr1 1050 1051 . site_001
Now score each guide against each candidate site:
python scripts/pairwise_homology.py \
--guides data/guides.fa \
--candidates results/loose.tsv \
--fasta /path/to/hg38.fa \
--out results/homology_scores.tsvWhat this does for each site × guide combination:
- Extracts genomic context: Pulls ±100bp around the cut site
- Aligns on both strands: Uses Smith-Waterman local alignment
- Counts mismatches: Total and seed region (positions 1-8)
- Checks PAM presence: Looks for NGG/NAG at the expected position
- Reports best match: Keeps highest-scoring alignment
Runtime: ~0.2 seconds per 100 sites per guide (pure Python implementation)
Combine the homology scores with Sheriff's cell-level data and other off-target predictors:
import pandas as pd
# Load all data sources
homology = pd.read_csv("results/homology_scores.tsv", sep="\t")
sheriff = pd.read_csv("results/edit_site_info.txt", sep="\t")
offinder = pd.read_csv("results/cas_offinder.txt", sep="\t") # If available
# Merge on site_id
combined = homology.merge(sheriff, on="site_id", how="left")
combined = combined.merge(offinder, on="site_id", how="left")
# Filter by homology criteria
filtered = combined[
(combined["mm_total"] <= 4) & # Max 4 total mismatches
(combined["mm_seed"] <= 1) & # Max 1 seed mismatch
(combined["pam_present"] == 1) # Canonical PAM required
]
# Sort by evidence strength
filtered = filtered.sort_values(
["score", "cell_count"], # Homology score, then cell support
ascending=[False, False]
)
filtered.to_csv("results/prioritized_offtargets.tsv", sep="\t", index=False)This adapter script handles format conversion between Sheriff and the homology scorer.
Key operations:
# Collapse window to center point
centre = ((df["start"] + df["end"]) // 2).astype(int)
df["start"] = centre
df["end"] = centre + 1 # BED is half-open
# Clean chromosome names
df["chr"] = df["chr"].str.replace(r"^hg\d+_", "", regex=True)
# Add unknown strand
df["strand"] = "."The main scoring engine that performs local alignment.
Core algorithm:
# For each candidate site
for site in candidates:
# Extract ±100bp window
window = genome[chr][center-100:center+100]
# Try each guide
for guide in guides:
# Check both strands
for strand in ["+", "-"]:
# Slide 20nt window
for offset in range(len(seq) - 20):
# Local alignment
alignment = pairwise2.align.localms(
guide_seq, genomic_seq,
match=1, mismatch=-1,
gap_open=-0.8, gap_extend=-0.5
)
# Count mismatches
mm_total = count_mismatches(alignment)
mm_seed = count_mismatches(alignment[:8])
# Keep best scoring
if score > best_score:
best_hit = current_hitScoring parameters:
- Match: +1
- Mismatch: -1
- Gap open: -0.8
- Gap extend: -0.5
The homology_scores.tsv file contains:
| Column | Description |
|---|---|
site_id |
Sheriff's site identifier |
guide_id |
Guide name from FASTA header |
strand |
Best-scoring strand (+/-) |
genome_offset |
Position relative to cut site (-100 to +100) |
mm_total |
Total mismatches across 20nt |
mm_seed |
Mismatches in seed region (pos 1-8) |
score |
Smith-Waterman alignment score |
pam_present |
1 if NGG/NAG found, 0 otherwise |
aln_guide |
Aligned guide sequence |
aln_locus |
Aligned genomic sequence |
Example row:
site_001 gRNA_1 + -5 2 0 17.2 1 GAGTCCGAGCAGAAGAAGAA GAGTCCGAGCAGAAGAAGAA
Issue: "Chromosome not found in reference"
- Check that chromosome names match between your files
- The converter removes
hg38_prefixes - ensure your reference uses standard names (chr1, not hg38_chr1)
Issue: "No PAM detected for on-target site"
- Verify your guides include only the 20nt protospacer (not the PAM)
- Check that the reference genome matches your experimental organism
Issue: Script runs slowly
- Consider parallelizing by splitting the candidate file
- For >10k sites, consider using a cluster with GNU parallel
| Check | Expected Result |
|---|---|
| On-target alignment | mm_total ≤ 1, mm_seed = 0, pam_present = 1 |
| Known off-target | Appropriate mm_total/mm_seed for validated site |
| Runtime | <1 second per 1000 site-guide pairs |
Q: My guides are 19nt or 21nt, not 20nt. What should I do?
A: Modify line containing range(len(seq) - 20) in pairwise_homology.py:
# Change 20 to your guide length
for offset in range(len(seq) - 19): # For 19nt guidesQ: Can I use a different scoring matrix?
A: Yes, modify the pairwise2.align.localms() parameters:
# Current: match=1, mismatch=-1, gap_open=-0.8, gap_extend=-0.5
aln = pairwise2.align.localms(
guide_seq, sub,
2, -3, -5, -2, # Your custom scores
one_alignment_only=True
)Q: How do I filter results?
A: Common filtering criteria:
# Stringent (likely off-targets)
df[(df["mm_total"] <= 3) & (df["mm_seed"] == 0) & (df["pam_present"] == 1)]
# Permissive (possible off-targets)
df[(df["mm_total"] <= 5) & (df["mm_seed"] <= 2)]
# Exclude non-canonical PAMs
df[df["pam_present"] == 1]Q: Can I run this on mouse (mm10) or other genomes?
A: Yes! Just provide the appropriate reference FASTA. The edit_sites2loose.py script already handles mm10 prefixes.
Q: How does this compare to Cas-OFFinder or CRISPOR?
A: This tool is complementary:
- Cas-OFFinder: Fast genome-wide search with mismatches/bulges
- CRISPOR: Aggregates multiple off-target predictors
- This tool: Focuses on Sheriff-identified sites with detailed alignment
Use all three for comprehensive off-target assessment.
If you use this tool, please cite:
- The Sheriff paper (when published)
- BioPython: Cock et al., 2009
- This repository
MIT License - See LICENSE file for details
For issues or questions:
- Check the FAQ and Troubleshooting sections
- Open an issue on GitHub
- Contact: your-email@institution.edu
Happy off-target hunting! 🧬🔬