Skip to content

RILAB/argprep

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

287 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ARGprep Pipeline

This repository provides a Snakemake workflow for processing AnchorWave MAFs directly into per-contig site outputs. The workflow emits all-sites VCFs, variant-only VCFs, and BED masks from the alignments. Written using Codex. Note that v1.0 is a major rewrite from v0.4, and no longer uses Tassel or GATK.

Requirements

  • Conda
  • The environment defined in argprep.yml

Setup

conda env create -f argprep.yml
conda activate argprep

Configure

Create or edit a config file such as options.yaml and set:

  • maf_dir: directory containing *.maf or *.maf.gz
  • reference_fasta: reference FASTA path
  • results_dir: output directory

Optional controls:

  • max_missing_count
  • max_missing_fraction
  • mask_indels
  • mask_indel_adjacent_snps
  • treat_n_as_missing
  • allow_multiallelic_snps

Advanced override:

  • contigs: restrict the run to specific contigs instead of using the shared MAF/reference contigs automatically
  • samples: restrict the run to specific sample basenames instead of using all *.maf / *.maf.gz files in maf_dir

Example CLI override:

snakemake -j 8 --configfile options.yaml --config samples='["sampleA","sampleB"]' contigs='["chr1","chr2"]'

Missingness thresholds:

  • max_missing_count is an absolute number of missing samples allowed at a retained site.
  • max_missing_fraction is a fraction of samples allowed to be missing.
  • If both are set, the workflow uses the stricter threshold.
  • The fraction is converted to a count with downward truncation. For example, with 10 samples, 0.15 allows 1 missing sample.

Indel masking behavior:

  • mask_indels: true masks reference positions directly overlapped by deletions.
  • mask_indel_adjacent_snps: true additionally masks SNPs immediately adjacent to an insertion or deletion.
  • mask_indels: false disables indel-based masking entirely, so indel-overlapped and indel-adjacent sites are judged only by the remaining filters such as missingness.
  • mask_indel_adjacent_snps only has an effect when mask_indels: true.

Run

Local:

snakemake -j 8 --configfile options.yaml

SLURM:

snakemake --profile profiles/slurm --configfile options.yaml

Note that Slurm will default to using resources (memory, time) defined in profiles/slurm/config.yaml. Parsing the maf file is the most computationally expensive step in the pipeline, and those resources can be modified in the main options.yaml.

Outputs

Outputs are written under results/sites/ by default:

  • combined.<contig>.all_sites.vcf
  • combined.<contig>.variants.vcf
  • combined.<contig>.masked.bed
  • combined.<contig>.site_summary.tsv
  • summary.html

The pipeline still validates that retained sites plus the mask span each contig exactly, but that check is now internal and is no longer written as a separate coverage.txt file.

Testing

pytest -q

Simulation Helper

The repository includes scripts/simulate_msprime_indels.py for generating haploid test datasets with msprime SNP variation plus branch-based indels on the tree sequence. Note that these simulations are not intended to be evolutionarily accurate, but simply to give a reasonable example data.

Example:

python scripts/simulate_msprime_indels.py \
  --sequence-length 1000000 \  # ancestral sequence length in bp
  --num-samples 8 \  # number of haploid samples
  --theta 0.01 \  # scaled mutation parameter, 4Ne*mu
  --rho 0.01 \  # scaled recombination parameter, 4Ne*r
  --ne 10000 \  # effective population size used to convert theta and rho
  --indel-rate 1e-8 \  # indel events per bp per generation on branches
  --indel-lambda 0.001 \  # exponential indel size rate; mean size = 1/lambda = 1000 bp
  --seed 8675309 \  # RNG seed for reproducibility
  --out-prefix example_data/example  # output prefix for FASTA, MAF, TSV files

Outputs:

  • <prefix>.reference.fa
  • <prefix>.samples.fa
  • <prefix>.indels.tsv
  • <prefix>.summary.tsv
  • <prefix>.maf/

Summary fields include:

  • seed
  • sequence_length
  • ancestral_bp_with_indel_in_ge1_sample
  • total_snps
  • snps_without_overlapping_indel

About

Snakemake pipeline for generating SINGER input files from whole genome alignment .maf files.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages