ARGprep Pipeline

This repository provides a Snakemake workflow for processing AnchorWave MAFs directly into per-contig site outputs. The workflow emits all-sites VCFs, variant-only VCFs, and BED masks from the alignments. Written using Codex. Note that v1.0 is a major rewrite from v0.4, and no longer uses Tassel or GATK.

Requirements

Conda
The environment defined in argprep.yml

Setup

conda env create -f argprep.yml
conda activate argprep

Configure

Create or edit a config file such as options.yaml and set:

maf_dir: directory containing *.maf or *.maf.gz
reference_fasta: reference FASTA path
results_dir: output directory

Optional controls:

max_missing_count
max_missing_fraction
mask_indels
mask_indel_adjacent_snps
treat_n_as_missing
allow_multiallelic_snps

Advanced override:

contigs: restrict the run to specific contigs instead of using the shared MAF/reference contigs automatically
samples: restrict the run to specific sample basenames instead of using all *.maf / *.maf.gz files in maf_dir

Example CLI override:

snakemake -j 8 --configfile options.yaml --config samples='["sampleA","sampleB"]' contigs='["chr1","chr2"]'

Missingness thresholds:

max_missing_count is an absolute number of missing samples allowed at a retained site.
max_missing_fraction is a fraction of samples allowed to be missing.
If both are set, the workflow uses the stricter threshold.
The fraction is converted to a count with downward truncation. For example, with 10 samples, 0.15 allows 1 missing sample.

Indel masking behavior:

mask_indels: true masks reference positions directly overlapped by deletions.
mask_indel_adjacent_snps: true additionally masks SNPs immediately adjacent to an insertion or deletion.
mask_indels: false disables indel-based masking entirely, so indel-overlapped and indel-adjacent sites are judged only by the remaining filters such as missingness.
mask_indel_adjacent_snps only has an effect when mask_indels: true.

Run

Local:

snakemake -j 8 --configfile options.yaml

SLURM:

snakemake --profile profiles/slurm --configfile options.yaml

Note that Slurm will default to using resources (memory, time) defined in profiles/slurm/config.yaml. Parsing the maf file is the most computationally expensive step in the pipeline, and those resources can be modified in the main options.yaml.

Outputs

Outputs are written under results/sites/ by default:

combined.<contig>.all_sites.vcf
combined.<contig>.variants.vcf
combined.<contig>.masked.bed
combined.<contig>.site_summary.tsv
summary.html

The pipeline still validates that retained sites plus the mask span each contig exactly, but that check is now internal and is no longer written as a separate coverage.txt file.

Testing

pytest -q

Simulation Helper

The repository includes scripts/simulate_msprime_indels.py for generating haploid test datasets with msprime SNP variation plus branch-based indels on the tree sequence. Note that these simulations are not intended to be evolutionarily accurate, but simply to give a reasonable example data.

Example:

python scripts/simulate_msprime_indels.py \
  --sequence-length 1000000 \  # ancestral sequence length in bp
  --num-samples 8 \  # number of haploid samples
  --theta 0.01 \  # scaled mutation parameter, 4Ne*mu
  --rho 0.01 \  # scaled recombination parameter, 4Ne*r
  --ne 10000 \  # effective population size used to convert theta and rho
  --indel-rate 1e-8 \  # indel events per bp per generation on branches
  --indel-lambda 0.001 \  # exponential indel size rate; mean size = 1/lambda = 1000 bp
  --seed 8675309 \  # RNG seed for reproducibility
  --out-prefix example_data/example  # output prefix for FASTA, MAF, TSV files

Outputs:

<prefix>.reference.fa
<prefix>.samples.fa
<prefix>.indels.tsv
<prefix>.summary.tsv
<prefix>.maf/

Summary fields include:

seed
sequence_length
ancestral_bp_with_indel_in_ge1_sample
total_snps
snps_without_overlapping_indel

Name		Name	Last commit message	Last commit date
Latest commit History 287 Commits
example_data		example_data
profiles/slurm		profiles/slurm
scripts		scripts
tests		tests
.gitignore		.gitignore
AGENTS.md		AGENTS.md
LICENSE		LICENSE
README.md		README.md
Snakefile		Snakefile
argprep.yml		argprep.yml
changelog.md		changelog.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ARGprep Pipeline

Requirements

Setup

Configure

Run

Outputs

Testing

Simulation Helper

About

Uh oh!

Releases 2

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ARGprep Pipeline

Requirements

Setup

Configure

Run

Outputs

Testing

Simulation Helper

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages