This repository provides a Snakemake workflow for processing AnchorWave MAFs directly into per-contig site outputs. The workflow emits all-sites VCFs, variant-only VCFs, and BED masks from the alignments. Written using Codex. Note that v1.0 is a major rewrite from v0.4, and no longer uses Tassel or GATK.
- Conda
- The environment defined in
argprep.yml
conda env create -f argprep.yml
conda activate argprepCreate or edit a config file such as options.yaml and set:
maf_dir: directory containing*.mafor*.maf.gzreference_fasta: reference FASTA pathresults_dir: output directory
Optional controls:
max_missing_countmax_missing_fractionmask_indelsmask_indel_adjacent_snpstreat_n_as_missingallow_multiallelic_snps
Advanced override:
contigs: restrict the run to specific contigs instead of using the shared MAF/reference contigs automaticallysamples: restrict the run to specific sample basenames instead of using all*.maf/*.maf.gzfiles inmaf_dir
Example CLI override:
snakemake -j 8 --configfile options.yaml --config samples='["sampleA","sampleB"]' contigs='["chr1","chr2"]'Missingness thresholds:
max_missing_countis an absolute number of missing samples allowed at a retained site.max_missing_fractionis a fraction of samples allowed to be missing.- If both are set, the workflow uses the stricter threshold.
- The fraction is converted to a count with downward truncation. For example, with 10 samples,
0.15allows1missing sample.
Indel masking behavior:
mask_indels: truemasks reference positions directly overlapped by deletions.mask_indel_adjacent_snps: trueadditionally masks SNPs immediately adjacent to an insertion or deletion.mask_indels: falsedisables indel-based masking entirely, so indel-overlapped and indel-adjacent sites are judged only by the remaining filters such as missingness.mask_indel_adjacent_snpsonly has an effect whenmask_indels: true.
Local:
snakemake -j 8 --configfile options.yamlSLURM:
snakemake --profile profiles/slurm --configfile options.yamlNote that Slurm will default to using resources (memory, time) defined in profiles/slurm/config.yaml. Parsing the maf file is the most computationally expensive step in the pipeline, and those resources can be modified in the main options.yaml.
Outputs are written under results/sites/ by default:
combined.<contig>.all_sites.vcfcombined.<contig>.variants.vcfcombined.<contig>.masked.bedcombined.<contig>.site_summary.tsvsummary.html
The pipeline still validates that retained sites plus the mask span each contig exactly, but that check is now internal and is no longer written as a separate coverage.txt file.
pytest -qThe repository includes scripts/simulate_msprime_indels.py for generating haploid test datasets with msprime SNP variation plus branch-based indels on the tree sequence. Note that these simulations are not intended to be evolutionarily accurate, but simply to give a reasonable example data.
Example:
python scripts/simulate_msprime_indels.py \
--sequence-length 1000000 \ # ancestral sequence length in bp
--num-samples 8 \ # number of haploid samples
--theta 0.01 \ # scaled mutation parameter, 4Ne*mu
--rho 0.01 \ # scaled recombination parameter, 4Ne*r
--ne 10000 \ # effective population size used to convert theta and rho
--indel-rate 1e-8 \ # indel events per bp per generation on branches
--indel-lambda 0.001 \ # exponential indel size rate; mean size = 1/lambda = 1000 bp
--seed 8675309 \ # RNG seed for reproducibility
--out-prefix example_data/example # output prefix for FASTA, MAF, TSV filesOutputs:
<prefix>.reference.fa<prefix>.samples.fa<prefix>.indels.tsv<prefix>.summary.tsv<prefix>.maf/
Summary fields include:
seedsequence_lengthancestral_bp_with_indel_in_ge1_sampletotal_snpssnps_without_overlapping_indel