A technical-scientific guide to genomes_analyzer.py
- Abstract
- Introduction
- What's new
- Genomes Analyzer Pipeline
- How to Use the Genomes Analyzer
- Background execution & monitoring
- Conclusion
- Appendix 1 β Command Line Tools & Typical Usage
- Appendix 2 β Additional Important Modules
High-throughput DNA sequencing has transformed biological and clinical research, but turning raw reads into actionable variant calls still requires a reliable, explainable, and resource-efficient pipeline. Genomes Analyzer is a Python-driven workflow designed to take one or more whole-genome or exome samples from raw FASTQ (or pre-aligned BAM/CRAM) to compressed, indexed VCF. It emphasizes clear provenance, conservative defaults, and transparent logging while remaining pragmatic about compute and memory on commodity Linux workstations. The pipeline integrates widely used open-source toolsβFastQC for read quality control, cutadapt for adapter/quality trimming, bwa-mem2 for alignment, samtools for sorting/indexing/duplicate marking, and a selectable variant-calling backend: either GATK (HaplotypeCaller β GenotypeGVCFs) or BCFtools (mpileup β call β norm). To keep runtimes reasonable, Genomes Analyzer splits the reference into shards (BED intervals) and executes them in parallel with safe concatenation and indexing, while preserving chromosome order. It prints human-readable progress dashboards (including the shard BED preview and per-shard heartbeats) so researchers can monitor long runs without guesswork.
This document introduces the pipeline to readers without assuming prior genomics expertise. We define essential terms (e.g., FASTQ, BAM/CRAM, VCF, read mapping, duplicate marking, genotyping, normalization), explain each processing step and its rationale, and show representative snippets of intermediate and final outputs. We also document configuration via YAML, including paths, sample declarations, quality/coverage thresholds, and parallelization parameters. Finally, we provide an appendix detailing the external genomics tools used, typical command patterns, and commonly tuned parameters. The goal is to enable reproducible analyses with sensible defaults while making it straightforward to adapt the pipeline to different datasets, hardware budgets, and scientific objectives.
Genomes Analyzer is a modular, script-based pipeline for small teams or single investigators who need trustworthy variant calls from short-read sequencing. The script genomes_analyzer.py orchestrates a complete workflow from raw sequencing reads to an indexed Variant Call Format (VCF) file. The tool is particularly useful when:
- You want a transparent pipeline using standard tools with well-understood behavior.
- You need to choose between GATK and BCFtools callers without rewriting your workflow.
- You want robust parallel sharding of the reference genome to utilize multi-core CPUs.
- You value clear loggingβwhat is running, on which intervals, and how progress looks over time.
Recent updates expanded the pipeline, improved resilience, and enriched the YAML configuration. Highlights:
- Neural Module: AI-powered DNA analysis using Google DeepMind's AlphaGenome for functional predictions (gene expression, epigenetics, chromatin accessibility, variant effects). See neural_module/docs/NEURAL_MODULE.md.
- Paternity analysis: likelihood-based SNP evaluation with configurable thresholds and optional use of VEP allele frequencies.
- Ancestry (ADMIXTURE supervised): supervised ADMIXTURE using HGDP+1KG reference, with QC, pruning and category collapsing.
- Idempotent ancestry pipeline: reuses existing outputs, checks for prepared references and intermediate PLINK files.
- More robust bcftools execution: all heavy commands run via
conda run -n genomics bash -lcfor consistent environments. - Background-friendly logging: optional wider logs, emojis, and colors when running detached.
- New config profiles: low memory, latest reference (GENCODE r46), and a "monster" profile for 128 cores/256 GB.
- Universal environment bootstrap:
start_genomics_universal.shauto-detects Conda locations and activatesgenomics.
| Type | Description | Examples |
|---|---|---|
| Inputs | Paired FASTQ files per sample (or a pre-aligned BAM/CRAM) and a reference FASTA with its index and dictionary. | fastq/NA12878_R1.fastq.gz, fastq/NA12878_R2.fastq.gz, refs/GRCh38.d1.vd1.fa |
| Primary outputs | Aligned BAM/BAI, per-shard VCFs, final VCF/VCF index, and quality-control reports. | bam/NA12878.mkdup.bam, vcf/NA12878.vcf.gz, qc/NA12878_fastqc.html |
Before running the workflow, make sure the environment is prepared as described in How to Use the Genomes Analyzer. The workflow transforms raw sequencing readsβunaltered sequences produced by high-throughput instrumentsβinto variant calls: structured records that capture singleβnucleotide changes, small insertions/deletions, and other deviations from a reference genome, typically stored in the text-based Variant Call Format (VCF). These stages are well-defined, and each acronym is introduced before use so that readers new to genomics can follow along.
FASTQ
βββ Quality control (FastQC)
βββ Adapter trimming (cutadapt)
βββ Alignment (BWAβMEM2)
βββ Sort & index (samtools)
βββ Mark duplicates (samtools markdup)
βββ [Optional] Base Quality Score Recalibration (GATK)
βββ CRAM conversion & coverage (mosdepth)
βββ Shard reference genome (BED intervals)
βββ Variant calling
β βββ GATK: HaplotypeCaller β GenotypeGVCFs
β βββ BCFtools: mpileup β call β norm
βββ Concatenate shards (preserve order)
βββ Final VCF quality control
βββ Gene list & per-sample coverage
βββ Pairwise & trio comparisons
βββ Paternity analysis (likelihood-based)
βββ Ancestry (ADMIXTURE supervised)
βββ [Optional] RNA-seq module
1. Read Quality Control β FastQC
FASTQ files store sequencing reads and their per-base quality scores. FastQC scans these files and produces HTML reports summarizing quality metrics such as Phred scores, nucleotide composition, and overrepresented sequences.
Why it matters: Early detection of poor-quality cycles or adapter contamination prevents misleading alignments and spares compute time.
Representative snippet (fastqc_data.txt):
>>Per base sequence quality pass
#Base Mean Median Lower Upper
1 33.8 34 31 36
...
>>Overrepresented sequences warn
Outputs: qc/<sample>_R1_fastqc.html, qc/<sample>_R2_fastqc.html plus compressed archives containing raw metrics.
2. Adapter and Quality Trimming β cutadapt
Sequencing libraries often carry leftover adapter sequences and low-quality ends. cutadapt removes these artifacts and can filter short reads.
Why it matters: Adapter sequences and low-quality bases reduce mapping accuracy and inflate false variant calls.
Representative snippet (log extract):
=== Summary ===
Total read pairs processed: 374,102,311
Pairs written (passing filters):372,918,421 (99.7%)
Total basepairs processed: 112.2 Gbp
Quality-trimmed: 1.6 Gbp (1.4%)
Outputs: trimmed/<sample>_R1.fastq.gz, trimmed/<sample>_R2.fastq.gz.
3. Alignment β BWAβMEM2
bwa-mem2 maps each read pair to the reference genome. The output is a SAM (Sequence Alignment/Map) stream that records candidate genomic coordinates, alignment scores, and flags. Alignments are typically converted on the fly to the binary BAM format.
Why it matters: Accurate mapping is a prerequisite to reliable variant detection. Each read receives a mapping quality score indicating confidence in its genomic position.
Representative SAM header:
@SQ SN:chr1 LN:248956422
@SQ SN:chr2 LN:242193529
@PG ID:bwa-mem2 PN:bwa-mem2 VN:2.2.1 CL:bwa-mem2 mem -t 16 ...
Outputs: aligned/<sample>.sam (usually streamed downstream).
4. Sorting & Indexing β samtools
samtools sort arranges BAM records by genomic coordinate and writes an index (.bai) to enable random access.
Why it matters: Variant callers expect coordinate-sorted and indexed BAM files to quickly fetch reads overlapping regions of interest.
Outputs: bam/<sample>.sorted.bam, bam/<sample>.sorted.bam.bai.
5. Duplicate Marking β samtools markdup
PCR amplification (PCR) and optical artifacts can yield duplicate readsβmultiple observations of the same DNA fragment captured more than once. samtools markdup flags these duplicates so that callers can ignore them.
Why it matters: Treating duplicates as independent evidence biases allele counts and may cause false positives.
Outputs: bam/<sample>.mkdup.bam, bam/<sample>.mkdup.bam.bai.
6. Optional Base Quality Score Recalibration β GATK BQSR
Sequencing machines sometimes misestimate base quality scores. Base Quality Score Recalibration (BQSR) uses known variant sites to adjust these scores.
Why it matters: More accurate base qualities improve probabilistic models in downstream callers. This step is optional and can be skipped to save time.
Outputs: recalibrated BAM (if applied) plus BQSR reports.
7. CRAM Conversion & Coverage β samtools + mosdepth
After duplicate marking (and optional BQSR) the pipeline compresses BAM files to CRAM and runs mosdepth to summarize per-base coverage.
Why it matters: CRAM greatly reduces disk usage while coverage metrics reveal underβcovered regions and overall depth.
Outputs: bam/<sample>.mkdup.cram, bam/<sample>.mkdup.cram.crai, bam/<sample>.mosdepth.summary.txt.
Large genomes are divided into manageable BED intervals ("shards"). Each shard is processed independently to leverage parallelism.
Why it matters: Short-read variant callers parallelize poorly within a single region; sharding provides coarse-grained parallelism across chromosomes/blocks and reduces wall-clock time.
Outputs: vcf/shards/<sample>/<sample>.part_XX.vcf.gz and corresponding .tbi indexes.
At this stage, aligned reads are converted into genomic variants. Two backends are available:
A. GATK HaplotypeCaller β GenotypeGVCFs
- HaplotypeCaller analyzes each shard, performing local re-assembly to produce per-sample genomic VCFs (gVCFs).
- GenotypeGVCFs merges gVCFs and emits a final VCF with genotypes.
Outputs: vcf/<sample>.g.vcf.gz, vcf/<sample>.vcf.gz and their indexes.
B. BCFtools mpileup β call β norm
- mpileup summarizes read evidence at each position, applying mapping-quality (MAPQ) and base-quality (BASEQ) filters.
- call infers SNPs and indels using a multiallelic model.
- norm left-aligns indels and splits multiallelic records for compatibility.
Outputs: sharded VCFs (part_XX.vcf.gz) and final merged VCF (vcf/<sample>.vcf.gz).
Per-shard VCFs are concatenated in chromosome order and re-indexed so that downstream tools see a seamless, genome-wide VCF.
Outputs: vcf/<sample>.vcf.gz, vcf/<sample>.vcf.gz.tbi.
bcftools stats and related utilities generate summary metrics and plots to assess callset quality.
Outputs: qc/<sample>.bcftools.stats.txt and optional graphical summaries via plot-vcfstats or MultiQC.
genomes_analyzer.py extracts gene coordinates from the reference GTF and, using mosdepth, calculates breadth and depth per gene for each sample.
Why it matters: Gene-level summaries highlight targets with insufficient coverage and facilitate downstream presence/absence analyses.
Outputs: genes/genes.bed, genes/<sample>_gene_presence.tsv.
For cohorts or family trios, the pipeline can compare VCFs pairwise and flag candidate de novo variants absent in the parents.
Why it matters: Automated comparison streamlines interpretation and quality control across samples.
Outputs: reports in comparisons/ and trio/ directories.
Calculates per-candidate paternity likelihoods based on trios of genotypes (child, mother, alleged parent). Applies coverage (DP), genotype quality heuristics, and optional VEP allele frequencies to compute a per-site Paternity Index and overall likelihood ratio.
Outputs: TSVs per pair in paternity/ and a summary paternity/paternity_summary.md.
Runs supervised ADMIXTURE using HGDP+1KG as reference, with QC filters (MAF, missingness), LD pruning and optional category collapsing. Fully idempotent: reuses prepared PLINK references and skips if final summary exists.
Outputs: results in ancestry/, including ancestry_summary_K{K}.tsv and optional ancestry_summary_collapsed.tsv.
If RNA-seq samples are defined in the YAML, a lightweight expression pipeline (HISAT2 β StringTie β gffcompare) is executed after DNA analysis.
Outputs: transcript assemblies and expression tables under rna/.
| Path | Type | Description |
|---|---|---|
bam/<sample>.mkdup.bam |
BAM | Coordinate-sorted, duplicate-marked alignments |
bam/<sample>.mkdup.cram |
CRAM | Compressed alignments with index |
bam/<sample>.mosdepth.summary.txt |
TXT | Coverage summary (mosdepth) |
vcf/shards/<sample>/part_XX.vcf.gz |
VCF | Per-shard variant calls |
vcf/<sample>.vcf.gz |
VCF | Final, genome-wide variant calls |
genes/<sample>_gene_presence.tsv |
TSV | Per-gene coverage report |
qc/<sample>_R1_fastqc.html |
HTML | Read quality report |
CONDA_BASE="/home/lume2/miniforge3"
if ! command -v mamba >/dev/null 2>&1; then
conda install -n base -c conda-forge -y mamba
fi
conda activate
conda env remove --name genomics
conda deactivateCONDA_BASE="/home/lume2/miniforge3"
if ! command -v mamba >/dev/null 2>&1; then
conda install -n base -c conda-forge -y mamba
fi
conda activate
./scripts/install_genomics_env.sh
# VEP: escolha um instalador
# OpΓ§Γ£o padrΓ£o e resiliente:
source scripts/vep_install_smart.sh
# Alternativas:
# source scripts/vep_install_latest.sh
# source scripts/vep_install_fixed.shLeave any active conda environment and initialize the session:
conda deactivate
# MΓ©todo universal (autoβdetecta conda):
source scripts/start_genomics_universal.sh
# Alternativa simples, se seu conda estΓ‘ em ~/miniforge3:
# source scripts/start_genomics.shExecute the workflow by pointing the script to your YAML file:
conda deactivate
source scripts/start_genomics.sh
./genomes_analyzer.py --config configs/config_human_30x_low_memory.yaml
# Perfis prontos:
# - configs/config_human_30x.yaml (trio 30Γ, ENA/1000G)
# - configs/config_human_30x_low_memory.yaml (downsample 25%, footprint reduzido)
# - configs/config_human_30x_latest_ref.yaml (GENCODE r46, GRCh38 primary)
# - configs/config_human_30x_monster.yaml (128 cores / 256 GB, K=4 ancestry, VEP rΓ‘pido)
# - configs/config_human_30x_filtered.yaml (exemplo com filtros mais restritos)The console prints progress panels, including per-shard heartbeats when variant calling is parallelized.
genomes_analyzer.py is driven by a YAML file. The examples provided target human trios at ~30Γ and include multiple profiles. Important sections are summarized below and extended with new analysis modules.
| Field | Description | Example |
|---|---|---|
name |
Project identifier used in output paths. | human_30x_trio_demo |
organism |
Latin name of the species. | homo_sapiens |
reference.name |
Identifier for the reference genome build. | GRCh38.d1.vd1 |
reference.fasta_url |
URL to a gzipped FASTA that will be downloaded and unpacked. | https://api.gdc.cancer.gov/...834 |
reference.bwa_index_url |
Pre-built BWA index (saves RAM during alignment). | https://api.gdc.cancer.gov/...7225 |
| Field | Purpose | Typical value |
|---|---|---|
force_indexes |
Rebuild alignment indexes even if present. | false |
sort_mem_mb |
Memory (MiB) per thread for samtools sort. |
512 |
bwa_batch_k |
Reads per batch for BWA; smaller uses less RAM. | 20000000 |
aln_threads |
Threads for alignment. | 16 |
gene_presence_min_mean_cov |
Minimum average coverage to consider a gene present. | 5.0 |
gene_presence_min_breadth_1x |
Minimum breadth at 1Γ to consider present. | 0.8 |
trio_child_id / trio_parent_ids |
Trio IDs used by trio/paternity analyses. | NA12878 / [NA12891, NA12892] |
trio_min_dp_child / trio_min_dp_parents |
Minimum depth for trio filters. | 15 / 15 |
trio_min_gq |
Minimum genotype quality for trio filters. | 30 |
trio_min_ab_het / trio_max_ab_het |
Allelic balance range for hets. | 0.25 / 0.75 |
trio_min_ab_hom |
Minimum alt fraction for homβalt. | 0.90 |
trio_max_parent_alt_frac |
Max alt fraction tolerated in parents at de novo sites. | 0.02 |
paternity_prior |
Prior for paternity before evidence. | 0.5 |
paternity_epsilon |
Small error rate to avoid degenerate likelihoods. | 0.001 |
paternity_require_pass / paternity_force_pass |
Use only FILTER=PASS variants; force if tag missing. | true / true |
paternity_use_vep_af |
Use VEP allele frequencies in likelihoods. | true/false |
paternity_skip_all_hets |
Skip sites where trio is all heterozygous. | true/false |
Alignment and variant-calling options.
| Field | Description | Example |
|---|---|---|
aligner |
Choose bwa or bwa-mem2. |
bwa |
variant_caller |
gatk or bcftools. |
bcftools |
bcf_mapq / bcf_baseq |
Minimum mapping/base quality in mpileup. |
20 |
bcf_scatter_parts |
Number of BED shards for parallel calling. | 16 |
hc_java_mem_gb |
Heap size for GATK HaplotypeCaller. | 24 |
bcf_mapq / bcf_baseq / bcf_max_depth |
bcftools mpileup quality and depth caps. | 20/20/500 |
bcf_scatter_parts / bcf_max_parallel |
Shards and parallelism for calling. | 16..64 / N cores |
bwa_index_* |
BWA index tuning for highβRAM hosts. | see monster config |
vep_* / annotate_with_vep |
VEP tuning and cache settings. | see monster config |
| Field | Description |
|---|---|
base_dir |
Root directory for results. |
work_dir |
Working directory for temporary files. |
temp_dir |
Highβspeed disk for intermediates. |
Defines how sequencing data are retrieved from the Sequence Read Archive (SRA) or European Nucleotide Archive (ENA).
| Field | Description | Example |
|---|---|---|
tool |
Download utility (sra_toolkit). |
sra_toolkit |
use_ascp |
Use Aspera for faster transfers if available. | false |
threads |
Threads for download/conversion. | 8 |
Runβtime behaviour and resilience.
| Field | Description | Example |
|---|---|---|
verbose |
Verbose logging. | true |
resume |
Skip completed steps. | true |
progress_interval_sec |
Interval for progress reports. | 60 |
Controls downsampling or disk usage.
| Field | Description | Example |
|---|---|---|
downsample.enabled |
Whether to subsample reads. | true |
downsample.fraction |
Fraction of reads to retain (e.g., 0.25 β7.5Γ). |
0.25 |
Defines the biological samples to process.
| Field | Description |
|---|---|
sample_id |
Unique identifier (e.g., NA12878). |
study |
Accession of the sequencing study. |
runs |
List of SRA/ENA run IDs containing the reads. |
Two equivalent ways are supported:
- Ordered list (legacy):
steps:
- fetch_fastqs
- qc_reads
- align_and_sort
- mark_duplicates
- bqsr
- call_genes
- summarize- Boolean map (recommended, clearer with new modules):
steps:
refs_and_indexes: true
fetch_fastqs: true
estimate_space: true
downsample: true
qc_and_trimming: true
align_and_sort: true
cram_and_coverage: true
variants_and_vep: true
gene_list: true
gene_presence: true
pairwise: true
trio_denovo: true
paternity: true
ancestry: true
rnaseq: falseIf both forms are present, the boolean map takes precedence.
Controls the supervised ADMIXTURE module.
| Field | Description | Example |
|---|---|---|
method |
Currently admixture_supervised. |
admixture_supervised |
k |
Number of ancestral populations (K). | 4 |
threads |
Threads for ADMIXTURE/PLINK. | 32 |
tools.plink / tools.plink2 / tools.admixture |
Executable names. | plink / plink2 / admixture |
qc.maf |
Minor allele frequency filter. | 0.01 |
qc.geno_missing |
Genotype missingness per SNP (max). | 0.05 |
qc.mind |
Sample missingness (remove if > threshold). | 0.99999 |
qc.indep_pairwise |
LD pruning window/step/r2. | [200, 50, 0.2] |
reference.plink_tar_url |
HGDP+1KG PLINK tarball. | URL |
reference.sample_info_url |
Sample metadata for reference. | URL |
categories |
Map from labels to superpop codes. | {europeu: [eur], ...} |
collapse |
Optional collapsing of categories. | {europeu: [europeu], ...} |
Run detached and monitor long jobs on shared servers or highβcore workstations:
# Executar em background (logs com formataΓ§Γ£o otimizada)
./scripts/run_in_background.sh --config configs/config_human_30x_latest_ref.yaml
# Perfis dedicados
./scripts/run_monster_background.sh --config configs/config_human_30x_monster.yaml
./scripts/run_atena_background.sh --config configs/config_human_30x_atena.yaml
# Monitores auxiliares
./scripts/monitor_monster.sh
./scripts/monitor_bwa_index.shDiagnostics and recovery:
scripts/diagnose_bcftools_error.sh: troubleshooting for zeroβvariant situations; reproduces bcftools pipelines viaconda run -n genomicswith extra logging.scripts/fix_reference_mismatch.sh: safeguards and guidance for reference readβgroup mismatches.- Steps are idempotent; ancestry and heavy bcftools/ADMIXTURE stages check for expected outputs before recomputing.
Genomes Analyzer provides a clear, reproducible path from raw reads to variant calls using a dependable stack of openβsource tools. By letting users switch between GATK and BCFtools and by exposing explicit sharding and parallelization controls, it adapts to many datasets and machines. The detailed documentation and YAML configuration aim to demystify genomics processing for newcomers while remaining efficient for experienced practitioners.
The pipeline wraps several established commandβline programs. Table 1 summarizes their roles and highlights common parameters.
| Tool | Role | Typical command | Key parameters |
|---|---|---|---|
| FastQC | Read quality assessment | fastqc -t 8 fastq/*_R{1,2}.fastq.gz -o qc/ |
-t threads |
| cutadapt | Adapter and quality trimming | cutadapt -j 8 -q 20,20 -m 30 -a ADAPT1 -A ADAPT2 -o out_R1.fastq.gz -p out_R2.fastq.gz in_R1.fastq.gz in_R2.fastq.gz |
-q quality cutoff, -m min length, -a/-A adapters |
| BWAβMEM2 | Shortβread alignment | bwa-mem2 mem -t 16 -R "@RG\tID:S\tSM:S" refs.fa R1.fq.gz R2.fq.gz |
-t threads, -R read group |
| samtools | BAM/CRAM processing | samtools sort -@8 -o out.bam in.bam |
-@ threads |
| BCFtools | Variant calling and manipulation | `bcftools mpileup -f refs.fa -q20 -Q20 | bcftools call -m -v -Oz -o out.vcf.gz` |
| GATK | Variant calling (Java) | gatk --java-options "-Xmx24g" HaplotypeCaller -R refs.fa -I in.bam -O out.g.vcf.gz |
--java-options memory, -L intervals |
Each tool offers many more options; consult the official manuals for advanced usage. The examples above match how genomes_analyzer.py invokes them under default settings.
FastQC evaluates base-call quality, sequence content and other metrics for raw reads. The example command above scans paired FASTQ files and writes an HTML report and a compressed archive to qc/. Successful runs produce files named *_fastqc.html and *_fastqc.zip.
cutadapt removes adapter sequences and trims low-quality bases. Running the sample command yields trimmed FASTQ files (out_R1.fastq.gz, out_R2.fastq.gz) and a log describing how many reads were modified or discarded.
bwa-mem2 aligns reads to a reference genome. The output is typically piped to samtools to create BAM files. Expect alignment statistics on stderr and an alignment stream on stdout.
samtools manipulates SAM/BAM/CRAM files. The sorting example writes a coordinate-sorted BAM (out.bam) and reports progress percentages while running. Additional subcommands handle indexing, duplicate marking and more.
BCFtools calls and filters variants. The example pipeline (mpileup β call) emits a compressed VCF (out.vcf.gz) and writes call statistics to stderr. The norm step (not shown) normalizes indels and splits multiallelic records.
The Java-based GATK suite provides sophisticated variant calling. HaplotypeCaller generates per-sample gVCFs, and tools like GenotypeGVCFs merge them. Outputs include out.g.vcf.gz and corresponding indexes, with detailed progress logs in the console.
This section describes specialized modules available in the repository that extend the core functionality of the Genomes Analyzer pipeline for specific use cases such as AI-powered DNA analysis, dataset generation for machine learning, and ancestry inference.
In addition to traditional variant calling, this repository includes Neural Module (in neural_module/), an AI-powered toolkit that uses AlphaGenome from Google DeepMind to predict functional genomic features directly from DNA sequences.
Neural Module leverages deep learning to predict:
- 𧬠Gene Expression (RNA-seq, CAGE, PRO-cap)
- π¬ Chromatin Accessibility (ATAC-seq, DNase-seq)
- βοΈ Epigenetic Markers (Histone modifications: H3K27AC, H3K4ME3, H3K27ME3, etc.)
- π Transcription Factors (CTCF and other binding sites)
- π§© 3D Structure (Contact Maps)
- βοΈ Splicing (Junction sites, site usage)
β
11 Analysis Types supported by AlphaGenome
β
Advanced Visualizations (heatmaps, dashboards, multi-output comparison)
β
Variant Effect Prediction with 3-panel comparison
β
Ontology Metadata Export (tissue/cell-type information in CSV/JSON)
β
Real Sequence Download Guide (Ensembl, UCSC, NCBI, samtools)
β
Complete Documentation in Portuguese and English
# Download a real genomic sequence (HBB gene, 2048 bp)
curl 'https://rest.ensembl.org/sequence/region/human/11:5227002..5229049?coord_system_version=GRCh38' \
-H 'Content-type:text/x-fasta' > HBB_gene.fasta
# Analyze with AlphaGenome (from neural_module directory)
cd neural_module
python neural_module.py \
-i ../HBB_gene.fasta \
-k YOUR_API_KEY \
-o results/
# Analyze variant (e.g., Sickle Cell Anemia mutation)
python neural_module.py \
-i ../HBB_gene.fasta \
-k YOUR_API_KEY \
-o sickle_cell/ \
--variant 1024 A Tπ Complete Neural Module Documentation: neural_module/README.md
Key guides:
- π Installation Guide
- π₯ Download Real Sequences
- π‘ Usage Guide
- π Interpreting Results
- π Quick Start
Neural Module can be used standalone or integrated with the main pipeline to analyze specific genomic regions identified by variant calling.
π Complete Integration Guide: neural_module/docs/INTEGRATION.md
The integration tool (neural_module/neural_integration.py) provides:
- Automated extraction of sequences from VCF, BED, or gene lists
- Neural analysis of variants and genomic regions
- Correlation of traditional variant calls with AI predictions
- 4 operation modes: integrated analysis, VCF extraction, BED extraction, gene extraction
Quick example:
# Extract variants from pipeline VCF and analyze with AlphaGenome
cd neural_module
python neural_integration.py \
--integrated \
--vcf ../vcf/sample.vcf.gz \
--ref ../refs/GRCh38.fa \
--api-key YOUR_API_KEY \
--output integrated_analysis/π Location: This module is in
neural_longevity_dataset/
The Neural Longevity Dataset Builder automates the creation of machine learning datasets for longevity research by integrating genomic data from the 1000 Genomes Project with AI-powered functional predictions from AlphaGenome.
- π₯ Automated Download: 1000 Genomes High Coverage VCF data
- 𧬠Variant Processing: Calls variants with bcftools, selects central points
- πͺ Window Extraction: FASTA windows centered on variants with ALT allele applied
- π€ AlphaGenome Integration: Feature extraction for each sequence
- π PyTorch Datasets: Balanced train/validation/test splits ready for ML
- π Complete Pipeline: From raw genomic data to ML-ready features
# Build a longevity marker dataset
cd neural_longevity_dataset
python neural_longevity_dataset.py --config configs/default.yaml
# Train a model
python longevity_train.py --config configs/train.yamlπ Complete Guide: neural_longevity_dataset/README.md
π Quick Start: neural_longevity_dataset/QUICKSTART.md
π Project Details: neural_longevity_dataset/docs/PROJECT.md
Note: Run the script from /dados/GENOMICS_DATA/top3/ so that downloads, caches, and dataset artifacts stay organized per cohort.
build_non_longevous_dataset is a modular pipeline for building genomic datasets from non-longevous individuals in the 1000 Genomes Project. It analyzes metadata CSV files, selects samples based on configurable criteria (superpopulation, population, sex), and automatically runs build_window_and_predict.py (included in the module) with AlphaGenome predictions for each selected individual.
β
Automated Sample Selection β Configure by superpopulation or population with flexible filters
β
Metadata Analysis β Comprehensive statistics about sample distribution and demographics
β
Idempotent Execution β Built-in checkpoint system to resume interrupted runs
β
AlphaGenome Integration β Direct integration with AI-powered genomic predictions
β
Organized Structure β Professional module layout with configs and scripts
cd build_non_longevous_dataset
# Analyze available samples
python3 build_non_longevous_dataset.py --config configs/default.yaml
# Configure selection criteria in configs/default.yaml
# Enable additional steps and run full pipeline
python3 build_non_longevous_dataset.py --config configs/default.yamlπ Complete Documentation: build_non_longevous_dataset/README.md
Additional guides:
- π Quick Start Guide
- π§ Implementation Details
- π Module Structure
π Location: This module is in
neural_ancestry_predictor/
Neural Ancestry Predictor is a PyTorch-based neural network that predicts genetic ancestry (superpopulation, population, or FROG likelihood) from AlphaGenome predictions stored in PyTorch datasets. The module is fully configurable via YAML and integrates with Weights & Biases for experiment tracking and visualization.
β
Fully Configurable β Architecture, training, and data processing controlled via YAML
β
Multiple Prediction Targets β Superpopulation (5 classes), Population (26 classes), or FROG likelihood (150 values)
β
Flexible Data Processing β Configurable window extraction, downsampling, and haplotype combination
β
Automatic Normalization β Computes and caches z-score parameters for consistent preprocessing
β
Checkpointing β Saves best models (by loss and accuracy) with full training state
β
Comprehensive Metrics β Accuracy, precision, recall, F1-score, confusion matrix
β
W&B Integration β Professional experiment tracking with graphs and visualizations
The network consists of:
- Input Layer: AlphaGenome predictions (ATAC, RNA-seq, etc.) from genomic windows
- Hidden Layers: Configurable number and size with ReLU/tanh/sigmoid activation
- Dropout: Regularization to prevent overfitting
- Output Layer: Softmax for classification or linear for regression
cd neural_ancestry_predictor
# Train model
python3 neural_ancestry_predictor.py --config configs/default.yaml
# Test model
python3 neural_ancestry_predictor.py --config configs/default.yaml --mode testThe YAML configuration is organized into 8 sections:
A) Dataset Input Parameters β Which AlphaGenome outputs to use, haplotype mode (H1/H2/H1+H2), window size, downsampling
B) Output Parameters β Prediction target (superpopulation/population/frog_likelihood)
C) Model Architecture β Hidden layers, activation function, dropout rate
D) Training Parameters β Optimizer, learning rate, batch size, epochs
E) Data Split β Train/validation/test split ratios
F) Weights & Biases β Experiment tracking configuration
G) Checkpointing β Model saving frequency and loading
H) Mode β Train or test operation
dataset_input:
dataset_dir: "/dados/GENOMICS_DATA/top3/non_longevous_results"
alphagenome_outputs: ["ATAC"]
haplotype_mode: "H1+H2"
window_center_size: 100 # Extract center bases
downsample_factor: 1 # No downsampling
output:
prediction_target: "superpopulation" # 5 classes: AFR, AMR, EAS, EUR, SAS
model:
hidden_layers: [128, 64]
activation: "relu"
dropout_rate: 0.2
training:
optimizer: "adam"
learning_rate: 0.001
batch_size: 16
num_epochs: 100Neural Ancestry Predictor is designed to work seamlessly with PyTorch datasets generated by the Non-Longevous Dataset Builder:
# Step 1: Build dataset (from build_non_longevous_dataset/)
python3 build_non_longevous_dataset.py --config configs/small.yaml
# Step 2: Train ancestry predictor (from neural_ancestry_predictor/)
python3 neural_ancestry_predictor.py --config configs/default.yamlThe dataset provides:
- AlphaGenome predictions for each individual's genomic windows
- Metadata (population, superpopulation, sex)
- FROG ancestry likelihood values
- Organized structure ready for neural network training
π Complete Documentation: neural_ancestry_predictor/README.md
Key topics covered:
- π₯ Installation and dependencies
- π§ Configuration details for all parameters
- ποΈ Architecture and data processing pipeline
- π Training, validation, and testing workflows
- π Metrics interpretation and model evaluation
- π― Hyperparameter tuning guide
- β Comprehensive FAQ
- Start small: Use
window_center_size: 100and single output (ATAC) for quick experiments - Scale up: Increase window size and add outputs (RNA_SEQ, CAGE) for better performance
- Regularization: Adjust dropout (0.2-0.5) if validation loss diverges from training loss
- GPU recommended: ~50-100x faster than CPU for medium/large datasets
π Location: This module is in
FROGAncestryCalc/
FROGAncestryCalc (FROG-kb Ancestry Inference Batch Likelihood Computation Tool) is a tool for ancestry inference based on Ancestry Informative SNPs (AISNPs). The modified version in this repository supports pipe delimiters (|) and includes tools to extract SNPs from genomic data in various formats.
β
Multiple AISNP Panels β Supports 5 panels: 55AI (KiddLab), 128AI (Seldin), 34plex (SNPforID), combined (192 SNPs), precision (165 SNPs)
β
Automated Extraction β Scripts to extract SNPs from VCF, BAM, FASTQ, and 1000 Genomes Project
β
155 Populations β Calculates ancestry likelihoods for 155 worldwide populations
β
Flexible Formats β Converts VCF/BAM/FASTQ to FROGAncestryCalc format
β
Detailed Reports β Generates likelihood, order of magnitude, and population ranking files
cd FROGAncestryCalc
# Extract SNPs from a VCF file
python3 tools/vcf_to_frog.py \
sample.vcf.gz \
tools/aisnps_55_list.txt \
input/sample_data.txt
# Run ancestry analysis
./run.shThe module includes three tools to extract AISNPs from genomic data:
| Tool | Data Source |
|---|---|
vcf_to_frog.py |
VCF files (from any source) |
extract_snps_from_1000genomes.sh |
Direct download from 1000 Genomes Project Phase 3 |
extract_snps_from_wgs.sh |
Whole genome sequencing data (FASTQ/BAM/VCF) |
π Complete Documentation: FROGAncestryCalc/README.md
Additional guides:
- 𧬠55 AISNPs List
- βοΈ Modification Details
FROGAncestryCalc can be used independently or integrated with the main pipeline for ancestry analysis of processed samples:
# Extract SNPs from pipeline-generated VCF
cd FROGAncestryCalc
python3 tools/vcf_to_frog.py \
../vcf/NA12878.vcf.gz \
tools/aisnps_55_list.txt \
input/NA12878_aisnps.txt
# Run analysis
./run.shNote: The main pipeline also includes ancestry analysis via supervised ADMIXTURE (step 15), which uses a different approach based on PLINK and HGDP+1KG references. FROGAncestryCalc offers an alternative specifically focused on validated AISNP panels for forensic and clinical use.
π Location: This module is in
genes_difference_count/
Genes Difference Count is a high-performance C++ tool for comparing gene sequences between family members (trio analysis: father, mother, child) using parallel processing and optimized algorithms. It generates comprehensive pairwise comparison statistics for genetic differences.
β
Parallel Processing β OpenMP-based parallelization for maximum performance
β
IUPAC Compatibility β Full support for IUPAC nucleotide ambiguity codes
β
Smart Alignment β Hirschberg algorithm for short sequences, fast estimation for long ones
β
Comprehensive Statistics β Detailed per-gene metrics (matches, substitutions, indels)
β
CSV Output β Structured results for easy downstream analysis
β
Optimized β Aggressive compiler optimizations for ultra-fast execution
cd genes_difference_count
# Step 1: Generate gene FASTA files from VCF data
# Edit scripts/generate_gene_fastas.sh to configure paths (lines 17-34)
bash scripts/generate_gene_fastas.sh
# Step 2: Configure input/output paths in genes_difference_count.cpp
# Edit lines 16-24 to point to generated FASTA files
# Step 3: Compile with optimizations
make
# Step 4: Run the analysis
./genes_difference_countThe tool generates a CSV file with detailed statistics for each gene across all three pairwise comparisons:
- Father vs Mother
- Father vs Child
- Mother vs Child
Each comparison includes: aligned columns, matches, substitutions, indels, and ambiguous bases.
On a 16-core system:
- Speed: ~100-150 comparisons/second
- Typical Runtime: 8-12 minutes for ~60,000 comparisons (20,000 genes Γ 3 pairs)
- Parallelization: Scales well with available CPU cores
π Complete Documentation: genes_difference_count/README.md
Includes:
- π§ Configuration guide
- π¨ Compilation instructions
- π Output format details
- π Performance optimization tips
- 𧬠Algorithm details (IUPAC compatibility, alignment strategies)