Skip to content

Kari-Genomics-Lab/Extreme_Env_2

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

47 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Life at the extremes

This repository contains the official implementation of the study: Life at the extremes: Maximally divergent microbes with similar genomic signatures linked to extreme environments (Safari et al., 2025): preprint

In that work, we showed that extremophiles — despite belonging to maximally divergent lineages — can converge toward highly similar genomic k-mer signatures when adapting to extreme environments (temperature, pH, and beyond). These convergent patterns highlight the role of large-scale mutational and selective pressures in shaping microbial genomes under stress.


Installation

# From repo root
pip install -e .

Download Data

Download the extremophile genome assemblies and metadata from Zenodo DOI: link

Data Layout

Place downloaded assemblies (FASTA .fna) and the metadat file under data/.

All results will be written to outputs/ automatically.

All the results of the experiments of this study are available in results/ folder.

Supervised Learning Experiments

1) Effect of Genome Proxy Selection (Multiple Runs)

Tests whether random genome proxy choice changes classification accuracy.

python3 src/extprime/pipelines/pipeline_supervised.py \
  --exp_type exp1 --max_k 6 --data_root data --output_root outputs

2) Accuracy vs. Genome Proxy Length (Single Run)

Compares accuracy across proxy lengths (k set by --max_k).

python3 src/extprime/pipelines/pipeline_supervised.py \
  --exp_type exp2 --max_k 6 --data_root data --output_root outputs

3) Effect of Number of Subfragments (n)

Varies n in the composite genome proxy to measure its impact.

python3 src/extprime/pipelines/pipeline_supervised.py \
  --exp_type exp3 --max_k 6 --data_root data --output_root outputs

Optional: Add --whole_genome to use entire genomes instead of proxies.

Outputs

Results are written under:

outputs/{exp_type}/{env}/fragments_{length}/...

Each folder contains the generated FASTA for that environment and the model outputs produced by the pipeline.

Command-Line Flags

  • --exp_type {exp1,exp2,exp3,tuning} – Choose the experiment
  • --max_k INT – Maximum k-mer length considered by the models
  • --data_root PATH – Input root (default: data)
  • --output_root PATH – Results root (default: outputs)
  • --whole_genome – Use entire genomes instead of proxies (optional)

Unupervised Learning Experiments

4) Non-parametric clustering and candidate identification

python3 src/extprime/pipelines/pipeline_unsupervised.py \
  --exp_type non-parametric --k_mer 6 --data_root path_to_the_subfragments --output_root outputs \
  --fragement_length 100000 --n_clusters 4 --env Temperature

Command-Line Flags

  • --exp_type {parametric, non-parametric} – Choose the experiment
  • --k_mer INT – K-mer length
  • --fragement_length INT – Fragment length
  • --n_clusters INT – Number of clusters (default: 4) - not needed for non-parametric
  • --outputs_root PATH – Output results directory
  • --env {pH,Temperature} – Environment type
  • --fragment_path PATH – Path to a fragment FASTA file
  • --data_root PATH – Input data directory

FCGR distance calculation and filtering

python3 src/extprime/analysis/distance_calculator.py --data_root path_to_the_subfragments

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors