Description

This is a library for estimating decoding probabilities from a query nucleotide sequence based on a set of reference amino acid kmers.

The library is written in C and is meant to be used via the provided Python wrapper. The full list of functions and their descriptions can be found in kaci.h

Installation

The library can be built on Linux (or WSL).

Install the dependencies for the library:

sudo apt update
sudo apt install build-essential
sudo apt install zlib1g-dev

Clone this repo.

git clone https://github.com/artmeln/lib-kaci.git

Install python dependencies (these are used only by the tests and examples, not by the library itself):

pip install -r requirements.txt

Build the library:

cd src
make libkaci
cd ..

Getting started

The best way to start is to run the provided examples (both examples will take a few seconds to execute). First build a reference:

cd examples
python3 build_reference.py

To process queries run:

mkdir results
wget -P genomes/ https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/146/045/GCF_000146045.2_R64/GCF_000146045.2_R64_genomic.fna.gz
python3 process_query.py

This will produce typical KACI output files for the downloaded genome in results directory.

You should look at these examples to familarize yourself with the inputs that control KACI.

Inputs

The following inputs should be given to the library:

amino acid kmer length
translation table - a string of single letter amino acid decodings for the 64 possible codons. The order of decodings in the table is based on the ascending ordering of codons where T<C<A<G. The translation table for the standard genetic code is FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG
maximum query count - how many times an amino acid kmer that is found repeatedly in the query and matches a reference kmer will be used in estimating decoding probabilities
the name of the file containing query nucleotide sequences in fasta format (*.fna or *.fna.gz) OR a list of such file names when working in the batch mode
the name of the file containing the reference kmers (the format of this file is described below)

If the reference file is not available, it should be constructed first. The library settings that influence this process are

maximum allowed count for any single amino acid in a family reference kmer
minimum allowed number of proteins in a family
minimum allowed link number - the number of kmers in one family that are linked by a single amino acid substitution at a specific position
name of the file containing representative sequences of protein domains for all available protein families in fasta format. One example of such file is Pfam-A.fasta.gz which accompanies releases of Pfam domain database.

Outputs

a json file containing the decoding probabilities for every codon and the number of times each codon contributed to the decoding probability calculation
optionally a csv file listing all query kmers that contributed to the calculation

Tests

If you make modifications to the library, you may want to check that everything is still working as expected. To run the tests:

Download genomes from NCBI:

wget -P test/genomes/ https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/146/045/GCF_000146045.2_R64/GCF_000146045.2_R64_genomic.fna.gz
wget -P test/genomes/ https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/661/245/GCA_001661245.1_Pacta1_2/GCA_001661245.1_Pacta1_2_genomic.fna.gz
wget -P test/genomes/ https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/182/965/GCF_000182965.3_ASM18296v3/GCF_000182965.3_ASM18296v3_genomic.fna.gz

Run the tests:

python3 tests.py

Note: the last test needs seqkit

Acknowledgments

This project was made possible by a very fast hash table C library

In addition, it utilizes a C implementation of logaddexp

Citation

Artem V Melnykov. "New genetic codes in bacteria and archaea identified with a fast k-mer based algorithm" (preprint)

If you want to reproduce results from this paper, follow these steps:

Make sure you are running on a machine with at least 16 GB of available memory and at least 30 GB hard drive.
Build the library as described above.
Make sure that your examples directory has the following structure

examples/
 ├── genomes/
 ├── ref/
 └── results/

Download the reference, unzip it and place into examples/ref/:

cd examples
wget -P ref/ https://zenodo.org/records/19318166/files/ref_k11_link20.tar.gz
tar -xvzf ref/ref_k11_link20.tar.gz -C ref/

Download genomes of interest and place them into examples/genomes/ (you don't have to unzip them).
To process all files located in examples/genomes/ use 1_batch_process_genomes.py (you may want to edit this file and change the number of threads to be used). There will be an initial delay of a couple of minutes while the reference is being loaded, after that you should start seeing output files as they are written to the hard drive.

python3 1_batch_process_genomes.py

To extract the inferrences for all files located in examples/results and save them as summary_genomic.csv run:

python3 2_evaluate_results.py

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.vscode		.vscode
examples		examples
src		src
test		test
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
README.md		README.md
requirements.txt		requirements.txt
tests.py		tests.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Description

Installation

Getting started

Inputs

Outputs

Tests

Acknowledgments

Citation

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Description

Installation

Getting started

Inputs

Outputs

Tests

Acknowledgments

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages