Skip to content

BilkentCompGen/FUSOR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FUSOR: Space-efficient Taxonomic Classification of Using Hybrid Hierarchical Interleaved Binary Fuse Filters

Tool Configuration Disk (GiB) Peak Memory (GiB) Time (h:m:s)
FUSOR (3-wise) Taxor defaults 38 98.0 9:09:10
FUSOR (4-wise) Taxor defaults 22 97.5 9:09:57
FUSOR (4-wise) ganon2 defaults 35 108.5 35:14:53
FUSOR (4-wise) no downsampling (k=22) 239 856.3 46:45:46
Taxor Taxor defaults 42 56.4 9:32:00
ganon2 ganon2 defaults 57 58.6 4:57:35
ganon2 no downsampling (k=22, max FP rate=2-8) 333 502.6 53:30:42
Kraken2 Kraken2 defaults 78 86.8 19:08:39

NOTE! The "no downsampling" in the Taxor row in the paper is a mistake, it should be "Taxor defaults"

Citation

TODO add own paper here

Table of contents

Description

FUSOR is a taxonomic classification and profiling tool that efficiently classifies DNA sequences against large sets of genomic reference sequences. FUSOR stores k-mers in an optimized hierarchical interleaved Binary Fuse filter (HIBFF) index and combines k-mer similarity and genome coverage information for precise taxonomic classification and profiling. It features:

  • Low false positive rates for k-mer matching
  • NCBI taxonomy integration
  • Open canonical syncmers as k-mer selection scheme for improved downsampling
  • classification with binning and taxonomic profiling
  • read reassignment EM algorithm for multi-matching reads
  • advanced filtration of search results
  • taxonomic and sequence abundance reports with genome size correction

Benchmarking results based on simulated and real long-read data sets demonstrate that FUSOR enables more precise taxonomic classification and profiling of microbial populations while having a smaller memory footprint than other tools.

Installation

You can build FUSOR from source using the following commands. Ensure you have installed CMake (>=3.16) and GCC (>= 10) installed.

# 1. Clone the repository and initialize submodules
git clone https://github.com/tunaokcu/FUSOR.git
cd FUSOR
git submodule update --init --recursive

# 2. Create a clean build directory
mkdir build
cd build

# 3. Configure CMake targeting the src directory
cmake ../src

# 4. Compile FUSOR components using all available CPU cores
make -j$(nproc)
make -j$(nproc) ganon-classify

Pre-built databases

Users can easily build custom databases as described below or use the following pre-built database index files

Kingdom Source Parameters Size Link MD5 Hash
Viruses Genbank Release 258 k=22, s=12 373 MB download 0e8edd19a6314450f88f556dcf6b7c95
Archaea, Bacteria, Fungi, Viruses RefSeq Release 216 k=22, s=12 9.9 GB download 768ee0320dcf41f5b15efafa028ba836

Commands

Subcommand / Executable Function
build Construct HIBFF index from fasta reference files
search / ganon-classify Search sequences/classify against a database index
profile Generate the taxonomic profile and abundance from results

Database Building (fusor build)

FUSOR supports two primary build profiles: the Taxor profile and the Ganon profile.

  • Taxor Defaults (--build-taxor): Optimized for precise taxonomic classification using syncmers. Automatically sets kmer_size=20, syncmer_size=10, and use_syncmer=true.
  • Ganon Defaults (--build-ganon): Optimized for fast classification using minimizers. Automatically sets kmer_size=19, window_size=31, and use_syncmer=false.

You can override these defaults by explicitly specifying --kmer-size, --window-size, or --syncmer-size.

Example (Ganon Defaults):

fusor build --build-ganon --input-file refseq_accessions_taxonomy.csv --input-sequence-dir refseq/files --output-filename database.hixf --threads 6

Example (Taxor Defaults):

fusor build --build-taxor --input-file refseq_accessions_taxonomy.csv --input-sequence-dir refseq/files --output-filename database.hixf --threads 6
fusor build - Creates and HIBFF index of a given set of fasta files
==================================================================

DESCRIPTION
    Creates an HIBFF index using either k-mers or syncmers

OPTIONS

  Basic options:
    -h, --help
          Prints the help page.
    -hh, --advanced-help
          Prints the help page including advanced options.
    --version
          Prints the version information.
    --copyright
          Prints the copyright/license information.
    --export-help (std::string)
          Export the help page information. Value must be one of [html, man].

  Main options:
    --input-file (std::string)
          tab-separated-value file containing taxonomy information and reference file names
    --input-sequence-dir (std::string)
          directory containing the fasta reference files Default: .
    --output-filename (std::string)
          A file name for the resulting index. Default: .
    --kmer-size (signed 32 bit integer)
          size of kmers used for index construction Default: 20. Value must be in range [1,30].
    --syncmer-size (signed 32 bit integer)
          size of syncmer used for index construction Default: 10. Value must be in range [1,26].
    --threads (signed 32 bit integer)
          The number of threads to use. Default: 1. Value must be in range [1,32].
    --use-syncmer
          enable using syncmers for smaller index size
    --fast-layout
          disables estimate_union to accelerate chopper layout calculation for large databases. Approximates layout linearly.

input-file
This file contains all relevant information about the organisms in the database, which will be indexed. All values are tab-separated and the file should have following columns:

  • Column 1: Assembly accession: the assembly accession.version reported in this field is a unique identifier for the set of sequences in this particular version of the genome assembly.
  • Column 2: Taxonomy ID: the NCBI taxonomy identifier for the organism from which the genome assembly was derived. The NCBI Taxonomy Database is a curated classification and nomenclature for all of the organisms in the public sequence databases. The taxonomy record can be retrieved from the NCBI Taxonomy resource: https://www.ncbi.nlm.nih.gov/taxonomy/
  • Column 3: FTP path: the path to the directory on the NCBI genomes FTP site from which data for this genome assembly can be downloaded
  • Column 4: Organism name
  • Column 5: Taxonomy string
  • Column 6: Taxonomy ID string

A two-line example of such a file is provided below. You can easily create such a file by following the preprocessing steps described in the Usage section.

GCF_000002495.2	318829	https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/002/495/GCF_000002495.2_MG8	Pyricularia oryzae	k__Eukaryota;p__Ascomycota;c__Sordariomycetes;o__Magnaporthales;f__Pyriculariaceae;g__Pyricularia;s__Pyricularia oryzae	2759;4890;147550;639021;2528436;48558;318829
GCF_000002515.2	28985	https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/002/515/GCF_000002515.2_ASM251v1	Kluyveromyces lactis	k__Eukaryota;p__Ascomycota;c__Saccharomycetes;o__Saccharomycetales;f__Saccharomycetaceae;g__Kluyveromyces;s__Kluyveromyces lactis	2759;4890;4891;4892;4893;4910;28985

input-sequence-dir
Path to the directory containing fasta files (compressed) of organisms listed in the tab-separated file explained above. The file stem of the fasta files needs to match the last directory path string of the FTP path in column 3 of the input file (e.g. GCF_000002495.2_MG8)

output-filename
Path to the output file containing the hierarchical interleaved XOR filter index of the reference sequences and taxonomy information for the profiling step.

kmer-size
Size of k-length-substrings used for pseudo-mapping. When using syncmers for downsampling, the kmer-size has to be even-numbered because of using open canonical syncmers. The maximum supported k-mer size is 30.

syncmer-size
Size of the substrings used for selecting a k-mer for pseudo-mapping. The syncmer-size also has to be even-numbered because of the usage of open canonical syncmers. This number needs to be smaller than the k-mer size and the maximum supported size is 26.

use-syncmer
Switch that enables the usage of syncmers for downsampling of k-mers.

fast-layout
Disables dynamic programming intersect evaluation (which calculates false-positive rates explicitly). Required for large assembly level datasets to bypass O(N^2) dynamic programming scaling issues when compiling the hierarchical index graph.

threads
Number of threads used for computing the hierarchical structure and building the HIBFF index.

Classification (fusor search and ganon-classify)

Once you have a built database, you can classify sequences using either ganon-classify or fusor search.

Using ganon-classify

Ideal when using the Ganon build profile (--build-ganon).

ganon-classify --hixf database.hixf --rel-cutoff 0.05 --rel-filter 0 --output-prefix ganon_classified --output-all -r reads.fna

Note: The --rel-filter 0 flag restricts output to the best unique match per read. --rel-filter 1 will output all equivalently tied matching candidates passing the cutoff.

Using fusor search

Ideal when using the Taxor build profile (--build-taxor). If the build process generated a stash map (.stashmap), you should pass it during search:

fusor search --index-file database.hixf --stashmap database.stashmap --query-file reads.fna --output-file taxor_classified.tsv --threads 6
fusor search - Queries a file of DNA sequences against an HIBFF index
====================================================================

DESCRIPTION
    Query sequences against the FUSOR HIBFF index structure

OPTIONS

  Basic options:
    -h, --help
          Prints the help page.
    -hh, --advanced-help
          Prints the help page including advanced options.
    --version
          Prints the version information.
    --copyright
          Prints the copyright/license information.
    --export-help (std::string)
          Export the help page information. Value must be one of [html, man].

  Main options:
    --index-file (std::string)
          FUSOR index file containing HIBFF index and reference sequence information
    --query-file (std::string)
          file containing sequences to query against the index Default: .
    --output-file (std::string)
          A file name for the resulting output. Default: .
    --threads (unsigned 8 bit integer)
          The number of threads to use. Default: 1. Value must be in range [1,32].
    --percentage (double)
          If set, this threshold is used instead of the k-mer/syncmer models. Default: -1. Value must be in range
          [0,1].
    --error-rate (double)
          Expected error rate of reads that will be queried Default: 0.04. Value must be in range [0,1].

index-file
Path to the file containing the hierarchical interleaved XOR filter index of the reference sequences and taxonomy information for the profiling step.

query-file
Path to a fast(a/q) file containing sequenced reads of a sample, which shall be taxonomically classified. This file can be gzip compressed.

output-file
Path to the output file containing results of the classification step. This file is tab-separated with 10 columns per line.

  • Column 1 : read identifier
  • Column 2 : Assembly Accession ID of the matching reference
  • Column 3 : Organism name of the matching reference
  • Column 4 : Taxonomy ID of the matching reference
  • Column 5 : Cumulative length of the matching reference sequence
  • Column 6 : Sequence length of the queried read
  • Column 7 : Overall number of k-mers (syncmers) generated from the queried read
  • Column 8 : Number of k-mers (syncmers) of the query read that match with the reference sequence
  • Column 9 : Taxonomy string of the matched reference
  • Column 10 : Taxonomy ID string of the matched reference
#QUERY_NAME	ACCESSION	REFERENCE_NAME	TAXID	REF_LEN	QUERY_LEN	QHASH_COUNT	QHASH_MATCH TAX_STR TAX_ID_STR
f3a88e6f-9873-445a-b0c7-2406f3a360fc|1613       GCF_022819245.1 Limosilactobacillus fermentum   1613    2016236 1139    98      56      k__Bacteria;p__Bacillota;c__Bacilli;o__Lactobacillales;f__Lactobacillaceae;g__Limosilactobacillus;s__Limosilactobacillus fermentum      2;1239;91061;186826;33958;2742598;1613
9d3a345f-3bf7-4304-977c-ee8b2b3c2b17|96241      GCF_000959025.1 Bacillus intestinalis   1963032 4031727 2589    224     104     k__Bacteria;p__Bacillota;c__Bacilli;o__Bacillales;f__Bacillaceae;g__Bacillus;s__Bacillus intestinalis   2;1239;91061;1385;186817;1386;1963032
9d3a345f-3bf7-4304-977c-ee8b2b3c2b17|96241      GCF_006094475.1 Bacillus spizizenii     96241   4045538 2589    224     104     k__Bacteria;p__Bacillota;c__Bacilli;o__Bacillales;f__Bacillaceae;g__Bacillus;s__Bacillus spizizenii     2;1239;91061;1385;186817;1386;96241
4e845c81-7397-4956-bdb2-5893b450aa3e|1613       GCF_022819245.1 Limosilactobacillus fermentum   1613    2016236 1790    158     38      k__Bacteria;p__Bacillota;c__Bacilli;o__Lactobacillales;f__Lactobacillaceae;g__Limosilactobacillus;s__Limosilactobacillus fermentum      2;1239;91061;186826;33958;2742598;1613
acc4abca-4daf-4af1-a3fc-edc189c37095|2807625    GCF_018622975.1 Staphylococcus sp. MZ9  2836367 2860948 3971    344     158     k__Bacteria;p__Bacillota;c__Bacilli;o__Bacillales;f__Staphylococcaceae;g__Staphylococcus;s__Staphylococcus sp. MZ9      2;1239;91061;1385;90964;1279;2836367

threads
Number of threads used for querying the sequences in the input file against the HIBFF index.

percentage
Can be used to define the minimum percentage of k-mers/syncmers that need to match a reference. By default we use the k-mer model from Blanca et al. or empircally computed values for determining the thesholds for reporting a match.

error-rate
For more accurate classification of reads we are calculating the expected number of mutated k-mers for each read prefix based on the expected sequencing error rate. Than a confidence interval for the mutated k-mers is calculated as described by Blanca et al. and the minimum number of matching k-mers is calculated based on the upper bound of the confidence interval. The significance level of the confidence interval is set to 95% by default. When using synmcers, we are using emprically calculated minimum numbers of matching syncmers for a given error rate and k-mer length.

Abundance Estimation (fusor profile)

Use fusor profile to generate taxonomic profiling and abundance estimation from search results.

fusor profile --search-file taxor_classified.tsv --cami-report-file SAMPLE.report --seq-abundance-file SAMPLE.abundance --binning-file SAMPLE.binning --sample-id SAMPLE --threads 6
fusor profile - Taxonomic profiling of a sample by giving read matching results of FUSOR search
===============================================================================================

DESCRIPTION
    Taxonomic profiling of the given read set

OPTIONS

  Basic options:
    -h, --help
          Prints the help page.
    -hh, --advanced-help
          Prints the help page including advanced options.
    --version
          Prints the version information.
    --copyright
          Prints the copyright/license information.
    --export-help (std::string)
          Export the help page information. Value must be one of [html, man].

  Main options:
    --search-file (std::string)
          FUSOR search file containing results of read querying against the HIBFF index
    --cami-report-file (std::string)
          output file reporting genomic abundances in CAMI profiling format. This is the relative genome abundance in
          terms of the genome copy number for the respective TAXID in the overall sample. Note that this is not
          identical to the relative abundance in terms of assigned base pairs.
    --seq-abundance-file (std::string)
          output file reporting sequence abundance in CAMI profiling format (including unclassified reads). This is
          the relative sequence abundance in terms of sequenced base pairs for the respective TAXID in the overall
          sample. Note that this is not identical to the genomic abundance in terms of genome copy number for the
          respective TAXID. Default: .
    --binning-file (std::string)
          output file reporting read to taxa assignments in CAMI binning format
    --sample-id (std::string)
          Identifier of the analyzed sample
    --threads (unsigned 8 bit integer)
          The number of threads to use. Default: 1. Value must be in range [1,32].

search-file
Path to the output file of the search step containing results of the classification. This file is tab-separated with 10 columns per line as described above.

cami-report-file
Output file reporting genomic abundances in CAMI profiling format. This is the relative genome abundance or taxonomic abundance in terms of the genome copy number for the respective TAXID in the overall sample.

seq-abundance-file
Output file reporting sequence abundance in CAMI profiling format (including unclassified reads). This is the relative sequence abundance in terms of sequenced base pairs for the respective TAXID in the overall sample.

binning-file
Output file reporting read to taxon assignments in CAMI binning format.

sample-id
String that identifies the analyzed sample.

threads
Number of threads used for taxonomic profiling.

Usage

First download the reference sequences and taxonomy dump of the sequences from the NCBI using genome_updater.

genome_updater.sh \
    -d "refseq"\
    -g "archaea,bacteria,fungi,viral" \
    -c "all" \
    -l "complete genome,chromosome" \
    -f "genomic.fna.gz" \
    -o "refseq-abv" \
    -t 12 \
    -A "species:1" \
    -m -a -p

Then, unpack the taxonomy dump and create a tab-separated-values file using the Linux command cut and taxonkit.

# cd to 2023-03-15_12-56-12

# taxdump
mkdir -p taxdump
tar -zxvf taxdump.tar.gz -C taxdump

cut -f 1,7,20 assembly_summary.txt \
| taxonkit lineage -i 2 -r -n -L --data-dir taxdump \
| taxonkit reformat -I 2 -P -t --data-dir taxdump \
| cut -f 1,2,3,4,6,7 > refseq_accessions_taxonomy.csv

Now we can build the hierarchical interleaved binary fuse filter (HIBFF) index of the reference sequences and the NCBI taxonomy.

fusor build --input-file refseq_accessions_taxonomy.csv --input-sequence-dir refseq/2023-03-15_12-56-12/files \
--output-filename refseq-abfv-k22-s12.hixf --threads 6 --kmer-size 22 --syncmer-size 12 --use-syncmer

Then, we query the sample fastq file against the index allowing in this case a sequencing error rate of 15%.

fusor search --index-file refseq-abfv-k22-s12.hixf --query-file SAMPLE.fq.gz \
--output-file SAMPLE.search.txt --error-rate 0.15 --threads 6 

Finally, the query result file is used as input for taxonomic profiling, which has three output files containing taxonomic abundances and sequence abundances in CAMI report format as well as a binning file with final read to reference assignments.

fusor profile --search-file SAMPLE.search.txt --cami-report-file SAMPLE.report \
--seq-abundance-file SAMPLE.abundance --binning-file SAMPLE.binning --sample-id SAMPLE

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages