(Practical Heuristic ON Incremental matching statistics computation)
This framework supports the currently memory-friendliest way to compute the matching statistics of a pattern on highly-repetitive texts, given that the input text is precomputed with the MONI index (more precisely, we need all ingredients of MONI except the thresholds).
We require the pattern and the text to be available in form of sequences stored in the .fa (FASTA) format.
To use our solution, you need to have recent cmake, g++, zsh, and python 3 installed.
We need the following python 3 packages for extracting and concatenating .fa files:
pip3 install biopython
pip3 install fastaparser
pip3 install psutilgit clone --branch phoni https://github.com/koeppl/phonimkdir build
cd build; cmake ..
makeTo build the index we use the command phoni build from the build directory.
phoni build \
-r <filename of the reference> \
-t <number of threads> \
-g <grammar format> \
-f <input file is a fasta file> \For example, to build the phoni index for the file yeast.fasta using 4 threads and the plain grammar we run from the build folder:
python3 phoni build -r ../data/yeast.fasta -f -t 4 -g plain
This command will produce yeast.fasta.phoni and yeast.fasta.plain.slp in the data folder, which represent the phoni index.
To query the index we use the command phoni query from the build directory.
phoni ms \
-r <filename of the reference> \
-p <number of threads> \
-g <grammar format> \For example, to query the phoni index for the file yeast.fasta using the plain grammar with the pattern samples.fastq we run from the build folder:
python3 phoni build -r ../data/yeast.fasta -p ../data/samples.fa -g plain
This command will produce samples.fa.positions and samples.fa.lengths in the data folder, which represent the matching staistics positions and lengths of samples.fa against yeast.fasta, respectively.
To perform the queries using the old query command we first have to split the samples.fa file using the python tool splitpattern.py:
mkdir data/samples.fa.dir
python3 splitpattern.py data/samples.fa data/samples.fa.dirThen we can query samples.fa against the yeast.fasta index with the plain grammar using the following command:
./build/test/src/phoni_compatibility data/yeast.fasta -p data/samples.fa -g plainprefixpattern.py: takes the x% prefix of each pattern stored in a.fafile and outputs a new.fafile to stdoutsplitpattern.py: splits a.fafile into individual sequences stored in a directory given as program parameter (we need this formsfastas it does not support reading.fafiles)
We provide a script and benchmark files to evaluate PHONI in the setting as described in the paper
Christina Boucher, Travis Gagie, Tomohiro I, Dominik Köppl, Ben Langmead, Giovanni Manzini, Gonzalo Navarro, Alejandro Pacheco, Massimiliano Rossi: PHONI: Streamed Matching Statistics with Multi-Genome References, In proceedings of 2021 Data Compression Conference, pp. 193-202, 2021.
Christina Boucher, Travis Gagie, Tomohiro I, Dominik Köppl, Ben Langmead, Giovanni Manzini, Gonzalo Navarro, Alejandro Pacheco, Massimiliano Rossi: PHONI: Streamed Matching Statistics with Multi-Genome References, arXiv:2011.05610, 11 Nov 2020.
In our experiments we used the file
- chr19.1000.fa.xz as our text dataset, and
- chr19.10.fa.xz as our pattern dataset.
We have a shell script benchmark.sh for an automatic benchmark.
For this to work, some variables in it has to be set, as this project does not ship with the other matching statistic algorithms, namely
meaning it is necessary to download and compile those projects individually, and the set the corresponding variables in benchmark.sh manually
(more precisely: in the switch-case statement for the hostname in the beginning).
Finally, the output of benchmark.sh can be processed by sqlplots to generate the plots shown in the paper.
To compute the naive PHONI variant evaluated in the paper, simple exchange lceToRBounded with lceToR_NaiveBounded in the file include/ms/phoni.hpp.