Use a greedy approximation algorithm to find the shortest common superstring of given strings in succinct space. The algorithm is described in the arXiv paper. The implementation assumes a byte alphabet.
- Reasonably new compilers for C and C++, e.g. GCC 6 or Clang 3.7. C++14 support is required.
- GNU gengetopt 2.22.6.
On Linux also the following libraries are required to build and run a tool to verify the constructed superstring:
git clone https://github.com/tsnorri/compact-superstring.gitcd compact-superstringgit submodule update --inittouch local.mkmake -j4 BUILD_STYLE=release
- Clone the repository with
git clone https://github.com/tsnorri/compact-superstring.git. - Change the working directory with
cd compact-superstring. - Run
git submodule update --init. This clones the missing submodules and updates their working tree. - Edit
local.mkin the repository root to override build variables. Useful variables includeCCandCXXfor C and C++ compilers respectively,LOCAL_CXXFLAGSandLOCAL_LDFLAGSfor C++ and linker flags respectively andBOOST_IOSTREAMS_LIBfor the command line options to link Boost's Iostreams library. Seecommon.mkfor additional variables. - Run make with a suitable numer of parallel jobs and the variable
BUILD_STYLEdefined, e.g.make -j4 BUILD_STYLE=release
Useful make targets include:
- all
- Build everything
- clean
- Remove build products except for dependencies (in the `lib` folder).
- clean-all
- Remove all build products.
Please see run.sh for examples.
The tool tribble/find-superstring/find-superstring takes a FASTA file as input and generates an index. The index may then be used to generate the superstring.
The tool tribble/verify-superstring/verify-superstring takes the superstring generated by find-superstring as input and builds another index. This index may then be used to check that all the reads in the original FASTA input file are substrings of the superstring.
See tribble/find-superstring/find-superstring --help and tribble/verify-superstring/verify-superstring --help for command line options and examples.
The implementation differs from the one described in the arXiv paper in the preprocessing stage where it sorts the input strings and removes duplicates. For simplicity, we used the C++ standard library sorting. If there is a huge number of duplicates in the data, this might take O(n log n) bits of space. If your dataset contains a huge number of duplicates, we suggest you remove those before running the algorithm.