Add RNAcentral databases and download caching to mmseqs databases by antonvnv · Pull Request #5 · pskvins/MMseqs2

antonvnv · 2026-04-13T22:21:30Z

Add RNAcentral databases and download caching to mmseqs databases

Based on PR #4 which adds local nucleotide database indexing via mmseqs databases ./file.fasta.gz db tmp

Changes

--download-dir parameter (c7c8f0e)

Adds a persistent download cache directory (default: downloads/) so that large files are not re-downloaded across runs. When the file already exists in the cache, the download is skipped. Only active for named
databases — local file paths do not create the directory.

RNAcentral databases (7d98645)

Adds two new named databases:

RNAcentral_current — latest release from current_release/ (non-deterministic, contents change with each RNAcentral release)
RNAcentral_26_0 — pinned to release 26.0 (deterministic)

Both download rnacentral_active.fasta.gz and index it using the nucleotide pipeline (createdb → splitsequence → makepaddedseqdb → createindex). Downloads are stored under // subdirectories.

Usage

  # Download and index RNAcentral (pinned release)
  mmseqs databases RNAcentral_26_0 db tmp

  # Use a custom download cache
  mmseqs databases RNAcentral_26_0 db tmp --download-dir /data/downloads

  # Skip download if file is already cached
  ls downloads/RNAcentral_26_0/rnacentral_active.fasta.gz  # exists
  mmseqs databases RNAcentral_26_0 db tmp                   # skips download

Building a nucleotide database from a FASTA file previously required manually chaining createdb, splitsequence, makepaddedseqdb, and createindex with the right flags. Now a single command does it: mmseqs databases ./input.fasta.gz outdb tmp Both relative (./...) and absolute (/...) paths work — any argument containing '/' that isn't a known database name is treated as a local file. Protein inputs are rejected with a clear error since the indexing pipeline is nucleotide-specific. This keeps `databases` as the single entry point to maintain indexing requirements, and makes it suitable for reindexing external or already manually downloaded databases.

Add a --download-dir parameter (default: "downloads") that provides a persistent cache directory for downloaded files. This avoids re-downloading large files across runs. The directory is resolved to an absolute path and created if it does not exist. The shell script receives it as DOWNLOAD_DIR environment variable.

Add RNAcentral active sequences as downloadable databases: - RNAcentral_current: latest release (non-deterministic) - RNAcentral_26_0: pinned release 26.0 (deterministic) Both use the LOCAL_FASTA pipeline for nucleotide indexing and store downloads under <download-dir>/<dbname>/ subdirectories.

antonvnv added 3 commits April 13, 2026 21:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add RNAcentral databases and download caching to mmseqs databases#5

Add RNAcentral databases and download caching to mmseqs databases#5
antonvnv wants to merge 3 commits into
pskvins:masterfrom
antonvnv:rnacentral

antonvnv commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

antonvnv commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant