Skip to content

Add RNAcentral databases and download caching to mmseqs databases#5

Open
antonvnv wants to merge 3 commits into
pskvins:masterfrom
antonvnv:rnacentral
Open

Add RNAcentral databases and download caching to mmseqs databases#5
antonvnv wants to merge 3 commits into
pskvins:masterfrom
antonvnv:rnacentral

Conversation

@antonvnv

Copy link
Copy Markdown

Add RNAcentral databases and download caching to mmseqs databases

Based on PR #4 which adds local nucleotide database indexing via mmseqs databases ./file.fasta.gz db tmp

Changes

  • --download-dir parameter (c7c8f0e)

Adds a persistent download cache directory (default: downloads/) so that large files are not re-downloaded across runs. When the file already exists in the cache, the download is skipped. Only active for named
databases — local file paths do not create the directory.

Adds two new named databases:

  • RNAcentral_current — latest release from current_release/ (non-deterministic, contents change with each RNAcentral release)
  • RNAcentral_26_0 — pinned to release 26.0 (deterministic)

Both download rnacentral_active.fasta.gz and index it using the nucleotide pipeline (createdb → splitsequence → makepaddedseqdb → createindex). Downloads are stored under // subdirectories.

Usage

  # Download and index RNAcentral (pinned release)
  mmseqs databases RNAcentral_26_0 db tmp

  # Use a custom download cache
  mmseqs databases RNAcentral_26_0 db tmp --download-dir /data/downloads

  # Skip download if file is already cached
  ls downloads/RNAcentral_26_0/rnacentral_active.fasta.gz  # exists
  mmseqs databases RNAcentral_26_0 db tmp                   # skips download

Building a nucleotide database from a FASTA file previously
required manually chaining createdb, splitsequence, makepaddedseqdb,
and createindex with the right flags. Now a single command does it:

  mmseqs databases ./input.fasta.gz outdb tmp

Both relative (./...) and absolute (/...) paths work — any argument
containing '/' that isn't a known database name is treated as a local
file. Protein inputs are rejected with a clear error since the
indexing pipeline is nucleotide-specific.

This keeps `databases` as the single entry point to maintain indexing
requirements, and makes it suitable for reindexing external or already
manually downloaded databases.
Add a --download-dir parameter (default: "downloads") that provides
a persistent cache directory for downloaded files. This avoids
re-downloading large files across runs. The directory is resolved
to an absolute path and created if it does not exist. The shell
script receives it as DOWNLOAD_DIR environment variable.
Add RNAcentral active sequences as downloadable databases:
- RNAcentral_current: latest release (non-deterministic)
- RNAcentral_26_0: pinned release 26.0 (deterministic)

Both use the LOCAL_FASTA pipeline for nucleotide indexing
and store downloads under <download-dir>/<dbname>/ subdirectories.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant