scbirlab/nf-ggi is a Nextflow pipeline to screen gene-gene interactions within an organism or between organisms (in the case of host-pathogen or phage-bacterium interactions).
Table of contents
- Processing steps
- Requirements
- Quick start
- Inputs
- Outputs
- Credit
- Issues, problems, suggestions
- Further help
scbirlab/nf-ggi carries out the following steps:
- [Optional] Download Rhea DB (of metabolites) in preparation for searching.
For proteins or proteomes in the sample sheet:
- [Optional] Download its STRING database and tidy up the data.
- Download FASTA sequences of proteins from UniProt
- If multiple proteomes are available, choose according to this priority: "Reference and representative", "Reference", "Representative", "Other"
- Find reactions in Rhea DB and connect products with reactants between enzymes in the proteome.
For each FASTA sequence:
- Generate a multiple sequence alignment with
hhblits.
For method == "self":
- Within each organism, generate all unique pairs of proteins.
For method == "bait":
- All unique pairs of proteins between the organism and listed baits.
For method == "custom":
- All unique pairs of proteins listed.
Then for each protein pair, optionally:
- with
--dca: Calculate the co-evolutionary signal with DCA, optionally generating plots of contact maps. - with
--rf2t: Predict the interface contact map withyunta rf2t(RosettaFold-2track), optionally generating plots of contact maps. - with
--af2: Predict the protein-protein complex structure map withyunta af2(AlphaFold2), optionally generating plots of contact maps.
You need access to the UniClust and BFD databases, and you need Nextflow and either conda, Singularity, or Docker to be installed.
To generate multiple-sequence alignments (MSAs) for co-evolutionary analysis, hhblits databases of
pre-clustered sequences is required. Unfortunately, these are extremely large, so cannot be downlaoded as
part of the pipeline. You should download the UniClust and BFD
databases, then set the --uniclust and --bfd parameters of the pipeline (see below).
If you're at the Crick, these databases already reside on NEMO, and there is no need to downlaod them.
You need to have Nextflow and either Singularity, Docker, of conda installed on your system.
If you're at the Crick or your shared cluster has Nextflow and Singularity already installed, try:
module load Nextflow SingularityOtherwise, if it's your first time using Nextflow on your system, you can install it using conda:
conda install -c bioconda nextflow You may need to set the NXF_HOME environment variable. For example,
mkdir -p ~/.nextflow
export NXF_HOME=~/.nextflowTo make this a permanent change, you can do something like the following:
mkdir -p ~/.nextflow
echo "export NXF_HOME=~/.nextflow" >> ~/.bash_profile
source ~/.bash_profileThere are three run modes for the pipeline:
"self": run all protein-protein interactions within an organism"bait": run interactions between all proteins from an organism and either one protein or another organism's proteome"custom": run specified protein pairs from a file
The easiest way to get going is by specifying parameters on the command-line:
bfd=path/to/your/bfd
uniclust=path/to/your/uniclust
nextflow run scbirlab/nf-ggi \
--bfd "$bfd" --uniclust "$uniclust" \
--organism_id 243273 \
--dca --rf2t --plotsHere's what the flags mean:
--organism_id: The Taxon ID of the organism, whih you can find at NCBI or UniProt--dca,--rf2t: Run direct-coupling analysis and RosettaFold-2track--plots: Generate amino acid contact maps. This takes about 10MB per protein-protein interaction, so be sure you have enough disk space for the number of protein-protein pairs you're testing!
You can also run --metabolites to get the metabolic network, and --string to get the STRING co-expression network.
Because only --organism_id was provided, the pipeline assumes "self" mode (i.e. all-vs-all within taxon 243273).
You can run bait mode by providing --bait <UniProt ID>:
nextflow run scbirlab/nf-ggi --bfd "$bfd" --uniclust "$uniclust" \
--organism_id 559292 --bait P00931 \
--dcaThe bait can be another organisms's proteome. In this case, we need to specifiy that --bait_is_taxon and --interspecies:
nextflow run scbirlab/nf-ggi --bfd "$bfd" --uniclust "$uniclust" \
--organism_id 559292 --bait 1773 \
--bait_is_taxon --interspecies \
--rf2tscbirlab/nf-ggi runs on a Singularity container engine by default to ensure software versions are consistent. If you have
docker installed, you can run using -with-docker to use it instead, or if you have Conda you can run -with-conda.
Make a sample sheet (see below) with columns representing the flags above, and, optionally, a nextflow.config file in the
directory where you want the pipeline to run. Then simply run:
nextflow run scbirlab/nf-ggiEach time you run the pipeline after the first time, Nextflow will use a locally-cached version which
will not be automatically updated. If you want to ensure that you're using the very latest version of the
pipeline, use the -latest flag.
nextflow run scbirlab/nf-ggi -latestIf you want to run a particular tagged version of the pipeline, such as v0.0.5, you can do so using
nextflow run scbirlab/nf-ggi -r v0.0.5For help, use nextflow run scbirlab/nf-ggi --help.
The first time you run the pipeline on your system, the software dependencies in environment.yml will be installed.
This may take several minutes.
The pipeline can be run with command-line arguments:
# intra-species all-vs-all:
nextflow run scbirlab/nf-ggi --uniclust <path> --bfd <path> \
--organism_id <taxon ID>
# intra-species all-vs-1:
nextflow run scbirlab/nf-ggi --uniclust <path> --bfd <path> \
--organism_id <taxon ID> --bait <UniProtID>
# inter-species all-vs-all:
nextflow run scbirlab/nf-ggi --uniclust <path> --bfd <path> \
--organism_id <taxon ID> --bait <taxon ID> \
--bait_is_taxon --interspecies
# custom list of pairs:
nextflow run scbirlab/nf-ggi --uniclust <path> --bfd <path> \
--organism_id <taxon ID> \
--filename <path> --column1 <gene-col1> --column2 <gene-col2> \
[--interspecies --organism_id2 <taxon ID>] [--format <gene-name-type>]The following parameters are required:
--organism_id Taxon ID for organism
Bait mode:
--bait UniProt ID for bait protein, or Taxon ID for bait organism
Custom mode:
--filename Filename to get custom protein pairs
--column1, --column2 Column names from --filename to get protein IDs
The following parameters are optional. They have default values which can be overridden if necessary.
--reviewed Only pull SwissProt reviewed proteins from proteome
--isoforms Additionally pull isoform sequences from proteome
--proteome_opts Additonal filters for pulling from proteome. Check
https://www.ebi.ac.uk/proteins/api/doc/#!/proteins/search for options.
--bait_is_taxon Indicate that bait is an organism ID
--interspecies Run analysis between interacting species proteomes
--plots Generate contact map plots
--organism_id2 When providing a file of pairs, if the second protein (--column2) is from another organism than the first
--format Type of gene identifier in --column1, --column2. Default: "Gene_Name"
--test Whether to run in test mode. Default: false.
--outputs Output folder. Default: "outputs".
--batch_size What size to batch protein-protein interactions into. Default: 100.You can run multiple combinations in one command using a sample sheet. The sample sheet is a CSV file with one row per combination of parameters to run. The column headings have the same names as the required flags for command-line usage. The optional flags are still on the command line, and applied to everything in the run. Here, the mode needs to be specified with --mode:
nextflow run scbirlab/nf-ggi --mode self --rf2t --metabolites --sample_sheet path/to/sample-sheet.csvHere is an example of the sample sheet for mode = "self", to find all the mycoplasma protein-protein interactions:
| organism_id | proteome_name |
|---|---|
| 243273 | "Mycoplasma genitalium" |
The proteome_name column is not neccesary, but you can add extra columns with human-readable annotations for your own sanity. We recommend:
proteome_name- (if using a bait)
bait_name
If running with mode = "bait", to do a pulldown against a single bait protein, add another column with the bait UniProt ID.
| organism_id | proteome_name | bait | bait_name |
|---|---|---|---|
| 243273 | Mycoplasma genitalium | P47259 | FolD |
If running with mode = "custom", to do a pulldown against a single bait protein, add another column with the bait UniProt ID.
| organism_id | proteome_name | format | filename | column1 | column2 |
|---|---|---|---|---|---|
| 559292 | Saccharomyces cerevisiae | Gene_Name | combos.csv | query_orf | array_orf |
In this case, combos.csv must be in the inputs folder defined above. It would look like:
| query_orf | query_gene_name | array_orf | array_gene_name |
|---|---|---|---|
| YAL058W | CNE1 | YAL068C | PAU8 |
Further examples are in the test directory of this repository.
For reproducibility, self-documentation, and to save typing, parameters with the same names as the command line flags above can be provided in a nextflow.config file in the working directory. For example:
params {
organism_id = "243273"
dca = true
rf2t = true
}Or with a sample sheet:
params {
sample_sheet = "path/to/sample-sheet.csv"
mode = "self"
dca = true
rf2t = true
metabolites = true
string = true
}Outputs are saved in the output folder defined above. They include these directories:
string: STRING co-expression valuesmetabolites: Reconstructed metabolic networkmsa: All MSA filesppi: All protein-protein interaction datasequences: Protein sequences
The idea of using DCA, RoseTTAFold-2track, and AlphaFold2 in a cascade of increasingly expensive and specific PPI detection methods has been explored in a series of papers from David Baker's lab:
- Cong et al., Protein interaction networks revealed by proteome coevolution. Science, 2019
- Humpreys et al., Computed structures of core eukaryotic protein complexes. Science, 2021
- Humpreys et al., Protein interactions in human pathogens revealed through deep learning. Nature Microbiology, 2024
scbirlab/nf-ggi applies these algorithms in a Nextflow pipeline to allow easy scaling, and enables inter-species interactions. It also reconstructs metabolic networks, and pulls known interactions from the STRING database.
Add to the issue tracker.
Here are the pages of the software and databases used by this pipeline.
Databases:
- STRING for co-expression
- Rhea for enzyme reactions
- UniProt for protein sequences
- NCBI Genbank for taxonomy
Software: