gSpreadComp Usage Tutorial

Welcome to the gSpreadComp usage tutorial! This guide is designed to help you understand how to use the tool effectively through practical examples and step-by-step instructions.

You can use genomes of your own, but if you need some genomes for testing, you can use the one here.

Introduction

gSpreadComp is designed to work with genomes in fasta format and requires a metadata table in CSV format containing a target variable. For example tables, please refer to example_metadata_table_link and example_genome_fasta_link. To recover prokaryotic genomes from metagenomic samples, you can use tools like MuDoGeR.

Prerequisites

Before proceeding with the tutorial, make sure you have:

Genomes in fasta format.
A metadata table in CSV format with a target variable. Your metadata table must be formatted correctly, i.e., including a column named "Library" to identify the source sample, a column named "Genome" to identify the genome, and a column named "Target" to identify your target variable. Refer to example here

Modules and Steps

In this tutorial, we will go through the following modules and steps of gSpreadComp:

Taxonomy Assignment
Genome Quality Estimation
ARG Annotation
Plasmid Identification
Virulence Factors (VFs) Annotation
Downstream Analysis

We will inspect the main outputs of each module to ensure a comprehensive understanding of the tool's functionalities and results.

Let's get started!

Taxonomy Assignment Module

The taxonomy module in gSpreadComp uses GTDBtk for taxonomy assignment. To run this module, use the gspreadcomp taxonomy command. Below are the available options for this module:

gspreadcomp taxonomy --help

Usage: gspreadcomp taxonomy [options] --genome_dir genome_folder -o output_dir
Options:
    --genome_dir STR    folder with the bins to be classified (in fasta format)
    --extension STR     fasta file extension (e.g. fa or fasta) [default: fa]
    -o STR              output directory
    -t INT              number of threads

Running the Taxonomy Module

Create a folder for the test run, for example, test_gspread_run.
Place your genomes in a subfolder within the test run folder, for example, 01_input_genomes.
Create an output folder within the test run folder, for example, 03_gspread_gtdb_taxonomy.

Assuming you have placed your genomes in 01_input_genomes, your fasta files have the fa extension, and your output folder is 03_gspread_gtdb_taxonomy, your command will look like:

$ gspreadcomp taxonomy --genome_dir ./01_input_genomes/ --extension fa -o ./03_gspread_gtdb_taxonomy/ -t 25

Output of the Taxonomy Module

After running the Taxonomy Module, you will find the output in the specified output directory, structured as follows:

03_gspread_gtdb_taxonomy/
├── align
├── classify
├── gtdb_df_format_gSpread.csv
├── gtdbtk.bac120.summary.tsv -> classify/gtdbtk.bac120.summary.tsv
├── gtdbtk.log
├── gtdbtk_result.tsv
├── gtdbtk.warnings.log
└── identify

Main Output File: gtdb_df_format_gSpread.csv. Understanding the Output

The user can find an example of the expected gtdb_df_format_gSpread.csv here.

The gtdb_df_format_gSpread.csv file contains taxonomy information for each genome in a format that is ready for integration into subsequent gSpreadComp modules. This file is crucial for the downstream analysis and should be retained. The gtdbtk_result.tsv: This file consolidates the results from GTDBtk, providing comprehensive information on taxonomy assignments in the GTDBtk format. For a detailed description of the other files, the user can go to the GTDB-tk page.

Genome Quality Estimation using CheckM

The quality module in gSpreadComp uses CheckM to estimate the quality of prokaryotic genomes. To run this module, use the gspreadcomp quality command. Below are the available options for this module:

gspreadcomp quality --help

Usage: gspreadcomp quality [options] --genome_dir genome_folder -o output_dir
Options:
    --genome_dir STR    folder with the genomes to estimate quality (in fasta format)
    --extension STR     fasta file extension (e.g. fa or fasta) [default: fa]
    -o STR              output directory
    -t INT              number of threads [default: 1]

Running the Quality Module

Ensure you are in the test_gspread_run folder created in the previous step.
Place your genomes in the 01_input_genomes subfolder within the test run folder.
Create an output folder within the test run folder, for example, 04_gspread_checkm_quality.

Assuming you have placed your genomes in 01_input_genomes and your output folder is 04_gspread_checkm_quality, your command will look like this:

$ gspreadcomp quality --genome_dir ./01_input_genomes/ --extension fa -o ./04_gspread_checkm_quality/ -t 25

Run this command, and once it's completed, you can proceed to inspect the output in the 04_gspread_checkm_quality folder.

Exploring the Output of the Quality Module

After running the Quality Module, you will find the output in the specified output directory, structured as follows:

04_gspread_checkm_quality/
├── bins
├── checkm_df_format_gSpread.csv
├── checkm.log
├── lineage.ms
├── outputcheckm.tsv
└── storage

Main Output File: checkm_df_format_gSpread.csv. Understanding the Output

The user can find an example of the expected checkm_df_format_gSpread.csv here.

The checkm_df_format_gSpread.csv file contains quality information for each genome in a format that is ready for integration into subsequent gSpreadComp modules. This file is crucial for the downstream analysis and should be retained. The outputcheckm.tsv is the main output file from CheckM itself, consolidating the results and providing comprehensive information on genome quality.

For a detailed description of the other files, the user can go to the CheckM page.

ARGs Annotation using DeepARG

The ARGs module in gSpreadComp uses DeepARG to predict the Antimicrobial Resistance Genes (ARGs) in a genome. To run this module, use the gspreadcomp args command. Below are the available options for this module:

gspreadcomp args --help

Usage: gspreadcomp args [options] --genome_dir genome_folder -o output_dir
Options:
    --genome_dir STR    folder with the genomes to be classified (in fasta format)
    --extension STR     fasta file extension (e.g. fa or fasta) [default: fa]
    --min_prob NUM      Minimum probability cutoff for DeepARG [Default: 0.8]
    --arg_alignment_identity NUM   Identity cutoff for sequence alignment for DeepARG [Default: 35]
    --arg_alignment_evalue NUM     Evalue cutoff for DeepARG [Default: 1e-10]
    --arg_alignment_overlap NUM    Alignment read overlap for DeepARG [Default: 0.8]
    --arg_num_alignments_per_entry NUM   Diamond, minimum number of alignments per entry [Default: 1000]
    -o STR              output directory

Running the ARGs Module

Ensure you are in the test_gspread_run folder created in the previous steps.
Place your genomes in the 01_input_genomes subfolder within the test run folder.
Create an output folder within the test run folder, for example, 05_gspread_deeparg_args.

Assuming you have placed your genomes in 01_input_genomes and your output folder is 05_gspread_deeparg_args, your command will look like this:

$ gspreadcomp args --genome_dir ./01_input_genomes/ --extension fa -o ./05_gspread_deeparg_args/

Run this command, and once it's completed, you can proceed to inspect the output in the 05_gspread_deeparg_args folder.

Inspecting the Output of the ARGs Module

After running the ARGs Module, you will find the output in the specified output directory, structured as follows:

05_gspread_deeparg_args/
├── deeparg_df_format_gSpread.csv
├── deeparg_df_combined_raw.csv
├── genome_name_1.fa
│   ├── genome_name_1.fa_deeparg_out.align.daa
│   ├── genome_name_1.fa_deeparg_out.align.daa.tsv
│   ├── genome_name_1.fa_deeparg_out.mapping.ARG
│   └── genome_name_1.fa_deeparg_out.mapping.potential.ARG
├── genome_name_2.fa
│   ├── genome_name_2.fa_deeparg_out.align.daa
│   ├── genome_name_2.fa_deeparg_out.align.daa.tsv
│   ├── genome_name_2.fa_deeparg_out.mapping.ARG
│   └── genome_name_2.fa_deeparg_out.mapping.potential.ARG
└── genomes_with_no_found_deeparg.csv

Understanding the Output Files and Directories

The user can find an example of the expected deeparg_df_format_gSpread.csv here.

deeparg_df_format_gSpread.csv: This is the format-ready main output file, containing formatted ARGs annotation information per genome. It's ready for integration into subsequent gSpreadComp modules.
deeparg_df_combined_raw.csv: This file combines the raw output from DeepARG for all genomes analyzed.
genome_name.fa Directories: For each genome analyzed, a separate directory is created, named after the genome. To get a detailed description of its content, the user can read the DeepARG documentation
genomes_with_no_found_deeparg.csv: This file lists the genomes for which no ARGs were found by DeepARG.

Plasmid Identification using PlasFlow

The Plasmid module in gSpreadComp uses PlasFlow to predict if a sequence within a fasta file is a chromosome, plasmid, or undetermined. To run this module, use the gspreadcomp plasmid command. Below are the available options for this module:

gspreadcomp plasmid --help

Usage: gspreadcomp plasmid [options] --genome_dir genome_folder -o output_dir
Options:
    --genome_dir STR    folder with the genomes to be classified (in fasta format)
    --extension STR     fasta file extension (e.g. fa or fasta) [default: fa]
    --threshold NUM     threshold for probability filtering [default: 0.7]
    -o STR              output directory

Running the Plasmid Identification Module

Ensure you are in the test_gspread_run folder created in the previous steps.
Place your genomes in the 01_input_genomes subfolder within the test run folder.
Create an output folder within the test run folder, for example, 06_gspread_plasmids.

Assuming you have placed your genomes in 01_input_genomes and your output folder is 06_gspread_plasmids, your command will look like this:

$ gspreadcomp plasmid --genome_dir ./01_input_genomes/ --extension fa -o ./06_gspread_plasmids/

Run this command, and once it's completed, you can proceed to inspect the output in the 06_gspread_plasmids folder. Below is the expected output structure and explanation of each output file.

Inspecting the Output of the Plasmid Module

After running the Plasmid Module, you will find the output in the specified output directory, structured as follows:

06_gspread_plasmids/
├── genome_name_1.fa
│   ├── genome_name_1.fa_plasflow_out.tsv
│   ├── genome_name_1.fa_plasflow_out.tsv_chromosomes.fasta
│   ├── genome_name_1.fa_plasflow_out.tsv_plasmids.fasta
│   └── genome_name_1.fa_plasflow_out.tsv_unclassified.fasta
├── genome_name_2.fa
│   ├── genome_name_2.fa_plasflow_out.tsv
│   ├── genome_name_2.fa_plasflow_out.tsv_chromosomes.fasta
│   ├── genome_name_2.fa_plasflow_out.tsv_plasmids.fasta
│   └── genome_name_2.fa_plasflow_out.tsv_unclassified.fasta
├── genomes_with_no_found_plasflow.csv
└── plasflow_combined_format_gSpread.csv

Understanding the Output Files and Directories

The user can find an example of the expected plasflow_combined_format_gSpread.csv here.

genome_name.fa Directories: For each genome analyzed, a separate directory is created, named after the genome. It contains the following files:
- genome_name.fa_plasflow_out.tsv: This file contains the PlasFlow results in tab-separated values format.
- genome_name.fa_plasflow_out.tsv_chromosomes.fasta: This file contains sequences predicted to be chromosomes.
- genome_name.fa_plasflow_out.tsv_plasmids.fasta: This file contains sequences predicted to be plasmids.
- genome_name.fa_plasflow_out.tsv_unclassified.fasta: This file contains sequences that could not be classified as either plasmids or chromosomes. For a detailed description of the files, the user can read the Plasflow documentation
genomes_with_no_found_plasflow.csv: This file lists the genomes for which no sequences were found by PlasFlow. Hopefully, it will be empty.
plasflow_combined_format_gSpread.csv: This is the format-ready main output file containing formatted PlasFlow results per genome. It's ready for integration into subsequent gSpreadComp modules.

Virulence Factor Annotation

The Pathogens module in gSpreadComp aligns the provided genomes against selected Virulence Factors databases and formats the output. The pathogens module essentially uses BLAST to align your genomes with defined Virulence Factors databases. Here, the user can find the Victors database and the VFDB database.

To run this module, use the gspreadcomp pathogens command. Below are the available options for this module:

gspreadcomp pathogens --help

Usage: gspreadcomp pathogens [options] --genome_dir genome_folder -o output_dir
Options:
    --genome_dir STR    folder with the genomes to be aligned against Virulence factors (in fasta format)
    --extension STR     fasta file extension (e.g. fa or fasta) [default: fa]
    --evalue NUM        evalue, expect value, threshold as defined by NCBI-BLAST [default: 1e-50]
    --vf STR            select the virulence factors database to be used (e.g. victors, vfdb or both) [default: both]
    -t INT              number of threads
    -o STR              output directory

Running the Pathogens Module

Ensure you are in the test_gspread_run folder created in the previous steps.
Place your genomes in the 01_input_genomes subfolder within the test run folder.
Create an output folder within the test run folder, for example, 07_gspread_pathogens.

Assuming you have placed your genomes in 01_input_genomes and your output folder is 07_gspread_pathogens, your command will look like this:

$ gspreadcomp pathogens --genome_dir ./01_input_genomes/ --extension fa -o ./07_gspread_pathogens/ --vf both -t 25

Run this command, and once it's completed, you can proceed to inspect the output in the 07_gspread_pathogens folder.

Inspecting the Output of the Pathogens Module

After running the Pathogens Module, you will find the output in the specified output directory. Below is the expected output structure and explanation of each output file.

07_gspread_pathogens/
├── genome_name_1.fa
│   ├── vfdb_genome_name_1.out
│   └── victors_genome_name_1.out
├── genome_name_2.fa
│   ├── vfdb_genome_name_2.out
│   └── victors_genome_name_2.out
├── vfdb_format_gSpread.csv
├── vfdb_headers.txt
├── vfdb_merged.out
├── vfdb_per_genome_unique_count.csv
├── victors_format_gSpread.csv
├── victors_db_headers.txt
├── victors_merged.out
└── victors_per_genome_unique_count.csv

Understanding the Output Files and Directories

The user can find an example of the expected victors_format_gSpread.csv here. An equivalent output is generated if the user uses the VFDB instead of Victors database.

vfdb_format_gSpread.csv & victors_format_gSpread.csv: These are the format-ready main output files containing formatted virulence factors results per genome. They're ready for integration into subsequent gSpreadComp modules.
vfdb_headers.txt & victors_db_headers.txt: These files contain the headers for the VFDB and Victors databases respectively.
vfdb_merged.out & victors_merged.out: These files contain the merged results of the alignments against the VFDB and Victors databases, respectively.
vfdb_per_genome_unique_count.csv & victors_per_genome_unique_count.csv: These files contain the count of unique virulence factors per genome for the VFDB and Victors databases respectively.

Main Output Files: vfdb_format_gSpread.csv.csv & victors_format_gSpread.csv.csv

The vfdb_format_gSpread.csv.csv and victors_format_gSpread.csv.csv files contain virulence factors results for each genome in a format that is ready for integration into subsequent gSpreadComp modules. These files are crucial for downstream analysis and should be retained.

Using custom files not generated with gSpreadComp

Tip

If the user wants to generate custom, quality, taxonomy, gene annotation, plasmid identification, or Virulence Factors annotation files outside gSpreadComp, it is important to maintain the table formatting.

All the examples of input tables used by the gSpread module are here.

Most important is to keep the column naming as seen in the example tables.

gSpread Module: Main Analysis and Downstream Processing

The gspread module is the final step in the gSpreadComp pipeline, integrating the previous modules' outputs to comprehensively analyze gene spread, potential plasmid-mediated horizontal gene transfer, and resistace-virulence ranking.

Usage:

To view the available options for the gspread module, use the following command:

gspreadcomp gspread --help

This will display the available parameters and their descriptions.

Running the Module:

If you've followed our tutorial steps sequentially, you should have the required inputs ready for the gspread module. Here's how to execute the module with the processed outputs:

gspreadcomp gspread --gtdbtk ./03_gspread_gtdb_taxonomy/gtdb_df_format_gSpread.csv  --checkm ./04_gspread_checkm_quality/checkm_df_format_gSpread.csv --gene ./05_gspread_deeparg_args/deeparg_df_format_gSpread.csv --meta ./02_metadata_gspread_sample.csv --plasmid ./06_gspread_plasmids/plasflow_combined_format_gSpread.csv --vf ./07_gspread_pathogens/victors_format_gSpread.csv -t 25 -o ./08_gspread_results/ --target_gene_col Gene_id

Inspecting the Output of the gSpread Module

After running the command, the gspread module will produce several output files in the specified output directory (./08_gspread_results/ in our example):

08_gspread_results/
├── common_tax_target.csv
├── gSpread_report.html
├── gene_pairwise_comp_results
├── gene_spread_results
├── genome_quality_norm
├── hgt_events_results
├── mags_complete_annotation.csv
├── mags_summary_results.csv
├── network_vis_files
└── pathogens_results

Among these, the gSpread_report.html , while other files and directories contain detailed results from various analyses performed by the module.

Several of the generated files are used in the report. However, a detailed inspection of all the outputs may also help the user.

The main files are the mags_complete_annotation.csv and the mags_summary_results.csv, which contain a result compilation and a ranking based on the resistance-virulence metric for every given genome.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gSpreadComp Usage Tutorial

Introduction

Prerequisites

Modules and Steps

Taxonomy Assignment Module

Running the Taxonomy Module

Output of the Taxonomy Module

Main Output File: gtdb_df_format_gSpread.csv. Understanding the Output

Genome Quality Estimation using CheckM

Running the Quality Module

Exploring the Output of the Quality Module

Main Output File: checkm_df_format_gSpread.csv. Understanding the Output

ARGs Annotation using DeepARG

Running the ARGs Module

Inspecting the Output of the ARGs Module

Understanding the Output Files and Directories

Plasmid Identification using PlasFlow

Running the Plasmid Identification Module

Inspecting the Output of the Plasmid Module

Understanding the Output Files and Directories

Virulence Factor Annotation

Running the Pathogens Module

Inspecting the Output of the Pathogens Module

Understanding the Output Files and Directories

Main Output Files: vfdb_format_gSpread.csv.csv & victors_format_gSpread.csv.csv

Using custom files not generated with gSpreadComp

gSpread Module: Main Analysis and Downstream Processing

Usage:

Running the Module:

Inspecting the Output of the gSpread Module

FilesExpand file tree

usage_tutorial.md

Latest commit

History

usage_tutorial.md

File metadata and controls

gSpreadComp Usage Tutorial

Introduction

Prerequisites

Modules and Steps

Taxonomy Assignment Module

Running the Taxonomy Module

Output of the Taxonomy Module

Main Output File: gtdb_df_format_gSpread.csv. Understanding the Output

Genome Quality Estimation using CheckM

Running the Quality Module

Exploring the Output of the Quality Module

Main Output File: checkm_df_format_gSpread.csv. Understanding the Output

ARGs Annotation using DeepARG

Running the ARGs Module

Inspecting the Output of the ARGs Module

Understanding the Output Files and Directories

Plasmid Identification using PlasFlow

Running the Plasmid Identification Module

Inspecting the Output of the Plasmid Module

Understanding the Output Files and Directories

Virulence Factor Annotation

Running the Pathogens Module

Inspecting the Output of the Pathogens Module

Understanding the Output Files and Directories

Main Output Files: vfdb_format_gSpread.csv.csv & victors_format_gSpread.csv.csv

Using custom files not generated with gSpreadComp

gSpread Module: Main Analysis and Downstream Processing

Usage:

Running the Module:

Inspecting the Output of the gSpread Module