Welcome to the gSpreadComp usage tutorial! This guide is designed to help you understand how to use the tool effectively through practical examples and step-by-step instructions.
You can use genomes of your own, but if you need some genomes for testing, you can use the one here.
gSpreadComp is designed to work with genomes in fasta format and requires a metadata table in CSV format containing a target variable. For example tables, please refer to example_metadata_table_link and example_genome_fasta_link. To recover prokaryotic genomes from metagenomic samples, you can use tools like MuDoGeR.
Before proceeding with the tutorial, make sure you have:
- Genomes in fasta format.
- A metadata table in CSV format with a target variable. Your metadata table must be formatted correctly, i.e., including a column named "Library" to identify the source sample, a column named "Genome" to identify the genome, and a column named "Target" to identify your target variable. Refer to example here
In this tutorial, we will go through the following modules and steps of gSpreadComp:
- Taxonomy Assignment
- Genome Quality Estimation
- ARG Annotation
- Plasmid Identification
- Virulence Factors (VFs) Annotation
- Downstream Analysis
We will inspect the main outputs of each module to ensure a comprehensive understanding of the tool's functionalities and results.
Let's get started!
The taxonomy module in gSpreadComp uses GTDBtk for taxonomy assignment. To run this module, use the gspreadcomp taxonomy command. Below are the available options for this module:
gspreadcomp taxonomy --help
Usage: gspreadcomp taxonomy [options] --genome_dir genome_folder -o output_dir
Options:
--genome_dir STR folder with the bins to be classified (in fasta format)
--extension STR fasta file extension (e.g. fa or fasta) [default: fa]
-o STR output directory
-t INT number of threads- Create a folder for the test run, for example,
test_gspread_run. - Place your genomes in a subfolder within the test run folder, for example,
01_input_genomes. - Create an output folder within the test run folder, for example,
03_gspread_gtdb_taxonomy.
Assuming you have placed your genomes in 01_input_genomes, your fasta files have the fa extension, and your output folder is 03_gspread_gtdb_taxonomy, your command will look like:
$ gspreadcomp taxonomy --genome_dir ./01_input_genomes/ --extension fa -o ./03_gspread_gtdb_taxonomy/ -t 25After running the Taxonomy Module, you will find the output in the specified output directory, structured as follows:
03_gspread_gtdb_taxonomy/
├── align
├── classify
├── gtdb_df_format_gSpread.csv
├── gtdbtk.bac120.summary.tsv -> classify/gtdbtk.bac120.summary.tsv
├── gtdbtk.log
├── gtdbtk_result.tsv
├── gtdbtk.warnings.log
└── identify
The user can find an example of the expected gtdb_df_format_gSpread.csv here.
The gtdb_df_format_gSpread.csv file contains taxonomy information for each genome in a format that is ready for integration into subsequent gSpreadComp modules. This file is crucial for the downstream analysis and should be retained.
The gtdbtk_result.tsv: This file consolidates the results from GTDBtk, providing comprehensive information on taxonomy assignments in the GTDBtk format.
For a detailed description of the other files, the user can go to the GTDB-tk page.
The quality module in gSpreadComp uses CheckM to estimate the quality of prokaryotic genomes. To run this module, use the gspreadcomp quality command. Below are the available options for this module:
gspreadcomp quality --help
Usage: gspreadcomp quality [options] --genome_dir genome_folder -o output_dir
Options:
--genome_dir STR folder with the genomes to estimate quality (in fasta format)
--extension STR fasta file extension (e.g. fa or fasta) [default: fa]
-o STR output directory
-t INT number of threads [default: 1]- Ensure you are in the
test_gspread_runfolder created in the previous step. - Place your genomes in the
01_input_genomessubfolder within the test run folder. - Create an output folder within the test run folder, for example,
04_gspread_checkm_quality.
Assuming you have placed your genomes in 01_input_genomes and your output folder is 04_gspread_checkm_quality, your command will look like this:
$ gspreadcomp quality --genome_dir ./01_input_genomes/ --extension fa -o ./04_gspread_checkm_quality/ -t 25Run this command, and once it's completed, you can proceed to inspect the output in the 04_gspread_checkm_quality folder.
After running the Quality Module, you will find the output in the specified output directory, structured as follows:
04_gspread_checkm_quality/
├── bins
├── checkm_df_format_gSpread.csv
├── checkm.log
├── lineage.ms
├── outputcheckm.tsv
└── storage
The user can find an example of the expected checkm_df_format_gSpread.csv here.
The checkm_df_format_gSpread.csv file contains quality information for each genome in a format that is ready for integration into subsequent gSpreadComp modules. This file is crucial for the downstream analysis and should be retained.
The outputcheckm.tsv is the main output file from CheckM itself, consolidating the results and providing comprehensive information on genome quality.
For a detailed description of the other files, the user can go to the CheckM page.
The ARGs module in gSpreadComp uses DeepARG to predict the Antimicrobial Resistance Genes (ARGs) in a genome. To run this module, use the gspreadcomp args command. Below are the available options for this module:
gspreadcomp args --help
Usage: gspreadcomp args [options] --genome_dir genome_folder -o output_dir
Options:
--genome_dir STR folder with the genomes to be classified (in fasta format)
--extension STR fasta file extension (e.g. fa or fasta) [default: fa]
--min_prob NUM Minimum probability cutoff for DeepARG [Default: 0.8]
--arg_alignment_identity NUM Identity cutoff for sequence alignment for DeepARG [Default: 35]
--arg_alignment_evalue NUM Evalue cutoff for DeepARG [Default: 1e-10]
--arg_alignment_overlap NUM Alignment read overlap for DeepARG [Default: 0.8]
--arg_num_alignments_per_entry NUM Diamond, minimum number of alignments per entry [Default: 1000]
-o STR output directory- Ensure you are in the
test_gspread_runfolder created in the previous steps. - Place your genomes in the
01_input_genomessubfolder within the test run folder. - Create an output folder within the test run folder, for example,
05_gspread_deeparg_args.
Assuming you have placed your genomes in 01_input_genomes and your output folder is 05_gspread_deeparg_args, your command will look like this:
$ gspreadcomp args --genome_dir ./01_input_genomes/ --extension fa -o ./05_gspread_deeparg_args/Run this command, and once it's completed, you can proceed to inspect the output in the 05_gspread_deeparg_args folder.
After running the ARGs Module, you will find the output in the specified output directory, structured as follows:
05_gspread_deeparg_args/
├── deeparg_df_format_gSpread.csv
├── deeparg_df_combined_raw.csv
├── genome_name_1.fa
│ ├── genome_name_1.fa_deeparg_out.align.daa
│ ├── genome_name_1.fa_deeparg_out.align.daa.tsv
│ ├── genome_name_1.fa_deeparg_out.mapping.ARG
│ └── genome_name_1.fa_deeparg_out.mapping.potential.ARG
├── genome_name_2.fa
│ ├── genome_name_2.fa_deeparg_out.align.daa
│ ├── genome_name_2.fa_deeparg_out.align.daa.tsv
│ ├── genome_name_2.fa_deeparg_out.mapping.ARG
│ └── genome_name_2.fa_deeparg_out.mapping.potential.ARG
└── genomes_with_no_found_deeparg.csv
The user can find an example of the expected deeparg_df_format_gSpread.csv here.
- deeparg_df_format_gSpread.csv: This is the format-ready main output file, containing formatted ARGs annotation information per genome. It's ready for integration into subsequent
gSpreadCompmodules. - deeparg_df_combined_raw.csv: This file combines the raw output from DeepARG for all genomes analyzed.
- genome_name.fa Directories: For each genome analyzed, a separate directory is created, named after the genome. To get a detailed description of its content, the user can read the DeepARG documentation
- genomes_with_no_found_deeparg.csv: This file lists the genomes for which no ARGs were found by DeepARG.
The Plasmid module in gSpreadComp uses PlasFlow to predict if a sequence within a fasta file is a chromosome, plasmid, or undetermined. To run this module, use the gspreadcomp plasmid command. Below are the available options for this module:
gspreadcomp plasmid --help
Usage: gspreadcomp plasmid [options] --genome_dir genome_folder -o output_dir
Options:
--genome_dir STR folder with the genomes to be classified (in fasta format)
--extension STR fasta file extension (e.g. fa or fasta) [default: fa]
--threshold NUM threshold for probability filtering [default: 0.7]
-o STR output directory- Ensure you are in the
test_gspread_runfolder created in the previous steps. - Place your genomes in the
01_input_genomessubfolder within the test run folder. - Create an output folder within the test run folder, for example,
06_gspread_plasmids.
Assuming you have placed your genomes in 01_input_genomes and your output folder is 06_gspread_plasmids, your command will look like this:
$ gspreadcomp plasmid --genome_dir ./01_input_genomes/ --extension fa -o ./06_gspread_plasmids/Run this command, and once it's completed, you can proceed to inspect the output in the 06_gspread_plasmids folder. Below is the expected output structure and explanation of each output file.
After running the Plasmid Module, you will find the output in the specified output directory, structured as follows:
06_gspread_plasmids/
├── genome_name_1.fa
│ ├── genome_name_1.fa_plasflow_out.tsv
│ ├── genome_name_1.fa_plasflow_out.tsv_chromosomes.fasta
│ ├── genome_name_1.fa_plasflow_out.tsv_plasmids.fasta
│ └── genome_name_1.fa_plasflow_out.tsv_unclassified.fasta
├── genome_name_2.fa
│ ├── genome_name_2.fa_plasflow_out.tsv
│ ├── genome_name_2.fa_plasflow_out.tsv_chromosomes.fasta
│ ├── genome_name_2.fa_plasflow_out.tsv_plasmids.fasta
│ └── genome_name_2.fa_plasflow_out.tsv_unclassified.fasta
├── genomes_with_no_found_plasflow.csv
└── plasflow_combined_format_gSpread.csv
The user can find an example of the expected plasflow_combined_format_gSpread.csv here.
-
genome_name.fa Directories: For each genome analyzed, a separate directory is created, named after the genome. It contains the following files:
- genome_name.fa_plasflow_out.tsv: This file contains the PlasFlow results in tab-separated values format.
- genome_name.fa_plasflow_out.tsv_chromosomes.fasta: This file contains sequences predicted to be chromosomes.
- genome_name.fa_plasflow_out.tsv_plasmids.fasta: This file contains sequences predicted to be plasmids.
- genome_name.fa_plasflow_out.tsv_unclassified.fasta: This file contains sequences that could not be classified as either plasmids or chromosomes. For a detailed description of the files, the user can read the Plasflow documentation
-
genomes_with_no_found_plasflow.csv: This file lists the genomes for which no sequences were found by PlasFlow. Hopefully, it will be empty.
-
plasflow_combined_format_gSpread.csv: This is the format-ready main output file containing formatted PlasFlow results per genome. It's ready for integration into subsequent
gSpreadCompmodules.
The Pathogens module in gSpreadComp aligns the provided genomes against selected Virulence Factors databases and formats the output.
The pathogens module essentially uses BLAST to align your genomes with defined Virulence Factors databases. Here, the user can find the Victors database and the VFDB database.
To run this module, use the gspreadcomp pathogens command. Below are the available options for this module:
gspreadcomp pathogens --help
Usage: gspreadcomp pathogens [options] --genome_dir genome_folder -o output_dir
Options:
--genome_dir STR folder with the genomes to be aligned against Virulence factors (in fasta format)
--extension STR fasta file extension (e.g. fa or fasta) [default: fa]
--evalue NUM evalue, expect value, threshold as defined by NCBI-BLAST [default: 1e-50]
--vf STR select the virulence factors database to be used (e.g. victors, vfdb or both) [default: both]
-t INT number of threads
-o STR output directory- Ensure you are in the
test_gspread_runfolder created in the previous steps. - Place your genomes in the
01_input_genomessubfolder within the test run folder. - Create an output folder within the test run folder, for example,
07_gspread_pathogens.
Assuming you have placed your genomes in 01_input_genomes and your output folder is 07_gspread_pathogens, your command will look like this:
$ gspreadcomp pathogens --genome_dir ./01_input_genomes/ --extension fa -o ./07_gspread_pathogens/ --vf both -t 25Run this command, and once it's completed, you can proceed to inspect the output in the 07_gspread_pathogens folder.
After running the Pathogens Module, you will find the output in the specified output directory. Below is the expected output structure and explanation of each output file.
07_gspread_pathogens/
├── genome_name_1.fa
│ ├── vfdb_genome_name_1.out
│ └── victors_genome_name_1.out
├── genome_name_2.fa
│ ├── vfdb_genome_name_2.out
│ └── victors_genome_name_2.out
├── vfdb_format_gSpread.csv
├── vfdb_headers.txt
├── vfdb_merged.out
├── vfdb_per_genome_unique_count.csv
├── victors_format_gSpread.csv
├── victors_db_headers.txt
├── victors_merged.out
└── victors_per_genome_unique_count.csvThe user can find an example of the expected victors_format_gSpread.csv here. An equivalent output is generated if the user uses the VFDB instead of Victors database.
- vfdb_format_gSpread.csv & victors_format_gSpread.csv: These are the format-ready main output files containing formatted virulence factors results per genome. They're ready for integration into subsequent
gSpreadCompmodules. - vfdb_headers.txt & victors_db_headers.txt: These files contain the headers for the VFDB and Victors databases respectively.
- vfdb_merged.out & victors_merged.out: These files contain the merged results of the alignments against the VFDB and Victors databases, respectively.
- vfdb_per_genome_unique_count.csv & victors_per_genome_unique_count.csv: These files contain the count of unique virulence factors per genome for the VFDB and Victors databases respectively.
The vfdb_format_gSpread.csv.csv and victors_format_gSpread.csv.csv files contain virulence factors results for each genome in a format that is ready for integration into subsequent gSpreadComp modules. These files are crucial for downstream analysis and should be retained.
Tip
If the user wants to generate custom, quality, taxonomy, gene annotation, plasmid identification, or Virulence Factors annotation files outside gSpreadComp, it is important to maintain the table formatting.
All the examples of input tables used by the gSpread module are here.
Most important is to keep the column naming as seen in the example tables.
The gspread module is the final step in the gSpreadComp pipeline, integrating the previous modules' outputs to comprehensively analyze gene spread, potential plasmid-mediated horizontal gene transfer, and resistace-virulence ranking.
To view the available options for the gspread module, use the following command:
gspreadcomp gspread --helpThis will display the available parameters and their descriptions.
If you've followed our tutorial steps sequentially, you should have the required inputs ready for the gspread module. Here's how to execute the module with the processed outputs:
gspreadcomp gspread --gtdbtk ./03_gspread_gtdb_taxonomy/gtdb_df_format_gSpread.csv --checkm ./04_gspread_checkm_quality/checkm_df_format_gSpread.csv --gene ./05_gspread_deeparg_args/deeparg_df_format_gSpread.csv --meta ./02_metadata_gspread_sample.csv --plasmid ./06_gspread_plasmids/plasflow_combined_format_gSpread.csv --vf ./07_gspread_pathogens/victors_format_gSpread.csv -t 25 -o ./08_gspread_results/ --target_gene_col Gene_idAfter running the command, the gspread module will produce several output files in the specified output directory (./08_gspread_results/ in our example):
08_gspread_results/
├── common_tax_target.csv
├── gSpread_report.html
├── gene_pairwise_comp_results
├── gene_spread_results
├── genome_quality_norm
├── hgt_events_results
├── mags_complete_annotation.csv
├── mags_summary_results.csv
├── network_vis_files
└── pathogens_results
Among these, the gSpread_report.html , while other files and directories contain detailed results from various analyses performed by the module.
Several of the generated files are used in the report. However, a detailed inspection of all the outputs may also help the user.
The main files are the mags_complete_annotation.csv and the mags_summary_results.csv, which contain a result compilation and a ranking based on the resistance-virulence metric for every given genome.