Skip to content

Tutorial

Victoria Offord edited this page Jun 22, 2023 · 5 revisions

Introduction

pyQUEST is a Python package which was designed for the quantification of Multiplexed Assays of Variant Effects (MAVEs), in particular Saturation Genome Editing or SGE.

pyQUEST has three primary functions:

  • To generate library-independent counts which represent the abundance of each unique read sequence in a user-defined FASTQ or CRAM
  • To generate library-dependent counts which represent the abundance of each template sequence in a user-defined library
  • To generate *summary statistics (these will vary depending on whether library-dependent counts are required)

This tutorial aims to walk the user through quantification with pyQUEST using data from a published SGE experiment.

Note: this tutorial was tested using Python version 3.11.2.


Tutorial overview

  1. Installation of pyQUEST and dependencies
  2. Overview of the tutorial dataset
  3. Preparing the library for pyQUEST
  4. Preparing the sequencing data for pyQUEST
  5. Using pyQUEST to generate library-independent counts
  6. Using pyQUEST to generate library-dependent counts

Installation of pyQUEST and dependencies

Installing pyQUEST

First, we clone the repository and change into the directory:

git clone https://github.com/cancerit/pyQUEST.git
cd pyQUEST

Next, we create a Python virtual environment and activate it (see here for more information):

python3 -m venv .venv
source .venv/bin/activate

Finally, we install pyQUEST's dependencies and then pyQUEST itself:

pip3 install -r requirements.txt
pip3 install .

Installing cutadapt

pyQUEST uses exact string matching such that for a read to be called as a match to an oligo library sequence, both sequences must be the same strand orientation, the same length and the same sequence content. For the tutorial dataset, this means we need to trim the read sequences to make them correspond. For read trimming, we are using cutadapt version 4.4 - but you could use a different tool for this if you wanted to - there's many out there!

To install cutadapt:

pip3 install cutadapt

Check everything is installed

To check everything has installed properly, check that you can see the tool versions:

pyquest --version
cutadapt --version

Getting help

For pyQUEST usage:

pyquest --help

For cutadapt usage:

cutadapt --help

Overview of the tutorial dataset

For the purpose of this tutorial, we will use data from the following preprint which describes a saturation genome editing (SGE) experiment on DDX3X, a gene related to neurodevelopmental disorders and a somatically-mutated cancer driver gene.

Radford E, Tan H, Andersson M, et al.
Saturation genome editing of DDX3X clarifies pathogenicity of germline and somatic variation.
medRxiv 2022
[DOI: 10.1101/2022.06.10.22276179](DOI: 10.1101/2022.06.10.22276179)

We won't be using the full dataset, instead taking a single plasmid library (DDX3XE2sg1) which should have roughly even coverage across all of the sequences and a set of corresponding reads from the ENA.

Please note, the counts generated here will differ from those in the publication as different tools and versions were used.


Preparing the library for pyQUEST

The pyQUEST library has a set format which is detailed in the pyQUEST README.md. The library used in this tutorial comes from Supplementary Table 1 of the publication and requires reformatting before it can be used with pyQUEST. The library in Supplementary Table 1 is a FASTA-formatted list of template sequences for all of the DDX3X exons. For this tutorial, we only need those for Exon 2 (E2) guide 1 (sg1).

In order to subset the library and reformat it for pyQUEST, we need to:

  1. Downloads Supplementary Table 1 (FASTA library sequences)
  2. Extracts only the FASTA sequences for DDX3X exon 2 guide 1 (DDX3XE2sg1)
  3. Reformats each sequence ready for pyQUEST

We can use the following command to do this:

curl 'https://europepmc.org/api/fulltextRepo?pprId=PPR506119&type=FILE&fileName=EMS146001-supplement-Supp_File1.txt&mimeType=text/plain' | grep -A 1 DDX3XE2sg1 | awk 'BEGIN{RS=">"}NR > 1{print $1"\t"$1"\t"$2;}' > DDX3XE2sg1_pyquest.tsv

The final file DDX3XE2sg1_pyquest.tsv has no header and three columns for each template: id, name and sequence. The id must be unique. The id and name may be the same as long as the name is unique (otherwise a unique id per line is required - this can be as simple as 1, 2, 3, 4.....).

However, these sequences contain 5' and 3' sequences (such as primers) which need removing in order for our exact matching to the read sequences. We don't match the sequences with the primers in place as there may be sequencing errors in the primer sequences which would mean a potential match (i.e. perfect sequence without primer) may be missed. This can be particularly important in screens where we are expecting templates to drop out and have low counts.

The primer sequences were retrieved from Supplementary Table 4:

DDX3X E2 ILL1 F ACACTCTTTCCCTACACGACGCTCTTCCGATCT TGAATGCAGTACTAGCAAATG DDX3X E2 ILL1 R TCGGCATTCCTGCTGAACCGCTCTTCCGATCT AATTGAAAGGCAAATTTCTTAAATTTTAC

And so, the relevant sequences to remove from the library sequences are:

  • TGAATGCAGTACTAGCAAATG
  • GTAAAATTTAAGAAATTTGCCTTTCAATT (e.g. the reverse complement of DDX3X E2 ILL1 R as the primers are provided in 5' -> 3')

Below is a basic representation of what we are doing:

DDX3X E2 ILL1 F         TGAATGCAGTACTAGCAAATG
TEMPLATE SEQUENCE ...CGATGAATGCAGTACTAGCAAATG CTAGAGGGCAG.....CATTGGGCTTAT GTAAAATTTAAGAAATTTGCCTTTCAATTTCGT...
DDX3X E2 ILL1 R                                                            GTAAAATTTAAGAAATTTGCCTTTCAATT

To remove the primer sequences:

awk -F"\t" '{sub(/.*TGAATGCAGTACTAGCAAATG/,"", $3); sub(/GTAAAATTTAAGAAATTTGCCTTTCAATT.*/,"", $3); print $1"\t"$2"\t"$3}' DDX3XE2sg1_pyquest.tsv > DDX3XE2sg1_pyquest.trimmed.tsv

Thus, the library we will be comparing to our reads is DDX3XE2sg1_pyquest.trimmed.tsv.

So we we have gone from the untrimmed library sequence (e.g. for 3intron_Change_position_41337466_to_A) in DDX3XE2sg1_pyquest.tsv:

AATGATACGGCGACCACCGATGAATGCAGTACTAGCAAATGCTAGAGGGCAGGACTAGTTTTCTCTTAATGTAGTATTTGAGCACATGGGGTGCTAACCATCTCACTCTCTTCTAGTTTGCTGGTCTAGACCTGAACTCTTCAGATAATCAGAGTGGAGGAAGTACAGCCAGCAATAAGTACAACATCTTGTGGGTTTATTGAATATTAGAGCTTAACATCTTAAGATTTCATTGGGCTTATGTAAAATTTAAGAAATTTGCCTTTCAATTTCGTATGCCGTCTTCTGCTTG

To the trimmed sequence without the primers (e.g. for 3intron_Change_position_41337466_to_A) in DDX3XE2sg1_pyquest.t5rimmed.tsv:

CTAGAGGGCAGGACTAGTTTTCTCTTAATGTAGTATTTGAGCACATGGGGTGCTAACCATCTCACTCTCTTCTAGTTTGCTGGTCTAGACCTGAACTCTTCAGATAATCAGAGTGGAGGAAGTACAGCCAGCAATAAGTACAACATCTTGTGGGTTTATTGAATATTAGAGCTTAACATCTTAAGATTTCATTGGGCTTAT

Preparing the sequencing data for pyQUEST

We have our library of expected sequences. Now we need the sequencing data to compare it to.

Downloading the sequence data from ENA

We're going to download the sequencing data in FASTQ format for a single replicate from the plasmid sequencing of DDX3X exon 2.

wget ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR974/ERR9741180/Plasmid_sg1_Rep1_E2_R1.fq.gz

Removing primers and adapters from the read sequences (trimming)

The read sequences require trimming to remove any adapter and primer sequences. This is because pyQUEST requires both the library sequence and the read sequence to be identical in content, orientation and length for it to be match.

For this, we are using cutadapt version 4.4, specifying the primer sequences (-a TGAATGCAGTACTAGCAAATG...GTAAAATTTAAGAAATTTGCCTTTCAATT), our input (raw) reads (Plasmid_sg1_Rep1_E2_R1.fq.gz), our output (trimmed) reads (Plasmid_sg1_Rep1_E2_R1.trimmed.fq.gz) and that we want a minimal report of trimming on the command line (--report=minimal).

cutadapt -a TGAATGCAGTACTAGCAAATG...GTAAAATTTAAGAAATTTGCCTTTCAATT -o  Plasmid_sg1_Rep1_E2_R1.trimmed.fq.gz --report=minimal Plasmid_sg1_Rep1_E2_R1.fq.gz | column -t

Report shows we read in 271,080 reads of which all matched our sequences for trimming.

Done           00:00:05       271,080 reads @  19.3 µs/read;   3.12 M reads/minute
status  in_reads  in_bp     too_short  too_long  too_many_n  out_reads  w/adapters  qualtrim_bp  out_bp
OK      271080    71836200  0          0         0           271080     271080      0            54116995

Example read before trimming:

@M00523:1015:000000000-CWY3R:1:1101:10008:5613 1:N:0:CCTTGCGACAG
TGAATGCAGTACTAGCAAATGCTAGAGGGCAGGACTAGTTTTCTCTTAATGTAGTATTTGAGCACATGGGGTGCTAACCATCTCACTCTCTTCTAGTTTGTTGGTCAGGACCTGAACTCTTCAGATAATCAGAGTGGAGGAAGTACAGCCAGCAGTAAGTACAACATCTTGTGGGTTTATTGAATATTAGAGCTTAACATCTTAAGATTTCATTGGGCTTATGTAAAATTTAAGAAATTTGCCTTTCAATTAGATCGGAAGAGCG
+
CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGDEGGGGGGGGGGGGGGGGGFFGGGGFGGGGGGGGGGGGEGGGGGGGGGGGGGFGGGGGGGF@FFGGGGGEFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGEGGGGGGGGGGGGGGGGGGGGDGGGGGGGGGGGFEGGGGGGGGFGFGGGGGGGGGFAEGGGGGGGGGGGGGGFGGFGFFGFAEEFGGGGEGCDD

Example read after trimming:

@M00523:1015:000000000-CWY3R:1:1101:10008:5613 1:N:0:CCTTGCGACAG
CTAGAGGGCAGGACTAGTTTTCTCTTAATGTAGTATTTGAGCACATGGGGTGCTAACCATCTCACTCTCTTCTAGTTTGTTGGTCAGGACCTGAACTCTTCAGATAATCAGAGTGGAGGAAGTACAGCCAGCAGTAAGTACAACATCTTGTGGGTTTATTGAATATTAGAGCTTAACATCTTAAGATTTCATTGGGCTTAT
+
GGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGDEGGGGGGGGGGGGGGGGGFFGGGGFGGGGGGGGGGGGEGGGGGGGGGGGGGFGGGGGGGF@FFGGGGGEFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGEGGGGGGGGGGGGGGGGGGGGDGGGGGGGGGGGFEGGGGGGGGFGFGGGGG

Using pyQUEST to generate library-independent counts

There are two types of counts generated by pyQUEST:

  • library-independent counts - the abundance of each unique read sequence
  • library-dependent counts - the abundance of each template sequence in a user-defined library

Library-independent counts can be generated by pyQUEST without providing a library:

pyquest -o LibIndep -s LibIndep Plasmid_sg1_Rep1_E2_R1.trimmed.fq.gz

Notes:

  • An output file prefix is always required (-o or --output).
  • When providing a FASTQ file, it is also necessary to provide a sample name (-s or --sample) - this is not required when using a CRAM file as input.

Quantification should be very quick on this dataset (<30 second) and produces two files:

  • LibIndep.query_counts.tsv.gz - abundance of each unique read sequence
  • LibIndep.stats.json - basic sequencing statistics and information about the run time environment.

To look at the read abundances ({output}.query_counts.tsv.gz):

zcat LibIndep.query_counts.tsv.gz | head -5

Here, you can see header lines starting with ## which show the command and version of pyQUEST used to generate the file. Beneath are the column headers in a line starting with #.

Read sequence abundances are found in four columns:

  • SEQUENCE the unique sequence found in the input reads
  • LENGTH the length of the unique sequence found in the input reads
  • COUNT the number of times the sequence was found in the input read file
##Command: pyquest -o LibIndep -s LibIndep Plasmid_sg1_Rep1_E2_R1.trimmed.fq.gz
##Version: 1.0.0
#SEQUENCE       LENGTH  COUNT
AACTCTCCAGATAATCAGAGTGGAGGAAGTACAGCCAGCAGTAAGTACAACATCTTGTGGGTTTATTGAATATTAGAGCTTAACATCTTAAGATTTCATTGGGCTTAT    108     1
AATCAGAGTGGAGGAAGTACAGCCAGCAGTAAGTACAACATCTTGTGGGTTTATTGAATATTAGAGCTTAACATCTTAAGATTTCATTGGGCTTAT        96      1

Finally, we look at the statistics which are in a JSON formatted file ({output}.stats.json):

python -m json.tool LibIndep.stats.json

Which gives:

{
    "version": "1.0.0",
    "command": "pyquest -o LibIndep -s LibIndep Plasmid_sg1_Rep1_E2_R1.trimmed.fq.gz",
    "sample_name": "LibIndep",
    "input_reads": 271080,
    "total_reads": 270954,
    "discarded_reads": 126,
    "vendor_failed_reads": 0,
    "length_excluded_reads": 0,
    "ambiguous_nt_reads": 126,
    "masked_reads": 0
}

These are basic sequencing statistics such as the number of reads in the input file (input_reads), the number of reads processed once discarded_reads (e.g. vendor_failed_reads, length_excluded_reads, ambiguous_nt_reads and masked_reads) are removed. See README.md for a more detailed description of each statistic.

Read length filter

Sometimes, after trimming there might be short reads you want to exclude from the analysis. This can be done by using the --min-length option. Here, we are asking to quantify all reads which are less than 30 nt long:

pyquest -o LibIndep30 -s LibIndep30 --min-length 30 Plasmid_sg1_Rep1_E2_R1.trimmed.fq.gz

If we look at the statistics:

python -m json.tool LibIndep30.stats.json

We can now see that 132 reads were discarded, 126 which contained ambiguous characters (e.g. Ns) and 6 reads which were shorted than 30 nt.

{
    "version": "1.0.0",
    "command": "pyquest -o LibIndep30 -s LibIndep30 --min-length 30 Plasmid_sg1_Rep1_E2_R1.trimmed.fq.gz",
    "sample_name": "LibIndep30",
    "input_reads": 271080,
    "total_reads": 270948,
    "discarded_reads": 132,
    "vendor_failed_reads": 0,
    "length_excluded_reads": 6,
    "ambiguous_nt_reads": 126,
    "masked_reads": 0
}

Using pyQUEST to generate library-dependent counts

In pyQUEST, the library-dependent counts represent the frequency of each sequence in a library which is provided by the user. The number of times a library sequence is found is determined by the library sequence to the sequences in the library-dependent counts. If found, the library-dependent count for a given sequence is the same as for that sequence in the library-independent counts.

Here, we provide the trimmed DDX3X exon 2 guide 1 library we prepared earlier to pyQUEST:

pyquest -o LibDep -s LibDep -l DDX3XE2sg1_pyquest.trimmed.tsv Plasmid_sg1_Rep1_E2_R1.trimmed.fq.gz

This time, pyQUEST takes a little longer to run (but still less than 10-15 seconds) and produces three files:

  • LibDep.query_counts.tsv.gz - abundance of each unique read sequence
  • LibDep.lib_counts.tsv.gz - abundance of each library sequence
  • LibDep.stats.json - basic sequencing statistics, library-specific statistics and information about the run time environment.

To look at the library sequences abundances ({output}.lib_counts.tsv.gz):

zcat LibDep.lib_counts.tsv.gz | head -7

Here, you can see header lines starting with ## which show the command and version of pyQUEST used to generate the file. Beneath are the column headers in a line starting with #.

Library sequence abundances are found in four columns:

  • ID the unique identifier for the library sequence
  • NAME the name of the of the library sequence
  • SEQUENCE the library sequence
  • LENGTH the length of the oligo template sequence
  • COUNT the number of times the library sequence was found in the input read file (derived from the library-independent counts)
  • UNIQUE whether the sequence is unique to the library (1 = unique, 0 = not unique)
  • SAMPLE the name of the sample
##Command: pyquest -o LibDep -s LibDep -l DDX3XE2sg1_pyquest.trimmed.tsv Plasmid_sg1_Rep1_E2_R1.trimmed.fq.gz
##Version: 1.0.0
#ID     NAME    SEQUENCE        LENGTH  COUNT   UNIQUE  SAMPLE
DDX3XE2sg1.3intron_Change_position_41337466_to_A        DDX3XE2sg1.3intron_Change_position_41337466_to_A        CTAGAGGGCAGGACTAGTTTTCTCTTAATGTAGTATTTGAGCACATGGGGTGCTAACCATCTCACTCTCTTCTAGTTTGCTGGTCTAGACCTGAACTCTTCAGATAATCAGAGTGGAGGAAGTACAGCCAGCAATAAGTACAACATCTTGTGGGTTTATTGAATATTAGAGCTTAACATCTTAAGATTTCATTGGGCTTAT     201     246  1LibDep
DDX3XE2sg1.3intron_Change_position_41337466_to_T        DDX3XE2sg1.3intron_Change_position_41337466_to_T        CTAGAGGGCAGGACTAGTTTTCTCTTAATGTAGTATTTGAGCACATGGGGTGCTAACCATCTCACTCTCTTCTAGTTTGCTGGTCTAGACCTGAACTCTTCAGATAATCAGAGTGGAGGAAGTACAGCCAGCATTAAGTACAACATCTTGTGGGTTTATTGAATATTAGAGCTTAACATCTTAAGATTTCATTGGGCTTAT     201     618  1LibDep
DDX3XE2sg1.3intron_Change_position_41337466_to_C        DDX3XE2sg1.3intron_Change_position_41337466_to_C        CTAGAGGGCAGGACTAGTTTTCTCTTAATGTAGTATTTGAGCACATGGGGTGCTAACCATCTCACTCTCTTCTAGTTTGCTGGTCTAGACCTGAACTCTTCAGATAATCAGAGTGGAGGAAGTACAGCCAGCACTAAGTACAACATCTTGTGGGTTTATTGAATATTAGAGCTTAACATCTTAAGATTTCATTGGGCTTAT     201     419  1LibDep
DDX3XE2sg1.3intron_Change_position_41337467_to_A        DDX3XE2sg1.3intron_Change_position_41337467_to_A        CTAGAGGGCAGGACTAGTTTTCTCTTAATGTAGTATTTGAGCACATGGGGTGCTAACCATCTCACTCTCTTCTAGTTTGCTGGTCTAGACCTGAACTCTTCAGATAATCAGAGTGGAGGAAGTACAGCCAGCAGAAAGTACAACATCTTGTGGGTTTATTGAATATTAGAGCTTAACATCTTAAGATTTCATTGGGCTTAT     201     425  1LibDep

Here we can see that the sequence corresponding to the library template DDX3XE2sg1.5intron_Change_position_41337383_to_C was found 334 times in our read sequences.

Finally, we look at the statistics which are in a JSON formatted file ({output}.stats.json):

python -m json.tool LibDep.stats.json

Which gives:

{
    "version": "1.0.0",
    "command": "pyquest -o LibDep -s LibDep -l DDX3XE2sg1_pyquest.trimmed.tsv Plasmid_sg1_Rep1_E2_R1.trimmed.fq.gz",
    "sample_name": "LibDep",
    "input_reads": 271080,
    "total_reads": 270954,
    "discarded_reads": 126,
    "vendor_failed_reads": 0,
    "length_excluded_reads": 0,
    "ambiguous_nt_reads": 126,
    "masked_reads": 0,
    "mapped_to_template_reads": 198268,
    "mean_count_per_template": 438.65,
    "median_count_per_template": 402.5,
    "multimap_reads": 0,
    "unmapped_reads": 72686,
    "total_templates": 452,
    "total_unique_templates": 452,
    "length_excluded_templates": 0,
    "zero_count_templates": 2,
    "low_count_templates_lt_15": 3,
    "low_count_templates_lt_30": 3,
    "low_count_templates_user": null,
    "gini_coefficient": 0.17
}

The library-dependent statistics contain the same basic sequences as the library-independent counts statistics, but also include library-specific statistics such as the number of reads mapped to library sequences (mapped_to_template_reads), mean and median coverage across the library (mean_count_per_template and median_count_per_template respectively), number of library sequences which did not match any reads (zero_count_templates) and the Gini coefficient (gini_coefficient). See README.md for a more detailed description of each statistic.

Read length filter

The read length filter can also be applied when generating library-dependent counts:

pyquest -o LibDep200 -s LibDep200 --min-length 200 -l DDX3XE2sg1_pyquest.trimmed.tsv Plasmid_sg1_Rep1_E2_R1.trimmed.fq.gz

Notice the warning which tells us 35 unique library sequences were below our minimum length threshold of 200nt.

WARNING: 35 unique library sequences are below the minimum length (200)!

If we look at the statistics, we can tell this corresponds to 30538 reads (the sum of the abundances of library sequences < 200nt):

python -m json.tool LibDep200.stats.json

We can now see that 30538 reads were discarded (length_excluded_reads).

{
    "version": "1.0.0",
    "command": "pyquest -o LibDep200 -s LibDep200 --min-length 200 -l DDX3XE2sg1_pyquest.trimmed.tsv Plasmid_sg1_Rep1_E2_R1.trimmed.fq.gz",
    "sample_name": "LibDep200",
    "input_reads": 271080,
    "total_reads": 240425,
    "discarded_reads": 30655,
    "vendor_failed_reads": 0,
    "length_excluded_reads": 30538,
    "ambiguous_nt_reads": 126,
    "masked_reads": 0,
    "mapped_to_template_reads": 182710,
    "mean_count_per_template": 438.15,
    "median_count_per_template": 401.0,
    "multimap_reads": 0,
    "unmapped_reads": 57715,
    "total_templates": 452,
    "total_unique_templates": 452,
    "length_excluded_templates": 35,
    "zero_count_templates": 2,
    "low_count_templates_lt_15": 3,
    "low_count_templates_lt_30": 3,
    "low_count_templates_user": null,
    "gini_coefficient": 0.17
}

Note: all library sequences will be present in the library-dependent counts when the minimum length filter is applied, but they will have 0 reads assigned.


Congratulations! You have reached the end of the tutorial!