Skip to content

High rate of non-canonical splice site predictions #213

@EleanorEveCarr

Description

@EleanorEveCarr

Hello, thank you for creating Helixer.

We've discovered an issue when running Helixer on Boeremia exigua RefSeq genome assembly GCF_020726555.1 https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_020726555.1/.
Helixer is predicting non-canonical splice sites for ~31% of genes, while the original RefSeq annotation for this same assembly contains only 0.9% of genes with non-canonical splice sites (which is within the normal range for fungi).

  • Helixer version: v0.3.3
  • Model: fungi_v0.3_a_0100.h5

We also noticed that this assembly was included in Helixer's training data for model fungi_v0.3_a_0100.h5.

Here is an example gene where Helixer incorrectly predicts non-canonical splice sites for both introns despite the RefSeq annotation showing canonical sites very close by:

Gene: gene-C7974DRAFT_127213
RefSeq GFF (canonical):

NW_025763520.1  RefSeq  gene    1287542 1288678 .       +       .       ID=gene-C7974DRAFT_127213;Dbxref=GeneID:70164704;Name=C7974DRAFT_127213;gbkey=Gene;gene_biotype=protein_coding;locus_tag=C7974DRAFT_127213
NW_025763520.1  RefSeq  mRNA    1287542 1288678 .       +       .       ID=rna-XM_046135284.1;Parent=gene-C7974DRAFT_127213;Dbxref=GeneID:70164704,GenBank:XM_046135284.1;Name=XM_046135284.1;gbkey=mRNA;locus_tag=C7974DRAFT_127213;orig_protein_id=gnl|WGS:JAHBNH|C7974DRAFT_127213;orig_transcript_id=gnl|WGS:JAHBNH|C7974DRAFT_mRNA127213;product=necrosis inducing protein-domain-containing protein;transcript_id=XM_046135284.1
NW_025763520.1  RefSeq  exon    1287542 1287905 .       +       .       ID=exon-XM_046135284.1-1;Parent=rna-XM_046135284.1;Dbxref=GeneID:70164704,GenBank:XM_046135284.1;gbkey=mRNA;locus_tag=C7974DRAFT_127213;orig_protein_id=gnl|WGS:JAHBNH|C7974DRAFT_127213;orig_transcript_id=gnl|WGS:JAHBNH|C7974DRAFT_mRNA127213;product=necrosis inducing protein-domain-containing protein;transcript_id=XM_046135284.1
NW_025763520.1  RefSeq  exon    1287954 1288365 .       +       .       ID=exon-XM_046135284.1-2;Parent=rna-XM_046135284.1;Dbxref=GeneID:70164704,GenBank:XM_046135284.1;gbkey=mRNA;locus_tag=C7974DRAFT_127213;orig_protein_id=gnl|WGS:JAHBNH|C7974DRAFT_127213;orig_transcript_id=gnl|WGS:JAHBNH|C7974DRAFT_mRNA127213;product=necrosis inducing protein-domain-containing protein;transcript_id=XM_046135284.1
NW_025763520.1  RefSeq  exon    1288414 1288678 .       +       .       ID=exon-XM_046135284.1-3;Parent=rna-XM_046135284.1;Dbxref=GeneID:70164704,GenBank:XM_046135284.1;gbkey=mRNA;locus_tag=C7974DRAFT_127213;orig_protein_id=gnl|WGS:JAHBNH|C7974DRAFT_127213;orig_transcript_id=gnl|WGS:JAHBNH|C7974DRAFT_mRNA127213;product=necrosis inducing protein-domain-containing protein;transcript_id=XM_046135284.1
NW_025763520.1  RefSeq  CDS     1287670 1287905 .       +       0       ID=cds-XP_046000208.1;Parent=rna-XM_046135284.1;Dbxref=InterPro:IPR008701,JGIDB:Boeex1_127213,GeneID:70164704,GenBank:XP_046000208.1;Name=XP_046000208.1;gbkey=CDS;locus_tag=C7974DRAFT_127213;orig_transcript_id=gnl|WGS:JAHBNH|C7974DRAFT_mRNA127213;product=necrosis inducing protein-domain-containing protein;protein_id=XP_046000208.1
NW_025763520.1  RefSeq  CDS     1287954 1288365 .       +       1       ID=cds-XP_046000208.1;Parent=rna-XM_046135284.1;Dbxref=InterPro:IPR008701,JGIDB:Boeex1_127213,GeneID:70164704,GenBank:XP_046000208.1;Name=XP_046000208.1;gbkey=CDS;locus_tag=C7974DRAFT_127213;orig_transcript_id=gnl|WGS:JAHBNH|C7974DRAFT_mRNA127213;product=necrosis inducing protein-domain-containing protein;protein_id=XP_046000208.1
NW_025763520.1  RefSeq  CDS     1288414 1288485 .       +       0       ID=cds-XP_046000208.1;Parent=rna-XM_046135284.1;Dbxref=InterPro:IPR008701,JGIDB:Boeex1_127213,GeneID:70164704,GenBank:XP_046000208.1;Name=XP_046000208.1;gbkey=CDS;locus_tag=C7974DRAFT_127213;orig_transcript_id=gnl|WGS:JAHBNH|C7974DRAFT_mRNA127213;product=necrosis inducing protein-domain-containing protein;protein_id=XP_046000208.1

Helixer prediction (non-canonical):

NW_025763520.1  Helixer gene    1287446 1288663 .       +       .       ID=_NW_025763520.1_000229
NW_025763520.1  Helixer mRNA    1287446 1288663 .       +       .       ID=_NW_025763520.1_000229.1;Parent=_NW_025763520.1_000229
NW_025763520.1  Helixer exon    1287446 1287905 .       +       .       ID=_NW_025763520.1_000229.1.exon.1;Parent=_NW_025763520.1_000229.1
NW_025763520.1  Helixer five_prime_UTR  1287446 1287669 .       +       .       ID=_NW_025763520.1_000229.1.five_prime_UTR.1;Parent=_NW_025763520.1_000229.1
NW_025763520.1  Helixer CDS     1287670 1287905 .       +       0       ID=_NW_025763520.1_000229.1.CDS.1;Parent=_NW_025763520.1_000229.1
NW_025763520.1  Helixer exon    1287957 1288365 .       +       .       ID=_NW_025763520.1_000229.1.exon.2;Parent=_NW_025763520.1_000229.1
NW_025763520.1  Helixer CDS     1287957 1288365 .       +       1       ID=_NW_025763520.1_000229.1.CDS.2;Parent=_NW_025763520.1_000229.1
NW_025763520.1  Helixer exon    1288417 1288663 .       +       .       ID=_NW_025763520.1_000229.1.exon.3;Parent=_NW_025763520.1_000229.1
NW_025763520.1  Helixer CDS     1288417 1288485 .       +       0       ID=_NW_025763520.1_000229.1.CDS.3;Parent=_NW_025763520.1_000229.1
NW_025763520.1  Helixer three_prime_UTR 1288486 1288663 .       +       .       ID=_NW_025763520.1_000229.1.three_prime_UTR.1;Parent=_NW_025763520.1_000229.1

Are there any recommended parameters to use for cases like this? Any suggestions for troubleshooting this would be greatly appreciated!

Thanks in advance.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions