Hello, thank you for creating Helixer.
We've discovered an issue when running Helixer on Boeremia exigua RefSeq genome assembly GCF_020726555.1 https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_020726555.1/.
Helixer is predicting non-canonical splice sites for ~31% of genes, while the original RefSeq annotation for this same assembly contains only 0.9% of genes with non-canonical splice sites (which is within the normal range for fungi).
- Helixer version: v0.3.3
- Model: fungi_v0.3_a_0100.h5
We also noticed that this assembly was included in Helixer's training data for model fungi_v0.3_a_0100.h5.
Here is an example gene where Helixer incorrectly predicts non-canonical splice sites for both introns despite the RefSeq annotation showing canonical sites very close by:
Gene: gene-C7974DRAFT_127213
RefSeq GFF (canonical):
NW_025763520.1 RefSeq gene 1287542 1288678 . + . ID=gene-C7974DRAFT_127213;Dbxref=GeneID:70164704;Name=C7974DRAFT_127213;gbkey=Gene;gene_biotype=protein_coding;locus_tag=C7974DRAFT_127213
NW_025763520.1 RefSeq mRNA 1287542 1288678 . + . ID=rna-XM_046135284.1;Parent=gene-C7974DRAFT_127213;Dbxref=GeneID:70164704,GenBank:XM_046135284.1;Name=XM_046135284.1;gbkey=mRNA;locus_tag=C7974DRAFT_127213;orig_protein_id=gnl|WGS:JAHBNH|C7974DRAFT_127213;orig_transcript_id=gnl|WGS:JAHBNH|C7974DRAFT_mRNA127213;product=necrosis inducing protein-domain-containing protein;transcript_id=XM_046135284.1
NW_025763520.1 RefSeq exon 1287542 1287905 . + . ID=exon-XM_046135284.1-1;Parent=rna-XM_046135284.1;Dbxref=GeneID:70164704,GenBank:XM_046135284.1;gbkey=mRNA;locus_tag=C7974DRAFT_127213;orig_protein_id=gnl|WGS:JAHBNH|C7974DRAFT_127213;orig_transcript_id=gnl|WGS:JAHBNH|C7974DRAFT_mRNA127213;product=necrosis inducing protein-domain-containing protein;transcript_id=XM_046135284.1
NW_025763520.1 RefSeq exon 1287954 1288365 . + . ID=exon-XM_046135284.1-2;Parent=rna-XM_046135284.1;Dbxref=GeneID:70164704,GenBank:XM_046135284.1;gbkey=mRNA;locus_tag=C7974DRAFT_127213;orig_protein_id=gnl|WGS:JAHBNH|C7974DRAFT_127213;orig_transcript_id=gnl|WGS:JAHBNH|C7974DRAFT_mRNA127213;product=necrosis inducing protein-domain-containing protein;transcript_id=XM_046135284.1
NW_025763520.1 RefSeq exon 1288414 1288678 . + . ID=exon-XM_046135284.1-3;Parent=rna-XM_046135284.1;Dbxref=GeneID:70164704,GenBank:XM_046135284.1;gbkey=mRNA;locus_tag=C7974DRAFT_127213;orig_protein_id=gnl|WGS:JAHBNH|C7974DRAFT_127213;orig_transcript_id=gnl|WGS:JAHBNH|C7974DRAFT_mRNA127213;product=necrosis inducing protein-domain-containing protein;transcript_id=XM_046135284.1
NW_025763520.1 RefSeq CDS 1287670 1287905 . + 0 ID=cds-XP_046000208.1;Parent=rna-XM_046135284.1;Dbxref=InterPro:IPR008701,JGIDB:Boeex1_127213,GeneID:70164704,GenBank:XP_046000208.1;Name=XP_046000208.1;gbkey=CDS;locus_tag=C7974DRAFT_127213;orig_transcript_id=gnl|WGS:JAHBNH|C7974DRAFT_mRNA127213;product=necrosis inducing protein-domain-containing protein;protein_id=XP_046000208.1
NW_025763520.1 RefSeq CDS 1287954 1288365 . + 1 ID=cds-XP_046000208.1;Parent=rna-XM_046135284.1;Dbxref=InterPro:IPR008701,JGIDB:Boeex1_127213,GeneID:70164704,GenBank:XP_046000208.1;Name=XP_046000208.1;gbkey=CDS;locus_tag=C7974DRAFT_127213;orig_transcript_id=gnl|WGS:JAHBNH|C7974DRAFT_mRNA127213;product=necrosis inducing protein-domain-containing protein;protein_id=XP_046000208.1
NW_025763520.1 RefSeq CDS 1288414 1288485 . + 0 ID=cds-XP_046000208.1;Parent=rna-XM_046135284.1;Dbxref=InterPro:IPR008701,JGIDB:Boeex1_127213,GeneID:70164704,GenBank:XP_046000208.1;Name=XP_046000208.1;gbkey=CDS;locus_tag=C7974DRAFT_127213;orig_transcript_id=gnl|WGS:JAHBNH|C7974DRAFT_mRNA127213;product=necrosis inducing protein-domain-containing protein;protein_id=XP_046000208.1
Helixer prediction (non-canonical):
NW_025763520.1 Helixer gene 1287446 1288663 . + . ID=_NW_025763520.1_000229
NW_025763520.1 Helixer mRNA 1287446 1288663 . + . ID=_NW_025763520.1_000229.1;Parent=_NW_025763520.1_000229
NW_025763520.1 Helixer exon 1287446 1287905 . + . ID=_NW_025763520.1_000229.1.exon.1;Parent=_NW_025763520.1_000229.1
NW_025763520.1 Helixer five_prime_UTR 1287446 1287669 . + . ID=_NW_025763520.1_000229.1.five_prime_UTR.1;Parent=_NW_025763520.1_000229.1
NW_025763520.1 Helixer CDS 1287670 1287905 . + 0 ID=_NW_025763520.1_000229.1.CDS.1;Parent=_NW_025763520.1_000229.1
NW_025763520.1 Helixer exon 1287957 1288365 . + . ID=_NW_025763520.1_000229.1.exon.2;Parent=_NW_025763520.1_000229.1
NW_025763520.1 Helixer CDS 1287957 1288365 . + 1 ID=_NW_025763520.1_000229.1.CDS.2;Parent=_NW_025763520.1_000229.1
NW_025763520.1 Helixer exon 1288417 1288663 . + . ID=_NW_025763520.1_000229.1.exon.3;Parent=_NW_025763520.1_000229.1
NW_025763520.1 Helixer CDS 1288417 1288485 . + 0 ID=_NW_025763520.1_000229.1.CDS.3;Parent=_NW_025763520.1_000229.1
NW_025763520.1 Helixer three_prime_UTR 1288486 1288663 . + . ID=_NW_025763520.1_000229.1.three_prime_UTR.1;Parent=_NW_025763520.1_000229.1
Are there any recommended parameters to use for cases like this? Any suggestions for troubleshooting this would be greatly appreciated!
Thanks in advance.
Hello, thank you for creating Helixer.
We've discovered an issue when running Helixer on Boeremia exigua RefSeq genome assembly GCF_020726555.1 https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_020726555.1/.
Helixer is predicting non-canonical splice sites for ~31% of genes, while the original RefSeq annotation for this same assembly contains only 0.9% of genes with non-canonical splice sites (which is within the normal range for fungi).
We also noticed that this assembly was included in Helixer's training data for model fungi_v0.3_a_0100.h5.
Here is an example gene where Helixer incorrectly predicts non-canonical splice sites for both introns despite the RefSeq annotation showing canonical sites very close by:
Gene: gene-C7974DRAFT_127213
RefSeq GFF (canonical):
Helixer prediction (non-canonical):
Are there any recommended parameters to use for cases like this? Any suggestions for troubleshooting this would be greatly appreciated!
Thanks in advance.