Fix annotation <-> reference name translation by dkhofer · Pull Request #88 · genhub-bio/gen

dkhofer · 2026-04-02T20:26:55Z

Resolves #85

Chris7 · 2026-04-02T21:15:49Z

 INSERT INTO nodes (id, sequence_hash) values (X'84d6adbd5395281933fe41e877d3a7f02a3b1990a65be1901b2c91fc685e083b', X'84d6adbd5395281933fe41e877d3a7f02a3b1990a65be1901b2c91fc685e083b');
 INSERT INTO nodes (id, sequence_hash) values (X'1c7dfc64977b0838af0762d7333dcb64c175b15e65a70099ec38f46bf1a15ea3', X'1c7dfc64977b0838af0762d7333dcb64c175b15e65a70099ec38f46bf1a15ea3');
+
+INSERT INTO reference_aliases (reference_name, refseq_accession_id, genbank_accession_id) values ('E. coli K-12 MG1655', 'NC_000913.3', 'U00096.3');


is "reference name" our attempt to do a normalized name?

How should we support names that map to something like "chr1" and "1", "I"? A column called misc that is a json array of other names to load?

"Reference name" is there because I wanted a human-readable field so it's clearer what organism the row represents. I'm just using whatever ncbi is providing.

For this PR I'm focused on solving the fasta name <-> GFF name mapping, so that's why I've got just two fields beyond reference name. If we want the UCSC reference name or some other field, or a misc field, I'm open to that, but not planning to do that here unless there's evidence of common GFF usage of other reference names for things like e coli, yeast, mouse, human.

Actually, looks like while my issue with e coli is solved by associating refseq and genbank names, Bob's issue requires the UCSC name so I'll add that in, and maybe a misc field too

Here's some common stuff we'll hit:

Import a genome fasta from refseq/ncbi -- that will have NC_000001.11 for human's chr1

Use the GFF from GenCode -- https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_49/gencode.v49.annotation.gff3.gz. That will have Chr1 like:

##gff-version 3 #description: evidence-based annotation of the human genome (GRCh38), version 49 (Ensembl 115) #provider: GENCODE #contact: gencode-help@ebi.ac.uk #format: gff3 #date: 2025-07-08 ##sequence-region chr1 1 248956422 chr1 HAVANA gene 11121 24894 . + . ID=ENSG00000290825.2;gene_id=ENSG00000290825.2;gene_type=lncRNA;gene_name=DDX11L16;level=2;tag=overlaps_pseudogene chr1 HAVANA transcript 11121 14413 . + . ID=ENST00000832824.1;Parent=ENSG00000290825.2;gene_id=ENSG00000290825.2;transcript_id=ENST00000832824.1;gene_type=lncRNA;gene_name=DDX11L16;transcript_type=lncRNA;transcript_name=DDX11L16-260;level=2;tag=TAGENE chr1 HAVANA exon 11121 11211 . + . ID=exon:ENST00000832824.1:1;Parent=ENST00000832824.1;gene_id=ENSG00000290825.2;transcript_id=ENST00000832824.1;gene_type=lncRNA;gene_name=DDX11L16;transcript_type=lncRNA;transcript_name=DDX11L16-260;exon_number=1;exon_id=ENSE00004248723.1;level=2;tag=TAGENE chr1 HAVANA exon 12010 12227 . + . ID=exon:ENST00000832824.1:2;Parent=ENST00000832824.1;gene_id=ENSG00000290825.2;transcript_id=ENST00000832824.1;gene_type=lncRNA;gene_name=DDX11L16;transcript_type=lncRNA;transcript_name=DDX11L16-260;exon_number=2;exon_id=ENSE00004248735.1;level=2;tag=TAGENE chr1 HAVANA exon 12613 12721 . + . ID=exon:ENST00000832824.1:3;Parent=ENST00000832824.1;gene_id=ENSG00000290825.2;transcript_id=ENST00000832824.1;gene_type=lncRNA;gene_name=DDX11L16;transcript_type=lncRNA;tr

Vcfs will be similar w/ Chr1.

OK, yeah, this is more of an attempt to do normalized name now. For fields I've now got refseq ID, genbank ID, ensembl ID, UCSC ID, "misc" (not used yet, can be null), and "chromosome" which is a nullable int intended to be the chromosome number. So for yeast, everything seems to be roman numerals but the vcf can still be regular numbers (eg chrX versus chr10), so for yeast I'm manually populating the chromosome field to be for instance 10 if the chromosome is chrX. More details on how I'm using "10" below, but first the high level explanation.

Basically for each reference name, I'm building a list of strings that could be aliases, plus the reference name. For each block group name, if it's in the alias list, then I add each alias as a key in a hashmap pointing back to the block group name as a value. Then when parsing GFF or VCF or whatever, if the seq/ref name is present as a key in the alias map, we use the value as the block group to apply the annotation or variant to.

To build the list of aliases, I'm defining a compute_aliases method that uses the reference alias fields and constructs some additional strings in memory that could plausibly also be a name for the same reference contig, and returns it all in a big hashset. (So if the chromosome field is 10, we also add "chr10", "Chr10", "chromosome10", etc.) That contains the messiness to one method. The resulting hashset is what I use to populate the alias -> reference hashmap.

dkhofer · 2026-04-06T19:00:32Z

Seems to work. Commands using Bob's yeast files, unmodified:

gen init
gen import fasta --sample reference S288C_reference_sequence_R64-1-1_20110203.fsa
gen add-annotation-file saccharomyces_cerevisiae_R64-1-1_20110208.gff
gen update vcf --parent-sample reference BRQ_trunk.vcf

dkhofer · 2026-04-06T19:45:32Z

+        for reference_alias in reference_aliases {
+            let aliases = ReferenceAlias::compute_aliases(reference_alias);
+            for reference in &references {
+                if aliases.contains(reference) {


This could be problematic if there is somehow overlap between two sets of reference aliases

Chris7

I have a few thoughts

Chris7 · 2026-04-07T14:56:34Z

+
+    // Load all reference aliases, to accommodate alternate reference names in the GFF file
+    let references = sample_bgs.keys().cloned().collect::<Vec<String>>();
+    let references_by_alias = ReferenceAlias::get_references_by_alias(conn, references)?;


if we scoped referencealias by blockgroup, we could figure out chr1 is actually NC_00000.1 as well. Then in the lineage code we can copy down aliases to keep it working as usual.

That is a decent idea, and I thought about it for a while, but I'm having trouble figuring out how to make it work, because I'm not sure what assumptions we can make about the schema and data.

I am a bit confused about what you mean by scoping referencealias by blockgroup. I think the referencealias rows should exist independently of block groups. I think the best solution would be a join table between block group and reference alias. But then the question becomes which referencealias fields we put in the join table. I think all the relevant ones could be nullable, except reference name, which may be duplicated (for organisms with multiple chromosomes). I guess we could force reference name to end in something like "chr1" or something. Or we could add an incrementing ID field to referencealias.

To me, the benefit of adding a join table would be to log which reference we've associated a block group with, and slightly faster lookup time for the reference aliases of a block group, although I'm not sure speed will be a huge issue here. I'm not convinced those are important enough for us to nail down that part of the schema right now.

Chris7 · 2026-04-07T14:58:12Z

+    io::{Read, Write},
 };

+use anyhow::Result;


Anyhow is for application code. For errors, we should make a new TranslateBedError in errors.rs using thiserror and specify errors that way.

Oops, fixed

Chris7 · 2026-04-07T15:00:38Z

 ) STRICT;
 CREATE UNIQUE INDEX block_group_edges_uidx ON block_group_edges(block_group_id, edge_id, chromosome_index, phased);

+CREATE TABLE reference_aliases (


I've been adding new migrations these days. Can you make this core/0x-reference/up.sql?

Chris7 · 2026-04-07T15:01:02Z

 ) STRICT;
 CREATE UNIQUE INDEX block_group_edges_uidx ON block_group_edges(block_group_id, edge_id, chromosome_index, phased);

+CREATE TABLE reference_aliases (


need a down.sql too

Chris7 · 2026-04-07T15:02:09Z

+        Ok(())
+    }
+
+    pub fn load_all(conn: &GraphConnection) -> Result<Vec<ReferenceAlias>> {


We get this through all in the Query trait

Oops, did that

Chris7 · 2026-04-07T15:45:27Z

+        }
+        aliases.insert(reference_alias.ucsc_id);
+        aliases.insert(reference_alias.ensembl_id.clone());
+        aliases.insert(format!("chr{}", reference_alias.ensembl_id));


we should make downstream stuff case insensitive vs. this

I'd prefer to just keep it simple for now and not do downstream processing, is that all right?

Chris7 · 2026-04-07T15:49:22Z

+        reference_name: String,
+        /// The refseq accession ID
+        #[arg(long)]
+        refseq_accession_id: String,


we should have optional for quite a few of these. It's not guaranteed that an organism is in genbank/refseq/ensembl/ucsc

Yep, changed. Also added a uniqueness constraint on the refseq ID. Nothing else seems to be especially unique otherwise, unfortunately

Chris7 · 2026-04-07T15:50:43Z

+        reference_name: String,
+        /// The refseq accession ID
+        #[arg(long)]
+        refseq_accession_id: String,


do we actually care about the soruce as well? Could we do this as (reference_name, alias) and just ensure that's a unique pair? With my prior suggestion it would be (bg_id, reference_name, alias)

i guess an advantage of this is we know ucsc id X == ensembl id Y without defining all the cross pairs

Yeah, I like having it all in one row

dkhofer · 2026-04-14T19:51:28Z

This is ready for another look

Chris7

One comment. I think at some point we should use something like caseless to store these strings so we can do case-insensitive comparisons w/o a speed hit.

Chris7 · 2026-04-15T19:55:53Z

+            aliases.insert(format!("Chr{}", custom_id));
+            aliases.insert(format!("chrom{}", custom_id));
+            aliases.insert(format!("Chrom{}", custom_id));
+            aliases.insert(format!("chr{}", custom_id));


should be chromosome. Maybe a helper method to do all these?

Chris7 reviewed Apr 2, 2026

View reviewed changes

dkhofer commented Apr 6, 2026

View reviewed changes

Chris7 reviewed Apr 7, 2026

View reviewed changes

dkhofer force-pushed the reference-name-aliases branch from a30cbe8 to 404cf3b Compare April 13, 2026 19:07

Chris7 approved these changes Apr 15, 2026

View reviewed changes

dkhofer added 16 commits April 15, 2026 16:05

Fix annotation <-> reference name translation

e1c1df2

Minor fixes

ff59650

One more fix

b1481a5

Fix bed translation too, add unit tests

2bb4576

Add in more alias types and prepopulated values

aa1ca39

Fix build

8b1693b

Use reference aliases in update with vcf

0ed0e33

Remove print statements

d9b8a3e

Add command to add reference aliases

4a56c66

Also translate bed annotations

bf14563

Address some comments

0231447

Address comments

3d492e0

Fix build

f087e25

Actually get rid of anyhow

363ebc6

Refactor chromosome name code

013600e

Fix after rebase

996c5b3

dkhofer force-pushed the reference-name-aliases branch from 5c94548 to 996c5b3 Compare April 15, 2026 20:20

dkhofer merged commit a3a3dbc into main Apr 15, 2026
5 checks passed

dkhofer deleted the reference-name-aliases branch April 15, 2026 20:26

Conversation

dkhofer commented Apr 2, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dkhofer commented Apr 6, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Chris7 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dkhofer commented Apr 14, 2026

Uh oh!

Chris7 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants