Skip to content

Fix annotation <-> reference name translation#88

Merged
dkhofer merged 16 commits intomainfrom
reference-name-aliases
Apr 15, 2026
Merged

Fix annotation <-> reference name translation#88
dkhofer merged 16 commits intomainfrom
reference-name-aliases

Conversation

@dkhofer
Copy link
Copy Markdown
Member

@dkhofer dkhofer commented Apr 2, 2026

Resolves #85

INSERT INTO nodes (id, sequence_hash) values (X'84d6adbd5395281933fe41e877d3a7f02a3b1990a65be1901b2c91fc685e083b', X'84d6adbd5395281933fe41e877d3a7f02a3b1990a65be1901b2c91fc685e083b');
INSERT INTO nodes (id, sequence_hash) values (X'1c7dfc64977b0838af0762d7333dcb64c175b15e65a70099ec38f46bf1a15ea3', X'1c7dfc64977b0838af0762d7333dcb64c175b15e65a70099ec38f46bf1a15ea3');

INSERT INTO reference_aliases (reference_name, refseq_accession_id, genbank_accession_id) values ('E. coli K-12 MG1655', 'NC_000913.3', 'U00096.3');
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is "reference name" our attempt to do a normalized name?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How should we support names that map to something like "chr1" and "1", "I"? A column called misc that is a json array of other names to load?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Reference name" is there because I wanted a human-readable field so it's clearer what organism the row represents. I'm just using whatever ncbi is providing.

For this PR I'm focused on solving the fasta name <-> GFF name mapping, so that's why I've got just two fields beyond reference name. If we want the UCSC reference name or some other field, or a misc field, I'm open to that, but not planning to do that here unless there's evidence of common GFF usage of other reference names for things like e coli, yeast, mouse, human.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, looks like while my issue with e coli is solved by associating refseq and genbank names, Bob's issue requires the UCSC name so I'll add that in, and maybe a misc field too

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's some common stuff we'll hit:

##gff-version 3
#description: evidence-based annotation of the human genome (GRCh38), version 49 (Ensembl 115)
#provider: GENCODE
#contact: gencode-help@ebi.ac.uk
#format: gff3
#date: 2025-07-08
##sequence-region chr1 1 248956422
chr1    HAVANA  gene    11121   24894   .       +       .       ID=ENSG00000290825.2;gene_id=ENSG00000290825.2;gene_type=lncRNA;gene_name=DDX11L16;level=2;tag=overlaps_pseudogene
chr1    HAVANA  transcript      11121   14413   .       +       .       ID=ENST00000832824.1;Parent=ENSG00000290825.2;gene_id=ENSG00000290825.2;transcript_id=ENST00000832824.1;gene_type=lncRNA;gene_name=DDX11L16;transcript_type=lncRNA;transcript_name=DDX11L16-260;level=2;tag=TAGENE
chr1    HAVANA  exon    11121   11211   .       +       .       ID=exon:ENST00000832824.1:1;Parent=ENST00000832824.1;gene_id=ENSG00000290825.2;transcript_id=ENST00000832824.1;gene_type=lncRNA;gene_name=DDX11L16;transcript_type=lncRNA;transcript_name=DDX11L16-260;exon_number=1;exon_id=ENSE00004248723.1;level=2;tag=TAGENE
chr1    HAVANA  exon    12010   12227   .       +       .       ID=exon:ENST00000832824.1:2;Parent=ENST00000832824.1;gene_id=ENSG00000290825.2;transcript_id=ENST00000832824.1;gene_type=lncRNA;gene_name=DDX11L16;transcript_type=lncRNA;transcript_name=DDX11L16-260;exon_number=2;exon_id=ENSE00004248735.1;level=2;tag=TAGENE
chr1    HAVANA  exon    12613   12721   .       +       .       ID=exon:ENST00000832824.1:3;Parent=ENST00000832824.1;gene_id=ENSG00000290825.2;transcript_id=ENST00000832824.1;gene_type=lncRNA;gene_name=DDX11L16;transcript_type=lncRNA;tr

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vcfs will be similar w/ Chr1.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, yeah, this is more of an attempt to do normalized name now. For fields I've now got refseq ID, genbank ID, ensembl ID, UCSC ID, "misc" (not used yet, can be null), and "chromosome" which is a nullable int intended to be the chromosome number. So for yeast, everything seems to be roman numerals but the vcf can still be regular numbers (eg chrX versus chr10), so for yeast I'm manually populating the chromosome field to be for instance 10 if the chromosome is chrX. More details on how I'm using "10" below, but first the high level explanation.

Basically for each reference name, I'm building a list of strings that could be aliases, plus the reference name. For each block group name, if it's in the alias list, then I add each alias as a key in a hashmap pointing back to the block group name as a value. Then when parsing GFF or VCF or whatever, if the seq/ref name is present as a key in the alias map, we use the value as the block group to apply the annotation or variant to.

To build the list of aliases, I'm defining a compute_aliases method that uses the reference alias fields and constructs some additional strings in memory that could plausibly also be a name for the same reference contig, and returns it all in a big hashset. (So if the chromosome field is 10, we also add "chr10", "Chr10", "chromosome10", etc.) That contains the messiness to one method. The resulting hashset is what I use to populate the alias -> reference hashmap.

@dkhofer
Copy link
Copy Markdown
Member Author

dkhofer commented Apr 6, 2026

Seems to work. Commands using Bob's yeast files, unmodified:

gen init
gen import fasta --sample reference S288C_reference_sequence_R64-1-1_20110203.fsa
gen add-annotation-file saccharomyces_cerevisiae_R64-1-1_20110208.gff
gen update vcf --parent-sample reference BRQ_trunk.vcf

for reference_alias in reference_aliases {
let aliases = ReferenceAlias::compute_aliases(reference_alias);
for reference in &references {
if aliases.contains(reference) {
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be problematic if there is somehow overlap between two sets of reference aliases

Copy link
Copy Markdown
Member

@Chris7 Chris7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a few thoughts


// Load all reference aliases, to accommodate alternate reference names in the GFF file
let references = sample_bgs.keys().cloned().collect::<Vec<String>>();
let references_by_alias = ReferenceAlias::get_references_by_alias(conn, references)?;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we scoped referencealias by blockgroup, we could figure out chr1 is actually NC_00000.1 as well. Then in the lineage code we can copy down aliases to keep it working as usual.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a decent idea, and I thought about it for a while, but I'm having trouble figuring out how to make it work, because I'm not sure what assumptions we can make about the schema and data.

I am a bit confused about what you mean by scoping referencealias by blockgroup. I think the referencealias rows should exist independently of block groups. I think the best solution would be a join table between block group and reference alias. But then the question becomes which referencealias fields we put in the join table. I think all the relevant ones could be nullable, except reference name, which may be duplicated (for organisms with multiple chromosomes). I guess we could force reference name to end in something like "chr1" or something. Or we could add an incrementing ID field to referencealias.

To me, the benefit of adding a join table would be to log which reference we've associated a block group with, and slightly faster lookup time for the reference aliases of a block group, although I'm not sure speed will be a huge issue here. I'm not convinced those are important enough for us to nail down that part of the schema right now.

Comment thread gen-annotations/src/translate/bed.rs Outdated
io::{Read, Write},
};

use anyhow::Result;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyhow is for application code. For errors, we should make a new TranslateBedError in errors.rs using thiserror and specify errors that way.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, fixed

) STRICT;
CREATE UNIQUE INDEX block_group_edges_uidx ON block_group_edges(block_group_id, edge_id, chromosome_index, phased);

CREATE TABLE reference_aliases (
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been adding new migrations these days. Can you make this core/0x-reference/up.sql?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, added

) STRICT;
CREATE UNIQUE INDEX block_group_edges_uidx ON block_group_edges(block_group_id, edge_id, chromosome_index, phased);

CREATE TABLE reference_aliases (
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need a down.sql too

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

Comment thread gen-models/src/reference_alias.rs Outdated
Ok(())
}

pub fn load_all(conn: &GraphConnection) -> Result<Vec<ReferenceAlias>> {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We get this through all in the Query trait

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, did that

Comment thread gen-models/src/reference_alias.rs Outdated
}
aliases.insert(reference_alias.ucsc_id);
aliases.insert(reference_alias.ensembl_id.clone());
aliases.insert(format!("chr{}", reference_alias.ensembl_id));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should make downstream stuff case insensitive vs. this

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to just keep it simple for now and not do downstream processing, is that all right?

Comment thread src/commands/mod.rs Outdated
reference_name: String,
/// The refseq accession ID
#[arg(long)]
refseq_accession_id: String,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should have optional for quite a few of these. It's not guaranteed that an organism is in genbank/refseq/ensembl/ucsc

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, changed. Also added a uniqueness constraint on the refseq ID. Nothing else seems to be especially unique otherwise, unfortunately

Comment thread src/commands/mod.rs Outdated
reference_name: String,
/// The refseq accession ID
#[arg(long)]
refseq_accession_id: String,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we actually care about the soruce as well? Could we do this as (reference_name, alias) and just ensure that's a unique pair? With my prior suggestion it would be (bg_id, reference_name, alias)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i guess an advantage of this is we know ucsc id X == ensembl id Y without defining all the cross pairs

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I like having it all in one row

@dkhofer dkhofer force-pushed the reference-name-aliases branch from a30cbe8 to 404cf3b Compare April 13, 2026 19:07
@dkhofer
Copy link
Copy Markdown
Member Author

dkhofer commented Apr 14, 2026

This is ready for another look

Copy link
Copy Markdown
Member

@Chris7 Chris7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One comment. I think at some point we should use something like caseless to store these strings so we can do case-insensitive comparisons w/o a speed hit.

Comment thread gen-models/src/reference_alias.rs Outdated
aliases.insert(format!("Chr{}", custom_id));
aliases.insert(format!("chrom{}", custom_id));
aliases.insert(format!("Chrom{}", custom_id));
aliases.insert(format!("chr{}", custom_id));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be chromosome. Maybe a helper method to do all these?

@dkhofer dkhofer force-pushed the reference-name-aliases branch from 5c94548 to 996c5b3 Compare April 15, 2026 20:20
@dkhofer dkhofer merged commit a3a3dbc into main Apr 15, 2026
5 checks passed
@dkhofer dkhofer deleted the reference-name-aliases branch April 15, 2026 20:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Translating block group names

2 participants