Refseq annotation #71

alinakbase · 2026-01-21T18:19:55Z

No description provided.

src/cdm_data_loader_utils/parsers/annotation_parse.py

+import argparse
+import json
+from pathlib import Path
+from typing import Optional


src/cdm_data_loader_utils/parsers/uniprot.py

+    if os.path.exists(tmp_path):
+        try:
+            os.remove(tmp_path)
+        except Exception:


tests/parsers/test_annotation_parse.py

@@ -0,0 +1,734 @@
+import json
+from pathlib import Path


ialarmedalien · 2026-01-21T20:30:56Z

src/cdm_data_loader_utils/parsers/annotation_parse.py

+    if identifier.startswith("GCF_"):
+        return f"insdc.gcf:{identifier}"


let's also add in

if identifier.startswith("GCA_"): return f"insdc.gca:{identifier}"

ialarmedalien · 2026-01-22T21:58:35Z

tests/parsers/test_annotation_parse.py

@@ -0,0 +1,710 @@
+import json


Before you start doing any refactoring, can you add in an integration test that checks the results of parsing the JSON data into all 8 CDM tables? Let me know when you have done that so I can take a look.

…_parse.py

ialarmedalien · 2026-01-28T21:48:30Z

tests/validation/assertions.py

+from typing import List, Optional
+from pyspark.sql import DataFrame
+from pyspark.sql.types import StructType
+from math import isclose
+
+
+def assertDataFrameSchemaEqual(df1: DataFrame, df2: DataFrame, msg: str = "") -> None:
+    fields1 = [(f.name, f.dataType) for f in df1.schema.fields]
+    fields2 = [(f.name, f.dataType) for f in df2.schema.fields]
+
+    assert fields1 == fields2, f"{msg}\nSchema mismatch:\n{fields1}\n!=\n{fields2}"
+
+


This is very cool, but it already exists in the pyspark codebase:

https://spark.apache.org/docs/latest/api/python/reference/pyspark.testing.html

You can just do this:

at the top of the file:

from pyspark.testing import assertDataFrameEqual, assertDataFrameSchemaEqual

in your test function:

def test_something() -> None: # set up stuff here assertDataFrameEqual(result_df, expected_df)

ialarmedalien · 2026-01-28T21:50:29Z

tests/parsers/test_annotation_parse.py

+    expected_tables = [
+        "contig",
+        "contig_x_contigcollection",
+        "contigcollection_x_feature",
+        "contigcollection_x_protein",
+        "feature",
+        "feature_x_protein",
+        "identifier",
+        "name",
+    ]


You also need the protein table -- it looks like the parser is not capturing the protein information any more.

ialarmedalien · 2026-01-29T19:23:54Z

tests/parsers/test_annotation_parse.py

+    # Load NCBI dataset from NCBI API
+    sample_api_response = test_data_dir / "refseq" / "annotation_report.json"
+    dataset = json.load(sample_api_response.open())
+
+    # Run parse function
+    parse_annotation_data(spark, [dataset], test_ns)


You have lost the lines where you load the annotation_report.parsed.json file here -- you need to restore those lines so that you are comparing the parsed report to annotation_report.parsed.json.

github-code-quality bot found potential problems Jan 21, 2026

View reviewed changes

src/cdm_data_loader_utils/parsers/annotation_parse.py

import argparse

import json

from pathlib import Path

from typing import Optional

src/cdm_data_loader_utils/parsers/uniprot.py

if os.path.exists(tmp_path):

try:

os.remove(tmp_path)

except Exception:

github-code-quality bot found potential problems Jan 21, 2026

View reviewed changes

tests/parsers/test_annotation_parse.py

@@ -0,0 +1,734 @@

import json

from pathlib import Path

ialarmedalien changed the base branch from develop to uniprot-refactor-v2 January 21, 2026 19:07

ialarmedalien reviewed Jan 21, 2026

View reviewed changes

ialarmedalien reviewed Jan 22, 2026

View reviewed changes

alinakbase and others added 25 commits January 27, 2026 15:54

uniprot refactor

4fbabb7

Remove .DS_Store from tests directory

970d808

Fixing path problems that were preventing module import

48896b5

Removing tests/ directory from under tests

1bb33aa

UniRef updates

03f1078

lint

17cba70

First steps towards using external IDs for CDM entities

f730e86

organize uniprot.py

9945924

revise uniprot.py and test_uniprot.py

b3f7085

format: apply ruff formatting to uniprot parser

9e3fff0

change the format

d45fe42

format

af305b1

Regenerate uv.lock

ed35484

Uniprot branch file movements

3351692

Fix formatting / tests / docs

e90ee59

update uniref and test uniref

bfb5f0c

reformat uniprot and uniref

478ba7b

update uniprot.py and test

19671f4

any data files used by the test should go in test/data

70940e3

formatting uniprot.py

46f66f8

Fix UniRef parsing tests and stabilize timestamp handling

7ac1a78

Remove large uniprot archaea test data directory

2e906a6

style: apply Ruff formatting to uniprot-refactor-v2

52427a8

add annotation parse

a338e5a

Update annotaion parse, Add test_annotation_parse.py

de910c9

alinakbase and others added 16 commits January 27, 2026 15:57

style: fix formatting

f1e2e03

fix: apply ruff format

d4ad7eb

Fix ruff lint errors in annotation_parse.py

5e11a41

style: apply Ruff formatting across modified files

a486f30

Restore accidentally deleted files

89c7798

add one new function in test_annotation_parse.py

05e1ec0

refactoring both annotation parser and test

8bec30c

delete the useless files

62ef5a0

remove GeneID

fb9caa8

change annotation parser

76b9948

change message

2a85238

change files

e558bf2

change path of test annotation parse

5413ad0

Add the function in validator

2c61daf

test

8c656d5

First draft of integration test

eab27c9

ialarmedalien force-pushed the refseq-annotation branch from 16aa4cf to eab27c9 Compare January 28, 2026 01:13

ialarmedalien added 2 commits January 27, 2026 17:28

important whitespace and comments

8290de0

updating expected parser output

8a73a13

ialarmedalien force-pushed the uniprot-refactor-v2 branch 2 times, most recently from 06c5508 to a84db46 Compare January 28, 2026 17:21

alinakbase added 3 commits January 28, 2026 12:32

Fix: remove duplicates in annotation parse

6ae80f7

Add assertions.py AND change the annotaion function in test_annotaion…

4cb865b

…_parse.py

delete the map table I don't need to use

5d20a71

ialarmedalien reviewed Jan 28, 2026

View reviewed changes

alinakbase added 2 commits January 28, 2026 17:55

change uv

6ac2dc7

pass the tests with docker

5d4a64c

ialarmedalien reviewed Jan 29, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refseq annotation #71

Refseq annotation #71

Uh oh!

alinakbase commented Jan 21, 2026

Uh oh!

ialarmedalien Jan 21, 2026

Uh oh!

ialarmedalien Jan 22, 2026

Uh oh!

ialarmedalien Jan 28, 2026 •

edited

Loading

Uh oh!

ialarmedalien Jan 28, 2026

Uh oh!

ialarmedalien Jan 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		if identifier.startswith("GCF_"):
		return f"insdc.gcf:{identifier}"

Refseq annotation #71

Are you sure you want to change the base?

Refseq annotation #71

Uh oh!

Conversation

alinakbase commented Jan 21, 2026

Uh oh!

ialarmedalien Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

ialarmedalien Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

ialarmedalien Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ialarmedalien Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

ialarmedalien Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ialarmedalien Jan 28, 2026 •

edited

Loading