Skip to content

Conversation

@alinakbase
Copy link
Collaborator

No description provided.

import argparse
import json
from pathlib import Path
from typing import Optional
if os.path.exists(tmp_path):
try:
os.remove(tmp_path)
except Exception:
@@ -0,0 +1,734 @@
import json
from pathlib import Path
@ialarmedalien ialarmedalien changed the base branch from develop to uniprot-refactor-v2 January 21, 2026 19:07
if identifier.startswith("GCF_"):
return f"insdc.gcf:{identifier}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's also add in

if identifier.startswith("GCA_"):
    return f"insdc.gca:{identifier}"

@@ -0,0 +1,710 @@
import json
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before you start doing any refactoring, can you add in an integration test that checks the results of parsing the JSON data into all 8 CDM tables? Let me know when you have done that so I can take a look.

@ialarmedalien ialarmedalien force-pushed the uniprot-refactor-v2 branch 2 times, most recently from 06c5508 to a84db46 Compare January 28, 2026 17:21
Comment on lines 1 to 13
from typing import List, Optional
from pyspark.sql import DataFrame
from pyspark.sql.types import StructType
from math import isclose


def assertDataFrameSchemaEqual(df1: DataFrame, df2: DataFrame, msg: str = "") -> None:
fields1 = [(f.name, f.dataType) for f in df1.schema.fields]
fields2 = [(f.name, f.dataType) for f in df2.schema.fields]

assert fields1 == fields2, f"{msg}\nSchema mismatch:\n{fields1}\n!=\n{fields2}"


Copy link
Collaborator

@ialarmedalien ialarmedalien Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very cool, but it already exists in the pyspark codebase:

https://spark.apache.org/docs/latest/api/python/reference/pyspark.testing.html

You can just do this:

at the top of the file:

from pyspark.testing import assertDataFrameEqual, assertDataFrameSchemaEqual

in your test function:

def test_something() -> None:
    # set up stuff here
    assertDataFrameEqual(result_df, expected_df)

Comment on lines +736 to +745
expected_tables = [
"contig",
"contig_x_contigcollection",
"contigcollection_x_feature",
"contigcollection_x_protein",
"feature",
"feature_x_protein",
"identifier",
"name",
]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You also need the protein table -- it looks like the parser is not capturing the protein information any more.

Comment on lines +727 to +732
# Load NCBI dataset from NCBI API
sample_api_response = test_data_dir / "refseq" / "annotation_report.json"
dataset = json.load(sample_api_response.open())

# Run parse function
parse_annotation_data(spark, [dataset], test_ns)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have lost the lines where you load the annotation_report.parsed.json file here -- you need to restore those lines so that you are comparing the parsed report to annotation_report.parsed.json.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants