-
Notifications
You must be signed in to change notification settings - Fork 4
Refseq annotation #71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: uniprot-refactor-v2
Are you sure you want to change the base?
Conversation
| if identifier.startswith("GCF_"): | ||
| return f"insdc.gcf:{identifier}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's also add in
if identifier.startswith("GCA_"):
return f"insdc.gca:{identifier}"| @@ -0,0 +1,710 @@ | |||
| import json | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before you start doing any refactoring, can you add in an integration test that checks the results of parsing the JSON data into all 8 CDM tables? Let me know when you have done that so I can take a look.
16aa4cf to
eab27c9
Compare
06c5508 to
a84db46
Compare
tests/validation/assertions.py
Outdated
| from typing import List, Optional | ||
| from pyspark.sql import DataFrame | ||
| from pyspark.sql.types import StructType | ||
| from math import isclose | ||
|
|
||
|
|
||
| def assertDataFrameSchemaEqual(df1: DataFrame, df2: DataFrame, msg: str = "") -> None: | ||
| fields1 = [(f.name, f.dataType) for f in df1.schema.fields] | ||
| fields2 = [(f.name, f.dataType) for f in df2.schema.fields] | ||
|
|
||
| assert fields1 == fields2, f"{msg}\nSchema mismatch:\n{fields1}\n!=\n{fields2}" | ||
|
|
||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very cool, but it already exists in the pyspark codebase:
https://spark.apache.org/docs/latest/api/python/reference/pyspark.testing.html
You can just do this:
at the top of the file:
from pyspark.testing import assertDataFrameEqual, assertDataFrameSchemaEqual
in your test function:
def test_something() -> None:
# set up stuff here
assertDataFrameEqual(result_df, expected_df)
| expected_tables = [ | ||
| "contig", | ||
| "contig_x_contigcollection", | ||
| "contigcollection_x_feature", | ||
| "contigcollection_x_protein", | ||
| "feature", | ||
| "feature_x_protein", | ||
| "identifier", | ||
| "name", | ||
| ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You also need the protein table -- it looks like the parser is not capturing the protein information any more.
| # Load NCBI dataset from NCBI API | ||
| sample_api_response = test_data_dir / "refseq" / "annotation_report.json" | ||
| dataset = json.load(sample_api_response.open()) | ||
|
|
||
| # Run parse function | ||
| parse_annotation_data(spark, [dataset], test_ns) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have lost the lines where you load the annotation_report.parsed.json file here -- you need to restore those lines so that you are comparing the parsed report to annotation_report.parsed.json.
No description provided.