PXP-10845: Add utilities for dsv file operations and manipulation#179
PXP-10845: Add utilities for dsv file operations and manipulation#179MaribelleHGomez wants to merge 3 commits intomasterfrom
Conversation
|
The style in this PR agrees with This formatting comment was generated automatically by a script in uc-cdis/wool. |
| output_writer.writerow(record) | ||
|
|
||
|
|
||
| def convert_type(file_name: str, new_type: str, has_headers=False): |
There was a problem hiding this comment.
so is this only changing the extensions? For a tsv to a csv would we not need to replace the tabs to commas?
| write(records, new_file_name) | ||
|
|
||
|
|
||
| def merge_files(files: Union[str, list], output_file_name: str, has_headers=False): |
There was a problem hiding this comment.
What if we have different headers or are different file types? I think we should check for that
| Args: | ||
| file_name(str): | ||
| Filename or file path | ||
| Returns: |
There was a problem hiding this comment.
Could you add an example in the description here?
| has_headers=False, | ||
| ): | ||
| """ | ||
| Chunk records into files of input size |
There was a problem hiding this comment.
I think this could use a better description. Something like "chunk manifests into x different chunks" and maybe in the logs add how many records there will be per manifest.
|
|
||
| def chunk( | ||
| file_name: str, | ||
| chunk_size: int, |
There was a problem hiding this comment.
Oh actually a fun idea would be adding another input, number of records in a manifest. So you're required to either give a number of chunks or number of records per chunks, do the math and create chunks according to that. If both are given, maybe do the math of wheteher or not those chunk and record numbers per chunk are compatible
| if has_headers: | ||
| file_one_headers = get_headers(file_one) | ||
| file_two_headers = get_headers(file_two) | ||
| if file_one_headers != file_two_headers and sorted(file_one_headers) == sorted( |
There was a problem hiding this comment.
i guess i dont understand what is going on here.
if we have headers:
A = ["a", "b", "c"]
B = ["b", "c", "a"]
print(A == B) : False
print(sorted(A) == sorted(B)) :True
So i see you're moving the header, i think you should add logs notifying that the columns aren't arranged correctly and we're moving the columns.
| if file_one_headers != file_two_headers and sorted(file_one_headers) == sorted( | ||
| file_two_headers | ||
| ): | ||
| list_two = move_columns(file_one_headers, file_two) |
There was a problem hiding this comment.
This should be a sorted(A) != sorted(B)
| else: | ||
| list_two = file_to_list(file_two, has_headers) | ||
|
|
||
| if not strict: |
There was a problem hiding this comment.
i think you need to handle cases where filename and file_name are the same thing and this code should be smart enough to figure that out. Luckily there are functions in the sdk that already handles that. Check here. https://github.com/uc-cdis/gen3sdk-python/blob/master/gen3/tools/utils.py
Maybe right after we open up the files we standardize the header formats
|
|
||
| logging.info(f"Taking intersection between {file_one} & {file_two}...") | ||
|
|
||
| list_one = file_to_list(file_one, has_headers) |
There was a problem hiding this comment.
Maybe we should validation for all manifests that the utils processes.
| return [row for row in list_one if row in list_two] | ||
|
|
||
|
|
||
| # TODO |
There was a problem hiding this comment.
Can you create a ticket for this TODO and link it here as well?
Jira Ticket: PXP-10845
New Features
Breaking Changes
Bug Fixes
Improvements
Dependency updates
Deployment changes