Update the legacy data job for processing the --importName#517
Conversation
Not up to standards ⛔🔴 Issues
|
| Category | Results |
|---|---|
| UnusedCode | 1 medium |
| Documentation | 11 minor |
| ErrorProne | 1 high |
| CodeStyle | 11 minor |
🟢 Metrics 0 complexity · 0 duplication
Metric Results Complexity 0 Duplication 0
NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.
There was a problem hiding this comment.
Code Review
This pull request introduces support for multi-import bulk loads and provenance-based JSON-LD exports, allowing the runner to merge configurations from subdirectories and trigger ingestion workflows with multiple imports. The review feedback highlights several critical and high-severity issues: a path-joining bug in filesystem.py that can cause FileNotFoundError due to leading slashes, potential data loss from skipping observations with empty or NULL provenances, a potential NameError in the configuration loader, and performance bottlenecks related to sequential multiprocessing.Pool recreation and missing database indexes on the provenance column.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces support for running bulk or selective imports by scanning and merging multiple configuration files from subdirectories, updating the Dockerfile, run script, and runner to handle the new --imports flag. The review feedback identifies several critical issues, including a potential UnboundLocalError when imports are unspecified, incorrect file resolution in filesystem.py due to absolute-style paths, inconsistent indentation and fragile path replacement in _merge_configs, and unintended directory creation when opening non-existent import directories.
Updates the legacy data job to support the --imports flag.
It has backwards compatibility, if --imports is not provided, we will continue looking for the config.json in the input/dir.
Otherwise, we will be looking in the subfolders provided.
CLI is updated in datacommonsorg/datacommons#99
Note that this supports following thepath in GCS to make the provenance name, but it does not yet support per-import feature yet. We currently hack te provenance name of multiple different imports. The work needed to do this blew up the PR, so I am now splitting it up. This is JUST the import job changes for support a --imports flag and fetching the configs.
The follow up PR will add full support for multiple individual imports.