Skip to content

Update the legacy data job for processing the --importName#517

Merged
gmechali merged 14 commits into
masterfrom
feat/cli-import-flag
Jun 3, 2026
Merged

Update the legacy data job for processing the --importName#517
gmechali merged 14 commits into
masterfrom
feat/cli-import-flag

Conversation

@gmechali
Copy link
Copy Markdown
Contributor

@gmechali gmechali commented Jun 1, 2026

Updates the legacy data job to support the --imports flag.
It has backwards compatibility, if --imports is not provided, we will continue looking for the config.json in the input/dir.
Otherwise, we will be looking in the subfolders provided.

CLI is updated in datacommonsorg/datacommons#99

Note that this supports following thepath in GCS to make the provenance name, but it does not yet support per-import feature yet. We currently hack te provenance name of multiple different imports. The work needed to do this blew up the PR, so I am now splitting it up. This is JUST the import job changes for support a --imports flag and fetching the configs.

The follow up PR will add full support for multiple individual imports.

@codacy-production
Copy link
Copy Markdown

codacy-production Bot commented Jun 1, 2026

Not up to standards ⛔

🔴 Issues 1 high · 1 medium · 22 minor

Alerts:
⚠ 24 issues (≤ 0 issues of at least minor severity)

Results:
24 new issues

Category Results
UnusedCode 1 medium
Documentation 11 minor
ErrorProne 1 high
CodeStyle 11 minor

View in Codacy

🟢 Metrics 0 complexity · 0 duplication

Metric Results
Complexity 0
Duplication 0

View in Codacy

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for multi-import bulk loads and provenance-based JSON-LD exports, allowing the runner to merge configurations from subdirectories and trigger ingestion workflows with multiple imports. The review feedback highlights several critical and high-severity issues: a path-joining bug in filesystem.py that can cause FileNotFoundError due to leading slashes, potential data loss from skipping observations with empty or NULL provenances, a potential NameError in the configuration loader, and performance bottlenecks related to sequential multiprocessing.Pool recreation and missing database indexes on the provenance column.

Comment thread simple/util/filesystem.py Outdated
Comment thread simple/stats/runner.py Outdated
Comment thread simple/stats/runner.py
Comment thread simple/stats/runner.py Outdated
Comment thread simple/stats/runner.py Outdated
Comment thread simple/stats/jsonld_exporter.py Outdated
@gmechali
Copy link
Copy Markdown
Contributor Author

gmechali commented Jun 1, 2026

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for running bulk or selective imports by scanning and merging multiple configuration files from subdirectories, updating the Dockerfile, run script, and runner to handle the new --imports flag. The review feedback identifies several critical issues, including a potential UnboundLocalError when imports are unspecified, incorrect file resolution in filesystem.py due to absolute-style paths, inconsistent indentation and fragile path replacement in _merge_configs, and unintended directory creation when opening non-existent import directories.

Comment thread simple/stats/runner.py Outdated
Comment thread simple/util/filesystem.py Outdated
Comment thread simple/stats/runner.py
Comment thread simple/stats/runner.py
@gmechali gmechali requested a review from dwnoble June 1, 2026 16:43
Comment thread simple/stats/runner.py
Comment thread simple/stats/runner.py
Comment thread simple/stats/runner.py Outdated
Comment thread simple/stats/main.py Outdated
@gmechali gmechali requested a review from dwnoble June 1, 2026 18:38
@gmechali gmechali merged commit 856db85 into master Jun 3, 2026
9 of 10 checks passed
@gmechali gmechali deleted the feat/cli-import-flag branch June 3, 2026 14:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants