Fix registry incremental provider flag by DhirenMhatre · Pull Request #3 · test-org-codity/airflow1

DhirenMhatre · 2026-05-16T16:22:54Z

Was generative AI tooling used to co-author this PR?

Yes (please specify the tool below)

Read the Pull Request Guidelines for more information. Note: commit author/co-author name and email in commits become permanently public when merged.
For fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
When adding dependency, check compliance with the ASF 3rd Party License Policy.
For significant user-facing changes create newsfragment: {pr_number}.significant.rst, in airflow-core/newsfragments. You can add this file in a follow-up commit after the PR is created so you know the PR number.

The `--provider` flag was only passed to `extract_metadata.py` but not to `extract_parameters.py` or `extract_connections.py`. This caused incremental builds to scan all 99 providers and 1625 modules instead of just the requested one.

The registry workflow was building the CI image from scratch every run (~24 min) because it lacked the BuildKit mount cache that ci-image-build.yml provides. Inline `breeze ci-image build` with registry cache doesn't help because Docker layer cache invalidates on every commit when the build context changes. Split into two jobs following the established pattern used by ci-amd-arm.yml and update-constraints-on-push.yml: - `build-ci-image`: calls ci-image-build.yml which handles mount cache restore, ghcr.io login, registry cache, and image stashing - `build-and-publish-registry`: restores the stashed image via prepare_breeze_and_image action, then runs the rest unchanged

extract_parameters.py with --provider intentionally skips writing modules.json (only the targeted provider's parameters are extracted). The merge script assumed modules.json always exists, causing a FileNotFoundError during incremental builds. Handle missing new_modules_path the same way missing existing_modules_path is already handled: treat it as an empty list.

The prepare_breeze_and_image action loads the CI image from /mnt, which requires make_mnt_writeable.sh to run first. Each job gets a fresh runner, so the writeable /mnt from the build job doesn't carry over.

Adding `packages: ['.']` to pnpm-workspace.yaml changed how pnpm processes overrides, causing ERR_PNPM_LOCKFILE_CONFIG_MISMATCH with --frozen-lockfile. Regenerate the lockfile with pnpm 9 to match.

The prebuild script ran `uv run` without --project, causing uv to resolve the full workspace including samba → krb5 which needs libkrb5-dev (not installed on the CI runner).

… on S3 Eleventy pagination templates emit empty fallback JSON for every provider, even when only one provider's data was extracted. A plain `aws s3 sync` uploads those stubs and overwrites real connection/parameter data. Changes: - Exclude per-provider connections.json and parameters.json from the main S3 sync during incremental builds, then selectively upload only the target provider's API files - Filter connections early in extract_connections.py (before the loop) and support space-separated multi-provider IDs - Suppress SCARF_ANALYTICS and DO_NOT_TRACK telemetry in CI - Document the Eleventy pagination limitation in README and AGENTS.md

The previous exclude only covered connections.json and parameters.json, but modules.json and versions.json for non-target providers also contain incomplete data (no version info extracted) and would overwrite correct data on S3. Simplify to exclude the entire api/providers/* subtree and selectively upload only the target provider's directory.

Non-target provider pages are rebuilt without connection/parameter data (the version-specific extraction files don't exist locally). Without this exclude, the incremental build overwrites complete HTML pages on S3 with versions missing the connection builder section.

The providers listing page uses merged data (all providers) and must be updated during incremental builds — especially for new providers. AWS CLI --include after --exclude re-includes the specific file.

codity-dm · 2026-05-16T16:23:04Z

Policy Check Failed

✗ 3/3 policy checks failed:

• Need 2 more approval(s) (0/2) — comment LGTM or approve via review
• Missing ticket reference (expected: JIRA-, ENG-, #*)
• 4 code file(s) changed but no test files added

To merge this PR:

Address the failed checks listed above
Ensure branch protection requires the codity/policy-check status

Configure policies in your dashboard

codity-dm · 2026-05-16T16:23:11Z

PR Summary

What Changed

The CI workflow now supports incremental provider-specific builds by syncing only the target provider to S3.
Registry extraction now accepts a --provider flag to limit extraction to a single provider.
The merge step safely handles partial builds by preserving existing provider data when new data is missing.

Key Changes by Area

CI Workflow: registry-build.yml uses a reusable job and adds selective S3 sync to avoid overwriting unrelated provider data during incremental builds.

Registry Extraction: extract_data in registry_commands.py now passes --provider to all extraction scripts, enabling targeted extraction.

Registry Merging: merge_registry_data.py handles missing new_modules.json gracefully and retains existing modules for non-updated providers.

CLI Tools: extract_connections.py now accepts space-separated provider IDs for filtering.

Files Changed

File	Changes Summary
`.github/workflows/registry-build.yml`	Added selective S3 sync for incremental builds using `--exclude` and targeted re-upload
`dev/breeze/src/airflow_breeze/commands/registry_commands.py`	Updated `extract_data` to pass `--provider` flag to extraction scripts
`dev/registry/extract_connections.py`	Added support for space-separated provider IDs
`dev/registry/merge_registry_data.py`	Added logic to handle missing `new_modules.json` and preserve existing modules
`dev/registry/tests/test_merge_registry_data.py`	Updated tests to cover partial build scenarios
`registry/pnpm-workspace.yaml`	Updated `liquidjs` and `markdown-it` version overrides
`registry/pnpm-lock.yaml`	Reflects updated dependencies from workspace changes
`registry/package.json`	Updated dependency versions for `liquidjs` and `markdown-it`
`registry/AGENTS.md`, `registry/README.md`	Minor updates to reflect registry behavior changes (no functional impact)

Review Focus Areas

Verify S3 sync logic in registry-build.yml correctly excludes and re-uploads only the target provider
Confirm merge_registry_data.py preserves existing provider data when new_modules.json is absent
Check that extract_connections.py handles empty or invalid provider ID lists

Architecture

Design Decisions: Incremental builds avoid full re-extraction by relying on S3 state and merging logic that tolerates missing files. This trades simplicity for correctness—partial builds assume the target provider’s data is complete and valid.

Scalability & Extensibility: The --provider flag pattern supports future parallelization of provider builds, though parallel safety is not yet implemented.

Risks: If a provider’s extraction fails mid-run, its stub data could overwrite existing data in S3. This is mitigated by the merge step’s handling of missing new_modules.json, but reviewers should confirm the sync logic never uploads partial or incomplete files.

Merge Status

MERGEABLE — PR Score 67/100, above threshold (50). All gates passed.

codity-dm · 2026-05-16T16:31:34Z

+            for pid in ${PROVIDER}; do
+              aws s3 sync "registry/_site/api/providers/${pid}/" \
+                "${S3_BUCKET}api/providers/${pid}/" \
+                --cache-control "${REGISTRY_CACHE_CONTROL}"
+              aws s3 sync "registry/_site/providers/${pid}/" \
+                "${S3_BUCKET}providers/${pid}/" \
+                --cache-control "${REGISTRY_CACHE_CONTROL}"
+            done


The for pid in ${PROVIDER} loop (line 277) is unquoted word-splitting on the input string, which is intentional for space-separated IDs, but if a provider ID contains special shell characters or glob patterns it will expand unexpectedly; use read -ra or quote-safe splitting instead.

Suggested fix

if [[ -n "${PROVIDER}" ]]; then read -ra PROVIDER_IDS <<< "${PROVIDER}" for pid in "${PROVIDER_IDS[@]}"; do aws s3 sync "registry/_site/api/providers/${pid}/" \ "${S3_BUCKET}api/providers/${pid}/" \ --cache-control "${REGISTRY_CACHE_CONTROL}" aws s3 sync "registry/_site/providers/${pid}/" \ "${S3_BUCKET}providers/${pid}/" \ --cache-control "${REGISTRY_CACHE_CONTROL}" done fi

Prompt for AI assistance

Copy the prompt below and paste it into ChatGPT, Claude, or any LLM:

You are an expert bash developer with deep knowledge of security, performance, and best practices. ### Context File: .github/workflows/registry-build.yml Lines: 277-284 Issue Type: robustness-medium Severity: medium Issue Description: The `for pid in ${PROVIDER}` loop (line 277) is unquoted word-splitting on the input string, which is intentional for space-separated IDs, but if a provider ID contains special shell characters or glob patterns it will expand unexpectedly; use `read -ra` or quote-safe splitting instead. Current Code: if [[ -n "${PROVIDER}" ]]; then for pid in ${PROVIDER}; do aws s3 sync "registry/_site/api/providers/${pid}/" \ "${S3_BUCKET}api/providers/${pid}/" \ --cache-control "${REGISTRY_CACHE_CONTROL}" aws s3 sync "registry/_site/providers/${pid}/" \ "${S3_BUCKET}providers/${pid}/" \ --cache-control "${REGISTRY_CACHE_CONTROL}" done fi --- ### Instructions 1. Fix the issue described above 2. Maintain the exact indentation and code style from the original 3. Follow bash best practices and language-specific idioms 4. Ensure the fix addresses the root cause, not just the symptoms 5. Add brief inline comments explaining the fix if needed ### Constraints - Do not change functionality beyond fixing the identified issue - Preserve existing variable names and function signatures unless they are part of the problem - Ensure the fix is production-ready ---

codity-dm · 2026-05-16T16:31:54Z

Security Scan Summary

Metric	Value
Vulnerabilities	Critical: 0
Overall Risk	Clean
Files Scanned	10

No critical security issues detected

Scan completed in 21.2s

Security scan powered by Codity.ai

codity-dm · 2026-05-16T16:36:09Z

License Compliance Scan

Metric	Value
Packages Scanned	217
High Risk (Strong Copyleft)	0
Medium Risk (Weak Copyleft)	3
Low Risk (Permissive)	211
Unknown License	3

Weak copyleft licenses found - verify compatibility

Some packages have unknown licenses - manual review required

Medium Risk Licenses - 3 packages

EPL-2.0 (1 packages):

elkjs 0.11.1

MPL-2.0 (2 packages):

github.com/hashicorp/go-plugin 1.7.0
github.com/hashicorp/yamux 0.1.2

Unknown Licenses - 3 packages

{% if params and params.environ and params.environ 'templated_unit_test' %}
python-dateutil 2.8.1 # including inline comments
com.google.cloud:google-cloud-pubsub ${google.cloud.version}

Powered by Codity.ai · Docs

codity-dm · 2026-05-16T16:36:35Z

Code Quality Report — test-org-codity/airflow1 · PR #3

Scanned: 2026-05-16 16:36 UTC | Score: 49/100 | Provider: github

Executive Summary

Severity	Count
Critical	0
High	1
Medium	4
Low	15

Top Findings

[CQ-LLM-001] `.github/workflows/registry-build.yml:54` (Complexity · HIGH)

Issue: The 'build-and-publish-registry' job has a high cyclomatic complexity due to multiple conditional checks and nested structures.
Suggestion: Consider breaking down the job into smaller, more manageable jobs or steps to reduce complexity.

jobs:
  build-and-publish-registry:
    timeout-minutes: 30
    name: "Build & Publish Registry"

[CQ-LLM-002] `.github/workflows/registry-build.yml:86` (Duplication · MEDIUM)

Issue: The environment variables for 'build-and-publish-registry' and 'build-ci-image' jobs have significant duplication.
Suggestion: Extract common environment variables into a reusable section to adhere to DRY principles.

env:
      EXISTING_REGISTRY_DIR: /tmp/existing-registry
      REGISTRY_DATA_DIR: dev/registry

[CQ-LLM-003] `.github/workflows/registry-build.yml:203` (Error_Handling · MEDIUM)

Issue: The script does not handle potential errors when copying files, which could lead to silent failures.
Suggestion: Add error handling to check if the copy operations succeed and log appropriate messages.

cp -r "${VERSIONS_SRC}/"* "${REGISTRY_SITE_VERSIONS_DIR}/"

[CQ-LLM-005] `.github/workflows/registry-build.yml:203` (Testability · MEDIUM)

Issue: The use of hard-coded paths and values makes the script less testable.
Suggestion: Consider using variables or parameters for paths and values to improve testability.

EXISTING_REGISTRY_DIR: /tmp/existing-registry

[CQ-LLM-006] `dev/registry/extract_connections.py:212` (Maintainability · MEDIUM)

Issue: The variable 'provider_filter' is not clearly named and may lead to confusion.
Suggestion: Rename 'provider_filter' to something more descriptive, such as 'filtered_providers'.

provider_filter: set[str] | None = None

[CQ-LLM-004] `.github/workflows/registry-build.yml:54` (Documentation · LOW)

Issue: Missing documentation for the new 'build-ci-image' job.
Suggestion: Add comments or documentation to explain the purpose and functionality of the 'build-ci-image' job.

build-ci-image:
    name: "Build CI image"

[CQ-008] `.github/workflows/registry-build.yml:82` (Maintainability · LOW)

Issue: Magic number 22 in code
Suggestion: Extract to a named constant

runners: '["ubuntu-22.04"]'

[CQ-008] `.github/workflows/registry-build.yml:96` (Maintainability · LOW)

Issue: Magic number 30 in code
Suggestion: Extract to a named constant

timeout-minutes: 30

[CQ-008] `.github/workflows/registry-build.yml:110` (Maintainability · LOW)

Issue: Magic number 300 in code
Suggestion: Extract to a named constant

REGISTRY_CACHE_CONTROL: public, max-age=300

[CQ-008] `dev/registry/tests/test_merge_registry_data.py:250` (Maintainability · LOW)

Issue: Magic number 2024 in code
Suggestion: Extract to a named constant

_provider("amazon", "Amazon", "2024-01-01"),

Per-File Breakdown

File	High	Medium	Low	Total
`.github/workflows/registry-build.yml`	1	3	4	8
`dev/registry/extract_connections.py`	0	1	0	1
`dev/registry/tests/test_merge_registry_data.py`	0	0	7	7
`registry/pnpm-lock.yaml`	0	0	4	4

Recommendations

Resolve High severity issues, especially error handling gaps and performance bottlenecks.

Run automated tests after applying fixes to verify no regressions.

greptile-apps · 2026-05-16T16:39:54Z

Greptile Summary

This PR fixes the incremental registry build pipeline by ensuring the --provider flag is passed to all three extraction scripts, adds selective S3 sync to avoid overwriting real provider data with Eleventy's empty stubs, and fixes pnpm workspace configuration so security overrides are applied correctly.

Incremental provider flag propagation: registry_commands.py now passes provider_flag to extract_parameters.py and extract_connections.py; extract_connections.py is updated to handle space-separated provider IDs, but extract_parameters.py is not — this causes multi-provider incremental builds to fail.
Selective S3 sync: The main sync excludes api/providers/* and providers/* for incremental builds, with a follow-up per-provider sync to upload only the target provider's files, preventing empty Eleventy stubs from overwriting live data.
pnpm workspace fix: Adding packages: ['.'] to pnpm-workspace.yaml activates the existing security overrides, downgrading minimatch/brace-expansion/balanced-match to their correct patched versions.

Confidence Score: 3/5

Safe to merge for single-provider incremental builds; multi-provider builds will fail at the extract-data step.

The core fix works correctly when a single provider ID is specified. However, extract_parameters.py was not updated to split space-separated provider IDs the way extract_connections.py was — passing PROVIDER='amazon google' causes extract_parameters.py to exit with a not-found error while extract_connections.py handles it correctly.

dev/registry/extract_parameters.py — the space-separated provider filter added to extract_connections.py was not applied here, leaving multi-provider incremental builds broken.

Important Files Changed

Filename	Overview
.github/workflows/registry-build.yml	Splits the single job into `build-ci-image` + `build-and-publish-registry`, adds selective S3 sync for incremental builds, and improves shell guards for empty directories.
dev/breeze/src/airflow_breeze/commands/registry_commands.py	Adds `provider_flag` to `extract_parameters.py` and `extract_connections.py` invocations. Single-provider builds are fixed; multi-provider builds expose a mismatch with `extract_parameters.py`.
dev/registry/extract_connections.py	Moves provider filtering upstream and adds space-separated ID support. Logic is correct and consistent with extract_metadata.py.
dev/registry/merge_registry_data.py	Adds a guard for a missing `new_modules_path` (skipped in `--provider` mode), treating it as an empty list. Safe and covered by the new test.
dev/registry/tests/test_merge_registry_data.py	Adds `test_missing_new_modules_file` covering the incremental-build edge case where modules.json is absent.
registry/pnpm-workspace.yaml	Adds `packages: ['.']` so pnpm applies the existing security overrides, resolving older patched versions of `minimatch`/`brace-expansion`/`balanced-match`.
registry/package.json	Adds `--project ../dev/registry` to the `uv run` prebuild script so the correct Python project context is used.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[registry-build.yml triggered] --> B{build-ci-image if: workflow_call OR allowlist}
    B -->|skipped| Z[All jobs skipped]
    B -->|runs| C[build-and-publish-registry]
    C --> D[Prepare breeze and CI image]
    D --> E{inputs.provider set?}
    E -->|yes incremental| F[Download existing data from S3]
    E -->|no full build| G[extract-data all providers]
    F --> H[extract-data --provider 'amazon google']
    H --> H1[extract_metadata.py OK]
    H --> H2[extract_parameters.py FAILS on multi-ID]
    H --> H3[extract_connections.py OK space-split]
    H1 & H2 & H3 --> I[merge_registry_data.py]
    G --> J[Copy output to registry src _data]
    I --> J
    J --> K[pnpm build Eleventy]
    K --> L[aws s3 sync main exclude api+providers subtrees]
    L --> M{incremental?}
    M -->|yes| N[per-pid sync api and html]
    M -->|no| O[sync pagefind with delete]
    N --> O
    O --> P[publish-versions]

_{Reviews (1): Last reviewed commit: "Re-include providers/index.html in incre..." | Re-trigger Greptile}

kaxil added 11 commits March 17, 2026 00:26

Fix registry workflow failures due to workspace dependency resolution

b8bc2dd

Fix /mnt not writable when loading stashed CI image

f78dc02

The prepare_breeze_and_image action loads the CI image from /mnt, which requires make_mnt_writeable.sh to run first. Each job gets a fresh runner, so the writeable /mnt from the build job doesn't carry over.

Regenerate pnpm lockfile for workspace mode

374d89d

Adding `packages: ['.']` to pnpm-workspace.yaml changed how pnpm processes overrides, causing ERR_PNPM_LOCKFILE_CONFIG_MISMATCH with --frozen-lockfile. Regenerate the lockfile with pnpm 9 to match.

Scope prebuild uv resolution to dev/registry project

014a317

The prebuild script ran `uv run` without --project, causing uv to resolve the full workspace including samba → krb5 which needs libkrb5-dev (not installed on the CI runner).

Re-include providers/index.html in incremental S3 sync

fce97bc

The providers listing page uses merged data (all providers) and must be updated during incremental builds — especially for new providers. AWS CLI --include after --exclude re-includes the specific file.

codity-dm Bot reviewed May 16, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix registry incremental provider flag#3

Fix registry incremental provider flag#3
DhirenMhatre wants to merge 11 commits into
mainfrom
fix-registry-incremental-provider-flag

DhirenMhatre commented May 16, 2026

Uh oh!

codity-dm Bot commented May 16, 2026

Uh oh!

codity-dm Bot commented May 16, 2026 •

edited

Loading

Uh oh!

codity-dm Bot May 16, 2026

Uh oh!

codity-dm Bot commented May 16, 2026 •

edited

Loading

Uh oh!

codity-dm Bot commented May 16, 2026

Uh oh!

codity-dm Bot commented May 16, 2026

[CQ-LLM-001] `.github/workflows/registry-build.yml:54` (Complexity · HIGH)

[CQ-LLM-002] `.github/workflows/registry-build.yml:86` (Duplication · MEDIUM)

[CQ-LLM-003] `.github/workflows/registry-build.yml:203` (Error_Handling · MEDIUM)

[CQ-LLM-005] `.github/workflows/registry-build.yml:203` (Testability · MEDIUM)

[CQ-LLM-006] `dev/registry/extract_connections.py:212` (Maintainability · MEDIUM)

[CQ-LLM-004] `.github/workflows/registry-build.yml:54` (Documentation · LOW)

[CQ-008] `.github/workflows/registry-build.yml:82` (Maintainability · LOW)

[CQ-008] `.github/workflows/registry-build.yml:96` (Maintainability · LOW)

[CQ-008] `.github/workflows/registry-build.yml:110` (Maintainability · LOW)

[CQ-008] `dev/registry/tests/test_merge_registry_data.py:250` (Maintainability · LOW)

Uh oh!

greptile-apps Bot commented May 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

DhirenMhatre commented May 16, 2026

Was generative AI tooling used to co-author this PR?

Uh oh!

codity-dm Bot commented May 16, 2026

Policy Check Failed

Uh oh!

codity-dm Bot commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Summary

What Changed

Key Changes by Area

Files Changed

Review Focus Areas

Architecture

Merge Status

Uh oh!

codity-dm Bot May 16, 2026

Choose a reason for hiding this comment

Uh oh!

codity-dm Bot commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Security Scan Summary

Scan completed in 21.2s

Uh oh!

codity-dm Bot commented May 16, 2026

License Compliance Scan

Uh oh!

codity-dm Bot commented May 16, 2026

Code Quality Report — test-org-codity/airflow1 · PR #3

Executive Summary

[CQ-LLM-001] .github/workflows/registry-build.yml:54 (Complexity · HIGH)

[CQ-LLM-002] .github/workflows/registry-build.yml:86 (Duplication · MEDIUM)

[CQ-LLM-003] .github/workflows/registry-build.yml:203 (Error_Handling · MEDIUM)

[CQ-LLM-005] .github/workflows/registry-build.yml:203 (Testability · MEDIUM)

[CQ-LLM-006] dev/registry/extract_connections.py:212 (Maintainability · MEDIUM)

[CQ-LLM-004] .github/workflows/registry-build.yml:54 (Documentation · LOW)

[CQ-008] .github/workflows/registry-build.yml:82 (Maintainability · LOW)

[CQ-008] .github/workflows/registry-build.yml:96 (Maintainability · LOW)

[CQ-008] .github/workflows/registry-build.yml:110 (Maintainability · LOW)

[CQ-008] dev/registry/tests/test_merge_registry_data.py:250 (Maintainability · LOW)

Per-File Breakdown

Recommendations

Uh oh!

greptile-apps Bot commented May 16, 2026

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Flowchart

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codity-dm Bot commented May 16, 2026 •

edited

Loading

codity-dm Bot commented May 16, 2026 •

edited

Loading

[CQ-LLM-001] `.github/workflows/registry-build.yml:54` (Complexity · HIGH)

[CQ-LLM-002] `.github/workflows/registry-build.yml:86` (Duplication · MEDIUM)

[CQ-LLM-003] `.github/workflows/registry-build.yml:203` (Error_Handling · MEDIUM)

[CQ-LLM-005] `.github/workflows/registry-build.yml:203` (Testability · MEDIUM)

[CQ-LLM-006] `dev/registry/extract_connections.py:212` (Maintainability · MEDIUM)

[CQ-LLM-004] `.github/workflows/registry-build.yml:54` (Documentation · LOW)

[CQ-008] `.github/workflows/registry-build.yml:82` (Maintainability · LOW)

[CQ-008] `.github/workflows/registry-build.yml:96` (Maintainability · LOW)

[CQ-008] `.github/workflows/registry-build.yml:110` (Maintainability · LOW)

[CQ-008] `dev/registry/tests/test_merge_registry_data.py:250` (Maintainability · LOW)