Skip to content

Fix registry incremental provider flag#3

Open
DhirenMhatre wants to merge 11 commits into
mainfrom
fix-registry-incremental-provider-flag
Open

Fix registry incremental provider flag#3
DhirenMhatre wants to merge 11 commits into
mainfrom
fix-registry-incremental-provider-flag

Conversation

@DhirenMhatre
Copy link
Copy Markdown


Was generative AI tooling used to co-author this PR?
  • Yes (please specify the tool below)

  • Read the Pull Request Guidelines for more information. Note: commit author/co-author name and email in commits become permanently public when merged.
  • For fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
  • When adding dependency, check compliance with the ASF 3rd Party License Policy.
  • For significant user-facing changes create newsfragment: {pr_number}.significant.rst, in airflow-core/newsfragments. You can add this file in a follow-up commit after the PR is created so you know the PR number.

kaxil added 11 commits March 17, 2026 00:26
The `--provider` flag was only passed to `extract_metadata.py` but not
to `extract_parameters.py` or `extract_connections.py`. This caused
incremental builds to scan all 99 providers and 1625 modules instead
of just the requested one.
The registry workflow was building the CI image from scratch every run
(~24 min) because it lacked the BuildKit mount cache that
ci-image-build.yml provides. Inline `breeze ci-image build` with
registry cache doesn't help because Docker layer cache invalidates
on every commit when the build context changes.

Split into two jobs following the established pattern used by
ci-amd-arm.yml and update-constraints-on-push.yml:

- `build-ci-image`: calls ci-image-build.yml which handles mount cache
  restore, ghcr.io login, registry cache, and image stashing
- `build-and-publish-registry`: restores the stashed image via
  prepare_breeze_and_image action, then runs the rest unchanged
extract_parameters.py with --provider intentionally skips writing
modules.json (only the targeted provider's parameters are extracted).
The merge script assumed modules.json always exists, causing a
FileNotFoundError during incremental builds.

Handle missing new_modules_path the same way missing
existing_modules_path is already handled: treat it as an empty list.
The prepare_breeze_and_image action loads the CI image from /mnt, which
requires make_mnt_writeable.sh to run first. Each job gets a fresh
runner, so the writeable /mnt from the build job doesn't carry over.
Adding `packages: ['.']` to pnpm-workspace.yaml changed how pnpm
processes overrides, causing ERR_PNPM_LOCKFILE_CONFIG_MISMATCH with
--frozen-lockfile. Regenerate the lockfile with pnpm 9 to match.
The prebuild script ran `uv run` without --project, causing uv to
resolve the full workspace including samba → krb5 which needs
libkrb5-dev (not installed on the CI runner).
… on S3

Eleventy pagination templates emit empty fallback JSON for every provider,
even when only one provider's data was extracted.  A plain `aws s3 sync`
uploads those stubs and overwrites real connection/parameter data.

Changes:
- Exclude per-provider connections.json and parameters.json from the main
  S3 sync during incremental builds, then selectively upload only the
  target provider's API files
- Filter connections early in extract_connections.py (before the loop)
  and support space-separated multi-provider IDs
- Suppress SCARF_ANALYTICS and DO_NOT_TRACK telemetry in CI
- Document the Eleventy pagination limitation in README and AGENTS.md
The previous exclude only covered connections.json and parameters.json,
but modules.json and versions.json for non-target providers also contain
incomplete data (no version info extracted) and would overwrite correct
data on S3.  Simplify to exclude the entire api/providers/* subtree and
selectively upload only the target provider's directory.
Non-target provider pages are rebuilt without connection/parameter data
(the version-specific extraction files don't exist locally). Without
this exclude, the incremental build overwrites complete HTML pages on
S3 with versions missing the connection builder section.
The providers listing page uses merged data (all providers) and must
be updated during incremental builds — especially for new providers.
AWS CLI --include after --exclude re-includes the specific file.
@codity-dm
Copy link
Copy Markdown

codity-dm Bot commented May 16, 2026

Policy Check Failed

✗ 3/3 policy checks failed:

• Need 2 more approval(s) (0/2) — comment LGTM or approve via review
• Missing ticket reference (expected: JIRA-, ENG-, #*)
• 4 code file(s) changed but no test files added


To merge this PR:

  1. Address the failed checks listed above
  2. Ensure branch protection requires the codity/policy-check status

Configure policies in your dashboard

@codity-dm
Copy link
Copy Markdown

codity-dm Bot commented May 16, 2026

PR Summary

What Changed

  • The CI workflow now supports incremental provider-specific builds by syncing only the target provider to S3.
  • Registry extraction now accepts a --provider flag to limit extraction to a single provider.
  • The merge step safely handles partial builds by preserving existing provider data when new data is missing.

Key Changes by Area

CI Workflow: registry-build.yml uses a reusable job and adds selective S3 sync to avoid overwriting unrelated provider data during incremental builds.

Registry Extraction: extract_data in registry_commands.py now passes --provider to all extraction scripts, enabling targeted extraction.

Registry Merging: merge_registry_data.py handles missing new_modules.json gracefully and retains existing modules for non-updated providers.

CLI Tools: extract_connections.py now accepts space-separated provider IDs for filtering.

Files Changed

File Changes Summary
.github/workflows/registry-build.yml Added selective S3 sync for incremental builds using --exclude and targeted re-upload
dev/breeze/src/airflow_breeze/commands/registry_commands.py Updated extract_data to pass --provider flag to extraction scripts
dev/registry/extract_connections.py Added support for space-separated provider IDs
dev/registry/merge_registry_data.py Added logic to handle missing new_modules.json and preserve existing modules
dev/registry/tests/test_merge_registry_data.py Updated tests to cover partial build scenarios
registry/pnpm-workspace.yaml Updated liquidjs and markdown-it version overrides
registry/pnpm-lock.yaml Reflects updated dependencies from workspace changes
registry/package.json Updated dependency versions for liquidjs and markdown-it
registry/AGENTS.md, registry/README.md Minor updates to reflect registry behavior changes (no functional impact)

Review Focus Areas

  • Verify S3 sync logic in registry-build.yml correctly excludes and re-uploads only the target provider
  • Confirm merge_registry_data.py preserves existing provider data when new_modules.json is absent
  • Check that extract_connections.py handles empty or invalid provider ID lists

Architecture

Design Decisions: Incremental builds avoid full re-extraction by relying on S3 state and merging logic that tolerates missing files. This trades simplicity for correctness—partial builds assume the target provider’s data is complete and valid.

Scalability & Extensibility: The --provider flag pattern supports future parallelization of provider builds, though parallel safety is not yet implemented.

Risks: If a provider’s extraction fails mid-run, its stub data could overwrite existing data in S3. This is mitigated by the merge step’s handling of missing new_modules.json, but reviewers should confirm the sync logic never uploads partial or incomplete files.

Merge Status

MERGEABLE — PR Score 67/100, above threshold (50). All gates passed.

Comment on lines +277 to +284
for pid in ${PROVIDER}; do
aws s3 sync "registry/_site/api/providers/${pid}/" \
"${S3_BUCKET}api/providers/${pid}/" \
--cache-control "${REGISTRY_CACHE_CONTROL}"
aws s3 sync "registry/_site/providers/${pid}/" \
"${S3_BUCKET}providers/${pid}/" \
--cache-control "${REGISTRY_CACHE_CONTROL}"
done
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Robustness Medium

The for pid in ${PROVIDER} loop (line 277) is unquoted word-splitting on the input string, which is intentional for space-separated IDs, but if a provider ID contains special shell characters or glob patterns it will expand unexpectedly; use read -ra or quote-safe splitting instead.

Suggested fix
          if [[ -n "${PROVIDER}" ]]; then
            read -ra PROVIDER_IDS <<< "${PROVIDER}"
            for pid in "${PROVIDER_IDS[@]}"; do
              aws s3 sync "registry/_site/api/providers/${pid}/" \
                "${S3_BUCKET}api/providers/${pid}/" \
                --cache-control "${REGISTRY_CACHE_CONTROL}"
              aws s3 sync "registry/_site/providers/${pid}/" \
                "${S3_BUCKET}providers/${pid}/" \
                --cache-control "${REGISTRY_CACHE_CONTROL}"
            done
          fi
Prompt for AI assistance

Copy the prompt below and paste it into ChatGPT, Claude, or any LLM:

You are an expert bash developer with deep knowledge of security, performance, and best practices.

### Context

File: .github/workflows/registry-build.yml
Lines: 277-284
Issue Type: robustness-medium
Severity: medium

Issue Description:
The `for pid in ${PROVIDER}` loop (line 277) is unquoted word-splitting on the input string, which is intentional for space-separated IDs, but if a provider ID contains special shell characters or glob patterns it will expand unexpectedly; use `read -ra` or quote-safe splitting instead.

Current Code:
          if [[ -n "${PROVIDER}" ]]; then
            for pid in ${PROVIDER}; do
              aws s3 sync "registry/_site/api/providers/${pid}/" \
                "${S3_BUCKET}api/providers/${pid}/" \
                --cache-control "${REGISTRY_CACHE_CONTROL}"
              aws s3 sync "registry/_site/providers/${pid}/" \
                "${S3_BUCKET}providers/${pid}/" \
                --cache-control "${REGISTRY_CACHE_CONTROL}"
            done
          fi

---

### Instructions

1. Fix the issue described above
2. Maintain the exact indentation and code style from the original
3. Follow bash best practices and language-specific idioms
4. Ensure the fix addresses the root cause, not just the symptoms
5. Add brief inline comments explaining the fix if needed

### Constraints

- Do not change functionality beyond fixing the identified issue
- Preserve existing variable names and function signatures unless they are part of the problem
- Ensure the fix is production-ready

---


Like Dislike Create Issue Jira

@codity-dm
Copy link
Copy Markdown

codity-dm Bot commented May 16, 2026

Security Scan Summary

Metric Value
Vulnerabilities Critical: 0
Overall Risk Clean
Files Scanned 10

No critical security issues detected

Scan completed in 21.2s

Security scan powered by Codity.ai

@codity-dm
Copy link
Copy Markdown

codity-dm Bot commented May 16, 2026

License Compliance Scan

Metric Value
Packages Scanned 217
High Risk (Strong Copyleft) 0
Medium Risk (Weak Copyleft) 3
Low Risk (Permissive) 211
Unknown License 3

Weak copyleft licenses found - verify compatibility

Some packages have unknown licenses - manual review required

Medium Risk Licenses - 3 packages

EPL-2.0 (1 packages):

  • elkjs 0.11.1

MPL-2.0 (2 packages):

  • github.com/hashicorp/go-plugin 1.7.0
  • github.com/hashicorp/yamux 0.1.2
Unknown Licenses - 3 packages
  • {% if params and params.environ and params.environ 'templated_unit_test' %}
  • python-dateutil 2.8.1 # including inline comments
  • com.google.cloud:google-cloud-pubsub ${google.cloud.version}

Powered by Codity.ai · Docs

@codity-dm
Copy link
Copy Markdown

codity-dm Bot commented May 16, 2026

Code Quality Report — test-org-codity/airflow1 · PR #3

Scanned: 2026-05-16 16:36 UTC | Score: 49/100 | Provider: github

Executive Summary

Severity Count
Critical 0
High 1
Medium 4
Low 15
Top Findings

[CQ-LLM-001] .github/workflows/registry-build.yml:54 (Complexity · HIGH)

Issue: The 'build-and-publish-registry' job has a high cyclomatic complexity due to multiple conditional checks and nested structures.
Suggestion: Consider breaking down the job into smaller, more manageable jobs or steps to reduce complexity.

jobs:
  build-and-publish-registry:
    timeout-minutes: 30
    name: "Build & Publish Registry"

[CQ-LLM-002] .github/workflows/registry-build.yml:86 (Duplication · MEDIUM)

Issue: The environment variables for 'build-and-publish-registry' and 'build-ci-image' jobs have significant duplication.
Suggestion: Extract common environment variables into a reusable section to adhere to DRY principles.

env:
      EXISTING_REGISTRY_DIR: /tmp/existing-registry
      REGISTRY_DATA_DIR: dev/registry

[CQ-LLM-003] .github/workflows/registry-build.yml:203 (Error_Handling · MEDIUM)

Issue: The script does not handle potential errors when copying files, which could lead to silent failures.
Suggestion: Add error handling to check if the copy operations succeed and log appropriate messages.

cp -r "${VERSIONS_SRC}/"* "${REGISTRY_SITE_VERSIONS_DIR}/"

[CQ-LLM-005] .github/workflows/registry-build.yml:203 (Testability · MEDIUM)

Issue: The use of hard-coded paths and values makes the script less testable.
Suggestion: Consider using variables or parameters for paths and values to improve testability.

EXISTING_REGISTRY_DIR: /tmp/existing-registry

[CQ-LLM-006] dev/registry/extract_connections.py:212 (Maintainability · MEDIUM)

Issue: The variable 'provider_filter' is not clearly named and may lead to confusion.
Suggestion: Rename 'provider_filter' to something more descriptive, such as 'filtered_providers'.

provider_filter: set[str] | None = None

[CQ-LLM-004] .github/workflows/registry-build.yml:54 (Documentation · LOW)

Issue: Missing documentation for the new 'build-ci-image' job.
Suggestion: Add comments or documentation to explain the purpose and functionality of the 'build-ci-image' job.

build-ci-image:
    name: "Build CI image"

[CQ-008] .github/workflows/registry-build.yml:82 (Maintainability · LOW)

Issue: Magic number 22 in code
Suggestion: Extract to a named constant

runners: '["ubuntu-22.04"]'

[CQ-008] .github/workflows/registry-build.yml:96 (Maintainability · LOW)

Issue: Magic number 30 in code
Suggestion: Extract to a named constant

timeout-minutes: 30

[CQ-008] .github/workflows/registry-build.yml:110 (Maintainability · LOW)

Issue: Magic number 300 in code
Suggestion: Extract to a named constant

REGISTRY_CACHE_CONTROL: public, max-age=300

[CQ-008] dev/registry/tests/test_merge_registry_data.py:250 (Maintainability · LOW)

Issue: Magic number 2024 in code
Suggestion: Extract to a named constant

_provider("amazon", "Amazon", "2024-01-01"),

Per-File Breakdown

File Critical High Medium Low Total
.github/workflows/registry-build.yml 0 1 3 4 8
dev/registry/extract_connections.py 0 0 1 0 1
dev/registry/tests/test_merge_registry_data.py 0 0 0 7 7
registry/pnpm-lock.yaml 0 0 0 4 4

Recommendations

  1. Resolve High severity issues, especially error handling gaps and performance bottlenecks.
  • Run automated tests after applying fixes to verify no regressions.

@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented May 16, 2026

Greptile Summary

This PR fixes the incremental registry build pipeline by ensuring the --provider flag is passed to all three extraction scripts, adds selective S3 sync to avoid overwriting real provider data with Eleventy's empty stubs, and fixes pnpm workspace configuration so security overrides are applied correctly.

  • Incremental provider flag propagation: registry_commands.py now passes provider_flag to extract_parameters.py and extract_connections.py; extract_connections.py is updated to handle space-separated provider IDs, but extract_parameters.py is not — this causes multi-provider incremental builds to fail.
  • Selective S3 sync: The main sync excludes api/providers/* and providers/* for incremental builds, with a follow-up per-provider sync to upload only the target provider's files, preventing empty Eleventy stubs from overwriting live data.
  • pnpm workspace fix: Adding packages: ['.'] to pnpm-workspace.yaml activates the existing security overrides, downgrading minimatch/brace-expansion/balanced-match to their correct patched versions.

Confidence Score: 3/5

Safe to merge for single-provider incremental builds; multi-provider builds will fail at the extract-data step.

The core fix works correctly when a single provider ID is specified. However, extract_parameters.py was not updated to split space-separated provider IDs the way extract_connections.py was — passing PROVIDER='amazon google' causes extract_parameters.py to exit with a not-found error while extract_connections.py handles it correctly.

dev/registry/extract_parameters.py — the space-separated provider filter added to extract_connections.py was not applied here, leaving multi-provider incremental builds broken.

Important Files Changed

Filename Overview
.github/workflows/registry-build.yml Splits the single job into build-ci-image + build-and-publish-registry, adds selective S3 sync for incremental builds, and improves shell guards for empty directories.
dev/breeze/src/airflow_breeze/commands/registry_commands.py Adds provider_flag to extract_parameters.py and extract_connections.py invocations. Single-provider builds are fixed; multi-provider builds expose a mismatch with extract_parameters.py.
dev/registry/extract_connections.py Moves provider filtering upstream and adds space-separated ID support. Logic is correct and consistent with extract_metadata.py.
dev/registry/merge_registry_data.py Adds a guard for a missing new_modules_path (skipped in --provider mode), treating it as an empty list. Safe and covered by the new test.
dev/registry/tests/test_merge_registry_data.py Adds test_missing_new_modules_file covering the incremental-build edge case where modules.json is absent.
registry/pnpm-workspace.yaml Adds packages: ['.'] so pnpm applies the existing security overrides, resolving older patched versions of minimatch/brace-expansion/balanced-match.
registry/package.json Adds --project ../dev/registry to the uv run prebuild script so the correct Python project context is used.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[registry-build.yml triggered] --> B{build-ci-image if: workflow_call OR allowlist}
    B -->|skipped| Z[All jobs skipped]
    B -->|runs| C[build-and-publish-registry]
    C --> D[Prepare breeze and CI image]
    D --> E{inputs.provider set?}
    E -->|yes incremental| F[Download existing data from S3]
    E -->|no full build| G[extract-data all providers]
    F --> H[extract-data --provider 'amazon google']
    H --> H1[extract_metadata.py OK]
    H --> H2[extract_parameters.py FAILS on multi-ID]
    H --> H3[extract_connections.py OK space-split]
    H1 & H2 & H3 --> I[merge_registry_data.py]
    G --> J[Copy output to registry src _data]
    I --> J
    J --> K[pnpm build Eleventy]
    K --> L[aws s3 sync main exclude api+providers subtrees]
    L --> M{incremental?}
    M -->|yes| N[per-pid sync api and html]
    M -->|no| O[sync pagefind with delete]
    N --> O
    O --> P[publish-versions]
Loading

Reviews (1): Last reviewed commit: "Re-include providers/index.html in incre..." | Re-trigger Greptile

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants