Fix registry incremental provider flag#3
Conversation
The `--provider` flag was only passed to `extract_metadata.py` but not to `extract_parameters.py` or `extract_connections.py`. This caused incremental builds to scan all 99 providers and 1625 modules instead of just the requested one.
The registry workflow was building the CI image from scratch every run (~24 min) because it lacked the BuildKit mount cache that ci-image-build.yml provides. Inline `breeze ci-image build` with registry cache doesn't help because Docker layer cache invalidates on every commit when the build context changes. Split into two jobs following the established pattern used by ci-amd-arm.yml and update-constraints-on-push.yml: - `build-ci-image`: calls ci-image-build.yml which handles mount cache restore, ghcr.io login, registry cache, and image stashing - `build-and-publish-registry`: restores the stashed image via prepare_breeze_and_image action, then runs the rest unchanged
extract_parameters.py with --provider intentionally skips writing modules.json (only the targeted provider's parameters are extracted). The merge script assumed modules.json always exists, causing a FileNotFoundError during incremental builds. Handle missing new_modules_path the same way missing existing_modules_path is already handled: treat it as an empty list.
The prepare_breeze_and_image action loads the CI image from /mnt, which requires make_mnt_writeable.sh to run first. Each job gets a fresh runner, so the writeable /mnt from the build job doesn't carry over.
Adding `packages: ['.']` to pnpm-workspace.yaml changed how pnpm processes overrides, causing ERR_PNPM_LOCKFILE_CONFIG_MISMATCH with --frozen-lockfile. Regenerate the lockfile with pnpm 9 to match.
The prebuild script ran `uv run` without --project, causing uv to resolve the full workspace including samba → krb5 which needs libkrb5-dev (not installed on the CI runner).
… on S3 Eleventy pagination templates emit empty fallback JSON for every provider, even when only one provider's data was extracted. A plain `aws s3 sync` uploads those stubs and overwrites real connection/parameter data. Changes: - Exclude per-provider connections.json and parameters.json from the main S3 sync during incremental builds, then selectively upload only the target provider's API files - Filter connections early in extract_connections.py (before the loop) and support space-separated multi-provider IDs - Suppress SCARF_ANALYTICS and DO_NOT_TRACK telemetry in CI - Document the Eleventy pagination limitation in README and AGENTS.md
The previous exclude only covered connections.json and parameters.json, but modules.json and versions.json for non-target providers also contain incomplete data (no version info extracted) and would overwrite correct data on S3. Simplify to exclude the entire api/providers/* subtree and selectively upload only the target provider's directory.
Non-target provider pages are rebuilt without connection/parameter data (the version-specific extraction files don't exist locally). Without this exclude, the incremental build overwrites complete HTML pages on S3 with versions missing the connection builder section.
The providers listing page uses merged data (all providers) and must be updated during incremental builds — especially for new providers. AWS CLI --include after --exclude re-includes the specific file.
Policy Check Failed✗ 3/3 policy checks failed: • Need 2 more approval(s) (0/2) — comment LGTM or approve via review To merge this PR:
|
PR SummaryWhat Changed
Key Changes by AreaCI Workflow: Registry Extraction: Registry Merging: CLI Tools: Files Changed
Review Focus Areas
ArchitectureDesign Decisions: Incremental builds avoid full re-extraction by relying on S3 state and merging logic that tolerates missing files. This trades simplicity for correctness—partial builds assume the target provider’s data is complete and valid. Scalability & Extensibility: The Risks: If a provider’s extraction fails mid-run, its stub data could overwrite existing data in S3. This is mitigated by the merge step’s handling of missing Merge StatusMERGEABLE — PR Score 67/100, above threshold (50). All gates passed. |
| for pid in ${PROVIDER}; do | ||
| aws s3 sync "registry/_site/api/providers/${pid}/" \ | ||
| "${S3_BUCKET}api/providers/${pid}/" \ | ||
| --cache-control "${REGISTRY_CACHE_CONTROL}" | ||
| aws s3 sync "registry/_site/providers/${pid}/" \ | ||
| "${S3_BUCKET}providers/${pid}/" \ | ||
| --cache-control "${REGISTRY_CACHE_CONTROL}" | ||
| done |
There was a problem hiding this comment.
The for pid in ${PROVIDER} loop (line 277) is unquoted word-splitting on the input string, which is intentional for space-separated IDs, but if a provider ID contains special shell characters or glob patterns it will expand unexpectedly; use read -ra or quote-safe splitting instead.
Suggested fix
if [[ -n "${PROVIDER}" ]]; then
read -ra PROVIDER_IDS <<< "${PROVIDER}"
for pid in "${PROVIDER_IDS[@]}"; do
aws s3 sync "registry/_site/api/providers/${pid}/" \
"${S3_BUCKET}api/providers/${pid}/" \
--cache-control "${REGISTRY_CACHE_CONTROL}"
aws s3 sync "registry/_site/providers/${pid}/" \
"${S3_BUCKET}providers/${pid}/" \
--cache-control "${REGISTRY_CACHE_CONTROL}"
done
fiPrompt for AI assistance
Copy the prompt below and paste it into ChatGPT, Claude, or any LLM:
You are an expert bash developer with deep knowledge of security, performance, and best practices.
### Context
File: .github/workflows/registry-build.yml
Lines: 277-284
Issue Type: robustness-medium
Severity: medium
Issue Description:
The `for pid in ${PROVIDER}` loop (line 277) is unquoted word-splitting on the input string, which is intentional for space-separated IDs, but if a provider ID contains special shell characters or glob patterns it will expand unexpectedly; use `read -ra` or quote-safe splitting instead.
Current Code:
if [[ -n "${PROVIDER}" ]]; then
for pid in ${PROVIDER}; do
aws s3 sync "registry/_site/api/providers/${pid}/" \
"${S3_BUCKET}api/providers/${pid}/" \
--cache-control "${REGISTRY_CACHE_CONTROL}"
aws s3 sync "registry/_site/providers/${pid}/" \
"${S3_BUCKET}providers/${pid}/" \
--cache-control "${REGISTRY_CACHE_CONTROL}"
done
fi
---
### Instructions
1. Fix the issue described above
2. Maintain the exact indentation and code style from the original
3. Follow bash best practices and language-specific idioms
4. Ensure the fix addresses the root cause, not just the symptoms
5. Add brief inline comments explaining the fix if needed
### Constraints
- Do not change functionality beyond fixing the identified issue
- Preserve existing variable names and function signatures unless they are part of the problem
- Ensure the fix is production-ready
---
Security Scan Summary
No critical security issues detected Scan completed in 21.2sSecurity scan powered by Codity.ai |
License Compliance Scan
Weak copyleft licenses found - verify compatibility Some packages have unknown licenses - manual review required Medium Risk Licenses - 3 packagesEPL-2.0 (1 packages):
MPL-2.0 (2 packages):
Unknown Licenses - 3 packages
Powered by Codity.ai · Docs |
Code Quality Report — test-org-codity/airflow1 · PR #3Scanned: 2026-05-16 16:36 UTC | Score: 49/100 | Provider: github Executive Summary
Top Findings[CQ-LLM-001]
|
| File | Critical | High | Medium | Low | Total |
|---|---|---|---|---|---|
.github/workflows/registry-build.yml |
0 | 1 | 3 | 4 | 8 |
dev/registry/extract_connections.py |
0 | 0 | 1 | 0 | 1 |
dev/registry/tests/test_merge_registry_data.py |
0 | 0 | 0 | 7 | 7 |
registry/pnpm-lock.yaml |
0 | 0 | 0 | 4 | 4 |
Recommendations
- Resolve High severity issues, especially error handling gaps and performance bottlenecks.
- Run automated tests after applying fixes to verify no regressions.
Greptile SummaryThis PR fixes the incremental registry build pipeline by ensuring the
Confidence Score: 3/5Safe to merge for single-provider incremental builds; multi-provider builds will fail at the extract-data step. The core fix works correctly when a single provider ID is specified. However, dev/registry/extract_parameters.py — the space-separated provider filter added to extract_connections.py was not applied here, leaving multi-provider incremental builds broken. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[registry-build.yml triggered] --> B{build-ci-image if: workflow_call OR allowlist}
B -->|skipped| Z[All jobs skipped]
B -->|runs| C[build-and-publish-registry]
C --> D[Prepare breeze and CI image]
D --> E{inputs.provider set?}
E -->|yes incremental| F[Download existing data from S3]
E -->|no full build| G[extract-data all providers]
F --> H[extract-data --provider 'amazon google']
H --> H1[extract_metadata.py OK]
H --> H2[extract_parameters.py FAILS on multi-ID]
H --> H3[extract_connections.py OK space-split]
H1 & H2 & H3 --> I[merge_registry_data.py]
G --> J[Copy output to registry src _data]
I --> J
J --> K[pnpm build Eleventy]
K --> L[aws s3 sync main exclude api+providers subtrees]
L --> M{incremental?}
M -->|yes| N[per-pid sync api and html]
M -->|no| O[sync pagefind with delete]
N --> O
O --> P[publish-versions]
Reviews (1): Last reviewed commit: "Re-include providers/index.html in incre..." | Re-trigger Greptile |
Was generative AI tooling used to co-author this PR?
{pr_number}.significant.rst, in airflow-core/newsfragments. You can add this file in a follow-up commit after the PR is created so you know the PR number.