Skip to content

Pin Docker Hub test images against K8s system-test rate-limit flakes#66423

Open
potiuk wants to merge 2 commits intoapache:mainfrom
potiuk:pin-k8s-test-images-against-dockerhub-ratelimit
Open

Pin Docker Hub test images against K8s system-test rate-limit flakes#66423
potiuk wants to merge 2 commits intoapache:mainfrom
potiuk:pin-k8s-test-images-against-dockerhub-ratelimit

Conversation

@potiuk
Copy link
Copy Markdown
Member

@potiuk potiuk commented May 5, 2026

Summary

The scheduled K8s system-test job intermittently fails because multiple
test pods pull alpine:latest (xcom sidecar default), busybox:latest,
and ubuntu:latest from Docker Hub anonymously and trip the
100-pulls-per-6h limit. Without a tag, kubelet defaults imagePullPolicy
to Always, so even nodes with the image cached re-pull every run.

Recent example: https://github.com/apache/airflow/actions/runs/25365187430/job/74380551079

What changed

  1. Production defaultxcom_sidecar.PodDefaults.SIDECAR_CONTAINER
    now uses alpine:3.23 via a new XCOM_SIDECAR_IMAGE constant.
    Tagged → kubelet defaults to imagePullPolicy: IfNotPresent.
    Documented in the cncf-kubernetes provider changelog.
  2. Test pin — every bare image="ubuntu" / "busybox" / "alpine"
    in kubernetes-tests/ and in providers/cncf/kubernetes/tests/...
    is pinned (ubuntu:24.04, busybox:1.37, alpine:3.23).
  3. Pre-load into kind — new _preload_test_images_to_kind() in
    breeze k8s, called from _run_complete_tests after
    _upload_k8s_image. Pulls each image on the runner with
    exponential-backoff retries on Docker Hub 429s, then kind load docker-image puts it on every node.
  4. Auto-trackerscripts/ci/prek/upgrade_important_versions.py
    gains UPGRADE_ALPINE / UPGRADE_BUSYBOX flags, regex patterns for
    alpine: / busybox: / chart ALPINE_VERSION ARGs, plus the
    relevant call-sites added to FILES_TO_UPDATE. The next "Upgrade
    important CI environment" run keeps these pins fresh. Ubuntu is
    intentionally manual (interim releases beat LTS in semver-sort).

Drive-by: # type: ignore[no-redef] on the tomli as tomllib fallback
in dev/registry/extract_{metadata,versions}.py so mypy-dev passes on
edits to anything under dev/. The same fix is in PR #66314 — whichever
lands first, the other becomes a no-op rebase.

Test plan

  • Unit: uv --project providers/cncf/kubernetes run pytest providers/cncf/kubernetes/tests/unit/cncf/kubernetes/operators/test_pod.py -k xcom_sidecar_container_image_default -xvs — passes locally.
  • CI: K8s scheduled job (the one currently flaking) passes once this lands.
  • CI: provider unit tests still green.
  • Manual: prek run --files <changed files> — clean.

Was generative AI tooling used to co-author this PR?
  • Yes — Claude Code (Opus 4.7)

Generated-by: Claude Code (Opus 4.7) following the guidelines

Copy link
Copy Markdown
Contributor

@bugraoz93 bugraoz93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small nit on how we define the image version and name. Maybe even a more generic place in tests such as devel-common, although not sure if this use case exactly fits. So it can also reduce MR size for upgrades, since I see we also added it to the flow of upgrading_important_packages. Even the upgrade PRs are also auto generated, any definition we can make common could be a good while reviewing them :)

Copy link
Copy Markdown
Contributor

@jscheffl jscheffl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool!

Also bike @bugraoz93 would propose to use a constant in all the test_* repetitions.

potiuk added 2 commits May 6, 2026 12:51
The scheduled K8s system-test job has been intermittently red because
multiple test pods pull the unpinned `alpine:latest` (xcom sidecar) and
`busybox:latest` / `ubuntu:latest` (test pods) from Docker Hub
anonymously and trip its 100-pulls-per-6h limit
(https://github.com/apache/airflow/actions/runs/25365187430/job/74380551079).
Without a tag, kubelet defaults `imagePullPolicy` to `Always`, so even
nodes that already cached the image re-pull every run.

Changes
-------

1. **Production default**: `xcom_sidecar.PodDefaults.SIDECAR_CONTAINER`
   now uses `alpine:3.23` via a new module-level `XCOM_SIDECAR_IMAGE`
   constant. Tagged → `imagePullPolicy: IfNotPresent` by default →
   nodes with the image cached do not re-pull.

2. **System / kubernetes-tests pin**: every bare `image="ubuntu"` /
   `"busybox"` / `"alpine"` in `kubernetes-tests/...` and the
   `cncf/kubernetes` system / unit tests is now pinned (ubuntu:24.04,
   busybox:1.37, alpine:3.23). Test assertions in
   `test_pod.py` updated to match the new sidecar default.

3. **Pre-load into kind**: a new `_preload_test_images_to_kind()` helper
   in `breeze k8s` runs after `_upload_k8s_image()` in
   `_run_complete_tests`. It pulls each image on the runner with
   exponential-backoff retries on Docker Hub 429s, then `kind load
   docker-image` puts it on every node — so kubelet never has to reach
   out to the registry once the cluster is ready.

4. **Auto-tracker**: `scripts/ci/prek/upgrade_important_versions.py`
   gains `UPGRADE_ALPINE` / `UPGRADE_BUSYBOX` flags, fetchers using the
   existing Docker Hub `get_latest_image_version()`, regex patterns for
   `alpine:` / `busybox:` literals plus chart `ALPINE_VERSION` ARGs, and
   the relevant call-sites added to `FILES_TO_UPDATE`. The next "Upgrade
   important CI environment" run will keep these pins fresh
   automatically. Ubuntu is intentionally not auto-tracked: the tracker
   would prefer the highest semver, which can be an interim
   (non-LTS) release — system tests want LTS.

Drive-by
--------

`# type: ignore[no-redef]` on the standard `import tomli as tomllib`
fallback in `dev/registry/extract_{metadata,versions}.py` so `mypy-dev`
passes on edits to anything else under `dev/`. Identical fix lives in
PR apache#66314 — whichever lands first, the other becomes a no-op rebase.
Three follow-ups to the original commit, surfaced by CI on
apache#66423:

1. The dict-literal `image` keys in `expected_pod` fixtures inside
   `kubernetes-tests/tests/kubernetes_tests/test_kubernetes_pod_operator.py`
   still pointed at the bare names (`"ubuntu"`, `"alpine"`) — only
   the kwarg-style `image=` references were caught by the original
   sed. Pinned them to match the new defaults. Without this, every
   pod-spec equality assertion against `self.expected_pod` failed
   on Python 3.10 K8s system tests.

2. The cncf.kubernetes changelog note used a level-3 `~~~` heading
   directly under `Changelog ---`, which (a) shifted the entire
   version-section hierarchy and produced ~700 cascading docs-build
   errors, and (b) was 1 char short of the title length triggering
   a `Title underline too short` warning. Replaced the heading with
   a bold-led paragraph — same content, no hierarchy disruption.

3. `kubelet` was missing from `docs/spelling_wordlist.txt`, so the
   sphinx spellcheck flagged it in the new note.
@potiuk potiuk force-pushed the pin-k8s-test-images-against-dockerhub-ratelimit branch from 231c6ad to 55d04ae Compare May 6, 2026 10:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

4 participants