Skip to content

feat(worker): export GMS host/port/protocol; bump chart to 0.0.42#70

Merged
rajatoss merged 3 commits into
mainfrom
fix/export-gms-host-port-protocol
May 19, 2026
Merged

feat(worker): export GMS host/port/protocol; bump chart to 0.0.42#70
rajatoss merged 3 commits into
mainfrom
fix/export-gms-host-port-protocol

Conversation

@rajatoss
Copy link
Copy Markdown
Member

@rajatoss rajatoss commented May 18, 2026

Summary

Export DATAHUB_GMS_HOST / PORT / PROTOCOL env vars from the datahub-executor-worker chart so the executor image's startup readiness gate (wait_for_gms_ready in start.sh) can find GMS in any deployment, with sensible defaults derived from gms.url.

Problem

The executor image's wait_for_gms_ready shell function (the Wolfi-image replacement for dockerize) reads DATAHUB_GMS_HOST / PORT / PROTOCOL. The chart only exported DATAHUB_GMS_URL. With nothing else set, the gate fell back to a hardcoded http://datahub-gms:8080/health and timed out after 240s in any namespace where the actual Service is something like dh-datahub-gms.

Observed on f7700a3ff0-dev07 (pr8856 image, chart 0.0.41): worker pods crashlooping with Waiting for GMS at http://datahub-gms:8080/health (timeout 240s)... while the actual Service is dh-datahub-gms.

Behavior

Precedence per field (host, port, protocol):

  1. Any of the three explicitly set in values → use the explicit values; unset fields fall back to literal internal defaults (datahub-gms / 8080 / http). The trio is treated as "all-or-nothing" — touching one means the user is describing the internal Service, so other defaults are also internal-friendly. This prevents partial-override silent corruption (e.g., setting host: dh-datahub-gms against an external LB URL inheriting port: 443).
  2. None of the three set, but gms.url is set → derive all three by parsing the URL (sprig urlParse). Self-contained installs (gms.url: http://datahub-gms:8080) need no extra config.
  3. Nothing set at all → literal defaults datahub-gms / 8080 / http — matches the script's previous hardcoded fallback, preserving backward compatibility with chart 0.0.41.

Verification

Rendered against 16 edge cases (helm template locally), covering:

  • Backward compat (no URL, no overrides) — literal defaults preserved
  • Self-contained internal URL only — chart parses everything
  • External LB URL only — chart parses (probe likely 404s at LB root; documents the case)
  • LB URL + partial override (only host set) — port/protocol fall back to internal literals, not URL-parsed
  • All 3 overrides explicit — exact values pass through
  • URL with non-standard port — parsed correctly
  • URL with query string / userinfo / path / trailing slash — handled cleanly
  • Empty/missing URL — falls to literal defaults
  • Each explicit value beats URL parsing — precedence verified

Companion change

acryldata/datahub-fork#9666 makes start.sh also derive these from DATAHUB_GMS_URL as a fallback — defense in depth so future regressions can't recur. That change has merged to acryl-main; backport to releases/v1.1.0 is on branch cherry-pick/9666-to-releases-v1.1.0.

rajatoss and others added 2 commits May 18, 2026 14:19
The executor image's wait_for_gms_ready startup gate reads
DATAHUB_GMS_HOST/PORT/PROTOCOL, but the chart only exported
DATAHUB_GMS_URL. With nothing set, the gate fell back to a hardcoded
"datahub-gms:8080" — which fails DNS in any namespace whose GMS
release has a non-empty name prefix (e.g. "dh-datahub-gms").

Export host/port/protocol from values, defaulting to the existing
fallback so current users aren't affected. Deployments with a
prefixed GMS service set the override:

  global:
    datahub:
      gms:
        host: dh-datahub-gms
Removed comments explaining host/port/protocol usage.
@rajatoss rajatoss self-assigned this May 18, 2026
@rajatoss rajatoss requested a review from skrydal May 18, 2026 09:38
Comment thread charts/datahub-executor-worker/templates/workload.yaml Outdated
PR feedback: rather than requiring customers to specify host/port/protocol
explicitly, derive them from gms.url so self-contained installs only need
the URL.

Behavior:
- If none of host/port/protocol are set and gms.url is set, parse the URL
  to fill all three.
- If any of host/port/protocol is explicitly set, the trio falls back to
  literal internal defaults (datahub-gms/8080/http) for unset fields.
  This prevents partial-override silent corruption where, e.g., setting
  host: dh-datahub-gms while leaving port unset would otherwise inherit
  port 443 from an external LB URL.
- If gms.url is also unset, all three fall back to the existing literal
  defaults — preserving backward compatibility with chart 0.0.41.
Copy link
Copy Markdown
Contributor

@skrydal skrydal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish we had a way to test this

@rajatoss
Copy link
Copy Markdown
Member Author

I did some testing via scripts with different urls

@rajatoss rajatoss merged commit 2e5a3d0 into main May 19, 2026
1 check passed
@rajatoss rajatoss deleted the fix/export-gms-host-port-protocol branch May 19, 2026 05:29
rajatoss added a commit that referenced this pull request May 19, 2026
PR #70 (chart 0.0.42) introduced URL-parsing that always exported
DATAHUB_GMS_HOST / PORT / PROTOCOL derived from gms.url. This broke
the DataHub Python SDK: `_get_config_from_env()` in
datahub.cli.config_utils short-circuits on `if port is not None` and
rebuilds the URL from HOST:PORT, dropping any path component. For
LB-routed URLs like `https://dev07.acryl.io/gms`, the reconstructed
URL became `https://dev07.acryl.io:443`, which hit the frontend and
failed ingestion with "you seem to have connected to the frontend
service instead of the GMS endpoint".

The HOST/PORT/PROTOCOL env vars were only ever a workaround for an
image-side bug (the startup script in pr8856 / v1.1.0rc1 hardcoded
"datahub-gms" instead of reading DATAHUB_GMS_URL). The proper fix
landed in acryldata/datahub-fork#9666 (now backported to
releases/v1.1.0 via #9668) — the startup script now reads
DATAHUB_GMS_URL directly. With that image-side fix in place, the
chart no longer needs to set HOST/PORT/PROTOCOL at all.

This chart change reverts to the 0.0.41 behavior (only emit
DATAHUB_GMS_URL) and bumps the version to 0.0.43. Customers who
deployed 0.0.42 with implicit URL-parsing get fixed ingestion on
upgrade. Customers who had set explicit gms.host/port/protocol in
0.0.42 (unlikely — the chart only published briefly) can use the
existing extraEnvs: escape hatch if they need those env vars.
rajatoss added a commit that referenced this pull request May 19, 2026
PR #70 (chart 0.0.42) introduced URL-parsing that always exported
DATAHUB_GMS_HOST / PORT / PROTOCOL derived from gms.url. This broke
the DataHub Python SDK: `_get_config_from_env()` in
datahub.cli.config_utils short-circuits on `if port is not None` and
rebuilds the URL from HOST:PORT, dropping any path component. For
LB-routed URLs like `https://dev07.acryl.io/gms`, the reconstructed
URL became `https://dev07.acryl.io:443`, which hit the frontend and
failed ingestion with "you seem to have connected to the frontend
service instead of the GMS endpoint".

The HOST/PORT/PROTOCOL env vars were only ever a workaround for an
image-side bug (the startup script in pr8856 / v1.1.0rc1 hardcoded
"datahub-gms" instead of reading DATAHUB_GMS_URL). The proper fix
landed in acryldata/datahub-fork#9666 (now backported to
releases/v1.1.0 via #9668) — the startup script now reads
DATAHUB_GMS_URL directly. With that image-side fix in place, the
chart no longer needs to set HOST/PORT/PROTOCOL at all.

This chart change reverts to the 0.0.41 behavior (only emit
DATAHUB_GMS_URL) and bumps the version to 0.0.43. Customers who
deployed 0.0.42 with implicit URL-parsing get fixed ingestion on
upgrade. Customers who had set explicit gms.host/port/protocol in
0.0.42 (unlikely — the chart only published briefly) can use the
existing extraEnvs: escape hatch if they need those env vars.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants