ci: replace broken CI with orchestrated parallel build system#145
ci: replace broken CI with orchestrated parallel build system#145jackluo923 wants to merge 10 commits intorelease-0.293-clp-connectorfrom
Conversation
Replace broken CI with a unified orchestrator (ci.yml) that coordinates
Java coordinator and C++ native worker builds in parallel.
Job dependency graph:
resolve-config → create-builder-image ─┬─► presto (Java) ────┬─► integration-tests
├─► prestocpp (C++) ──┘
└─► presto-tests
Caching strategy:
- Bake ccache (~2GB) and Maven cache into Docker image layers
- Builder image based on GitHub runner image for self-hosted caching
- Parallel jobs on same host share cached layers with zero network traffic
- Builder image tagged by dependency hash, rebuilt only when deps change
Image tagging (configurable via IMAGE_VERSION_TYPE, set to RELEASE for this branch):
- Immutable: version-RELEASE-timestamp-hash (e.g., 0.293-RELEASE-20250522-abc123)
- Mutable: version-RELEASE-SNAPSHOT for pulling latest
Outputs three Docker images: unified-builder, presto, prestissimo
📝 WalkthroughWalkthroughRestructures CI to a builder-image driven model: adds a unified builder Dockerfile, replaces wget with curl in download helper and introduces MAVEN_REPO, converts many workflows to reusable/workflow_call patterns, consolidates builds/tests (Java/C++), updates branch filters for snapshot releases, and removes redundant workflows. Changes
Sequence Diagram(s)sequenceDiagram
participant Trigger as Developer / Git event
participant Resolve as resolve-config job
participant Registry as GHCR (builder-image)
participant Builder as create-builder-image job
participant CI as CI jobs (presto / prestocpp / integration)
participant MAVEN as MAVEN_REPO (cache)
participant Artifacts as Artifact storage / Actions
Trigger->>Resolve: compute builder-image tag, runtime tags
Resolve->>Registry: check manifest (exists?)
alt image missing
Registry->>Builder: request build
Builder->>Registry: push builder-image
end
Resolve->>CI: provide builder-image & tags
CI->>Registry: pull builder-image (container)
CI->>MAVEN: use MAVEN_REPO from builder-image
CI->>Artifacts: upload build artifacts (presto-server, presto-native-build)
CI->>Registry: push runtime images (when configured)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Tip Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 14
🤖 Fix all issues with AI agents
In @.github/dockerfiles/yscope-presto-builder.dockerfile:
- Around line 136-137: The Dockerfile currently masks Maven dependency
resolution failures with "|| true" on the "./mvnw dependency:resolve-plugins
dependency:resolve -B --no-transfer-progress -Dmaven.repo.local=${MAVEN_REPO}"
command; remove the unconditional swallow and instead capture the command's exit
code and, if non‑zero, emit a clear warning (including the exit code and a brief
message) to stderr or the build log before continuing so failures are visible
while preserving non-fatal behavior.
- Around line 64-70: Replace the interactive "apt" call with scripted "apt-get":
in the RUN sequence that calls /presto/scripts/setup-ubuntu.sh, change "apt
install -y rpm" to use "apt-get update && apt-get install -y rpm" (and
optionally follow with "apt-get clean && rm -rf /var/lib/apt/lists/*") so the
Dockerfile uses the stable non-interactive package manager; update the RUN line
invoking /presto/velox/scripts/setup-ubuntu.sh and
/presto/scripts/setup-adapters.sh as needed to keep the same ordering but ensure
all package installs use apt-get.
In @.github/workflows/ci.yml:
- Around line 184-185: The current BASE_VERSION extraction using grep on pom.xml
is fragile (it looks for the first <version> tag); update the CI step that sets
BASE_VERSION to reliably obtain the project version instead: either call Maven
to evaluate project.version (e.g., use mvn help:evaluate
-Dexpression=project.version with quiet/forceStdout flags) and strip -SNAPSHOT,
or use an XML parser like xmllint to select /project/version and then remove
-SNAPSHOT; replace the existing grep|head|sed pipeline where BASE_VERSION is
defined with one of these robust approaches (referencing the BASE_VERSION
variable and the pom.xml file in the workflow).
In @.github/workflows/create-builder-image.yml:
- Around line 129-137: Add layer caching to the "Build and push unified builder
image" step that uses docker/build-push-action@v6 by configuring cache-from and
cache-to (or using type=gha) so subsequent runs reuse image layers; update the
step that currently sets context/file/tags/labels to include cache configuration
(cache-from: type=registry,ref=... or cache-to: type=gha,mode=max) and ensure
the referenced tag input (inputs.builder-image) is used as the cache
key/reference.
In @.github/workflows/integration-tests.yml:
- Around line 98-130: The duplicated restore logic for creating dirs, moving
binaries, and setting execute permissions (handling presto_server and
velox_functions_remote_server_main) should be extracted into a reusable step:
implement a composite GitHub Action (e.g.,
.github/actions/restore-native-artifacts) or a shell script and replace the
repeated blocks in each job with a single use step (uses:
./.github/actions/restore-native-artifacts or run:
./scripts/restore-native-artifacts.sh); ensure the action/script performs the
mkdir -p for the two target paths, moves
${GITHUB_WORKSPACE}/presto_cpp/main/presto_server and
${GITHUB_WORKSPACE}/velox/velox/functions/remote/server/velox_functions_remote_server_main
when present, sets chmod +x on the targets, and exits with an error if
presto_server is missing to preserve current behavior.
- Around line 304-313: The current shell discovery using TESTFILES/TESTCLASSES
and a for-loop is brittle (won’t handle special chars, nested/inner classes, or
non-test files); replace it by invoking Maven/Surefire to enumerate tests or, if
keeping shell, use a null-delimited find (find ... -print0) piped to xargs -0
and filter candidate files by presence of the `@Test` annotation (or a regex
matching typical test class patterns) before extracting the top-level class name
into TESTCLASSES, and apply the same change to the second occurrence (lines
referenced 437-446) so filenames with spaces/special chars and non-test files
are safely excluded.
- Line 333: Remove the redundant JVM system property -DLD_LIBRARY_PATH from the
test invocation and rely on the exported environment variable LD_LIBRARY_PATH
instead; locate the occurrence of the "-DLD_LIBRARY_PATH=\"${LD_LIBRARY_PATH}\""
JVM arg in the workflow step and delete it (this also applies to the duplicate
occurrence later), ensuring the environment export of LD_LIBRARY_PATH remains in
place so child processes inherit it.
- Around line 192-194: The export line that sets LD_LIBRARY_PATH appends
${LD_LIBRARY_PATH} unconditionally which can produce a trailing colon if
LD_LIBRARY_PATH is unset; change it to only append the existing value when
present (e.g., check or conditionally expand LD_LIBRARY_PATH) so the export
LD_LIBRARY_PATH line does not end up with an extra colon—locate the export
LD_LIBRARY_PATH="/usr/local/lib:...:${LD_LIBRARY_PATH}" line and modify it to
conditionally include ${LD_LIBRARY_PATH} only when non-empty.
- Around line 147-151: The current extraction logic can silently pick the wrong
tarball or directory when multiple matches exist; update the script around
PRESTO_SERVER_TAR and PRESTO_SERVER_DIR to explicitly expand the glob, count
matches, and fail fast if the count is not exactly one (zero or >1) with an
error message; set PRESTO_SERVER_TAR to that single match, extract it, then
similarly resolve PRESTO_SERVER_DIR by validating that exactly one
presto-server-*/ directory exists before exporting PRESTO_SERVER_DIR to
GITHUB_ENV (fail the job on validation failure).
In @.github/workflows/presto-build.yml:
- Around line 183-188: The version extraction step ("Extract base version", id
"extract-version") can silently use multiple matching files; update the shell in
that run block to detect multiple matches before calling sed: expand the glob
into a list/array, if more than one match then exit with a clear error (or
explicitly pick the first with a documented head -1 choice), otherwise assign
the single filename to VERSION and then run the sed extraction; ensure the step
fails fast on ambiguous matches and still echoes "base-version=${VERSION}" to
$GITHUB_OUTPUT.
In @.github/workflows/prestocpp-linux-build-and-unit-test.yml:
- Around line 166-168: The workflow emits invalid Docker tags when
inputs.runtime-version-tag or inputs.runtime-snapshot-tag are empty; update the
job that sets the tags (the tags: block) to skip or omit empty inputs by adding
a conditional filter so only non-empty inputs are expanded into ghcr.io/${{
github.repository }}/prestissimo:${{ inputs.runtime-version-tag }} and
ghcr.io/${{ github.repository }}/prestissimo:${{ inputs.runtime-snapshot-tag }},
or alternatively make the inputs required (runtime-version-tag and
runtime-snapshot-tag) so empty values cannot be provided; ensure the push/tag
step checks for non-empty tag strings before attempting to push.
In @.github/workflows/tests.yml:
- Around line 39-44: Remove the unused YAML anchor by deleting "&runner" from
the runs-on definition: update the runs-on key (the block that currently starts
with "runs-on: &runner") to simply be "runs-on:" with the same expression ${{
github.repository_owner == 'y-scope' && fromJSON('["self-hosted", "x64",
"ubuntu-noble"]') || 'ubuntu-latest' }} so the anchor &runner is not defined (no
other changes to the runs-on expression or surrounding job needed).
In `@presto-native-execution/scripts/dockerfiles/prestissimo-runtime.dockerfile`:
- Line 18: The Dockerfile uses a lowercase build stage keyword in the line with
"FROM ${BUILDER_IMAGE} as prestissimo-image"; change the lowercase "as" to
uppercase "AS" so the directive reads "FROM ${BUILDER_IMAGE} AS
prestissimo-image" to keep keyword casing consistent with Dockerfile best
practices.
- Around line 26-28: The RUN instruction that invokes ldd on /tmp/presto_server
should use POSIX-safe negation and enable pipefail so failures in pipes are
detected: change the shell invocation to set -o pipefail (or use "set -euo
pipefail" style) before running the ldd/grep/awk pipeline, use a spaced negation
form (! ldd ... | grep -q "not found") or test grep's exit code with -q to avoid
relying on ! attached to a pipeline, and keep the LD_LIBRARY_PATH and the
copy-to-/runtime-libraries logic intact so the
LD_LIBRARY_PATH=/usr/local/lib:/usr/local/lib64 ldd /tmp/presto_server | awk 'NF
== 4 { system("cp " $3 " /runtime-libraries") }' step still runs only when no
"not found" is detected.
Making runtime-version-tag and runtime-snapshot-tag required inputs prevents invalid Docker tags (e.g., 'ghcr.io/.../prestissimo:') when empty values are passed. Since ci.yml always provides these tags, this is a safe change that enforces the expected contract.
Add validation to ensure exactly one presto-server tarball exists before extraction. This prevents silent failures when glob pattern matches multiple files unexpectedly.
- Use ${LD_LIBRARY_PATH:+:$LD_LIBRARY_PATH} to avoid trailing colon
when LD_LIBRARY_PATH is unset
- Remove redundant -DLD_LIBRARY_PATH JVM property since the
environment variable is already exported and inherited by child
processes
Add validation to ensure exactly one presto-server tarball exists before extracting version. This prevents silent failures when glob pattern matches multiple files unexpectedly.
- Use uppercase AS for consistency with FROM keyword - Add space after ! for POSIX shell compliance - Set pipefail to catch failures in intermediate pipe commands - Use grep -q for silent matching
- Use apt-get instead of apt for scripted installs (more stable CLI) - Log warning instead of silently ignoring Maven dependency failures
The test was frequently timing out in CI. Double the timeout to provide more headroom for slower CI environments.
Pin all third-party GitHub Actions to their full commit SHAs to prevent supply chain attacks from compromised action tags. Version comments added for maintainability.
YAML comments inside quoted strings are treated as literal text. Move # vX comments outside the quotes so they are parsed as comments.
8319ab5 to
bb0abb3
Compare
Description
Replace the broken CI with a unified orchestrator (
ci.yml) that coordinates Java coordinator and C++ native worker builds in parallel using ephemeral containerized builds.Problems with Current CI
The existing CI has several critical limitations:
Solution
Architecture
Job Dependency Graph:
Workflow Structure:
ci.yml: Main orchestrator calling reusable workflowscreate-builder-image.yml: Builds/pushes builder image if tag doesn't existpresto-build.yml: Java build with artifact upload and Docker image pushprestocpp-linux-build-and-unit-test.yml: C++ build, tests, and prestissimo imageintegration-tests.yml: E2E tests using pre-built artifactstests.yml: Java unit testsKey Design Decisions
1. Ephemeral Containerized Builds
2. Caching Strategy: Bake into Docker Image Layers
3. Image Tagging: Dual Tags
<version>-<TYPE>-<YYYYMMDDHHMMSS>-<short-sha>0.293-RELEASE-20260130053917-ce0b304<version>-<TYPE>-SNAPSHOT0.293-RELEASE-SNAPSHOTRELEASEfor default branch,BETAfor other branchesResults
Outputs
Docker Images (ghcr.io):
unified-builderprestoprestissimoArtifacts (1-day retention):
presto-serverpresto-server-*.tar.gzpresto-clipresto-cli-*-executable.jarpresto-native-buildpresto_server,velox_functions_remote_server_mainChecklist
breaking change.
Validation performed
resolve-config: ✅create-builder-image: ✅presto-java8/presto-java17: ✅prestocppbuild and tests: ✅presto-tests(8 modules × Java 8/17): ✅integration-tests(DWRF, PARQUET, tpch-tpcds, iceberg, general, sidecar): ✅Summary by CodeRabbit