Skip to content

ci: replace broken CI with orchestrated parallel build system#145

Open
jackluo923 wants to merge 10 commits intorelease-0.293-clp-connectorfrom
ci
Open

ci: replace broken CI with orchestrated parallel build system#145
jackluo923 wants to merge 10 commits intorelease-0.293-clp-connectorfrom
ci

Conversation

@jackluo923
Copy link
Copy Markdown
Member

@jackluo923 jackluo923 commented Feb 2, 2026

Description

Replace the broken CI with a unified orchestrator (ci.yml) that coordinates Java coordinator and C++ native worker builds in parallel using ephemeral containerized builds.

Problems with Current CI

The existing CI has several critical limitations:

Problem Impact
Single stateful dedicated runner Only one workflow at a time; everything queues
Stateful = fragile Any broken state persists and breaks subsequent runs
Maintenance nightmare Pre-installed deps and caches require SSH and careful coordination
Branch switching breaks everything Corrupts caches and leaves stale artifacts
No parallelism Can't run matrix jobs or multiple workflows concurrently
Slow ~2+ hours end-to-end

Solution

Problem Solution
Single stateful runner Ephemeral containerized builds on multiple self-hosted runners
Stateful = fragile Every build starts fresh in a container—no state leakage
Maintenance nightmare Dependencies baked into Docker image; update by rebuilding
Branch switching breaks everything Containers are isolated; parallel runs don't interfere
No parallelism Multiple ephemeral runners enable full parallel execution
Slow (~2+ hours) Parallel jobs + pre-warmed caches = 15-30 minutes

Architecture

Job Dependency Graph:

resolve-config ─► create-builder-image ─┬─► presto (Java) ────┬─► integration-tests
                                        ├─► prestocpp (C++) ──┘
                                        └─► presto-tests

Workflow Structure:

  • ci.yml: Main orchestrator calling reusable workflows
  • create-builder-image.yml: Builds/pushes builder image if tag doesn't exist
  • presto-build.yml: Java build with artifact upload and Docker image push
  • prestocpp-linux-build-and-unit-test.yml: C++ build, tests, and prestissimo image
  • integration-tests.yml: E2E tests using pre-built artifacts
  • tests.yml: Java unit tests

Key Design Decisions

1. Ephemeral Containerized Builds

  • Every job runs in a fresh container—no state pollution between runs
  • Multiple jobs can run in parallel across many machines
  • Failed builds can't corrupt future runs
  • Branch switching is safe—each run is isolated

2. Caching Strategy: Bake into Docker Image Layers

  • Bake ccache and Maven cache into Docker image layers
  • Builder image based on GitHub runner image (always pre-cached on self-hosted runners)
  • Parallel jobs on same host share cached layers with zero network traffic
  • Builder image tagged by dependency hash—rebuilt only when dependencies change

3. Image Tagging: Dual Tags

Tag Type Format Example
Immutable <version>-<TYPE>-<YYYYMMDDHHMMSS>-<short-sha> 0.293-RELEASE-20260130053917-ce0b304
Mutable <version>-<TYPE>-SNAPSHOT 0.293-RELEASE-SNAPSHOT
  • Immutable: Pin specific builds for production deployments
  • Mutable SNAPSHOT: Always pull latest for dev/testing
  • TYPE: RELEASE for default branch, BETA for other branches

Results

Metric Old CI New CI
Total CI time ~2+ hours 15-30 min
Parallelism None Full parallel
State isolation None (stateful) Complete (ephemeral)
Maintenance Manual SSH Rebuild Docker image

Outputs

Docker Images (ghcr.io):

Image Description
unified-builder Build environment with all dependencies and pre-warmed caches
presto Java coordinator runtime
prestissimo C++ native worker runtime

Artifacts (1-day retention):

Artifact Contents
presto-server presto-server-*.tar.gz
presto-cli presto-cli-*-executable.jar
presto-native-build presto_server, velox_functions_remote_server_main

Checklist

  • The PR satisfies the contribution guidelines.
  • This is a breaking change and that has been indicated in the PR title, OR this isn't a
    breaking change.
  • Necessary docs have been updated, OR no docs need to be updated.

Validation performed

  • CI workflow runs successfully: Run #21521096630
  • All 36 jobs pass:
    • resolve-config: ✅
    • create-builder-image: ✅
    • presto-java8 / presto-java17: ✅
    • prestocpp build and tests: ✅
    • presto-tests (8 modules × Java 8/17): ✅
    • integration-tests (DWRF, PARQUET, tpch-tpcds, iceberg, general, sidecar): ✅

Summary by CodeRabbit

  • Chores
    • Optimized CI with a reusable builder image, pre-warmed caches and faster dependency resolution to accelerate builds.
    • Added configurable Maven cache location for consistent dependency handling.
    • Consolidated and extended CI workflows for parallel Java/C++ builds, image creation, and artifact management.
  • New Features
    • New integration-test workflow to run end-to-end tests between Java and native components.

Replace broken CI with a unified orchestrator (ci.yml) that coordinates
Java coordinator and C++ native worker builds in parallel.

Job dependency graph:
  resolve-config → create-builder-image ─┬─► presto (Java) ────┬─► integration-tests
                                         ├─► prestocpp (C++) ──┘
                                         └─► presto-tests

Caching strategy:
- Bake ccache (~2GB) and Maven cache into Docker image layers
- Builder image based on GitHub runner image for self-hosted caching
- Parallel jobs on same host share cached layers with zero network traffic
- Builder image tagged by dependency hash, rebuilt only when deps change

Image tagging (configurable via IMAGE_VERSION_TYPE, set to RELEASE for this branch):
- Immutable: version-RELEASE-timestamp-hash (e.g., 0.293-RELEASE-20250522-abc123)
- Mutable: version-RELEASE-SNAPSHOT for pulling latest

Outputs three Docker images: unified-builder, presto, prestissimo
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Feb 2, 2026

Caution

Review failed

The head commit changed during the review from 8319ab5 to bb0abb3.

📝 Walkthrough

Walkthrough

Restructures CI to a builder-image driven model: adds a unified builder Dockerfile, replaces wget with curl in download helper and introduces MAVEN_REPO, converts many workflows to reusable/workflow_call patterns, consolidates builds/tests (Java/C++), updates branch filters for snapshot releases, and removes redundant workflows.

Changes

Cohort / File(s) Summary
Build tooling scripts
​.github/bin/download_nodejs
Replaced wget_retry() with curl_retry(); added MAVEN_REPO (defaults to ${HOME}/.m2/repository) and updated Node/Yarn/Maven cache paths to use ${MAVEN_REPO}.
Builder image
​.github/dockerfiles/yscope-presto-builder.dockerfile
New builder Dockerfile that pre-warms Maven/ccache/node caches, installs CMake 3.28.3, and resolves Maven dependencies into a MAVEN_REPO path.
Runtime image
presto-native-execution/scripts/dockerfiles/prestissimo-runtime.dockerfile
Switched to using BUILDER_IMAGE as base; simplified runtime layout and library handling; copies prebuilt binary from builder image.
Central CI orchestration
​.github/workflows/ci.yml
New central workflow that computes builder-image/runtime tags, conditionally creates builder image, and orchestrates parallel Java/C++ build, test, and integration jobs.
Builder image provisioning
​.github/workflows/create-builder-image.yml
New workflow to check for builder image via docker manifest inspect and build/push only if absent; fork PRs require explicit environment approval.
Reusable Java build
​.github/workflows/presto-build.yml
New reusable workflow to build Presto Java artifacts (optional image build), using builder-image-provided Maven repo and metadata-driven image tagging.
C++ build & images
​.github/workflows/prestocpp-linux-build-and-unit-test.yml
Converted to workflow_call with builder-image and runtime tag inputs; builds native artifact, uploads presto-native-build, and builds/pushes prestissimo runtime image.
Integration tests
​.github/workflows/integration-tests.yml
New workflow to run end-to-end Java/C++ integration tests, restores prebuilt artifacts, sets MAVEN_REPO, and runs Maven tests with native libs and LD_LIBRARY_PATH.
Tests matrix
​.github/workflows/tests.yml
Moved to workflow_call and containerized builder-image-driven matrix across Java versions and test modules; centralizes MAVEN_REPO usage.
Removed workflows
​.github/workflows/maven-checks.yml, ​.github/workflows/prestissimo-worker-images-build.yml
Deleted older Maven checks and prestissimo-worker image build workflows; responsibilities consolidated into new workflows.
Branch filter updates
​.github/workflows/docs.yml, ​.github/workflows/pr-title-checks.yaml, ​.github/workflows/prestocpp-format-and-header-check.yml
Replaced exact-branch filters with prefix pattern release-0.293-clp-connector-snapshot* for concurrency/cancellation and PR triggers; minor step name casing changes.

Sequence Diagram(s)

sequenceDiagram
    participant Trigger as Developer / Git event
    participant Resolve as resolve-config job
    participant Registry as GHCR (builder-image)
    participant Builder as create-builder-image job
    participant CI as CI jobs (presto / prestocpp / integration)
    participant MAVEN as MAVEN_REPO (cache)
    participant Artifacts as Artifact storage / Actions

    Trigger->>Resolve: compute builder-image tag, runtime tags
    Resolve->>Registry: check manifest (exists?)
    alt image missing
        Registry->>Builder: request build
        Builder->>Registry: push builder-image
    end
    Resolve->>CI: provide builder-image & tags
    CI->>Registry: pull builder-image (container)
    CI->>MAVEN: use MAVEN_REPO from builder-image
    CI->>Artifacts: upload build artifacts (presto-server, presto-native-build)
    CI->>Registry: push runtime images (when configured)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The PR title clearly and specifically summarizes the main change: replacing a broken stateful CI system with an orchestrated parallel build system using containerization.
Description check ✅ Passed The PR description is comprehensive and addresses all key template sections: Description, Motivation/Context, Impact, and Test Plan. It includes detailed architecture diagrams, problem/solution tables, design decisions, validation results, and a complete checklist.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch ci

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 14

🤖 Fix all issues with AI agents
In @.github/dockerfiles/yscope-presto-builder.dockerfile:
- Around line 136-137: The Dockerfile currently masks Maven dependency
resolution failures with "|| true" on the "./mvnw dependency:resolve-plugins
dependency:resolve -B --no-transfer-progress -Dmaven.repo.local=${MAVEN_REPO}"
command; remove the unconditional swallow and instead capture the command's exit
code and, if non‑zero, emit a clear warning (including the exit code and a brief
message) to stderr or the build log before continuing so failures are visible
while preserving non-fatal behavior.
- Around line 64-70: Replace the interactive "apt" call with scripted "apt-get":
in the RUN sequence that calls /presto/scripts/setup-ubuntu.sh, change "apt
install -y rpm" to use "apt-get update && apt-get install -y rpm" (and
optionally follow with "apt-get clean && rm -rf /var/lib/apt/lists/*") so the
Dockerfile uses the stable non-interactive package manager; update the RUN line
invoking /presto/velox/scripts/setup-ubuntu.sh and
/presto/scripts/setup-adapters.sh as needed to keep the same ordering but ensure
all package installs use apt-get.

In @.github/workflows/ci.yml:
- Around line 184-185: The current BASE_VERSION extraction using grep on pom.xml
is fragile (it looks for the first <version> tag); update the CI step that sets
BASE_VERSION to reliably obtain the project version instead: either call Maven
to evaluate project.version (e.g., use mvn help:evaluate
-Dexpression=project.version with quiet/forceStdout flags) and strip -SNAPSHOT,
or use an XML parser like xmllint to select /project/version and then remove
-SNAPSHOT; replace the existing grep|head|sed pipeline where BASE_VERSION is
defined with one of these robust approaches (referencing the BASE_VERSION
variable and the pom.xml file in the workflow).

In @.github/workflows/create-builder-image.yml:
- Around line 129-137: Add layer caching to the "Build and push unified builder
image" step that uses docker/build-push-action@v6 by configuring cache-from and
cache-to (or using type=gha) so subsequent runs reuse image layers; update the
step that currently sets context/file/tags/labels to include cache configuration
(cache-from: type=registry,ref=... or cache-to: type=gha,mode=max) and ensure
the referenced tag input (inputs.builder-image) is used as the cache
key/reference.

In @.github/workflows/integration-tests.yml:
- Around line 98-130: The duplicated restore logic for creating dirs, moving
binaries, and setting execute permissions (handling presto_server and
velox_functions_remote_server_main) should be extracted into a reusable step:
implement a composite GitHub Action (e.g.,
.github/actions/restore-native-artifacts) or a shell script and replace the
repeated blocks in each job with a single use step (uses:
./.github/actions/restore-native-artifacts or run:
./scripts/restore-native-artifacts.sh); ensure the action/script performs the
mkdir -p for the two target paths, moves
${GITHUB_WORKSPACE}/presto_cpp/main/presto_server and
${GITHUB_WORKSPACE}/velox/velox/functions/remote/server/velox_functions_remote_server_main
when present, sets chmod +x on the targets, and exits with an error if
presto_server is missing to preserve current behavior.
- Around line 304-313: The current shell discovery using TESTFILES/TESTCLASSES
and a for-loop is brittle (won’t handle special chars, nested/inner classes, or
non-test files); replace it by invoking Maven/Surefire to enumerate tests or, if
keeping shell, use a null-delimited find (find ... -print0) piped to xargs -0
and filter candidate files by presence of the `@Test` annotation (or a regex
matching typical test class patterns) before extracting the top-level class name
into TESTCLASSES, and apply the same change to the second occurrence (lines
referenced 437-446) so filenames with spaces/special chars and non-test files
are safely excluded.
- Line 333: Remove the redundant JVM system property -DLD_LIBRARY_PATH from the
test invocation and rely on the exported environment variable LD_LIBRARY_PATH
instead; locate the occurrence of the "-DLD_LIBRARY_PATH=\"${LD_LIBRARY_PATH}\""
JVM arg in the workflow step and delete it (this also applies to the duplicate
occurrence later), ensuring the environment export of LD_LIBRARY_PATH remains in
place so child processes inherit it.
- Around line 192-194: The export line that sets LD_LIBRARY_PATH appends
${LD_LIBRARY_PATH} unconditionally which can produce a trailing colon if
LD_LIBRARY_PATH is unset; change it to only append the existing value when
present (e.g., check or conditionally expand LD_LIBRARY_PATH) so the export
LD_LIBRARY_PATH line does not end up with an extra colon—locate the export
LD_LIBRARY_PATH="/usr/local/lib:...:${LD_LIBRARY_PATH}" line and modify it to
conditionally include ${LD_LIBRARY_PATH} only when non-empty.
- Around line 147-151: The current extraction logic can silently pick the wrong
tarball or directory when multiple matches exist; update the script around
PRESTO_SERVER_TAR and PRESTO_SERVER_DIR to explicitly expand the glob, count
matches, and fail fast if the count is not exactly one (zero or >1) with an
error message; set PRESTO_SERVER_TAR to that single match, extract it, then
similarly resolve PRESTO_SERVER_DIR by validating that exactly one
presto-server-*/ directory exists before exporting PRESTO_SERVER_DIR to
GITHUB_ENV (fail the job on validation failure).

In @.github/workflows/presto-build.yml:
- Around line 183-188: The version extraction step ("Extract base version", id
"extract-version") can silently use multiple matching files; update the shell in
that run block to detect multiple matches before calling sed: expand the glob
into a list/array, if more than one match then exit with a clear error (or
explicitly pick the first with a documented head -1 choice), otherwise assign
the single filename to VERSION and then run the sed extraction; ensure the step
fails fast on ambiguous matches and still echoes "base-version=${VERSION}" to
$GITHUB_OUTPUT.

In @.github/workflows/prestocpp-linux-build-and-unit-test.yml:
- Around line 166-168: The workflow emits invalid Docker tags when
inputs.runtime-version-tag or inputs.runtime-snapshot-tag are empty; update the
job that sets the tags (the tags: block) to skip or omit empty inputs by adding
a conditional filter so only non-empty inputs are expanded into ghcr.io/${{
github.repository }}/prestissimo:${{ inputs.runtime-version-tag }} and
ghcr.io/${{ github.repository }}/prestissimo:${{ inputs.runtime-snapshot-tag }},
or alternatively make the inputs required (runtime-version-tag and
runtime-snapshot-tag) so empty values cannot be provided; ensure the push/tag
step checks for non-empty tag strings before attempting to push.

In @.github/workflows/tests.yml:
- Around line 39-44: Remove the unused YAML anchor by deleting "&runner" from
the runs-on definition: update the runs-on key (the block that currently starts
with "runs-on: &runner") to simply be "runs-on:" with the same expression ${{
github.repository_owner == 'y-scope' && fromJSON('["self-hosted", "x64",
"ubuntu-noble"]') || 'ubuntu-latest' }} so the anchor &runner is not defined (no
other changes to the runs-on expression or surrounding job needed).

In `@presto-native-execution/scripts/dockerfiles/prestissimo-runtime.dockerfile`:
- Line 18: The Dockerfile uses a lowercase build stage keyword in the line with
"FROM ${BUILDER_IMAGE} as prestissimo-image"; change the lowercase "as" to
uppercase "AS" so the directive reads "FROM ${BUILDER_IMAGE} AS
prestissimo-image" to keep keyword casing consistent with Dockerfile best
practices.
- Around line 26-28: The RUN instruction that invokes ldd on /tmp/presto_server
should use POSIX-safe negation and enable pipefail so failures in pipes are
detected: change the shell invocation to set -o pipefail (or use "set -euo
pipefail" style) before running the ldd/grep/awk pipeline, use a spaced negation
form (! ldd ... | grep -q "not found") or test grep's exit code with -q to avoid
relying on ! attached to a pipeline, and keep the LD_LIBRARY_PATH and the
copy-to-/runtime-libraries logic intact so the
LD_LIBRARY_PATH=/usr/local/lib:/usr/local/lib64 ldd /tmp/presto_server | awk 'NF
== 4 { system("cp " $3 " /runtime-libraries") }' step still runs only when no
"not found" is detected.

Comment thread .github/dockerfiles/yscope-presto-builder.dockerfile
Comment thread .github/dockerfiles/yscope-presto-builder.dockerfile Outdated
Comment thread .github/workflows/ci.yml
Comment thread .github/workflows/create-builder-image.yml
Comment thread .github/workflows/integration-tests.yml
Comment thread .github/workflows/presto-build.yml
Comment thread .github/workflows/prestocpp-linux-build-and-unit-test.yml
Comment thread .github/workflows/tests.yml
Comment thread presto-native-execution/scripts/dockerfiles/prestissimo-runtime.dockerfile Outdated
Making runtime-version-tag and runtime-snapshot-tag required inputs
prevents invalid Docker tags (e.g., 'ghcr.io/.../prestissimo:') when
empty values are passed. Since ci.yml always provides these tags,
this is a safe change that enforces the expected contract.
Add validation to ensure exactly one presto-server tarball exists
before extraction. This prevents silent failures when glob pattern
matches multiple files unexpectedly.
- Use ${LD_LIBRARY_PATH:+:$LD_LIBRARY_PATH} to avoid trailing colon
  when LD_LIBRARY_PATH is unset
- Remove redundant -DLD_LIBRARY_PATH JVM property since the
  environment variable is already exported and inherited by child
  processes
Add validation to ensure exactly one presto-server tarball exists
before extracting version. This prevents silent failures when glob
pattern matches multiple files unexpectedly.
- Use uppercase AS for consistency with FROM keyword
- Add space after ! for POSIX shell compliance
- Set pipefail to catch failures in intermediate pipe commands
- Use grep -q for silent matching
- Use apt-get instead of apt for scripted installs (more stable CLI)
- Log warning instead of silently ignoring Maven dependency failures
The test was frequently timing out in CI. Double the timeout to
provide more headroom for slower CI environments.
Pin all third-party GitHub Actions to their full commit SHAs to prevent
supply chain attacks from compromised action tags. Version comments
added for maintainability.
YAML comments inside quoted strings are treated as literal text.
Move # vX comments outside the quotes so they are parsed as comments.
@jackluo923 jackluo923 force-pushed the ci branch 2 times, most recently from 8319ab5 to bb0abb3 Compare February 19, 2026 07:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant