Skip to content

feat(e2e): support running e2e tests on real OCP clusters with Prometheus alert validation#205

Open
rlobillo wants to merge 4 commits into
openshift-virtualization:mainfrom
rlobillo:e2e-support-real-ocp
Open

feat(e2e): support running e2e tests on real OCP clusters with Prometheus alert validation#205
rlobillo wants to merge 4 commits into
openshift-virtualization:mainfrom
rlobillo:e2e-support-real-ocp

Conversation

@rlobillo

@rlobillo rlobillo commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Short Description

Enable running e2e tests against real OCP clusters (not just Kind) and add Prometheus alert tests for VirtPlatformSyncFailed and VirtPlatformDependencyMissing.

More details

The existing e2e suite was designed for Kind-only. This PR adapts the test lifecycle to be non-destructive on real clusters and adds a new make run-e2e-tests-only target that runs Ginkgo directly against whatever cluster KUBECONFIG points to, producing JUnit XML and JSON reports in _output/.

Two new Prometheus alert tests are added:

  • VirtPlatformSyncFailed (table-driven): creates a ValidatingWebhookConfiguration that blocks SSA Create and Update for each managed asset, then triggers reconciliation so the SSA dry-run fails and the CNV-89450 fix sets compliance_status=0. Asserts the alert fires with correct labels (kind, name, severity: critical, operator: virt-platform-autopilot). Assets whose CRD or gateCRD are missing are skipped automatically.
  • VirtPlatformDependencyMissing (passive): queries the /metrics endpoint for kubevirt_autopilot_missing_dependency == 1, then verifies a warning alert fires for each missing optional CRD individually.

What this PR does / why we need it

Commit 1: feat(e2e): support running e2e tests against real OCP clusters

  • Add make run-e2e-tests-only target (runs Ginkgo with JUnit/JSON reports)
  • Adapt drift_test.go and anti_thrashing_e2e_test.go to use ensureHCOExists() / patchAutopilotAndWait() instead of deleting/recreating HCO (non-destructive on real clusters)
  • crd_lifecycle_test.go skips on OCP (CRD creation/deletion only meaningful on Kind)
  • Add isOpenShiftCluster() helper (detects OCP via ClusterVersion CRD presence)

Commit 2: feat(e2e): add Prometheus alert tests for OCP

  • Table-driven VirtPlatformSyncFailed tests for all phase-1 assets (MachineConfig, KubeletConfig, NodeHealthCheck, UIPlugin, KubeDescheduler)
  • queryFiringAlert() returns map[string]string labels with PromQL label filters — concise logs ("not firing yet" / "firing — kind=X name=Y severity=Z")
  • queryMetricExists() with clean logs ("not found yet" / "found (N series)")
  • touchHCO() triggers reconciliation before metrics wait (handles idle pods with no recent metrics)
  • getMissingDependenciesFromMetrics() parses /metrics for individual missing dependencies
  • Prometheus querying via thanos-querier route with SA token authentication
  • webhookCreated flag guards AfterEach cleanup (prevents timeouts on skipped/failed tests)
  • PrometheusRule for durations reduced to 15s for faster test feedback
  • Metrics wait timeout: 10min (handles post-MCO-rollout scenarios, CNV-89454)

Commit 3: fix(e2e): remove unused namespace param from captureAssetMetrics

  • Fix unparam lint: captureAssetMetrics namespace parameter always received operatorNamespace — use the package-level variable directly

Commit 4: fix(e2e): block Update ops in webhook to trigger SSA dry-run failure

  • The blocking webhook only intercepted Create operations, but SSA dry-run uses Patch (Update). With the CNV-89450 fix now setting compliance_status=0 on dry-run failure, the webhook must also block Update so DetectDrift() errors out and the VirtPlatformSyncFailed alert fires.
  • Removes the old workaround of deleting the resource to bypass dry-run drift detection.

🤖 Generated with Claude Code

@openshift-ci

openshift-ci Bot commented Jun 10, 2026

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign tiraboschi for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@github-actions

Copy link
Copy Markdown
Contributor

Generated Files Verification Failed

One or more generated files in this PR are out of sync:

  • CRDs: Run make update-crds if CRD verification failed
  • RBAC: Run make generate-rbac if RBAC verification failed

Please regenerate the files locally and commit the changes.

@rlobillo rlobillo force-pushed the e2e-support-real-ocp branch from ddf1ff8 to 2aa2c76 Compare June 10, 2026 14:57
@github-actions

Copy link
Copy Markdown
Contributor

Generated Files Verification Failed

One or more generated files in this PR are out of sync:

  • CRDs: Run make update-crds if CRD verification failed
  • RBAC: Run make generate-rbac if RBAC verification failed

Please regenerate the files locally and commit the changes.

1 similar comment
@github-actions

Copy link
Copy Markdown
Contributor

Generated Files Verification Failed

One or more generated files in this PR are out of sync:

  • CRDs: Run make update-crds if CRD verification failed
  • RBAC: Run make generate-rbac if RBAC verification failed

Please regenerate the files locally and commit the changes.

@rlobillo

Copy link
Copy Markdown
Contributor Author

/hold

Tests need to be updated as soon as #207 is ready on a build.

rlobillo and others added 3 commits June 11, 2026 09:27
Add `make run-e2e` target that runs ginkgo directly against whatever
cluster KUBECONFIG points to, without the Kind setup/teardown cycle.

Adapt test setup to be non-destructive on real clusters:
- drift_test.go and anti_thrashing_e2e_test.go use ensureHCOExists()
  and patchAutopilotAndWait() instead of deleting/recreating HCO
- crd_lifecycle_test.go skips automatically on OCP (detected via
  ClusterVersion CRD presence) since CRD creation/deletion is only
  meaningful on Kind
- Add isOpenShiftCluster() helper to helpers_test.go

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…, VirtPlatformDependencyMissing)

Table-driven tests for VirtPlatformSyncFailed using blocking webhooks
to trigger SSA failures, with CNV-89450 workaround (delete resource to
bypass dry-run). Passive test for VirtPlatformDependencyMissing when
optional CRDs are absent. Tests skip on Kind (no Prometheus) and when
gate CRDs are missing. Includes helpers for Prometheus querying via
thanos-querier route with SA token, PrometheusRule patching, and
metrics infrastructure readiness check (CNV-89454).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The unparam linter flagged that the namespace parameter always receives
operatorNamespace. Use the package-level variable directly instead.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@rlobillo rlobillo force-pushed the e2e-support-real-ocp branch from f2e3f9f to c90c024 Compare June 11, 2026 07:30
@github-actions

Copy link
Copy Markdown
Contributor

Generated Files Verification Failed

One or more generated files in this PR are out of sync:

  • CRDs: Run make update-crds if CRD verification failed
  • RBAC: Run make generate-rbac if RBAC verification failed

Please regenerate the files locally and commit the changes.

The blocking webhook only intercepted Create operations, but SSA
dry-run uses Patch (Update). With the CNV-89450 fix now setting
compliance_status=0 on dry-run failure, the webhook must also block
Update so DetectDrift() errors out and the VirtPlatformSyncFailed
alert fires. Also removes the old workaround of deleting the resource.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@rlobillo

Copy link
Copy Markdown
Contributor Author

All tests passing on real OCP cluster

$ tail e2e.log 
autogenerated by Ginkgo
[ReportAfterSuite] PASSED [0.013 seconds]
------------------------------

Ran 17 of 27 Specs in 325.317 seconds
SUCCESS! -- 17 Passed | 0 Failed | 0 Pending | 10 Skipped
PASS

Ginkgo ran 1 suite in 5m27.191502866s
Test Suite Passed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant