Skip to content

feat: create SharedTrust controller for cert-manager cacerts reconciliation#383

Open
r3v5 wants to merge 6 commits into
kagenti:mainfrom
r3v5:shared-trust-controller
Open

feat: create SharedTrust controller for cert-manager cacerts reconciliation#383
r3v5 wants to merge 6 commits into
kagenti:mainfrom
r3v5:shared-trust-controller

Conversation

@r3v5
Copy link
Copy Markdown
Contributor

@r3v5 r3v5 commented May 28, 2026

Summary

  • Add event-driven SharedTrust controller that watches cert-manager Certificate objects and reconciles cacerts
    secrets for Istio mTLS in istio-system and openshift-ingress
  • On cert data changes, trigger rollout restarts of istiod, istiod-openshift-gateway, and ztunnel so they reload
    the new CA chain
  • Verify root CA fingerprints match intermediate CA issuers; delete stale intermediates to force cert-manager
    re-issuance on mismatch
  • Controller registers conditionally — skips startup if cert-manager CRDs are absent

Jira

RHAIENG-4901

Next steps

When current PR is merged, I will submit this change kagenti/scripts/ocp/setup-kagenti.sh (remove operator-owned steps from _ensure_rhoai_shared_trust; retain cluster-admin/static install steps)

Test plan

  • Unit tests: fingerprint verification, cacerts assembly, all reconcile paths (fake client)
  • E2E tests: Kind cluster with cert-manager — cacerts creation, rebuild after deletion, restart annotations,
    graceful missing workload handling
  • Manual verification on Kind with Istio ambient mode — full mTLS connectivity confirmed
  • Manual verification on OpenShift (ROSA) with OSSM3 — cacerts created, fingerprints match, istiod loads our
    CA chain, ztunnel mTLS with SPIFFE identities, cert rotation triggers rebuild
  • Run: go test ./internal/controller/ -run "TestReconcile_|TestVerifyCA|TestBuildCacerts" -v
  • Run: export KIND_EXPERIMENTAL_PROVIDER=podman && CONTAINER_TOOL=podman KIND_CLUSTER=kagenti go test ./test/e2e/ -v -count=1 -timeout 20m --ginkgo.focus="SharedTrust"
  • Manual verification on OpenShift 4.19.18 (ROSA) with RHOAI 3.4.0, OSSM3 v3.2.2, Istio v1.27.5

r3v5 added 5 commits May 28, 2026 13:56
Add github.com/cert-manager/cert-manager for typed access to
Certificate and ClusterIssuer resources.

Ref: RHAIENG-4901

Assisted-By: Claude (Anthropic AI) <noreply@anthropic.com>
Signed-off-by: Ian Miller <milleryan2003@gmail.com>
…embly

Pure functions with no Kubernetes dependency:
- verifyCAFingerprint: SHA256 comparison of root vs intermediate CA
- buildCacertsData: assembles Istio cacerts secret key mapping

Includes unit tests for matching/mismatching fingerprints, invalid
PEM/DER, correct key mapping, and cert-chain concatenation.

Ref: RHAIENG-4901

Assisted-By: Claude (Anthropic AI) <noreply@anthropic.com>
Signed-off-by: Ian Miller <milleryan2003@gmail.com>
Event-driven controller watching cert-manager Certificate objects to
reconcile cacerts secrets for Istio mTLS. Creates cacerts in
istio-system and openshift-ingress with the key format istiod expects.

Triggers rollout restarts of istiod and ztunnel when cert data changes.
Verifies root CA fingerprints match intermediate issuers; deletes stale
intermediates to force re-issuance on mismatch.

Registers conditionally via CertManagerCRDExists, following the
TektonConfig pattern.

Ref: RHAIENG-4901

Assisted-By: Claude (Anthropic AI) <noreply@anthropic.com>
Signed-off-by: Ian Miller <milleryan2003@gmail.com>
Tests with fake client covering certificates missing, not ready,
fingerprint mismatch, happy path cacerts creation, data change update,
and no-restart when unchanged.

Ref: RHAIENG-4901

Assisted-By: Claude (Anthropic AI) <noreply@anthropic.com>
Signed-off-by: Ian Miller <milleryan2003@gmail.com>
End-to-end tests on Kind cluster with cert-manager covering cacerts
creation, rebuild after intermediate deletion, restart annotations on
istiod and ztunnel, and graceful handling of missing workloads.

Ref: RHAIENG-4901

Assisted-By: Claude (Anthropic AI) <noreply@anthropic.com>
Signed-off-by: Ian Miller <milleryan2003@gmail.com>
- Fix gofmt alignment in const block
- Replace deprecated result.Requeue with RequeueAfter
- Suppress deprecated GetEventRecorderFor (repo-wide, not just ours)
- Break long lines in e2e test to stay under 120 chars

Ref: RHAIENG-4901

Assisted-By: Claude (Anthropic AI) <noreply@anthropic.com>
Signed-off-by: Ian Miller <milleryan2003@gmail.com>
odh-devops-app Bot pushed a commit to opendatahub-io/agents-operator that referenced this pull request May 29, 2026
…n synthesizePipeline

## Why

When a namespace's `authbridge-runtime-config` ConfigMap has no
`pipeline:` of its own, the webhook synthesizes one from the
namespace's `authbridge-config` env-var contract (NamespaceConfig).
The synthesis passed `keycloak_url` + `keycloak_realm` only to the
outbound token-exchange plugin. jwt-validation got just `issuer`.

kagenti-extensions#383 extends the jwt-validation plugin to accept
these two fields and derive jwks_url from the INTERNAL Keycloak
URL — the sidecar actually reaches the JWKS endpoint from inside
the cluster, and `issuer` is the PUBLIC hostname (required for
`iss`-claim matching but typically unreachable from inside the
pod). Without the keycloak_* hints, jwt-validation falls back to
issuer-derivation and every inbound request fails with
"connection refused" fetching the JWKS → 401.

## Scope

Pure addition to `synthesizePipeline`: the same two fields that
already feed token-exchange now also feed jwt-validation. Both
plugins end up with the same "where is Keycloak internally" hint,
mirroring the pre-PR-kagenti#378 binary behavior where jwks_url was
derived from outbound.token_url via a cross-plugin pass.

Test: extend `TestEnsurePerAgentConfigMap_EmptyBaseYAML_Fallback
FromNsConfig` to assert jwt-validation now receives keycloak_url
and keycloak_realm.

## Compatibility

Requires kagenti-extensions ≥ the tag that includes kagenti#383 — older
authbridge binaries reject the new fields at DisallowUnknownFields
decode. The chart in kagenti/kagenti#1507 bumps both pins together.

Assisted-By: Claude (Anthropic AI) <noreply@anthropic.com>
Signed-off-by: Hai Huang <huang195@gmail.com>
@cwiklik cwiklik self-requested a review May 29, 2026 14:12
Copy link
Copy Markdown
Collaborator

@cwiklik cwiklik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well-structured PR with clean commit history, good test coverage (unit + e2e), and proper conditional registration pattern. The fingerprint verification logic is correct — it compares the root CA cert against the ca.crt field (issuing CA) stored in the intermediate's secret, validating the trust chain. All CI checks pass.

Areas reviewed: Go (controller logic, helpers, tests), YAML (RBAC)
Commits: 6 commits, all signed-off: yes
CI status: all 15 checks passing

Namespace string
}{
{IstioSystemSecretName, IstioSystemNamespace},
{OpenShiftIngressSecretName, OpenShiftIngressNamespace},
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: intermediateConfigs, staleConfigMapNamespaces, and all workload names (istiod, ztunnel, etc.) are hardcoded package-level vars. Consider making these configurable via the reconciler struct or env vars to support non-standard Istio deployments (e.g., different namespace names, custom istiod deployment names).

Not blocking — the hardcoded values match the standard RHOAI/OSSM3 layout — but it would improve portability.

return ctrl.NewControllerManagedBy(mgr).
Named("SharedTrust").
Watches(&cmv1.Certificate{}, mapToCertificateReconcile).
Complete(r)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: The controller only watches Certificate objects (line 364: Watches(&cmv1.Certificate{}, ...)). If someone manually modifies or deletes a cacerts secret, the controller won't detect it until the next Certificate event triggers reconciliation.

Consider adding a filtered Secret watch on the managed namespaces (istio-system, openshift-ingress) to self-heal from manual drift.

for attempt := range 3 {
if attempt > 0 {
delay := time.Duration(attempt) * 5 * time.Second
sharedTrustLogger.Info("Retrying cert-manager CRD discovery", "attempt", attempt+1, "delay", delay)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: CertManagerCRDExists blocks operator startup with time.Sleep (up to 15s total across retries). Consider using wait.PollUntilContextTimeout for cleaner cancellation semantics, especially if the operator receives a shutdown signal during startup.

return false, fmt.Errorf("parsing intermediate CA: %w", err)
}

return rootFP == intFP, nil
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: certFingerprint only decodes the first PEM block. If ca.crt contains a chain (multiple concatenated PEM blocks), only the first cert is fingerprinted. This works for the current cert-manager layout but is worth a one-line comment noting the assumption.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: New /:ToDo

Development

Successfully merging this pull request may close these issues.

3 participants