feat: create SharedTrust controller for cert-manager cacerts reconciliation#383
feat: create SharedTrust controller for cert-manager cacerts reconciliation#383r3v5 wants to merge 6 commits into
Conversation
Add github.com/cert-manager/cert-manager for typed access to Certificate and ClusterIssuer resources. Ref: RHAIENG-4901 Assisted-By: Claude (Anthropic AI) <noreply@anthropic.com> Signed-off-by: Ian Miller <milleryan2003@gmail.com>
…embly Pure functions with no Kubernetes dependency: - verifyCAFingerprint: SHA256 comparison of root vs intermediate CA - buildCacertsData: assembles Istio cacerts secret key mapping Includes unit tests for matching/mismatching fingerprints, invalid PEM/DER, correct key mapping, and cert-chain concatenation. Ref: RHAIENG-4901 Assisted-By: Claude (Anthropic AI) <noreply@anthropic.com> Signed-off-by: Ian Miller <milleryan2003@gmail.com>
Event-driven controller watching cert-manager Certificate objects to reconcile cacerts secrets for Istio mTLS. Creates cacerts in istio-system and openshift-ingress with the key format istiod expects. Triggers rollout restarts of istiod and ztunnel when cert data changes. Verifies root CA fingerprints match intermediate issuers; deletes stale intermediates to force re-issuance on mismatch. Registers conditionally via CertManagerCRDExists, following the TektonConfig pattern. Ref: RHAIENG-4901 Assisted-By: Claude (Anthropic AI) <noreply@anthropic.com> Signed-off-by: Ian Miller <milleryan2003@gmail.com>
Tests with fake client covering certificates missing, not ready, fingerprint mismatch, happy path cacerts creation, data change update, and no-restart when unchanged. Ref: RHAIENG-4901 Assisted-By: Claude (Anthropic AI) <noreply@anthropic.com> Signed-off-by: Ian Miller <milleryan2003@gmail.com>
End-to-end tests on Kind cluster with cert-manager covering cacerts creation, rebuild after intermediate deletion, restart annotations on istiod and ztunnel, and graceful handling of missing workloads. Ref: RHAIENG-4901 Assisted-By: Claude (Anthropic AI) <noreply@anthropic.com> Signed-off-by: Ian Miller <milleryan2003@gmail.com>
- Fix gofmt alignment in const block - Replace deprecated result.Requeue with RequeueAfter - Suppress deprecated GetEventRecorderFor (repo-wide, not just ours) - Break long lines in e2e test to stay under 120 chars Ref: RHAIENG-4901 Assisted-By: Claude (Anthropic AI) <noreply@anthropic.com> Signed-off-by: Ian Miller <milleryan2003@gmail.com>
…n synthesizePipeline ## Why When a namespace's `authbridge-runtime-config` ConfigMap has no `pipeline:` of its own, the webhook synthesizes one from the namespace's `authbridge-config` env-var contract (NamespaceConfig). The synthesis passed `keycloak_url` + `keycloak_realm` only to the outbound token-exchange plugin. jwt-validation got just `issuer`. kagenti-extensions#383 extends the jwt-validation plugin to accept these two fields and derive jwks_url from the INTERNAL Keycloak URL — the sidecar actually reaches the JWKS endpoint from inside the cluster, and `issuer` is the PUBLIC hostname (required for `iss`-claim matching but typically unreachable from inside the pod). Without the keycloak_* hints, jwt-validation falls back to issuer-derivation and every inbound request fails with "connection refused" fetching the JWKS → 401. ## Scope Pure addition to `synthesizePipeline`: the same two fields that already feed token-exchange now also feed jwt-validation. Both plugins end up with the same "where is Keycloak internally" hint, mirroring the pre-PR-kagenti#378 binary behavior where jwks_url was derived from outbound.token_url via a cross-plugin pass. Test: extend `TestEnsurePerAgentConfigMap_EmptyBaseYAML_Fallback FromNsConfig` to assert jwt-validation now receives keycloak_url and keycloak_realm. ## Compatibility Requires kagenti-extensions ≥ the tag that includes kagenti#383 — older authbridge binaries reject the new fields at DisallowUnknownFields decode. The chart in kagenti/kagenti#1507 bumps both pins together. Assisted-By: Claude (Anthropic AI) <noreply@anthropic.com> Signed-off-by: Hai Huang <huang195@gmail.com>
cwiklik
left a comment
There was a problem hiding this comment.
Well-structured PR with clean commit history, good test coverage (unit + e2e), and proper conditional registration pattern. The fingerprint verification logic is correct — it compares the root CA cert against the ca.crt field (issuing CA) stored in the intermediate's secret, validating the trust chain. All CI checks pass.
Areas reviewed: Go (controller logic, helpers, tests), YAML (RBAC)
Commits: 6 commits, all signed-off: yes
CI status: all 15 checks passing
| Namespace string | ||
| }{ | ||
| {IstioSystemSecretName, IstioSystemNamespace}, | ||
| {OpenShiftIngressSecretName, OpenShiftIngressNamespace}, |
There was a problem hiding this comment.
suggestion: intermediateConfigs, staleConfigMapNamespaces, and all workload names (istiod, ztunnel, etc.) are hardcoded package-level vars. Consider making these configurable via the reconciler struct or env vars to support non-standard Istio deployments (e.g., different namespace names, custom istiod deployment names).
Not blocking — the hardcoded values match the standard RHOAI/OSSM3 layout — but it would improve portability.
| return ctrl.NewControllerManagedBy(mgr). | ||
| Named("SharedTrust"). | ||
| Watches(&cmv1.Certificate{}, mapToCertificateReconcile). | ||
| Complete(r) |
There was a problem hiding this comment.
suggestion: The controller only watches Certificate objects (line 364: Watches(&cmv1.Certificate{}, ...)). If someone manually modifies or deletes a cacerts secret, the controller won't detect it until the next Certificate event triggers reconciliation.
Consider adding a filtered Secret watch on the managed namespaces (istio-system, openshift-ingress) to self-heal from manual drift.
| for attempt := range 3 { | ||
| if attempt > 0 { | ||
| delay := time.Duration(attempt) * 5 * time.Second | ||
| sharedTrustLogger.Info("Retrying cert-manager CRD discovery", "attempt", attempt+1, "delay", delay) |
There was a problem hiding this comment.
nit: CertManagerCRDExists blocks operator startup with time.Sleep (up to 15s total across retries). Consider using wait.PollUntilContextTimeout for cleaner cancellation semantics, especially if the operator receives a shutdown signal during startup.
| return false, fmt.Errorf("parsing intermediate CA: %w", err) | ||
| } | ||
|
|
||
| return rootFP == intFP, nil |
There was a problem hiding this comment.
nit: certFingerprint only decodes the first PEM block. If ca.crt contains a chain (multiple concatenated PEM blocks), only the first cert is fingerprinted. This works for the current cert-manager layout but is worth a one-line comment noting the assumption.
Summary
cacertssecrets for Istio mTLS in
istio-systemandopenshift-ingressthe new CA chain
re-issuance on mismatch
Jira
RHAIENG-4901
Next steps
When current PR is merged, I will submit this change
kagenti/scripts/ocp/setup-kagenti.sh(remove operator-owned steps from _ensure_rhoai_shared_trust; retain cluster-admin/static install steps)Test plan
graceful missing workload handling
CA chain, ztunnel mTLS with SPIFFE identities, cert rotation triggers rebuild
go test ./internal/controller/ -run "TestReconcile_|TestVerifyCA|TestBuildCacerts" -vexport KIND_EXPERIMENTAL_PROVIDER=podman && CONTAINER_TOOL=podman KIND_CLUSTER=kagenti go test ./test/e2e/ -v -count=1 -timeout 20m --ginkgo.focus="SharedTrust"