feat: MLflow operand controller for CR lifecycle and OTEL RBAC#378
Draft
Bobbins228 wants to merge 2 commits into
Draft
feat: MLflow operand controller for CR lifecycle and OTEL RBAC#378Bobbins228 wants to merge 2 commits into
Bobbins228 wants to merge 2 commits into
Conversation
3 tasks
dc58672 to
e274b84
Compare
Adds MLflowOperandReconciler that watches DataScienceCluster and: - Creates the MLflow CR in redhat-ods-applications when DSC mlflowoperator is Managed - Waits for MLflow Service/Endpoints readiness via requeue - Creates per-agent-namespace RoleBindings granting otel-collector SA access to MLflow ClusterRoles for trace export - Re-reconciles when DSC MLflow component state changes RBAC updates: - Expand mlflows verbs to include create/update - Add datascienceclusters get/list/watch - Add clusterroles get and bind (scoped to mlflow-operator roles) - Add endpoints to core resource watches Registered under --enable-mlflow gate alongside existing MLflowReconciler. Uses DSC v2 API (mlflowoperator field is v2-only). Ref: RHAIENG-4902 Assisted-By: cursor Signed-off-by: Bobbins228 <mcampbel@redhat.com>
e274b84 to
ca91c1f
Compare
- Set service-ca.crt as ca_file for MLflow TLS verification - Add mlflow.workspace and mlflow.experimentName Helm values for RHOAI workspace/experiment header configuration - Create MLflow experiment via API at bootstrap using configured name - Wire --mlflow-workspace and --mlflow-experiment-name CLI flags - Prefer mlflow-operator-mlflow-integration ClusterRole for gatewayendpoints/use permission Signed-off-by: Bobbins228 <mcampbel@redhat.com>
59142ee to
75d6094
Compare
odh-devops-app Bot
pushed a commit
to opendatahub-io/agents-operator
that referenced
this pull request
May 29, 2026
Follow-up to kagenti-extensions PR kagenti#378, which replaced authbridge's top-level `inbound:` / `outbound:` / `identity:` / `bypass:` / `routes:` YAML blocks with per-plugin config under `pipeline.*.plugins[].config`. The new binary rejects any config whose pipeline is empty (which is what yaml.v3's silent drop of unknown keys produces when the top-level blocks are removed), so the webhook had to stop synthesizing the old shape. What changed in pod_mutator.go: ensurePerAgentConfigMap's fallback-synthesis branch is rewritten. Previously: if baseYAML was missing any of `inbound`, `outbound`, `identity`, or `bypass`, synthesize each from NamespaceConfig values. Now: if baseYAML has no `pipeline:` section at all, synthesize one with jwt-validation (inbound) and token-exchange (outbound) entries whose configs carry Issuer / Keycloak URL / realm / default policy / identity type from NamespaceConfig. File-path defaults (audience_file, client_id_file, client_secret_file, jwt_svid_path) are no longer emitted by the webhook — the authbridge plugin applies them from its own convention layer at Configure time (see authbridge/authlib/plugins/CONVENTIONS.md). This keeps the webhook schema-agnostic about paths and reduces duplication. When baseYAML already has a `pipeline:` section (the post-migration Kagenti Helm chart emits it), synthesis is skipped entirely. Only mode + listener overrides layer on top. The Helm chart owns plugin config contents; the webhook owns per-agent mode selection. Tests: - TestEnsurePerAgentConfigMap_EmptyBaseYAML_FallbackFromNsConfig: asserts the synthesized pipeline has the correct plugin names and config values. Navigates pipeline.<direction>.plugins[<name>] via a new pluginConfigAt helper to keep assertions compact. - TestEnsurePerAgentConfigMap_BaseYAML_PreservesExistingFields: baseYAML rewritten to new shape; asserts plugin config is preserved verbatim. - TestEnsurePerAgentConfigMap_ListenerOverrides_Merged: baseYAML rewritten to new shape; listener assertions unchanged (listener is top-level in both schemas). - TestEnsurePerAgentConfigMap_FederatedJWT_MapsToSpiffe: asserts identity.type = spiffe under token-exchange. Dropped assertions for file-path defaults (those moved into the plugin). E2E fixtures: - test/e2e/fixtures.go: two hardcoded ConfigMaps with old-shape YAML rewritten to new per-plugin shape. Image tags: none bumped. The operator references authbridge images as `:latest` throughout — once PR kagenti#378 merges and CI publishes new `:latest`, operator CI picks it up. The version-pinning (and atomic authbridge+operator cutover) lives in the kagenti Helm chart PR that comes next. Assisted-By: Claude (Anthropic AI) <noreply@anthropic.com> Signed-off-by: Hai Huang <huang195@gmail.com>
odh-devops-app Bot
pushed a commit
to opendatahub-io/agents-operator
that referenced
this pull request
May 29, 2026
…n synthesizePipeline ## Why When a namespace's `authbridge-runtime-config` ConfigMap has no `pipeline:` of its own, the webhook synthesizes one from the namespace's `authbridge-config` env-var contract (NamespaceConfig). The synthesis passed `keycloak_url` + `keycloak_realm` only to the outbound token-exchange plugin. jwt-validation got just `issuer`. kagenti-extensions#383 extends the jwt-validation plugin to accept these two fields and derive jwks_url from the INTERNAL Keycloak URL — the sidecar actually reaches the JWKS endpoint from inside the cluster, and `issuer` is the PUBLIC hostname (required for `iss`-claim matching but typically unreachable from inside the pod). Without the keycloak_* hints, jwt-validation falls back to issuer-derivation and every inbound request fails with "connection refused" fetching the JWKS → 401. ## Scope Pure addition to `synthesizePipeline`: the same two fields that already feed token-exchange now also feed jwt-validation. Both plugins end up with the same "where is Keycloak internally" hint, mirroring the pre-PR-kagenti#378 binary behavior where jwks_url was derived from outbound.token_url via a cross-plugin pass. Test: extend `TestEnsurePerAgentConfigMap_EmptyBaseYAML_Fallback FromNsConfig` to assert jwt-validation now receives keycloak_url and keycloak_realm. ## Compatibility Requires kagenti-extensions ≥ the tag that includes kagenti#383 — older authbridge binaries reject the new fields at DisallowUnknownFields decode. The chart in kagenti/kagenti#1507 bumps both pins together. Assisted-By: Claude (Anthropic AI) <noreply@anthropic.com> Signed-off-by: Hai Huang <huang195@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds
MLflowOperandReconcilerto the kagenti-operator, moving MLflow CR creation and OTEL collector RBAC bootstrap from the setup shell script into the operator. Also fixes several bugs in the OTel bootstrap that prevented traces from reaching MLflow on RHOAI clusters.Changes
MLflow Operand Controller (original)
internal/controller/mlflow_operand_controller.go(new) —MLflowOperandReconcilerthat watches DSC, creates MLflow CR, waits for readiness via requeue, creates OTEL RoleBindings in agent namespacesinternal/controller/mlflow_operand_controller_test.go(new) — unit tests covering DSC absence, management state handling, MLflow CR creation, readiness checks, OTEL RoleBinding creation, ClusterRole fallback, and idempotencyinternal/mlflow/types.go— addedMLflowSpecto MLflow type, added minimalDataScienceClustertypes with v2 scheme registrationcmd/main.go— registeredMLflowOperandReconcilerunder--enable-mlflowgateinternal/controller/mlflow_controller.go— updated kubebuilder RBAC markers for expandedmlflowsverbscharts/kagenti-operator/templates/rbac/role.yaml— addeddatascienceclustersget/list/watch, expandedmlflowsto create/update, addedclusterrolesget +bind(scoped to mlflow-operator roles)OTel→MLflow Trace Export Fixes
otel.go: Setca_file: /var/run/secrets/kubernetes.io/serviceaccount/service-ca.crton the MLflow exporter so the collector trusts RHOAI's service-serving CA. Without this, exports fail withx509: certificate signed by unknown authority.mlflow.workspaceHelm value (wired as--mlflow-workspaceCLI flag). RHOAI requiresx-mlflow-workspaceheader on every request; without it MLflow returns 500. The workspace maps 1:1 to a Kubernetes namespace and is now an explicit configuration input rather than a runtime discovery.mlflow.experimentNameHelm value (defaultkagenti-traces, wired as--mlflow-experiment-name). The bootstrap calls the MLflow API at startup to create or retrieve the experiment, then sets the correctx-mlflow-experiment-idheader. Without a valid experiment ID, MLflow returns 422.mlflow_operand_controller.go: FlippedresolveMLflowClusterRoleto prefermlflow-operator-mlflow-integrationovermlflow-operator-mlflow-edit, ensuringgatewayendpoints/usepermission is present.presets.go: Removed hardcodedx-mlflow-experiment-id: "0"andtls.insecure: truefrom YAML presets since these are now set programmatically.Files changed
charts/kagenti-operator/values.yamlmlflow.workspaceandmlflow.experimentNamecharts/kagenti-operator/templates/manager/manager.yamlkagenti-operator/cmd/main.go--mlflow-workspaceand--mlflow-experiment-nameflagskagenti-operator/internal/bootstrap/otel.gokagenti-operator/internal/bootstrap/otel_test.gokagenti-operator/internal/bootstrap/presets.gokagenti-operator/internal/controller/mlflow_operand_controller.goManual Test Plan (OpenShift + RHOAI)
Prerequisites
kagenti-systemnamespace with kagenti-deps deployedStep 1: Deploy the operator
Step 2: Create agent namespace and LLM secret
Step 3: Restart the operator (so bootstrap discovers the new namespace)
Step 4: Verify bootstrap logs
Expected output:
MLflow CRD detected, discovering MLflow CRFound available MLflow CRwith correcttracesURLCreated/found MLflow experimentwithname: kagenti-traces,workspace: team1, and anexperimentIDOTel collector bootstrap completeStep 5: Verify the OTel collector ConfigMap
kubectl get configmap otel-collector-config -n kagenti-system -o jsonpath='{.data.base\.yaml}'Verify:
x-mlflow-workspace: team1x-mlflow-experiment-id: "<id>"(matches bootstrap log)ca_file: /var/run/secrets/kubernetes.io/serviceaccount/service-ca.crtbearertokenauth/mlflowextension presentStep 6: Deploy a test agent (e.g. weather-service)
Deploy the weather agent and tool from
agent-examples/intoteam1, adjusting LLM env vars and removingrunAsUser: 1000from securityContext for OpenShift SCC compatibility.Step 7: Send a request and verify traces
Step 8: Confirm no export errors
Expected: No errors. Traces should flow cleanly to MLflow.
Test Results
go test ./internal/bootstrap/ -v)go build ./...clean