feat: MLflow operand controller for CR lifecycle and OTEL RBAC by Bobbins228 · Pull Request #378 · kagenti/kagenti-operator

Bobbins228 · 2026-05-27T08:36:31Z

Summary

Adds MLflowOperandReconciler to the kagenti-operator, moving MLflow CR creation and OTEL collector RBAC bootstrap from the setup shell script into the operator. Also fixes several bugs in the OTel bootstrap that prevented traces from reaching MLflow on RHOAI clusters.

Changes

MLflow Operand Controller (original)

internal/controller/mlflow_operand_controller.go (new) — MLflowOperandReconciler that watches DSC, creates MLflow CR, waits for readiness via requeue, creates OTEL RoleBindings in agent namespaces
internal/controller/mlflow_operand_controller_test.go (new) — unit tests covering DSC absence, management state handling, MLflow CR creation, readiness checks, OTEL RoleBinding creation, ClusterRole fallback, and idempotency
internal/mlflow/types.go — added MLflowSpec to MLflow type, added minimal DataScienceCluster types with v2 scheme registration
cmd/main.go — registered MLflowOperandReconciler under --enable-mlflow gate
internal/controller/mlflow_controller.go — updated kubebuilder RBAC markers for expanded mlflows verbs
charts/kagenti-operator/templates/rbac/role.yaml — added datascienceclusters get/list/watch, expanded mlflows to create/update, added clusterroles get + bind (scoped to mlflow-operator roles)

OTel→MLflow Trace Export Fixes

TLS verification — otel.go: Set ca_file: /var/run/secrets/kubernetes.io/serviceaccount/service-ca.crt on the MLflow exporter so the collector trusts RHOAI's service-serving CA. Without this, exports fail with x509: certificate signed by unknown authority.
Workspace header — Added mlflow.workspace Helm value (wired as --mlflow-workspace CLI flag). RHOAI requires x-mlflow-workspace header on every request; without it MLflow returns 500. The workspace maps 1:1 to a Kubernetes namespace and is now an explicit configuration input rather than a runtime discovery.
Experiment ID — Added mlflow.experimentName Helm value (default kagenti-traces, wired as --mlflow-experiment-name). The bootstrap calls the MLflow API at startup to create or retrieve the experiment, then sets the correct x-mlflow-experiment-id header. Without a valid experiment ID, MLflow returns 422.
ClusterRole preference — mlflow_operand_controller.go: Flipped resolveMLflowClusterRole to prefer mlflow-operator-mlflow-integration over mlflow-operator-mlflow-edit, ensuring gatewayendpoints/use permission is present.
Removed hardcoded preset values — presets.go: Removed hardcoded x-mlflow-experiment-id: "0" and tls.insecure: true from YAML presets since these are now set programmatically.

Files changed

File	What
`charts/kagenti-operator/values.yaml`	Added `mlflow.workspace` and `mlflow.experimentName`
`charts/kagenti-operator/templates/manager/manager.yaml`	Wire new CLI flags from Helm values
`kagenti-operator/cmd/main.go`	Added `--mlflow-workspace` and `--mlflow-experiment-name` flags
`kagenti-operator/internal/bootstrap/otel.go`	TLS fix, config-driven workspace/experiment, MLflow API client for experiment creation
`kagenti-operator/internal/bootstrap/otel_test.go`	Updated tests for config-driven workspace, added workspace header test
`kagenti-operator/internal/bootstrap/presets.go`	Removed hardcoded experiment-id and insecure TLS
`kagenti-operator/internal/controller/mlflow_operand_controller.go`	ClusterRole preference fix

Manual Test Plan (OpenShift + RHOAI)

Prerequisites

OpenShift cluster with RHOAI installed (MLflow operator available)
kagenti-system namespace with kagenti-deps deployed
Access to an LLM endpoint (vLLM or similar)

Step 1: Deploy the operator

helm upgrade --install kagenti-operator charts/kagenti-operator/ \
  -n kagenti-system \
  --set controllerManager.container.image.repository=<your-registry>/kagenti-operator \
  --set controllerManager.container.image.tag=<your-tag> \
  --set controllerManager.container.imagePullPolicy=Always \
  --set mlflow.enable=true \
  --set mlflow.workspace=team1 \
  --set mlflow.experimentName=kagenti-traces \
  --set otelBootstrap.enable=true

Step 2: Create agent namespace and LLM secret

kubectl create namespace team1

kubectl create secret generic llama-stack-inference-model-secret \
  --from-literal INFERENCE_MODEL="<model>" \
  --from-literal VLLM_URL="<url>" \
  --from-literal VLLM_TLS_VERIFY="<true|false>" \
  --from-literal VLLM_API_TOKEN="<token>" \
  -n team1

Step 3: Restart the operator (so bootstrap discovers the new namespace)

kubectl rollout restart deployment/kagenti-controller-manager -n kagenti-system
kubectl rollout status deployment/kagenti-controller-manager -n kagenti-system

Step 4: Verify bootstrap logs

kubectl logs deployment/kagenti-controller-manager -n kagenti-system | grep -E "bootstrap|MLflow|workspace|experiment"

Expected output:

MLflow CRD detected, discovering MLflow CR
Found available MLflow CR with correct tracesURL
Created/found MLflow experiment with name: kagenti-traces, workspace: team1, and an experimentID
OTel collector bootstrap complete

Step 5: Verify the OTel collector ConfigMap

kubectl get configmap otel-collector-config -n kagenti-system -o jsonpath='{.data.base\.yaml}'

Verify:

x-mlflow-workspace: team1
x-mlflow-experiment-id: "<id>" (matches bootstrap log)
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/service-ca.crt
bearertokenauth/mlflow extension present

Step 6: Deploy a test agent (e.g. weather-service)

Deploy the weather agent and tool from agent-examples/ into team1, adjusting LLM env vars and removing runAsUser: 1000 from securityContext for OpenShift SCC compatibility.

Step 7: Send a request and verify traces

kubectl port-forward svc/weather-service -n team1 8080:8080 &

curl -s -X POST http://localhost:8080/ \
  -H "Content-Type: application/json" \
  -d '{"jsonrpc":"2.0","method":"message/send","id":"test-1","params":{"message":{"role":"user","messageId":"msg-1","parts":[{"type":"text","text":"What is the weather in London?"}]}}}'

Step 8: Confirm no export errors

kubectl logs deployment/otel-collector -n kagenti-system --since=5m | grep -iE "error|fail|reject|404|500|401"

Expected: No errors. Traces should flow cleanly to MLflow.

Test Results

Unit tests pass (go test ./internal/bootstrap/ -v)
go build ./... clean
Pre-commit hooks pass (go-fmt, go-vet, helmlint)
End-to-end tested on live OpenShift cluster with RHOAI MLflow
Traces confirmed in MLflow workspace with correct experiment
Zero export errors in OTel collector logs

Adds MLflowOperandReconciler that watches DataScienceCluster and: - Creates the MLflow CR in redhat-ods-applications when DSC mlflowoperator is Managed - Waits for MLflow Service/Endpoints readiness via requeue - Creates per-agent-namespace RoleBindings granting otel-collector SA access to MLflow ClusterRoles for trace export - Re-reconciles when DSC MLflow component state changes RBAC updates: - Expand mlflows verbs to include create/update - Add datascienceclusters get/list/watch - Add clusterroles get and bind (scoped to mlflow-operator roles) - Add endpoints to core resource watches Registered under --enable-mlflow gate alongside existing MLflowReconciler. Uses DSC v2 API (mlflowoperator field is v2-only). Ref: RHAIENG-4902 Assisted-By: cursor Signed-off-by: Bobbins228 <mcampbel@redhat.com>

- Set service-ca.crt as ca_file for MLflow TLS verification - Add mlflow.workspace and mlflow.experimentName Helm values for RHOAI workspace/experiment header configuration - Create MLflow experiment via API at bootstrap using configured name - Wire --mlflow-workspace and --mlflow-experiment-name CLI flags - Prefer mlflow-operator-mlflow-integration ClusterRole for gatewayendpoints/use permission Signed-off-by: Bobbins228 <mcampbel@redhat.com>

Follow-up to kagenti-extensions PR kagenti#378, which replaced authbridge's top-level `inbound:` / `outbound:` / `identity:` / `bypass:` / `routes:` YAML blocks with per-plugin config under `pipeline.*.plugins[].config`. The new binary rejects any config whose pipeline is empty (which is what yaml.v3's silent drop of unknown keys produces when the top-level blocks are removed), so the webhook had to stop synthesizing the old shape. What changed in pod_mutator.go: ensurePerAgentConfigMap's fallback-synthesis branch is rewritten. Previously: if baseYAML was missing any of `inbound`, `outbound`, `identity`, or `bypass`, synthesize each from NamespaceConfig values. Now: if baseYAML has no `pipeline:` section at all, synthesize one with jwt-validation (inbound) and token-exchange (outbound) entries whose configs carry Issuer / Keycloak URL / realm / default policy / identity type from NamespaceConfig. File-path defaults (audience_file, client_id_file, client_secret_file, jwt_svid_path) are no longer emitted by the webhook — the authbridge plugin applies them from its own convention layer at Configure time (see authbridge/authlib/plugins/CONVENTIONS.md). This keeps the webhook schema-agnostic about paths and reduces duplication. When baseYAML already has a `pipeline:` section (the post-migration Kagenti Helm chart emits it), synthesis is skipped entirely. Only mode + listener overrides layer on top. The Helm chart owns plugin config contents; the webhook owns per-agent mode selection. Tests: - TestEnsurePerAgentConfigMap_EmptyBaseYAML_FallbackFromNsConfig: asserts the synthesized pipeline has the correct plugin names and config values. Navigates pipeline.<direction>.plugins[<name>] via a new pluginConfigAt helper to keep assertions compact. - TestEnsurePerAgentConfigMap_BaseYAML_PreservesExistingFields: baseYAML rewritten to new shape; asserts plugin config is preserved verbatim. - TestEnsurePerAgentConfigMap_ListenerOverrides_Merged: baseYAML rewritten to new shape; listener assertions unchanged (listener is top-level in both schemas). - TestEnsurePerAgentConfigMap_FederatedJWT_MapsToSpiffe: asserts identity.type = spiffe under token-exchange. Dropped assertions for file-path defaults (those moved into the plugin). E2E fixtures: - test/e2e/fixtures.go: two hardcoded ConfigMaps with old-shape YAML rewritten to new per-plugin shape. Image tags: none bumped. The operator references authbridge images as `:latest` throughout — once PR kagenti#378 merges and CI publishes new `:latest`, operator CI picks it up. The version-pinning (and atomic authbridge+operator cutover) lives in the kagenti Helm chart PR that comes next. Assisted-By: Claude (Anthropic AI) <noreply@anthropic.com> Signed-off-by: Hai Huang <huang195@gmail.com>

…n synthesizePipeline ## Why When a namespace's `authbridge-runtime-config` ConfigMap has no `pipeline:` of its own, the webhook synthesizes one from the namespace's `authbridge-config` env-var contract (NamespaceConfig). The synthesis passed `keycloak_url` + `keycloak_realm` only to the outbound token-exchange plugin. jwt-validation got just `issuer`. kagenti-extensions#383 extends the jwt-validation plugin to accept these two fields and derive jwks_url from the INTERNAL Keycloak URL — the sidecar actually reaches the JWKS endpoint from inside the cluster, and `issuer` is the PUBLIC hostname (required for `iss`-claim matching but typically unreachable from inside the pod). Without the keycloak_* hints, jwt-validation falls back to issuer-derivation and every inbound request fails with "connection refused" fetching the JWKS → 401. ## Scope Pure addition to `synthesizePipeline`: the same two fields that already feed token-exchange now also feed jwt-validation. Both plugins end up with the same "where is Keycloak internally" hint, mirroring the pre-PR-kagenti#378 binary behavior where jwks_url was derived from outbound.token_url via a cross-plugin pass. Test: extend `TestEnsurePerAgentConfigMap_EmptyBaseYAML_Fallback FromNsConfig` to assert jwt-validation now receives keycloak_url and keycloak_realm. ## Compatibility Requires kagenti-extensions ≥ the tag that includes kagenti#383 — older authbridge binaries reject the new fields at DisallowUnknownFields decode. The chart in kagenti/kagenti#1507 bumps both pins together. Assisted-By: Claude (Anthropic AI) <noreply@anthropic.com> Signed-off-by: Hai Huang <huang195@gmail.com>

Bobbins228 requested a review from a team as a code owner May 27, 2026 08:36

rubambiza added this to Kagenti Issue Prioritization May 27, 2026

github-project-automation Bot moved this to New /:ToDo in Kagenti Issue Prioritization May 27, 2026

Bobbins228 mentioned this pull request May 27, 2026

refactor: remove MLflow CR creation and OTEL RBAC from setup script kagenti/kagenti#1692

Open

3 tasks

Bobbins228 force-pushed the feat/RHAIENG-4902-mlflow-operand-otel-rbac branch from dc58672 to e274b84 Compare May 27, 2026 08:41

Bobbins228 force-pushed the feat/RHAIENG-4902-mlflow-operand-otel-rbac branch from e274b84 to ca91c1f Compare May 27, 2026 09:46

Bobbins228 marked this pull request as draft May 27, 2026 12:49

Bobbins228 force-pushed the feat/RHAIENG-4902-mlflow-operand-otel-rbac branch from 59142ee to 75d6094 Compare May 27, 2026 16:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: MLflow operand controller for CR lifecycle and OTEL RBAC#378

feat: MLflow operand controller for CR lifecycle and OTEL RBAC#378
Bobbins228 wants to merge 2 commits into
kagenti:mainfrom
Bobbins228:feat/RHAIENG-4902-mlflow-operand-otel-rbac

Bobbins228 commented May 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Bobbins228 commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

MLflow Operand Controller (original)

OTel→MLflow Trace Export Fixes

Files changed

Manual Test Plan (OpenShift + RHOAI)

Prerequisites

Step 1: Deploy the operator

Step 2: Create agent namespace and LLM secret

Step 3: Restart the operator (so bootstrap discovers the new namespace)

Step 4: Verify bootstrap logs

Step 5: Verify the OTel collector ConfigMap

Step 6: Deploy a test agent (e.g. weather-service)

Step 7: Send a request and verify traces

Step 8: Confirm no export errors

Test Results

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Bobbins228 commented May 27, 2026 •

edited

Loading