Skip to content

feat: MLflow operand controller for CR lifecycle and OTEL RBAC#378

Draft
Bobbins228 wants to merge 2 commits into
kagenti:mainfrom
Bobbins228:feat/RHAIENG-4902-mlflow-operand-otel-rbac
Draft

feat: MLflow operand controller for CR lifecycle and OTEL RBAC#378
Bobbins228 wants to merge 2 commits into
kagenti:mainfrom
Bobbins228:feat/RHAIENG-4902-mlflow-operand-otel-rbac

Conversation

@Bobbins228
Copy link
Copy Markdown
Contributor

@Bobbins228 Bobbins228 commented May 27, 2026

Summary

Adds MLflowOperandReconciler to the kagenti-operator, moving MLflow CR creation and OTEL collector RBAC bootstrap from the setup shell script into the operator. Also fixes several bugs in the OTel bootstrap that prevented traces from reaching MLflow on RHOAI clusters.

Changes

MLflow Operand Controller (original)

  • internal/controller/mlflow_operand_controller.go (new) — MLflowOperandReconciler that watches DSC, creates MLflow CR, waits for readiness via requeue, creates OTEL RoleBindings in agent namespaces
  • internal/controller/mlflow_operand_controller_test.go (new) — unit tests covering DSC absence, management state handling, MLflow CR creation, readiness checks, OTEL RoleBinding creation, ClusterRole fallback, and idempotency
  • internal/mlflow/types.go — added MLflowSpec to MLflow type, added minimal DataScienceCluster types with v2 scheme registration
  • cmd/main.go — registered MLflowOperandReconciler under --enable-mlflow gate
  • internal/controller/mlflow_controller.go — updated kubebuilder RBAC markers for expanded mlflows verbs
  • charts/kagenti-operator/templates/rbac/role.yaml — added datascienceclusters get/list/watch, expanded mlflows to create/update, added clusterroles get + bind (scoped to mlflow-operator roles)

OTel→MLflow Trace Export Fixes

  • TLS verificationotel.go: Set ca_file: /var/run/secrets/kubernetes.io/serviceaccount/service-ca.crt on the MLflow exporter so the collector trusts RHOAI's service-serving CA. Without this, exports fail with x509: certificate signed by unknown authority.
  • Workspace header — Added mlflow.workspace Helm value (wired as --mlflow-workspace CLI flag). RHOAI requires x-mlflow-workspace header on every request; without it MLflow returns 500. The workspace maps 1:1 to a Kubernetes namespace and is now an explicit configuration input rather than a runtime discovery.
  • Experiment ID — Added mlflow.experimentName Helm value (default kagenti-traces, wired as --mlflow-experiment-name). The bootstrap calls the MLflow API at startup to create or retrieve the experiment, then sets the correct x-mlflow-experiment-id header. Without a valid experiment ID, MLflow returns 422.
  • ClusterRole preferencemlflow_operand_controller.go: Flipped resolveMLflowClusterRole to prefer mlflow-operator-mlflow-integration over mlflow-operator-mlflow-edit, ensuring gatewayendpoints/use permission is present.
  • Removed hardcoded preset valuespresets.go: Removed hardcoded x-mlflow-experiment-id: "0" and tls.insecure: true from YAML presets since these are now set programmatically.

Files changed

File What
charts/kagenti-operator/values.yaml Added mlflow.workspace and mlflow.experimentName
charts/kagenti-operator/templates/manager/manager.yaml Wire new CLI flags from Helm values
kagenti-operator/cmd/main.go Added --mlflow-workspace and --mlflow-experiment-name flags
kagenti-operator/internal/bootstrap/otel.go TLS fix, config-driven workspace/experiment, MLflow API client for experiment creation
kagenti-operator/internal/bootstrap/otel_test.go Updated tests for config-driven workspace, added workspace header test
kagenti-operator/internal/bootstrap/presets.go Removed hardcoded experiment-id and insecure TLS
kagenti-operator/internal/controller/mlflow_operand_controller.go ClusterRole preference fix

Manual Test Plan (OpenShift + RHOAI)

Prerequisites

  • OpenShift cluster with RHOAI installed (MLflow operator available)
  • kagenti-system namespace with kagenti-deps deployed
  • Access to an LLM endpoint (vLLM or similar)

Step 1: Deploy the operator

helm upgrade --install kagenti-operator charts/kagenti-operator/ \
  -n kagenti-system \
  --set controllerManager.container.image.repository=<your-registry>/kagenti-operator \
  --set controllerManager.container.image.tag=<your-tag> \
  --set controllerManager.container.imagePullPolicy=Always \
  --set mlflow.enable=true \
  --set mlflow.workspace=team1 \
  --set mlflow.experimentName=kagenti-traces \
  --set otelBootstrap.enable=true

Step 2: Create agent namespace and LLM secret

kubectl create namespace team1

kubectl create secret generic llama-stack-inference-model-secret \
  --from-literal INFERENCE_MODEL="<model>" \
  --from-literal VLLM_URL="<url>" \
  --from-literal VLLM_TLS_VERIFY="<true|false>" \
  --from-literal VLLM_API_TOKEN="<token>" \
  -n team1

Step 3: Restart the operator (so bootstrap discovers the new namespace)

kubectl rollout restart deployment/kagenti-controller-manager -n kagenti-system
kubectl rollout status deployment/kagenti-controller-manager -n kagenti-system

Step 4: Verify bootstrap logs

kubectl logs deployment/kagenti-controller-manager -n kagenti-system | grep -E "bootstrap|MLflow|workspace|experiment"

Expected output:

  • MLflow CRD detected, discovering MLflow CR
  • Found available MLflow CR with correct tracesURL
  • Created/found MLflow experiment with name: kagenti-traces, workspace: team1, and an experimentID
  • OTel collector bootstrap complete

Step 5: Verify the OTel collector ConfigMap

kubectl get configmap otel-collector-config -n kagenti-system -o jsonpath='{.data.base\.yaml}'

Verify:

  • x-mlflow-workspace: team1
  • x-mlflow-experiment-id: "<id>" (matches bootstrap log)
  • ca_file: /var/run/secrets/kubernetes.io/serviceaccount/service-ca.crt
  • bearertokenauth/mlflow extension present

Step 6: Deploy a test agent (e.g. weather-service)

Deploy the weather agent and tool from agent-examples/ into team1, adjusting LLM env vars and removing runAsUser: 1000 from securityContext for OpenShift SCC compatibility.

Step 7: Send a request and verify traces

kubectl port-forward svc/weather-service -n team1 8080:8080 &

curl -s -X POST http://localhost:8080/ \
  -H "Content-Type: application/json" \
  -d '{"jsonrpc":"2.0","method":"message/send","id":"test-1","params":{"message":{"role":"user","messageId":"msg-1","parts":[{"type":"text","text":"What is the weather in London?"}]}}}'

Step 8: Confirm no export errors

kubectl logs deployment/otel-collector -n kagenti-system --since=5m | grep -iE "error|fail|reject|404|500|401"

Expected: No errors. Traces should flow cleanly to MLflow.

Test Results

  • Unit tests pass (go test ./internal/bootstrap/ -v)
  • go build ./... clean
  • Pre-commit hooks pass (go-fmt, go-vet, helmlint)
  • End-to-end tested on live OpenShift cluster with RHOAI MLflow
  • Traces confirmed in MLflow workspace with correct experiment
  • Zero export errors in OTel collector logs

Adds MLflowOperandReconciler that watches DataScienceCluster and:
- Creates the MLflow CR in redhat-ods-applications when DSC
  mlflowoperator is Managed
- Waits for MLflow Service/Endpoints readiness via requeue
- Creates per-agent-namespace RoleBindings granting otel-collector
  SA access to MLflow ClusterRoles for trace export
- Re-reconciles when DSC MLflow component state changes

RBAC updates:
- Expand mlflows verbs to include create/update
- Add datascienceclusters get/list/watch
- Add clusterroles get and bind (scoped to mlflow-operator roles)
- Add endpoints to core resource watches

Registered under --enable-mlflow gate alongside existing
MLflowReconciler. Uses DSC v2 API (mlflowoperator field is
v2-only).

Ref: RHAIENG-4902

Assisted-By: cursor
Signed-off-by: Bobbins228 <mcampbel@redhat.com>
@Bobbins228 Bobbins228 force-pushed the feat/RHAIENG-4902-mlflow-operand-otel-rbac branch from e274b84 to ca91c1f Compare May 27, 2026 09:46
@Bobbins228 Bobbins228 marked this pull request as draft May 27, 2026 12:49
- Set service-ca.crt as ca_file for MLflow TLS verification
- Add mlflow.workspace and mlflow.experimentName Helm values for
  RHOAI workspace/experiment header configuration
- Create MLflow experiment via API at bootstrap using configured name
- Wire --mlflow-workspace and --mlflow-experiment-name CLI flags
- Prefer mlflow-operator-mlflow-integration ClusterRole for
  gatewayendpoints/use permission

Signed-off-by: Bobbins228 <mcampbel@redhat.com>
@Bobbins228 Bobbins228 force-pushed the feat/RHAIENG-4902-mlflow-operand-otel-rbac branch from 59142ee to 75d6094 Compare May 27, 2026 16:03
odh-devops-app Bot pushed a commit to opendatahub-io/agents-operator that referenced this pull request May 29, 2026
Follow-up to kagenti-extensions PR kagenti#378, which replaced authbridge's
top-level `inbound:` / `outbound:` / `identity:` / `bypass:` /
`routes:` YAML blocks with per-plugin config under
`pipeline.*.plugins[].config`. The new binary rejects any config
whose pipeline is empty (which is what yaml.v3's silent drop of
unknown keys produces when the top-level blocks are removed), so
the webhook had to stop synthesizing the old shape.

What changed in pod_mutator.go:

ensurePerAgentConfigMap's fallback-synthesis branch is rewritten.
Previously: if baseYAML was missing any of `inbound`, `outbound`,
`identity`, or `bypass`, synthesize each from NamespaceConfig values.
Now: if baseYAML has no `pipeline:` section at all, synthesize one
with jwt-validation (inbound) and token-exchange (outbound) entries
whose configs carry Issuer / Keycloak URL / realm / default policy /
identity type from NamespaceConfig.

File-path defaults (audience_file, client_id_file, client_secret_file,
jwt_svid_path) are no longer emitted by the webhook — the authbridge
plugin applies them from its own convention layer at Configure time
(see authbridge/authlib/plugins/CONVENTIONS.md). This keeps the
webhook schema-agnostic about paths and reduces duplication.

When baseYAML already has a `pipeline:` section (the post-migration
Kagenti Helm chart emits it), synthesis is skipped entirely. Only
mode + listener overrides layer on top. The Helm chart owns plugin
config contents; the webhook owns per-agent mode selection.

Tests:

- TestEnsurePerAgentConfigMap_EmptyBaseYAML_FallbackFromNsConfig:
  asserts the synthesized pipeline has the correct plugin names
  and config values. Navigates pipeline.<direction>.plugins[<name>]
  via a new pluginConfigAt helper to keep assertions compact.
- TestEnsurePerAgentConfigMap_BaseYAML_PreservesExistingFields:
  baseYAML rewritten to new shape; asserts plugin config is
  preserved verbatim.
- TestEnsurePerAgentConfigMap_ListenerOverrides_Merged: baseYAML
  rewritten to new shape; listener assertions unchanged (listener
  is top-level in both schemas).
- TestEnsurePerAgentConfigMap_FederatedJWT_MapsToSpiffe: asserts
  identity.type = spiffe under token-exchange. Dropped assertions
  for file-path defaults (those moved into the plugin).

E2E fixtures:

- test/e2e/fixtures.go: two hardcoded ConfigMaps with old-shape
  YAML rewritten to new per-plugin shape.

Image tags: none bumped. The operator references authbridge images
as `:latest` throughout — once PR kagenti#378 merges and CI publishes new
`:latest`, operator CI picks it up. The version-pinning (and
atomic authbridge+operator cutover) lives in the kagenti Helm
chart PR that comes next.

Assisted-By: Claude (Anthropic AI) <noreply@anthropic.com>
Signed-off-by: Hai Huang <huang195@gmail.com>
odh-devops-app Bot pushed a commit to opendatahub-io/agents-operator that referenced this pull request May 29, 2026
…n synthesizePipeline

## Why

When a namespace's `authbridge-runtime-config` ConfigMap has no
`pipeline:` of its own, the webhook synthesizes one from the
namespace's `authbridge-config` env-var contract (NamespaceConfig).
The synthesis passed `keycloak_url` + `keycloak_realm` only to the
outbound token-exchange plugin. jwt-validation got just `issuer`.

kagenti-extensions#383 extends the jwt-validation plugin to accept
these two fields and derive jwks_url from the INTERNAL Keycloak
URL — the sidecar actually reaches the JWKS endpoint from inside
the cluster, and `issuer` is the PUBLIC hostname (required for
`iss`-claim matching but typically unreachable from inside the
pod). Without the keycloak_* hints, jwt-validation falls back to
issuer-derivation and every inbound request fails with
"connection refused" fetching the JWKS → 401.

## Scope

Pure addition to `synthesizePipeline`: the same two fields that
already feed token-exchange now also feed jwt-validation. Both
plugins end up with the same "where is Keycloak internally" hint,
mirroring the pre-PR-kagenti#378 binary behavior where jwks_url was
derived from outbound.token_url via a cross-plugin pass.

Test: extend `TestEnsurePerAgentConfigMap_EmptyBaseYAML_Fallback
FromNsConfig` to assert jwt-validation now receives keycloak_url
and keycloak_realm.

## Compatibility

Requires kagenti-extensions ≥ the tag that includes kagenti#383 — older
authbridge binaries reject the new fields at DisallowUnknownFields
decode. The chart in kagenti/kagenti#1507 bumps both pins together.

Assisted-By: Claude (Anthropic AI) <noreply@anthropic.com>
Signed-off-by: Hai Huang <huang195@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: New /:ToDo

Development

Successfully merging this pull request may close these issues.

2 participants