Skip to content

feat(helm): add Helm chart for Parsec and MLflow#26

Open
marcosmamorim wants to merge 8 commits into
mainfrom
parsec-mlflow-helm
Open

feat(helm): add Helm chart for Parsec and MLflow#26
marcosmamorim wants to merge 8 commits into
mainfrom
parsec-mlflow-helm

Conversation

@marcosmamorim

Copy link
Copy Markdown
Contributor

Summary

Adds a Helm chart for deploying Parsec and MLflow to OpenShift, migrating from the existing Ansible playbook approach. Both deployment methods coexist — the playbooks are NOT removed.

  • Parsec chart (helm/) — 30 templates with domain-prefixed naming (app-*, oauth-*, secrets-*, build-*, cronjob-*)
  • MLflow subchart (helm/charts/mlflow/) — optional, enabled via mlflow.deploy: true

Secrets Management

  • BitwardenSyncSecret for external secrets (cloud credentials, kubeconfigs, API tokens, OAuth proxy) — 9 secrets for Parsec, 1 for MLflow
  • StringSecret for auto-generated credentials (MLflow PostgreSQL password, API htpasswd)
  • Bitwarden project parsec with all secret entries

Key Design Decisions

  • Flat templates with domain prefixes (middle ground between flat and subdirectories)
  • cm-env.yaml pattern centralizes all environment variables (deployment has zero inline env: blocks)
  • OAuth proxy clientSecret stored in Bitwarden and passed via values.yaml (OAuthClient needs the literal value at render time)
  • BuildConfig/ImageStream retained for on-cluster builds (future: migrate to external registry)
  • ArgoCD-ready with sync-wave annotations, also works with helm template | oc apply
  • Stakater Reloader annotation for automatic pod restarts on ConfigMap/Secret changes

Validated On

  • babydev (ocp-babydev.infra-us-east) — Parsec full deploy + OAuth login tested
  • ocpv-infra01 (parsec-dev-helm + mlflow-dev-helm namespaces) — Parsec + MLflow full deploy, all pods running, both UIs accessible

Usage

# Deploy Parsec only
helm template parsec helm/ -f helm/values-dev.yaml | oc apply -f -

# Deploy Parsec + MLflow
helm template parsec helm/ -f helm/values-dev.yaml --set mlflow.deploy=true | oc apply -f -

Test plan

  • helm lint passes
  • helm template renders 30 Parsec resources + 26 MLflow resources
  • Dry-run accepted by API server (oc apply --dry-run=server)
  • BitwardenSyncSecrets sync correctly (all 10 synced)
  • StringSecrets generate credentials (PostgreSQL, API htpasswd)
  • BuildConfig triggers image build from GitHub
  • Parsec pod starts with init containers + icinga-mcp sidecar
  • OAuth login flow works end-to-end
  • Health endpoints respond (/api/health, /api/health/ready)
  • MLflow PostgreSQL, tracking server, OAuth proxy, API proxy all running
  • MLflow UI accessible via OAuth
  • No interference with production deployment (parsec and mlflow namespaces)

Migrate Parsec deployment from Ansible playbooks to Helm chart following
team conventions (BitwardenSyncSecret, StringSecret, cm-env pattern).

- 30 templates organized by domain prefix (app-*, oauth-*, secrets-*, build-*, cronjob-*)
- Secrets managed via BitwardenSyncSecret (9 secrets) and Bitwarden project
- OAuth proxy with OAuthClient and BitwardenSyncSecret for credentials
- BuildConfig/ImageStream for on-cluster image builds
- Stakater Reloader annotation for automatic pod restarts
- ArgoCD-ready with sync-wave annotations
- values-dev.yaml for dev environment overrides
- Update CONTRIBUTING.md with Helm deployment instructions

Ansible playbooks are NOT removed — both deployment methods coexist.
Optional subchart (mlflow.deploy: true) for MLflow tracking server with:

- PostgreSQL with StringSecret for auto-generated DB password
- S3 artifact storage via ObjectBucketClaim (NooBaa)
- OAuth proxy with reencrypt TLS and SAR authorization
- Nginx API proxy with basic auth (StringSecret for htpasswd)
- 5 NetworkPolicies for inter-component traffic control
- OAuth credentials via BitwardenSyncSecret
- Align config defaults with Ansible production values (max_tokens,
  max_tool_rounds, batch_size, vertex.region, experiment_name, mcp_url)
- Add required() validations for critical values (vertex.project_id,
  reporting_mcp.mcp_url, buildConfig.gitRepo)
- Remove duplicate env vars from cm-env.yaml that already exist in
  config.yaml (follow Ansible pattern: config.yaml for app config,
  env vars only for paths/backends/connections)
- Add seccompProfile: RuntimeDefault to init containers and icinga
  sidecar (parity with production)
- Add config.anthropic.backend to values.yaml (was hardcoded)
- Add ingressDomain guard to OAuthClient and oauth-route
- Add alert_api_key and allowed_users to app-configmap.yaml (parity)
- Revert PVC default to ReadWriteMany (production requirement)
- MLflow: replace StringSecret htpasswd with basic-auth annotation
  (generates proper htpasswd format via secret-generator)
- MLflow: parameterize NetworkPolicy namespaces via parsecNamespaces
- MLflow: fix NOTES.txt URLs to use externalHostname helpers
- Remove unused cookie-secret from both OAuth BitwardenSyncSecrets
- Update CONTRIBUTING.md with OAuth secret prerequisites
- Fix MLflow Postgres liveness/readiness probes: use /bin/sh -c wrapper
  for $POSTGRESQL_USER expansion (K8s does not expand $(VAR) in exec
  probe command arrays)
- Remove orphaned app-cm-allowed-users.yaml (allowed_users moved to
  app-configmap.yaml, nothing references this ConfigMap anymore)
- Add seccompProfile: RuntimeDefault to OAuth proxy container
  (was missing while app-deployment.yaml containers had it)
- Fix learnings.admin_users type: use string (not array) to match
  code that calls .split(",")
- Add learnings.admin_groups to values.yaml and app-configmap.yaml
  (was missing, silently removing group-based admin access)
- Add parsecNamespaces override in values-dev.yaml for dev namespace
- Remove deleted parsec-allowed-users ConfigMap from RBAC Role
- Add stabilizationWindowSeconds: 60 to HPA scaleUp behavior
- Add timeoutSeconds to liveness (5s) and readiness (3s) probes on
  both app-deployment and oauth-deployment (parity with Ansible)
- Add config.mlflow.tracking_url to values-dev.yaml pointing to
  mlflow-dev-helm namespace (without it, Parsec silently drops metrics)
- Set github.mcp_url default to public Copilot MCP endpoint
- Document expected OBC provisioning delay in mlflow-deployment.yaml
  (CreateContainerConfigError for 10-30s on fresh install is normal)
- Add required() validation to parsec.image helper when
  buildConfig.enabled is false (prevents invalid :latest image ref)
@marcosmamorim marcosmamorim requested review from ptoal and rut31337 June 4, 2026 01:22
@rut31337

rut31337 commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Code Review — PR #26

Scope: 60 files, +2403 lines — Helm chart for Parsec + MLflow subchart
Effort: High (7 angles, 1-vote verify)

Findings (10, ranked by severity)

1. Plaintext OAuth client secrets committed to git (helm/values.yaml:144, helm/charts/mlflow/values.yaml:51)
Both values.yaml files contain real OAuth client secrets rendered directly into OAuthClient resources. Anyone with repo read access can impersonate the OAuth clients. These should be empty strings in the committed defaults, injected via --set or a gitignored values file.

2. BitwardenSyncSecret entries for API tokens missing the key field (helm/templates/secrets-api-tokens.yaml:14)
alert-api-key, github-token, and reporting-mcp-token entries have only secret: with no key: subfield, unlike every other BitwardenSyncSecret in the chart (e.g. secrets-cloud-creds.yaml which uses both secret: and key:). If the CRD requires key to select a field from the Bitwarden item, the synced values will be empty.

3. MLflow postgres password uses mittwald StringSecret CRD — undocumented operator dependency (helm/charts/mlflow/templates/postgres-string-secret.yaml:2)
Uses apiVersion: secretgenerator.mittwald.de/v1alpha1 / kind: StringSecret, while the rest of the chart uses BitwardenSyncSecret. If the mittwald operator isn't installed (likely), the postgres Secret never gets created → MLflow deployment enters CreateContainerConfigError.

4. Generated config.yaml omits Splunk configuration entirely (helm/templates/app-configmap.yaml:31)
No splunk: section in the ConfigMap, no env vars in app-cm-env.yaml, no Bitwarden secret template for Splunk token. Deploying via Helm silently disables all Splunk integration — Babylon pod log and AAP2 log queries will fail.

5. MLflow runs pip install on every container start with --no-cache-dir (helm/charts/mlflow/templates/mlflow-deployment.yaml:81)
Every pod restart re-downloads ~30MB from PyPI. If PyPI is down or egress is blocked, the container enters CrashLoopBackOff. A custom image with pre-installed deps would eliminate this.

6. Ansible deployment labeled "legacy" while the untested Helm chart is labeled "recommended" (docs/CONTRIBUTING.md:133)
Ansible is the current production deployment method. A new team member following CONTRIBUTING.md could deploy with the v0.1.0 Helm chart to production, discover gaps, and cause a partial outage.

7. OAuth proxy TLS cert volume mounted but HTTPS is disabled — dead config creates false dependency (helm/templates/oauth-deployment.yaml:62)
Volume oauth-serving-cert is required (no optional: true) but the proxy runs HTTP-only (-https-address= is empty, no -tls-cert/-tls-key). Service also exposes dead port 8443.

8. PVC named "pricing-cache" actually stores persistent user data (helm/templates/app-pvc.yaml:5)
Mounted at /app/data, stores conversations, shares, and learnings alongside caches. Name suggests disposable data — an operator could delete it during maintenance.

9. serviceMonitor values defined but no ServiceMonitor template exists (helm/values.yaml:215)
Operator sets serviceMonitor.enabled: true expecting Prometheus scraping — nothing happens.

10. Config.yaml duplicates keys already set as PARSEC_ env vars (helm/templates/app-configmap.yaml:2)
reporting_mcp.mcp_url is set in both app-configmap.yaml and app-cm-env.yaml. Dynaconf env vars silently override the config file value.

Move OAuth clientSecret values to --set deploy-time injection
instead of committing them to values.yaml. Add missing splunk
configuration section to the Helm chart (host and verify_ssl in
ConfigMap, token via env var / secret). Adjust CONTRIBUTING.md
deployment method labels to neutral wording.
@marcosmamorim

marcosmamorim commented Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

Code Review — PR #26 - @rut31337

#1 — OAuth client secrets: Fixed. Moved to --set at deploy time, values.yaml now has empty defaults. CONTRIBUTING.md updated with the --set oauth.clientSecret=... pattern.

#2 — BitwardenSyncSecret missing key field: By design. API token items in Bitwarden are single-value entries (one field per item), so the key selector is not needed. The cloud credentials use key because each Bitwarden item has multiple fields (e.g. access_key_id, secret_access_key from a single parsec-aws-credentials item).

#3 — StringSecret CRD dependency: Documented operator, present in all RHDP clusters. The mittwald secret-generator is deployed cluster-wide and used by other projects (demo-reporting, poolboy-scoring). Not an undocumented dependency.

#4 — Splunk config: Fixed. Added splunk section to values.yaml and the ConfigMap (host + verify_ssl). Token stays out of the ConfigMap — injected via PARSEC_SPLUNK__TOKEN env var or K8s Secret (Dynaconf override).

#5 — MLflow pip install at startup: Acknowledged. Parsec and MLflow are distinct applications running in separate namespaces and should have independent images and CI/CD workflows. Building a custom MLflow image with psycopg2-binary and boto3 pre-installed is not currently planned but should be done in the future.

#6 — "recommended" / "legacy" labels: Fixed. Changed to neutral "Method 1: Helm" and "Method 2: Ansible" — no production validation yet for Helm, so labeling is premature.

#7 — OAuth HTTPS port 8443 / serving-cert volume: Intentional, matches Ansible. This is the standard ose-oauth-proxy pattern on OpenShift. TLS termination happens at the Route level. The oauth-serving-cert secret is generated by the service.beta.openshift.io/serving-cert-secret-name annotation on the Service — the volume mount is required even when -https-address= is empty, as the proxy validates its presence. The Ansible template uses the identical configuration.

#8 — PVC named "pricing-cache": Matches Ansible. Both Ansible (parsec-pricing-cache, line 335 of manifests.yaml.j2) and Helm produce the same PVC name. Renaming now would be a breaking change for existing deployments. Worth addressing in a future migration, not in this PR.

#9 — ServiceMonitor values without template: Intentional. Kept disabled (enabled: false) as a placeholder — when we add Prometheus monitoring, we only need to add the template and flip the flag. No operational impact today.

#10 — Duplicate reporting_mcp.mcp_url: Intentional redundancy, not a regression. The Ansible template sets reporting_mcp.mcp_url only as an env var (line 214), not in its rendered config.yaml. The Helm chart does the same in cm-env.yaml. The ConfigMap also includes it as an explicit fallback — Dynaconf env vars take precedence, so there's no conflict. Both paths resolve to the same value.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants