From 502f92e350fb306c411e174fb581d5c6dbba52d0 Mon Sep 17 00:00:00 2001 From: Taylor Mutch Date: Tue, 2 Jun 2026 16:55:03 -0700 Subject: [PATCH 1/2] fix(helm): create sandbox JWT secret under cert-manager The cert-manager install path (certManager.enabled=true, pkiInitJob.enabled=false) left the gateway StatefulSet unable to start because nothing created the openshell-jwt-keys Secret: cert-manager owns TLS Secrets but does not mint the sandbox JWT signing key, and the certgen hook only rendered when pkiInitJob.enabled was true. Separate JWT signing-key provisioning from TLS PKI provisioning: - certgen: add a --jwt-only mode that creates only the Opaque JWT signing Secret, for use when another controller owns TLS Secrets. - certgen.yaml: render the hook when pkiInitJob.enabled OR certManager.enabled is true. cert-manager takes precedence and runs the hook with --jwt-only even if pkiInitJob.enabled remains true. Remove the mutual-exclusion failure between the two values. - _helpers.tpl: add openshell.sandboxJwtSecretName, shared by the hook and the StatefulSet mount. - Update values, README, docs, architecture, and the debug-openshell-cluster skill to reflect the new precedence; the documented cert-manager install no longer needs pkiInitJob.enabled=false. Closes #1691 --- .../skills/debug-openshell-cluster/SKILL.md | 9 ++ architecture/gateway.md | 47 ++++--- crates/openshell-server/src/certgen.rs | 124 +++++++++++++----- crates/openshell-server/src/cli.rs | 25 ++++ deploy/helm/openshell/README.md | 27 ++-- deploy/helm/openshell/README.md.gotmpl | 25 ++-- .../openshell/ci/values-cert-manager.yaml | 3 - deploy/helm/openshell/skaffold.yaml | 4 +- deploy/helm/openshell/templates/_helpers.tpl | 7 + deploy/helm/openshell/templates/certgen.yaml | 13 +- .../helm/openshell/templates/statefulset.yaml | 2 +- deploy/helm/openshell/tests/certgen_test.yaml | 116 ++++++++++++++++ deploy/helm/openshell/values.yaml | 18 ++- docs/kubernetes/managing-certificates.mdx | 11 +- docs/kubernetes/openshift.mdx | 6 +- 15 files changed, 334 insertions(+), 103 deletions(-) create mode 100644 deploy/helm/openshell/tests/certgen_test.yaml diff --git a/.agents/skills/debug-openshell-cluster/SKILL.md b/.agents/skills/debug-openshell-cluster/SKILL.md index aeaa503f7..b65bf26d2 100644 --- a/.agents/skills/debug-openshell-cluster/SKILL.md +++ b/.agents/skills/debug-openshell-cluster/SKILL.md @@ -157,6 +157,15 @@ kubectl -n openshell get secret \ openshell-jwt-keys ``` +In cert-manager installs, `certManager.enabled=true` makes cert-manager own TLS +generation. The Helm chart should still render the `openshell-certgen` +pre-install/pre-upgrade hook in JWT-only mode to create `openshell-jwt-keys`, +even if `pkiInitJob.enabled` remains true. +If the gateway pod is pending with `MountVolume.SetUp failed for volume +"sandbox-jwt"` and `openshell-jwt-keys` is absent, inspect the rendered +`templates/certgen.yaml` output and the hook Job logs; cert-manager creates TLS +Secrets but does not create the sandbox JWT signing Secret. + If the gateway exits with `failed to read sandbox JWT signing key from /etc/openshell-jwt/signing.pem`, verify that `openshell-jwt-keys` contains `signing.pem`, `public.pem`, and `kid`, and that the StatefulSet mounts the diff --git a/architecture/gateway.md b/architecture/gateway.md index 533ce04a5..5bb81e5d3 100644 --- a/architecture/gateway.md +++ b/architecture/gateway.md @@ -363,32 +363,39 @@ requested for that relay. ## PKI Bootstrap -`openshell-gateway generate-certs` is the one place mTLS materials are -created. Both deployment paths use it: +`openshell-gateway generate-certs` is the one place local mTLS materials and +sandbox JWT signing material are created. Deployment paths use it as follows: | Output mode | Selector | Layout | |---|---|---| -| Kubernetes Secrets | (default) `--namespace`, `--server-secret-name`, `--client-secret-name` | Two `kubernetes.io/tls` Secrets with `tls.crt` / `tls.key` / `ca.crt`. | -| Filesystem | `--output-dir ` | `/{ca.crt, ca.key, server/tls.{crt,key}, client/tls.{crt,key}}`. Also copies client materials to `$XDG_CONFIG_HOME/openshell/gateways/openshell/mtls/` for CLI auto-discovery. | +| Kubernetes Secrets | (default) `--namespace`, `--server-secret-name`, `--client-secret-name`, `--jwt-secret-name` | Two `kubernetes.io/tls` Secrets with `tls.crt` / `tls.key` / `ca.crt` plus one Opaque sandbox JWT Secret with `signing.pem` / `public.pem` / `kid`. | +| Kubernetes JWT-only Secret | `--namespace`, `--jwt-only`, `--jwt-secret-name` | One Opaque sandbox JWT Secret with `signing.pem` / `public.pem` / `kid`. | +| Filesystem | `--output-dir ` | `/{ca.crt, ca.key, server/tls.{crt,key}, client/tls.{crt,key}, jwt/{signing.pem,public.pem,kid}}`. Also copies client materials to `$XDG_CONFIG_HOME/openshell/gateways/openshell/mtls/` for CLI auto-discovery. | On Kubernetes, the Helm chart runs the command via a pre-install/pre-upgrade hook Job using the gateway image itself -- no separate cert-generation image, -no extra mirror burden in air-gapped environments. On package-managed local -gateways, the same command runs from the systemd unit's `ExecStartPre` to -bootstrap PKI into the configured local TLS directory on first start. The -Linux package unit defaults that directory to `~/.local/state/openshell/tls` -through `OPENSHELL_LOCAL_TLS_DIR` so certificate generation and runtime -auto-detection use the same path across systemd versions. - -Both modes share the same idempotency contract: all targets present -> skip; -partial state -> fail with a recovery hint; nothing present -> generate and -write. This guards mTLS continuity across restarts and upgrades while still -recovering cleanly if an operator deletes everything and starts over. - -Operators who manage PKI externally (cert-manager, an enterprise CA, or -pre-created Secrets) disable the Helm hook via `pkiInitJob.enabled=false`. -The chart also ships a `certManager.*` path that produces equivalent Secrets -through cert-manager `Issuer`/`Certificate` resources. +no extra mirror burden in air-gapped environments. In the default built-in PKI +path the hook creates TLS and sandbox JWT Secrets. When cert-manager is enabled, +cert-manager owns TLS Secrets and the hook runs with `--jwt-only` so the +required sandbox JWT Secret still exists before the gateway StatefulSet mounts +it, even if `pkiInitJob.enabled` remains true. On package-managed local +gateways, the same command runs from the systemd +unit's `ExecStartPre` to bootstrap PKI into the configured local TLS directory +on first start. The Linux package unit defaults that directory to +`~/.local/state/openshell/tls` through `OPENSHELL_LOCAL_TLS_DIR` so certificate +generation and runtime auto-detection use the same path across systemd +versions. + +The bootstrap paths share the same idempotency contract: all requested targets +present -> skip; partial requested state -> fail with a recovery hint; nothing +requested present -> generate and write. This guards continuity across restarts +and upgrades while still recovering cleanly if an operator deletes everything +and starts over. + +Operators who manage TLS PKI with cert-manager enable `certManager.enabled`; +cert-manager takes precedence over built-in TLS generation and the chart still +renders the JWT-only hook. Operators who pre-create all TLS and JWT Secrets can +disable both `pkiInitJob.enabled` and `certManager.enabled`. ## Configuration diff --git a/crates/openshell-server/src/certgen.rs b/crates/openshell-server/src/certgen.rs index b7ce0421c..683ed180f 100644 --- a/crates/openshell-server/src/certgen.rs +++ b/crates/openshell-server/src/certgen.rs @@ -6,8 +6,12 @@ //! Two output modes, dispatched by the presence of `--output-dir`: //! //! - **Kubernetes mode** (default): create two `kubernetes.io/tls` Secrets -//! in the supplied namespace. Used by the Helm pre-install hook. Requires -//! `--namespace`, `--server-secret-name`, `--client-secret-name`. +//! and one sandbox-JWT signing Secret in the supplied namespace. Used by +//! the Helm pre-install hook. Requires `--namespace`, +//! `--server-secret-name`, `--client-secret-name`, and `--jwt-secret-name`. +//! - **Kubernetes JWT-only mode** (`--jwt-only`): create only the +//! sandbox-JWT signing Secret. Used when another controller, such as +//! cert-manager, owns the TLS Secrets. //! - **Local mode** (`--output-dir `): write PEMs to the local package //! filesystem layout. Used by systemd units' `ExecStartPre`. Also copies //! client materials to @@ -47,11 +51,11 @@ pub struct CertgenArgs { namespace: Option, /// Name of the server TLS Secret (`kubernetes.io/tls`) to create. - #[arg(long, required_unless_present = "output_dir")] + #[arg(long, required_unless_present_any = ["output_dir", "jwt_only"])] server_secret_name: Option, /// Name of the client TLS Secret (`kubernetes.io/tls`) to create. - #[arg(long, required_unless_present = "output_dir")] + #[arg(long, required_unless_present_any = ["output_dir", "jwt_only"])] client_secret_name: Option, /// Name of the sandbox-JWT signing-key Secret (`Opaque`) to create. @@ -60,6 +64,11 @@ pub struct CertgenArgs { #[arg(long, required_unless_present = "output_dir")] jwt_secret_name: Option, + /// Create only the sandbox-JWT signing-key Secret in Kubernetes mode. + /// This is used when another controller owns TLS Secret provisioning. + #[arg(long, conflicts_with = "output_dir")] + jwt_only: bool, + /// Extra Subject Alternative Name for the server certificate. Repeatable. /// Auto-detected as an IP address or DNS name. #[arg(long = "server-san", value_name = "SAN")] @@ -116,6 +125,51 @@ async fn run_kubernetes(args: &CertgenArgs, bundle: &PkiBundle) -> Result<()> { .namespace .as_deref() .ok_or_else(|| miette::miette!("--namespace is required (or set POD_NAMESPACE)"))?; + + let client = Client::try_default() + .await + .into_diagnostic() + .wrap_err("failed to construct in-cluster Kubernetes client")?; + let api: Api = Api::namespaced(client, namespace); + + if args.jwt_only { + let jwt_name = args + .jwt_secret_name + .as_deref() + .ok_or_else(|| miette::miette!("--jwt-secret-name is required"))?; + let jwt_exists = api + .get_opt(jwt_name) + .await + .into_diagnostic() + .wrap_err_with(|| format!("failed to read secret {jwt_name}"))? + .is_some(); + if jwt_exists { + info!( + namespace = %namespace, + jwt = %jwt_name, + "JWT signing secret already exists, skipping." + ); + return Ok(()); + } + + let jwt_secret = jwt_signing_secret( + jwt_name, + &bundle.jwt_signing_key_pem, + &bundle.jwt_public_key_pem, + &bundle.jwt_key_id, + ); + api.create(&PostParams::default(), &jwt_secret) + .await + .into_diagnostic() + .wrap_err_with(|| format!("failed to create secret {jwt_name}"))?; + info!( + namespace = %namespace, + jwt = %jwt_name, + "JWT signing secret created." + ); + return Ok(()); + } + let server_name = args .server_secret_name .as_deref() @@ -124,17 +178,6 @@ async fn run_kubernetes(args: &CertgenArgs, bundle: &PkiBundle) -> Result<()> { .client_secret_name .as_deref() .ok_or_else(|| miette::miette!("--client-secret-name is required"))?; - let jwt_name = args - .jwt_secret_name - .as_deref() - .ok_or_else(|| miette::miette!("--jwt-secret-name is required"))?; - - let client = Client::try_default() - .await - .into_diagnostic() - .wrap_err("failed to construct in-cluster Kubernetes client")?; - let api: Api = Api::namespaced(client, namespace); - let server_exists = api .get_opt(server_name) .await @@ -147,6 +190,11 @@ async fn run_kubernetes(args: &CertgenArgs, bundle: &PkiBundle) -> Result<()> { .into_diagnostic() .wrap_err_with(|| format!("failed to read secret {client_name}"))? .is_some(); + + let jwt_name = args + .jwt_secret_name + .as_deref() + .ok_or_else(|| miette::miette!("--jwt-secret-name is required"))?; let jwt_exists = api .get_opt(jwt_name) .await @@ -193,6 +241,34 @@ async fn run_kubernetes(args: &CertgenArgs, bundle: &PkiBundle) -> Result<()> { K8sAction::CreateAll => {} } + create_tls_secrets(&api, server_name, client_name, bundle).await?; + let jwt_secret = jwt_signing_secret( + jwt_name, + &bundle.jwt_signing_key_pem, + &bundle.jwt_public_key_pem, + &bundle.jwt_key_id, + ); + api.create(&PostParams::default(), &jwt_secret) + .await + .into_diagnostic() + .wrap_err_with(|| format!("failed to create secret {jwt_name}"))?; + + info!( + namespace = %namespace, + server = %server_name, + client = %client_name, + jwt = %jwt_name, + "PKI secrets created." + ); + Ok(()) +} + +async fn create_tls_secrets( + api: &Api, + server_name: &str, + client_name: &str, + bundle: &PkiBundle, +) -> Result<()> { let server_secret = tls_secret( server_name, &bundle.server_cert_pem, @@ -205,12 +281,6 @@ async fn run_kubernetes(args: &CertgenArgs, bundle: &PkiBundle) -> Result<()> { &bundle.client_key_pem, &bundle.ca_cert_pem, ); - let jwt_secret = jwt_signing_secret( - jwt_name, - &bundle.jwt_signing_key_pem, - &bundle.jwt_public_key_pem, - &bundle.jwt_key_id, - ); api.create(&PostParams::default(), &server_secret) .await @@ -220,18 +290,6 @@ async fn run_kubernetes(args: &CertgenArgs, bundle: &PkiBundle) -> Result<()> { .await .into_diagnostic() .wrap_err_with(|| format!("failed to create secret {client_name}"))?; - api.create(&PostParams::default(), &jwt_secret) - .await - .into_diagnostic() - .wrap_err_with(|| format!("failed to create secret {jwt_name}"))?; - - info!( - namespace = %namespace, - server = %server_name, - client = %client_name, - jwt = %jwt_name, - "PKI secrets created." - ); Ok(()) } diff --git a/crates/openshell-server/src/cli.rs b/crates/openshell-server/src/cli.rs index b8d345f9e..9fb0ba339 100644 --- a/crates/openshell-server/src/cli.rs +++ b/crates/openshell-server/src/cli.rs @@ -979,6 +979,31 @@ mod tests { )); } + #[test] + fn generate_certs_jwt_only_parses_without_tls_secret_names() { + let _lock = ENV_LOCK + .lock() + .unwrap_or_else(std::sync::PoisonError::into_inner); + let _g1 = EnvVarGuard::remove("OPENSHELL_DB_URL"); + let _g2 = EnvVarGuard::remove("POD_NAMESPACE"); + + let cli = Cli::try_parse_from([ + "openshell-gateway", + "generate-certs", + "--namespace", + "openshell", + "--jwt-only", + "--jwt-secret-name", + "openshell-jwt-keys", + ]) + .expect("--jwt-only should make TLS secret-name flags optional"); + + assert!(matches!( + cli.command, + Some(super::Commands::GenerateCerts(_)) + )); + } + #[test] fn bare_invocation_with_no_db_url_parses_for_runtime_defaults() { // db_url is Option at the clap level so subcommand parsing diff --git a/deploy/helm/openshell/README.md b/deploy/helm/openshell/README.md index 15c9f7f85..62b313e0b 100644 --- a/deploy/helm/openshell/README.md +++ b/deploy/helm/openshell/README.md @@ -109,22 +109,19 @@ Append these flags to any of the PostgreSQL commands above for OpenShift: --set securityContext.runAsUser=null ``` -## PKI bootstrap +## Secret bootstrap By default, a pre-install/pre-upgrade hook Job runs `openshell-gateway generate-certs` -to create the gateway's server and client mTLS Secrets. The Job uses the gateway image -itself, so air-gapped environments only need to mirror that one image (no separate -openssl/alpine sidecar). +to create the gateway's server/client mTLS Secrets and sandbox JWT signing Secret. +The Job uses the gateway image itself, so air-gapped environments only need to +mirror that one image (no separate openssl/alpine sidecar). -The Job is idempotent: - -- Both target Secrets exist: log and exit 0. -- Exactly one exists: fail with `kubectl delete secret -n ` recovery hint. -- Neither exists: generate a CA, server cert, and client cert; POST both `kubernetes.io/tls` Secrets (`tls.crt`, `tls.key`, `ca.crt`). - -Disable with `--set pkiInitJob.enabled=false` when bringing your own PKI (cert-manager, -external CA, or pre-created Secrets). See `certManager.*` in `values.yaml` for the -cert-manager alternative. +When `certManager.enabled=true`, cert-manager owns the TLS Secrets and the chart +runs the same hook in JWT-only mode because cert-manager does not create the +sandbox JWT signing Secret. This precedence applies even if +`pkiInitJob.enabled` remains true. Set `pkiInitJob.enabled=false` only when an +external non-cert-manager TLS source manages TLS and you pre-create the sandbox +JWT signing Secret. ## Values @@ -135,7 +132,7 @@ cert-manager alternative. | certManager.certificateDuration | string | `"8760h"` | Duration for cert-manager-issued certificates. | | certManager.certificateRenewBefore | string | `"720h"` | Renewal window for cert-manager-issued certificates. | | certManager.clientCaFromServerTlsSecret | bool | `true` | Mount gateway client CA from the server TLS secret's ca.crt (populated by cert-manager for certs issued by a CA Issuer). Avoids a separate openshell-server-client-ca Secret. | -| certManager.enabled | bool | `false` | Create cert-manager Issuer and Certificate resources instead of using the PKI bootstrap Job. | +| certManager.enabled | bool | `false` | Create cert-manager Issuer and Certificate resources. When enabled, cert-manager owns TLS and the chart runs a JWT-only certgen hook to create the sandbox JWT signing Secret that cert-manager does not manage. | | certManager.serverDnsNames | list | `["openshell","openshell.openshell.svc","openshell.openshell.svc.cluster.local","localhost","openshell.localhost","*.openshell.localhost","host.docker.internal"]` | DNS SANs on the cert-manager-issued server certificate. | | certManager.serverIpAddresses | list | `["127.0.0.1"]` | IP SANs on the cert-manager-issued server certificate. | | fullnameOverride | string | `""` | Override the full generated resource name. | @@ -155,7 +152,7 @@ cert-manager alternative. | nameOverride | string | `"openshell"` | Override the chart name used in generated resource names. | | networkPolicy.enabled | bool | `true` | Create a NetworkPolicy restricting SSH ingress on sandbox pods to the gateway. | | nodeSelector | object | `{}` | Node selector for the gateway pod. | -| pkiInitJob.enabled | bool | `true` | Run a pre-install/pre-upgrade Job that creates gateway and client mTLS Secrets. | +| pkiInitJob.enabled | bool | `true` | Run a pre-install/pre-upgrade Job that creates gateway and client mTLS Secrets. When certManager.enabled=true, cert-manager owns TLS and this same hook runs in JWT-only mode even if pkiInitJob.enabled remains true. | | pkiInitJob.serverDnsNames | list | `[]` | Extra DNS SANs to append to the server certificate. | | pkiInitJob.serverIpAddresses | list | `[]` | Extra IP SANs to append to the server certificate. | | podAnnotations | object | `{}` | Extra annotations to add to the gateway pod. | diff --git a/deploy/helm/openshell/README.md.gotmpl b/deploy/helm/openshell/README.md.gotmpl index fc391a416..5fc4019e8 100644 --- a/deploy/helm/openshell/README.md.gotmpl +++ b/deploy/helm/openshell/README.md.gotmpl @@ -109,22 +109,19 @@ Append these flags to any of the PostgreSQL commands above for OpenShift: --set securityContext.runAsUser=null ``` -## PKI bootstrap +## Secret bootstrap By default, a pre-install/pre-upgrade hook Job runs `openshell-gateway generate-certs` -to create the gateway's server and client mTLS Secrets. The Job uses the gateway image -itself, so air-gapped environments only need to mirror that one image (no separate -openssl/alpine sidecar). - -The Job is idempotent: - -- Both target Secrets exist: log and exit 0. -- Exactly one exists: fail with `kubectl delete secret -n ` recovery hint. -- Neither exists: generate a CA, server cert, and client cert; POST both `kubernetes.io/tls` Secrets (`tls.crt`, `tls.key`, `ca.crt`). - -Disable with `--set pkiInitJob.enabled=false` when bringing your own PKI (cert-manager, -external CA, or pre-created Secrets). See `certManager.*` in `values.yaml` for the -cert-manager alternative. +to create the gateway's server/client mTLS Secrets and sandbox JWT signing Secret. +The Job uses the gateway image itself, so air-gapped environments only need to +mirror that one image (no separate openssl/alpine sidecar). + +When `certManager.enabled=true`, cert-manager owns the TLS Secrets and the chart +runs the same hook in JWT-only mode because cert-manager does not create the +sandbox JWT signing Secret. This precedence applies even if +`pkiInitJob.enabled` remains true. Set `pkiInitJob.enabled=false` only when an +external non-cert-manager TLS source manages TLS and you pre-create the sandbox +JWT signing Secret. {{ template "chart.valuesSection" . }} {{ template "helm-docs.versionFooter" . }} diff --git a/deploy/helm/openshell/ci/values-cert-manager.yaml b/deploy/helm/openshell/ci/values-cert-manager.yaml index ed99c8b46..2d159c176 100644 --- a/deploy/helm/openshell/ci/values-cert-manager.yaml +++ b/deploy/helm/openshell/ci/values-cert-manager.yaml @@ -7,8 +7,5 @@ server: disableTls: false -pkiInitJob: - enabled: false - certManager: enabled: true diff --git a/deploy/helm/openshell/skaffold.yaml b/deploy/helm/openshell/skaffold.yaml index dcf578d22..9a056238a 100644 --- a/deploy/helm/openshell/skaffold.yaml +++ b/deploy/helm/openshell/skaffold.yaml @@ -60,7 +60,7 @@ deploy: helm: releases: # cert-manager — comment this in and add values-cert-manager.yaml below - # when you want cert-manager to manage the PKI instead of pkiInitJob. + # when you want cert-manager to manage TLS while certgen handles JWT. # Requires cert-manager CRDs to be installed in the cluster first. #- name: cert-manager # repo: https://charts.jetstack.io @@ -89,7 +89,7 @@ deploy: - values.yaml - ci/values-skaffold.yaml # Add ci/values-cert-manager.yaml here (and uncomment the cert-manager - # release above) to switch from pkiInitJob to cert-manager for PKI. + # release above) to switch TLS generation to cert-manager. #- ci/values-cert-manager.yaml # To enable OIDC with a local Keycloak instance, run the one-time # setup task first, then uncomment the line below: diff --git a/deploy/helm/openshell/templates/_helpers.tpl b/deploy/helm/openshell/templates/_helpers.tpl index c40be9b80..a8e7ac721 100644 --- a/deploy/helm/openshell/templates/_helpers.tpl +++ b/deploy/helm/openshell/templates/_helpers.tpl @@ -133,6 +133,13 @@ Name of the Secret holding the PostgreSQL connection URI. {{- end -}} {{- end }} +{{/* +Name of the Secret holding gateway-minted sandbox JWT signing material. +*/}} +{{- define "openshell.sandboxJwtSecretName" -}} +{{- .Values.server.sandboxJwt.signingSecretName | default (printf "%s-jwt-keys" (include "openshell.fullname" .)) -}} +{{- end }} + {{/* gRPC endpoint sandbox pods use to call back into the gateway. An explicit .Values.server.grpcEndpoint is used verbatim. Otherwise it is derived from diff --git a/deploy/helm/openshell/templates/certgen.yaml b/deploy/helm/openshell/templates/certgen.yaml index 61203760b..3651e188f 100644 --- a/deploy/helm/openshell/templates/certgen.yaml +++ b/deploy/helm/openshell/templates/certgen.yaml @@ -1,10 +1,7 @@ # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # SPDX-License-Identifier: Apache-2.0 -{{- if and .Values.pkiInitJob.enabled .Values.certManager.enabled }} -{{- fail "pkiInitJob.enabled and certManager.enabled cannot both be true; disable one to avoid conflicting PKI sources." }} -{{- end }} -{{- if .Values.pkiInitJob.enabled }} +{{- if or .Values.pkiInitJob.enabled .Values.certManager.enabled }} {{- $hookName := printf "%s-certgen" (include "openshell.fullname" .) }} {{- $ns := .Release.Namespace }} apiVersion: v1 @@ -98,13 +95,19 @@ spec: command: ["/usr/local/bin/openshell-gateway"] args: - generate-certs + {{- if .Values.certManager.enabled }} + - --jwt-only + {{- else }} - --server-secret-name={{ .Values.server.tls.certSecretName }} - --client-secret-name={{ .Values.server.tls.clientTlsSecretName }} - - --jwt-secret-name={{ .Values.server.sandboxJwt.signingSecretName | default (printf "%s-jwt-keys" (include "openshell.fullname" .)) }} + {{- end }} + - --jwt-secret-name={{ include "openshell.sandboxJwtSecretName" . }} + {{- if and .Values.pkiInitJob.enabled (not .Values.certManager.enabled) }} {{- range .Values.pkiInitJob.serverDnsNames }} - --server-san={{ . }} {{- end }} {{- range .Values.pkiInitJob.serverIpAddresses }} - --server-san={{ . }} {{- end }} + {{- end }} {{- end }} diff --git a/deploy/helm/openshell/templates/statefulset.yaml b/deploy/helm/openshell/templates/statefulset.yaml index 087748d38..8be7fc206 100644 --- a/deploy/helm/openshell/templates/statefulset.yaml +++ b/deploy/helm/openshell/templates/statefulset.yaml @@ -147,7 +147,7 @@ spec: name: {{ include "openshell.fullname" . }}-config - name: sandbox-jwt secret: - secretName: {{ .Values.server.sandboxJwt.signingSecretName | default (printf "%s-jwt-keys" (include "openshell.fullname" .)) }} + secretName: {{ include "openshell.sandboxJwtSecretName" . }} defaultMode: {{ .Values.server.sandboxJwt.secretDefaultMode | default 0400 }} {{- if not .Values.server.disableTls }} - name: tls-cert diff --git a/deploy/helm/openshell/tests/certgen_test.yaml b/deploy/helm/openshell/tests/certgen_test.yaml new file mode 100644 index 000000000..cd88b60e9 --- /dev/null +++ b/deploy/helm/openshell/tests/certgen_test.yaml @@ -0,0 +1,116 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 + +suite: certgen hook +templates: + - templates/certgen.yaml +release: + name: openshell + namespace: my-namespace + +tests: + - it: renders combined TLS and JWT certgen hook by default + template: templates/certgen.yaml + asserts: + - hasDocuments: + count: 4 + - equal: + path: kind + value: Job + documentIndex: 3 + - contains: + path: spec.template.spec.containers[0].args + content: "--server-secret-name=openshell-server-tls" + documentIndex: 3 + - contains: + path: spec.template.spec.containers[0].args + content: "--client-secret-name=openshell-client-tls" + documentIndex: 3 + - contains: + path: spec.template.spec.containers[0].args + content: "--jwt-secret-name=openshell-jwt-keys" + documentIndex: 3 + - notContains: + path: spec.template.spec.containers[0].args + content: "--jwt-only" + documentIndex: 3 + + - it: renders JWT-only certgen hook when cert-manager owns TLS + template: templates/certgen.yaml + set: + certManager.enabled: true + pkiInitJob.enabled: false + asserts: + - hasDocuments: + count: 4 + - equal: + path: kind + value: Job + documentIndex: 3 + - contains: + path: spec.template.spec.containers[0].args + content: "--jwt-only" + documentIndex: 3 + - contains: + path: spec.template.spec.containers[0].args + content: "--jwt-secret-name=openshell-jwt-keys" + documentIndex: 3 + - notContains: + path: spec.template.spec.containers[0].args + content: "--server-secret-name=openshell-server-tls" + documentIndex: 3 + - notContains: + path: spec.template.spec.containers[0].args + content: "--client-secret-name=openshell-client-tls" + documentIndex: 3 + + - it: uses the configured sandbox JWT secret name + template: templates/certgen.yaml + set: + server.sandboxJwt.signingSecretName: custom-jwt-keys + asserts: + - contains: + path: spec.template.spec.containers[0].args + content: "--jwt-secret-name=custom-jwt-keys" + documentIndex: 3 + + - it: renders JWT-only hook when cert-manager is enabled even if pkiInitJob remains enabled + template: templates/certgen.yaml + set: + certManager.enabled: true + pkiInitJob.enabled: true + pkiInitJob.serverDnsNames: + - extra.example.test + pkiInitJob.serverIpAddresses: + - 192.0.2.10 + asserts: + - hasDocuments: + count: 4 + - equal: + path: kind + value: Job + documentIndex: 3 + - contains: + path: spec.template.spec.containers[0].args + content: "--jwt-only" + documentIndex: 3 + - contains: + path: spec.template.spec.containers[0].args + content: "--jwt-secret-name=openshell-jwt-keys" + documentIndex: 3 + - notContains: + path: spec.template.spec.containers[0].args + content: "--server-secret-name=openshell-server-tls" + documentIndex: 3 + - notContains: + path: spec.template.spec.containers[0].args + content: "--client-secret-name=openshell-client-tls" + documentIndex: 3 + - notContains: + path: spec.template.spec.containers[0].args + content: "--server-san=extra.example.test" + documentIndex: 3 + - notContains: + path: spec.template.spec.containers[0].args + content: "--server-san=192.0.2.10" + documentIndex: 3 diff --git a/deploy/helm/openshell/values.yaml b/deploy/helm/openshell/values.yaml index 4d03b86d7..26b1a2486 100644 --- a/deploy/helm/openshell/values.yaml +++ b/deploy/helm/openshell/values.yaml @@ -196,9 +196,11 @@ server: clientCaSecretName: openshell-server-client-ca # -- K8s secret mounted into sandbox pods for mTLS to the server. clientTlsSecretName: openshell-client-tls - # Gateway-minted sandbox JWT signing keys. The pre-install certgen hook - # generates an Ed25519 keypair and writes it to a secret containing - # signing.pem (PKCS#8), public.pem (SPKI), and kid (plain text). + # Gateway-minted sandbox JWT signing keys. The certgen hook generates an + # Ed25519 keypair and writes it to a secret containing signing.pem (PKCS#8), + # public.pem (SPKI), and kid (plain text). The hook runs in full PKI mode when + # pkiInitJob.enabled=true unless certManager.enabled=true, which takes + # precedence and runs the hook in JWT-only mode. sandboxJwt: # -- Name of the Opaque Secret holding the signing key material. Empty # falls back to the chart fullname with "-jwt-keys" appended. @@ -270,7 +272,7 @@ networkPolicy: # -- Create a NetworkPolicy restricting SSH ingress on sandbox pods to the gateway. enabled: true -# PKI bootstrap via a pre-install/pre-upgrade hook Job. +# Built-in TLS PKI bootstrap via a pre-install/pre-upgrade hook Job. # Runs `openshell-gateway generate-certs` to create the server and client TLS # Secrets in-cluster. Key material is written directly to K8s Secrets and # never appears in Helm release history. Idempotent: existing secrets are @@ -285,7 +287,9 @@ networkPolicy: # that domain, for example `*.apps.example.com` enables # `--.apps.example.com`. pkiInitJob: - # -- Run a pre-install/pre-upgrade Job that creates gateway and client mTLS Secrets. + # -- Run a pre-install/pre-upgrade Job that creates gateway and client mTLS + # Secrets. When certManager.enabled=true, cert-manager owns TLS and this same + # hook runs in JWT-only mode even if pkiInitJob.enabled remains true. enabled: true # -- Extra DNS SANs to append to the server certificate. serverDnsNames: [] @@ -295,7 +299,9 @@ pkiInitJob: # cert-manager Certificate/Issuer resources (requires cert-manager CRDs in-cluster). # Uses namespaced Issuers only (no ClusterIssuer). Does not install cert-manager itself. certManager: - # -- Create cert-manager Issuer and Certificate resources instead of using the PKI bootstrap Job. + # -- Create cert-manager Issuer and Certificate resources. When enabled, + # cert-manager owns TLS and the chart runs a JWT-only certgen hook to create + # the sandbox JWT signing Secret that cert-manager does not manage. enabled: false # -- Secret created for the intermediate CA (Certificate with isCA: true). caSecretName: openshell-ca-tls diff --git a/docs/kubernetes/managing-certificates.mdx b/docs/kubernetes/managing-certificates.mdx index 42536cb9d..a445f77e8 100644 --- a/docs/kubernetes/managing-certificates.mdx +++ b/docs/kubernetes/managing-certificates.mdx @@ -18,7 +18,10 @@ The OpenShell gateway uses mTLS certificates for transport between the gateway a The rest of this page covers switching to cert-manager. The built-in mode requires no configuration. -cert-manager and `pkiInitJob` are mutually exclusive. The chart fails if both are enabled at the same time. +When `certManager.enabled=true`, cert-manager owns TLS certificate generation. +The chart still runs a JWT-only initialization hook because cert-manager does +not create the sandbox JWT signing Secret required by the gateway. This +cert-manager precedence applies even if `pkiInitJob.enabled` remains true. ## Install cert-manager @@ -50,11 +53,13 @@ helm upgrade --install openshell \ oci://ghcr.io/nvidia/openshell/helm-chart \ --version \ --namespace openshell \ - --set certManager.enabled=true \ - --set pkiInitJob.enabled=false + --set certManager.enabled=true ``` The chart creates a self-signed CA, issues server and client certificates from it, and cert-manager handles renewal before expiry. +The chart also runs a pre-install hook in JWT-only mode to create the gateway's +sandbox JWT signing Secret. That Secret is separate from the cert-manager TLS +certificate Secrets and is mounted at `/etc/openshell-jwt`. ## Next Steps diff --git a/docs/kubernetes/openshift.mdx b/docs/kubernetes/openshift.mdx index acad5ff79..e56fc37db 100644 --- a/docs/kubernetes/openshift.mdx +++ b/docs/kubernetes/openshift.mdx @@ -54,10 +54,14 @@ helm install openshell oci://ghcr.io/nvidia/openshell/helm-chart \ | Override | Reason | |---|---| -| `pkiInitJob.enabled=false` | The PKI init Job runs as a non-root user with a fixed UID, which the SCC admission rewrites or rejects. Disabling it skips the Job; TLS must also be disabled. | +| `pkiInitJob.enabled=false` | Skips the built-in TLS PKI Job. TLS must also be disabled unless you provide TLS Secrets another way. | | `server.disableTls=true` | The gateway has no certificates without `pkiInitJob`, so it must run plaintext. | | `podSecurityContext.fsGroup=null` / `securityContext.runAsUser=null` | Clear the chart's hardcoded UID and fsGroup so OpenShift's SCC admission can assign them. | +The gateway still needs the sandbox JWT signing Secret. When disabling +`pkiInitJob` without enabling cert-manager, pre-create that Secret before +installing the chart. + ## Wait for the gateway to be ready ```shell From 8b5b09755ae32269a367f8b6cd2142c655b8a4c6 Mon Sep 17 00:00:00 2001 From: Taylor Mutch Date: Wed, 3 Jun 2026 09:18:06 -0700 Subject: [PATCH 2/2] fix(helm): honor cert-manager precedence for client CA volume The client CA volume logic treated pkiInitJob.enabled as proof that built-in PKI owns the client CA. With cert-manager precedence now allowing certManager.enabled=true alongside the default pkiInitJob.enabled=true, that assumption mounts the server TLS cert secret as the client CA and ignores certManager.clientCaFromServerTlsSecret=false, which can break mTLS or trust the wrong CA. Gate the pkiInitJob.enabled term with (not certManager.enabled) in all three client CA conditions (volume mount, volume definition, and secret selection) so cert-manager owns TLS when enabled. Add a Helm test suite covering built-in PKI, cert-manager shared CA, the regression config (cert-manager + clientCaFromServerTlsSecret=false + default pkiInitJob), and the no-client-CA case. --- .../helm/openshell/templates/statefulset.yaml | 6 +- .../tests/statefulset_client_ca_test.yaml | 85 +++++++++++++++++++ 2 files changed, 88 insertions(+), 3 deletions(-) create mode 100644 deploy/helm/openshell/tests/statefulset_client_ca_test.yaml diff --git a/deploy/helm/openshell/templates/statefulset.yaml b/deploy/helm/openshell/templates/statefulset.yaml index 8be7fc206..d56c06355 100644 --- a/deploy/helm/openshell/templates/statefulset.yaml +++ b/deploy/helm/openshell/templates/statefulset.yaml @@ -93,7 +93,7 @@ spec: - name: tls-cert mountPath: /etc/openshell-tls/server readOnly: true - {{- if or .Values.server.tls.clientCaSecretName .Values.pkiInitJob.enabled (and .Values.certManager.enabled .Values.certManager.clientCaFromServerTlsSecret) }} + {{- if or .Values.server.tls.clientCaSecretName (and .Values.pkiInitJob.enabled (not .Values.certManager.enabled)) (and .Values.certManager.enabled .Values.certManager.clientCaFromServerTlsSecret) }} - name: tls-client-ca mountPath: /etc/openshell-tls/client-ca readOnly: true @@ -153,10 +153,10 @@ spec: - name: tls-cert secret: secretName: {{ .Values.server.tls.certSecretName }} - {{- if or .Values.server.tls.clientCaSecretName .Values.pkiInitJob.enabled (and .Values.certManager.enabled .Values.certManager.clientCaFromServerTlsSecret) }} + {{- if or .Values.server.tls.clientCaSecretName (and .Values.pkiInitJob.enabled (not .Values.certManager.enabled)) (and .Values.certManager.enabled .Values.certManager.clientCaFromServerTlsSecret) }} - name: tls-client-ca secret: - {{- if or .Values.pkiInitJob.enabled (and .Values.certManager.enabled .Values.certManager.clientCaFromServerTlsSecret) }} + {{- if or (and .Values.pkiInitJob.enabled (not .Values.certManager.enabled)) (and .Values.certManager.enabled .Values.certManager.clientCaFromServerTlsSecret) }} secretName: {{ .Values.server.tls.certSecretName }} items: - key: ca.crt diff --git a/deploy/helm/openshell/tests/statefulset_client_ca_test.yaml b/deploy/helm/openshell/tests/statefulset_client_ca_test.yaml new file mode 100644 index 000000000..a7b02310c --- /dev/null +++ b/deploy/helm/openshell/tests/statefulset_client_ca_test.yaml @@ -0,0 +1,85 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 + +suite: statefulset client CA volume +templates: + - templates/gateway-config.yaml + - templates/statefulset.yaml +release: + name: openshell + namespace: my-namespace + +tests: + - it: mounts the server TLS secret ca.crt as client CA for built-in PKI + template: templates/statefulset.yaml + set: + pkiInitJob.enabled: true + certManager.enabled: false + asserts: + - equal: + path: spec.template.spec.volumes[3].name + value: tls-client-ca + - equal: + path: spec.template.spec.volumes[3].secret.secretName + value: openshell-server-tls + - equal: + path: spec.template.spec.volumes[3].secret.items[0].key + value: ca.crt + + - it: shares the cert-manager server TLS ca.crt when clientCaFromServerTlsSecret is true + template: templates/statefulset.yaml + set: + certManager.enabled: true + certManager.clientCaFromServerTlsSecret: true + asserts: + - equal: + path: spec.template.spec.volumes[3].name + value: tls-client-ca + - equal: + path: spec.template.spec.volumes[3].secret.secretName + value: openshell-server-tls + - equal: + path: spec.template.spec.volumes[3].secret.items[0].key + value: ca.crt + + # Regression: with cert-manager enabled and pkiInitJob left at its default + # `true`, the client CA condition must honor certManager precedence and NOT + # treat pkiInitJob.enabled as built-in TLS. With clientCaFromServerTlsSecret=false + # the gateway must mount the separate clientCaSecretName, not the server TLS + # cert secret. + - it: uses clientCaSecretName under cert-manager even when pkiInitJob stays enabled + template: templates/statefulset.yaml + set: + certManager.enabled: true + certManager.clientCaFromServerTlsSecret: false + pkiInitJob.enabled: true + asserts: + - equal: + path: spec.template.spec.volumes[3].name + value: tls-client-ca + - equal: + path: spec.template.spec.volumes[3].secret.secretName + value: openshell-server-client-ca + - notExists: + path: spec.template.spec.volumes[3].secret.items + + # When cert-manager owns TLS, does not share its CA, and no separate client CA + # secret is configured, there is no client CA to mount: the volume must not + # render rather than mounting an empty secret name. + - it: omits the client CA volume when cert-manager owns TLS and no client CA secret is set + template: templates/statefulset.yaml + set: + certManager.enabled: true + certManager.clientCaFromServerTlsSecret: false + pkiInitJob.enabled: true + server.tls.clientCaSecretName: "" + asserts: + - lengthEqual: + path: spec.template.spec.volumes + count: 3 + - notContains: + path: spec.template.spec.containers[0].volumeMounts + content: + name: tls-client-ca + mountPath: /etc/openshell-tls/client-ca + readOnly: true