Gatewayapi Namespaced Mode#4690
Conversation
electricjesus
left a comment
There was a problem hiding this comment.
Good work on the runtime Helm rendering migration — dropping 53K lines of pre-rendered YAML is a huge win. The GatewayNamespace mode looks solid with good test coverage. A few observations below.
| CurrentGatewayClasses: set.New[string](), | ||
| } | ||
|
|
||
| if gatewayAPI.Spec.GatewayDeploymentMode == nil { |
There was a problem hiding this comment.
Nit: The CRD already has +kubebuilder:default=ControllerNamespace, so any persisted GatewayAPI resource will have this field populated by the API server. This runtime defaulting only matters for in-memory objects that were never persisted (tests?). Not a problem, just noting the redundancy — if the CRD default is the source of truth, a comment here explaining why you also default in code would help future readers.
There was a problem hiding this comment.
It is used by the tests
All good catches man, all sorted |
| // Gateway resources using operator-managed GatewayClasses. These namespaces need | ||
| // per-namespace Enterprise resources (SA, CRB, pull secrets). | ||
| if *gatewayAPI.Spec.GatewayDeploymentMode == operatorv1.GatewayDeploymentModeGatewayNamespace && | ||
| variant.IsEnterprise() { |
There was a problem hiding this comment.
Why does the Variant matter here at all?
There was a problem hiding this comment.
For resources that are rendered only on EE license, like WAF.
04d49c6 to
8c6f64c
Compare
- Swap the checked-in gateway_api_resources.yaml for the embedded gateway-helm.tgz rendered via the helm SDK at startup; K8SGatewayAPICRDs/GatewayAPICRDs now take a runtime.Scheme and return an error (istio_controller updated for the new signature) - Deploy two envoy-gateway controllers: legacy in tigera-gateway (user-declared classes via Spec.GatewayClasses) and a new one in calico-system with deploy.type=GatewayNamespace; auto-provision the tigera-gateway-class-ns GatewayClass bound to the new controller - Group the tigera-gateway install behind legacyObjects/legacyTeardownObjects so the eventual deprecation is a single delete - HasLegacyGateways classifier in the controller: build a className -> controllerName map seeded from Spec.GatewayClasses + existing GatewayClass resources, classify every live Gateway; when no Gateway targets the tigera-gateway controller, the install is torn down; during the teardown-then-redeploy race the legacy render is deferred to avoid a "Namespace is terminating, skipping creation" log flood - Legacy teardown queues only the Namespace + cluster-scoped objects + the Deployment (for status.RemoveDeployments); in-namespace RBAC/Secrets ride the cascade to avoid the tigera-operator-secrets RoleBinding race - Move the shared waf-http-filter ClusterRoles out of the legacy bundle so the calico-system-side proxies keep their cluster-scoped perms after tigera-gateway is retired - Per-namespace Enterprise resources (SA, RoleBindings, pull secret, shared CRB subject) for namespaces hosting a namespaced-class Gateway; reserved namespaces skip shared resource create/delete; Secret goes before RoleBinding on cleanup to avoid 403 - Gate v3 NetworkPolicies on the calico-system Tier; render calico-system.envoy-gateway allow for the controller and certgen - Update unit tests and Makefile/docs accordingly Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Cover the calico-system envoy-gateway controller lifecycle, per-namespace resource provisioning and cleanup, custom EnvoyProxy and EnvoyGateway ConfigMap watches, owning-gateway env vars in l7-log-collector, and the legacy-class teardown path - Teardown sequencing for tigera-gateway cascading Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…lico-system - Render one envoy-gateway controller in calico-system with deploy.type=GatewayNamespace - Auto-provision tigera-gateway-class; honour user overrides if redeclared in Spec.GatewayClasses - Enumerate every operator-owned object from the legacy tigera-gateway install for cleanup (pull Secrets before tigera-operator-secrets); keep the Namespace itself in case users placed their own resources there - Point GatewayAPI finalizer at the calico-system envoy-gateway Deployment - Drop dual-controller fixtures and the legacy-undeploy test; consolidate FV tests to the calico-system layout
0d63b8f to
d3ef961
Compare
Upstream envoy-gateway rejects the combination of mergeGateways: true and GatewayNamespaceMode, so any user-supplied EnvoyProxy with merging enabled would cause its referenced Gateways to silently stop being programmed after the switch to GatewayNamespace (https://gateway.envoyproxy.io/docs/tasks/operations/gateway-namespace-mode/). In the GatewayAPI reconciler, when a Spec.GatewayClasses[].EnvoyProxyRef points at an EnvoyProxy with Spec.MergeGateways == true, force the field to false in our managed copy and log a warning naming the EnvoyProxy and GatewayClass. The user's source CR is not mutated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- remove controllerName param (never set by callers) - inline ReleaseName and GatewayNamespace deploy type - add DeploymentNamespace constant for the install namespace - drop now-unused helmGateway type
- parseManifest now errors on kinds it doesn't recognize so a chart bump that emits a new kind trips the existing render tests
Bug introduced in this branch; reverts the render and UT to master's behavior. - drop the render-side auto-provision of tigera-gateway-class - flip the UT that asserted the buggy output
Under deploy.type=GatewayNamespace (tigera#4690), envoy-proxy pods land in the Gateway's own namespace and mount the operator trust bundle at /etc/pki/tls/certs (added by tigera#4796). The mount references a ConfigMap in the proxy pod's own namespace, but tigera#4796 only writes the ConfigMap into calico-system (the controller's namespace), so the proxy Pod stops at Init:0/2 with: Warning FailedMount MountVolume.SetUp failed for volume "tigera-ca-bundle": configmap not found Mirror the trust bundle into each Gateway namespace alongside the existing per-namespace propagation of tigera-pull-secret and the waf-http-filter SA / RoleBindings. Reuses the existing reserved-NS guard and follows the same delete-before-RoleBinding ordering as the pull-secret cleanup. Reproduced live on seth-ez-a3b5 2026-05-19 with operator walter-merge-2026-05-18 (has both tigera#4690 and tigera#4796): fresh Gateway namespace -> everything else propagates but tigera-ca-bundle does not, proxy Pod stuck Init:0/2. Brief: tigera/gateway-extensions-controller/docs/planning/briefs/2026-05-19-ca-bundle-propagation-brief.md
Walter-supplied positive test: configure two Gateway namespaces
("default" and "app-ns") with a TrustedBundle, render, assert the
trust bundle ConfigMap (TrustedCertConfigMapName) lands in each
Gateway namespace.
Companion to the per-NS ConfigMap copy added in the previous commit.
| { | ||
| Action: v3.Allow, | ||
| Protocol: &networkpolicy.TCPProtocol, | ||
| Source: v3.EntityRule{Nets: []string{"0.0.0.0/0"}}, |
There was a problem hiding this comment.
This wouldn't work on ipv6, should we add "::/0"?
| } | ||
|
|
||
| // gatewayAPIControllerPolicy allows the controller + certgen to reach kube-apiserver and DNS. | ||
| func gatewayAPIControllerPolicy(namespace string, openShift bool) *v3.NetworkPolicy { |
There was a problem hiding this comment.
The labels on the deployment are taken from helm. And so if they decide to change something upstream, our network policy may start to block traffic unnoticed by us. There also seems to be a mismatch between the Deploy label and the pod label. Is it possible that we add our own labels to the deploy and pod, and use these for the network policy as well? Maybe calico-gateway-controller?
| } | ||
| } | ||
|
|
||
| // Start watching Gateway resources now that the CRDs are in place, so future |
There was a problem hiding this comment.
Should we add a check here to disallow gateways with namespace tigera-gateway? If they specify this namespace, will there be issues with the clean up and creation of resources?
- add ::/0 to controller NetworkPolicy ingress for dual-stack / IPv6-only - switch policy selector to our own k8s-app labels (controller + certgen) so a chart-side label rename can't silently break it - legacyTeardownObjects now takes the current create-set and skips any candidate it'd otherwise delete (e.g. user Gateway in tigera-gateway)
Description
Replace the previous Gateway-API install — which ran an Envoy Gateway controller in
tigera-gatewayand deployed all proxy workloads in that same namespace — with a single envoy-gateway controller incalico-systemrunning withdeploy.type=GatewayNamespace, so proxy workloads land in each Gateway's own namespace.This is a breaking change for clusters running the legacy install. Existing
GatewayCRs do not need edits —tigera-gateway-classand its controllerName are preserved — but proxy Pods, theirServices and LoadBalancer addresses are recreated in each Gateway's own namespace on first reconcile after upgrade. Anything pinned totigera-gateway(NetworkPolicies, monitoring, RBAC, external DNS) must follow.calico-systemwithcontrollerName=gateway.envoyproxy.io/gatewayclass-controller(chart default) anddeploy.type=GatewayNamespace. ControllerName + GatewayClass name are deliberately reused from the legacy install so existingGatewayCRs continue to be claimed.tigera-gateway-classGatewayClass + EnvoyProxy. Users can declare additional classes viaGatewayAPI.Spec.GatewayClasses; all classes target the single controller.gateway-helm.tgzand render at runtime via the Helm SDK; result is cached per process viasync.Once. Replaces the previous pre-rendered YAML.waf-http-filterSA + per-namespacewaf-http-filter-gateway-resourcesRoleBinding (least-privilege Gateway-API reads),tigera-operator-secretsRoleBinding,tigera-pull-secretcopy. Cluster-scoped perms (licensekeys, tokenreviews) go through a single sharedwaf-http-filter-gateway-namespacesClusterRoleBinding whose Subjects list is recomputed each reconcile.calico-system,tigera-operator): the operator does not create or delete the sharedtigera-operator-secretsRoleBinding ortigera-pull-secretcopy in those namespaces — the core Installation controller owns them.tigera-operator-secretsRoleBinding. That RoleBinding is what grants the operator Secret-delete perms, so reversing the order yields a 403 and aborts the reconcile.tigera-gatewayinstall: controller Deployment/Service/SAs/ConfigMap, certgen Job + RBAC, namespaced Role/RoleBinding, copied pull Secrets,tigera-operator-secretsRoleBinding,envoy-gateway-topology-injector.tigera-gatewayMWC, the orphanedwaf-http-filter-cluster-scopedandwaf-http-filter-gateway-resourcesClusterRoleBindings, and the deprecated combinedwaf-http-filterClusterRole/ClusterRoleBinding. Pull Secrets are queued beforetigera-operator-secrets. Thetigera-gatewayNamespace itself is intentionally not queued — users may have placed their own resources in it.calico-system.envoy-gateway) under thecalico-systemtier to keep the controller + certgen Job working under default-deny. Selector covers bothapp.kubernetes.io/name=gateway-helmandapp=certgen(the chart applies different labels to the certgen Job vs its pod template). Egress: DNS + kube-apiserver, then Pass. Ingress: 9443 (topology-injector webhook), 18000-18002 (xDS), 19001 (metrics).tigera-gatewayNamespace asserted not in the delete list). FV coverage: deploys the controller incalico-systemand asserts nothing lands intigera-gateway, provisions and cleans up per-namespace resources, GatewayClass + EnvoyProxy cleanup, custom EnvoyProxy watch, l7-log-collector owning-gateway env wiring, custom EnvoyGateway ConfigMap.Security
The Enterprise per-namespace render copies
tigera-pull-secretinto every namespace that hosts a Gateway, so permissive RBAC on those namespaces can expose the pull secret. Reserved namespaces are excluded from create and delete of the shared resources, so the operator does not clobber core-Installation-owned secrets.Upgrade / compatibility
In-place upgrade. The single controller in
calico-systemclaims alltigera-gateway-classGateways unchanged. Proxy Pods, their Services, and LoadBalancer addresses are recreated in each Gateway's own namespace on first reconcile, so any cluster setting pinned totigera-gateway(NetworkPolicies, monitoring, RBAC, external DNS) must be repointed to the Gateway's own namespace. The legacy controller install is removed automatically; thetigera-gatewayNamespace itself is preserved in case it holds user resources.Calico-private operator RBAC update required:
envoy-gateway-topology-injector.calico-systemadded to themutatingwebhookconfigurationsresourceNameslist withupdateanddeleteverbs (the legacy.tigera-gatewayentry is retained withdeleteso the upgrade-cleanup can reap it).Release Note
For PR author
make gen-filesmake gen-versionsFor PR reviewers
A note for code reviewers - all pull requests must have the following:
kind/bugif this is a bugfix.kind/enhancementif this is a a new feature.enterpriseif this PR applies to Calico Enterprise only.