Background
PR #638 introduced the --pod-readiness-gate flag (package pkg/readinessgate), which defers certificate issuance until specified pod conditions are met. Evaluating a gate requires reading the current state of the pod that owns the volume.
The initial implementation performs a live client.CoreV1().Pods(ns).Get(...) call on every gate evaluation:
https://github.com/cert-manager/csi-driver/blob/main/pkg/readinessgate/readinessgate.go (look for the TODO comment)
Problem
csi-lib's renewal loop fires roughly once per second per managed volume. With the current implementation, that translates to one apiserver call per second per pending volume on the node.
On a node hosting many pods that are awaiting their gates:
- The driver's client-go QPS limit (default 5 QPS, 10 burst) gets exhausted quickly.
- Once throttled, gate evaluation slows down, which in turn delays certificate issuance for every pending volume on that node.
- It also adds avoidable load to the apiserver.
As @SgtCoDFish noted in #638 (comment), this is acceptable for an opt-in feature today, but could bite users at scale and should be tracked.
Proposed fix
Replace the live Get with a shared pod informer scoped to the local node via a spec.nodeName field selector. This:
- Eliminates the per-second apiserver call — readiness gate evaluation becomes a cache lookup.
- Bounds memory to pods scheduled on this node only (a DaemonSet runs one pod per node, so a node-scoped informer is the right granularity).
- Sets the informer up only when
--pod-readiness-gate is provided, so the default deployment is unaffected.
The local node name is already available to the driver (passed via --node-id / NODE_NAME).
Acceptance criteria
Related
Background
PR #638 introduced the
--pod-readiness-gateflag (packagepkg/readinessgate), which defers certificate issuance until specified pod conditions are met. Evaluating a gate requires reading the current state of the pod that owns the volume.The initial implementation performs a live
client.CoreV1().Pods(ns).Get(...)call on every gate evaluation:https://github.com/cert-manager/csi-driver/blob/main/pkg/readinessgate/readinessgate.go (look for the
TODOcomment)Problem
csi-lib's renewal loop fires roughly once per second per managed volume. With the current implementation, that translates to one apiserver call per second per pending volume on the node.On a node hosting many pods that are awaiting their gates:
As @SgtCoDFish noted in #638 (comment), this is acceptable for an opt-in feature today, but could bite users at scale and should be tracked.
Proposed fix
Replace the live
Getwith a shared pod informer scoped to the local node via aspec.nodeNamefield selector. This:--pod-readiness-gateis provided, so the default deployment is unaffected.The local node name is already available to the driver (passed via
--node-id/NODE_NAME).Acceptance criteria
readinessgate.NewReadyToRequestFuncreads pods from an informer cache rather than calling the apiserver on each evaluation.spec.nodeName=<this-node>field selector.--pod-readiness-gateis set.TODOcomment inpkg/readinessgate/readinessgate.gois removed.Related
--pod-readiness-gate)