Skip to content

readinessgate: replace live pod Get with a node-scoped informer #646

@patyogesh20

Description

@patyogesh20

Background

PR #638 introduced the --pod-readiness-gate flag (package pkg/readinessgate), which defers certificate issuance until specified pod conditions are met. Evaluating a gate requires reading the current state of the pod that owns the volume.

The initial implementation performs a live client.CoreV1().Pods(ns).Get(...) call on every gate evaluation:

https://github.com/cert-manager/csi-driver/blob/main/pkg/readinessgate/readinessgate.go (look for the TODO comment)

Problem

csi-lib's renewal loop fires roughly once per second per managed volume. With the current implementation, that translates to one apiserver call per second per pending volume on the node.

On a node hosting many pods that are awaiting their gates:

  • The driver's client-go QPS limit (default 5 QPS, 10 burst) gets exhausted quickly.
  • Once throttled, gate evaluation slows down, which in turn delays certificate issuance for every pending volume on that node.
  • It also adds avoidable load to the apiserver.

As @SgtCoDFish noted in #638 (comment), this is acceptable for an opt-in feature today, but could bite users at scale and should be tracked.

Proposed fix

Replace the live Get with a shared pod informer scoped to the local node via a spec.nodeName field selector. This:

  • Eliminates the per-second apiserver call — readiness gate evaluation becomes a cache lookup.
  • Bounds memory to pods scheduled on this node only (a DaemonSet runs one pod per node, so a node-scoped informer is the right granularity).
  • Sets the informer up only when --pod-readiness-gate is provided, so the default deployment is unaffected.

The local node name is already available to the driver (passed via --node-id / NODE_NAME).

Acceptance criteria

  • readinessgate.NewReadyToRequestFunc reads pods from an informer cache rather than calling the apiserver on each evaluation.
  • Informer uses a spec.nodeName=<this-node> field selector.
  • Informer is started only when --pod-readiness-gate is set.
  • Unit tests cover the cache-miss path (pod not yet known to the informer).
  • The existing TODO comment in pkg/readinessgate/readinessgate.go is removed.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions