What happened?
In multi-replica Pinniped Concierge deployments using the impersonation proxy (mode: auto on GKE), non-leader pods permanently fail to load the impersonation-proxy-serving-cert TLS certificate. The DynamicServingCertificateController retries every 60 seconds but never succeeds. This causes tls: internal error for clients when the LoadBalancer routes to a non-leader pod.
Important distinction: There are two separate empty-certificate errors in the logs. Only one is a bug:
| Certificate |
Behavior |
Bug? |
concierge-serving-cert |
Transient at startup (<2 seconds), resolves once leader populates cert |
No — normal startup race |
impersonation-proxy-serving-cert |
Persistent every 60 seconds on non-leader pods, never resolves |
Yes — this is the bug |
Non-leader pod logs (persistent, never resolves):
Repeats every 60 seconds indefinitely:
{"level":"error","message":"Unhandled Error",
"error":"key failed with : not loading an empty serving certificate from "impersonation-proxy-serving-cert""}
Repeats every ~3 minutes indefinitely:
{"level":"error","message":"Unhandled Error",
"error":"impersonator-config-controller: { } failed with:
[...] write attempt rejected as client is not leader,
failed to update CredentialIssuer status: [...] write attempt rejected as client is not leader"}
Client-side error (intermittent, depends on LoadBalancer routing):
remote error: tls: internal error
Leader pod works correctly — acquires lease, loads certs, handles TokenCredentialRequests.
What did you expect to happen?
Non-leader pods should load the existing impersonation-proxy-serving-cert from the Kubernetes Secret (created by the leader) and serve TLS successfully. The ensureTLSSecretIsCreatedAndLoaded() function in impersonator_config.go already has a read-only path for this case — when the Secret exists, it reads from the informer cache (zero writes) and calls loadTLSCertFromSecret(). This path would succeed on non-leader pods, but it is never reached.
What is the simplest way to reproduce this behavior?
- Deploy Pinniped Concierge on GKE with
replicas: 2 and impersonation proxy mode: auto
- Wait for leader election to complete
- Check the non-leader pod logs:
"impersonation-proxy-serving-cert" errors every 60 seconds (persistent, never resolves)
"write attempt rejected as client is not leader" errors every ~3 minutes
- Run
kubectl commands repeatedly — some succeed (hit leader), some fail with tls: internal error (hit non-leader)
In what environment did you see this bug?
- Pinniped server version: v0.40.0 - (also with v0.44.0)
- Pinniped client version: v0.44.0
- Pinniped container image:
ghcr.io/vmware/pinniped/pinniped-server:v0.40.0 (also with v0.44.0)
- Pinniped configuration: OIDCIdentityProvider (Dex) → Pinniped Supervisor → Concierge JWTAuthenticator. Impersonation proxy mode:
auto. Service type: LoadBalancer (Internal GKE).
- Kubernetes version: GKE 1.31.x (managed control plane)
- Cloud provider: Google Cloud (GKE)
What else is there to know about this bug?
Root Cause Analysis
The doSync() method in internal/controller/impersonatorconfig/impersonator_config.go executes steps sequentially. Service write operations execute before the TLS cert-loading step. On non-leader pods, the leader election middleware rejects the Service write with ErrNotLeader, causing doSync() to return early — the cert-loading code is never reached.
Step 1: ensureImpersonatorIsStarted() ✅ read-only, works on non-leader
Step 2: ensureLoadBalancerIsStarted() ❌ WRITES to Service → ErrNotLeader → RETURNS EARLY
Step 3: ensureClusterIPServiceIsStarted() ⛔ never reached
Step 4: ensureCAAndTLSSecrets() ⛔ never reached ← cert loading here
└─ ensureTLSSecretIsCreatedAndLoaded()
└─ loadTLSCertFromSecret() ⛔ never reached ← read-only, would succeed!
The dynamiccert.Private provider (tlsServingCertDynamicCertProvider) is entirely passive — no informer, no file watcher, no background goroutine. The only way it gets populated is via SetCertKeyContent() inside loadTLSCertFromSecret(), which is gated behind the failing write operations.
On subsequent sync cycles, the informer re-enqueues the controller. But each retry hits the same early-return: Service update → ErrNotLeader → return. The cert-loading code is never reached, regardless of how many retries occur.
Workaround
Setting replicas: 1 on the pinniped-concierge Deployment eliminates the issue by ensuring only the leader pod exists. The startup-transient concierge-serving-cert errors still occur but resolve within seconds once the single pod acquires the lease.
What happened?
In multi-replica Pinniped Concierge deployments using the impersonation proxy (
mode: autoon GKE), non-leader pods permanently fail to load theimpersonation-proxy-serving-certTLS certificate. TheDynamicServingCertificateControllerretries every 60 seconds but never succeeds. This causestls: internal errorfor clients when the LoadBalancer routes to a non-leader pod.Important distinction: There are two separate empty-certificate errors in the logs. Only one is a bug:
concierge-serving-certimpersonation-proxy-serving-certNon-leader pod logs (persistent, never resolves):
Repeats every 60 seconds indefinitely:
Repeats every ~3 minutes indefinitely:
Client-side error (intermittent, depends on LoadBalancer routing):
Leader pod works correctly — acquires lease, loads certs, handles TokenCredentialRequests.
What did you expect to happen?
Non-leader pods should load the existing
impersonation-proxy-serving-certfrom the Kubernetes Secret (created by the leader) and serve TLS successfully. TheensureTLSSecretIsCreatedAndLoaded()function inimpersonator_config.goalready has a read-only path for this case — when the Secret exists, it reads from the informer cache (zero writes) and callsloadTLSCertFromSecret(). This path would succeed on non-leader pods, but it is never reached.What is the simplest way to reproduce this behavior?
replicas: 2and impersonation proxymode: auto"impersonation-proxy-serving-cert"errors every 60 seconds (persistent, never resolves)"write attempt rejected as client is not leader"errors every ~3 minuteskubectlcommands repeatedly — some succeed (hit leader), some fail withtls: internal error(hit non-leader)In what environment did you see this bug?
ghcr.io/vmware/pinniped/pinniped-server:v0.40.0(also with v0.44.0)auto. Service type:LoadBalancer(Internal GKE).What else is there to know about this bug?
Root Cause Analysis
The
doSync()method ininternal/controller/impersonatorconfig/impersonator_config.goexecutes steps sequentially. Service write operations execute before the TLS cert-loading step. On non-leader pods, the leader election middleware rejects the Service write withErrNotLeader, causingdoSync()to return early — the cert-loading code is never reached.The
dynamiccert.Privateprovider (tlsServingCertDynamicCertProvider) is entirely passive — no informer, no file watcher, no background goroutine. The only way it gets populated is viaSetCertKeyContent()insideloadTLSCertFromSecret(), which is gated behind the failing write operations.On subsequent sync cycles, the informer re-enqueues the controller. But each retry hits the same early-return: Service update →
ErrNotLeader→ return. The cert-loading code is never reached, regardless of how many retries occur.Workaround
Setting replicas: 1 on the pinniped-concierge Deployment eliminates the issue by ensuring only the leader pod exists. The startup-transient concierge-serving-cert errors still occur but resolve within seconds once the single pod acquires the lease.