fix(benchmarks): make kafka.version optional in Helm template by k-wall · Pull Request #96 · SamBarker/kroxylicious

k-wall · 2026-05-19T14:30:07Z

Summary

Fixes the kafka.version template rendering to be truly optional, allowing Strimzi to provide a default when the value is not set.

Problem

Commit ed1d43e removed kafka.version from values.yaml to allow Strimzi to default the version. However, the template still unconditionally rendered:

version: {{ .Values.kafka.version }}

When .Values.kafka.version is undefined, this produces version: (empty/null), which Strimzi 1.0.0 rejects with:

spec.kafka.version: Invalid value: "null": spec.kafka.version in body must be of type string: "null"

Solution

Wrap the version field in a Helm conditional:

{{- if .Values.kafka.version }}
version: {{ .Values.kafka.version }}
{{- end }}

When kafka.version is not set: field is omitted entirely, Strimzi uses its default
When kafka.version is set: field is rendered normally

Testing

Verified with helm template:

Without kafka.version:

helm template benchmark helm/kroxylicious-benchmark | grep -A 5 "spec:"

Result: version: field is absent ✓

With kafka.version:

helm template benchmark helm/kroxylicious-benchmark --set kafka.version="4.2.0" | grep -A 5 "spec:"

Result: version: 4.2.0 is rendered ✓

Notes

This issue was discovered when testing PR kroxylicious#3887 on a cluster with Strimzi 1.0.0, which has stricter validation than Strimzi 0.51.0 (the version this PR targets).

🤖 Generated with Claude Code

The kafka.version field was removed from values.yaml in commit ed1d43e to allow Strimzi to default the version, but the template still unconditionally rendered "version: {{ .Values.kafka.version }}", which produced "version: " (empty string/null) when the value was undefined. Strimzi 1.0.0 rejects this with: "spec.kafka.version: Invalid value: \"null\": spec.kafka.version in body must be of type string: \"null\"" Wrap the version field in a conditional so it's only rendered when .Values.kafka.version is set. When omitted, Strimzi uses its default Kafka version as documented in the CRD. Assisted-by: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Keith Wall <kwall@apache.org>

The kafka.version field was removed from values.yaml in commit ed1d43e to allow Strimzi to default the version, but the template still unconditionally rendered "version: {{ .Values.kafka.version }}", which produced "version: " (empty string/null) when the value was undefined. Strimzi 1.0.0 rejects this with: "spec.kafka.version: Invalid value: \"null\": spec.kafka.version in body must be of type string: \"null\"" Wrap the version field in a conditional so it's only rendered when .Values.kafka.version is set. When omitted, Strimzi uses its default Kafka version as documented in the CRD. Assisted-by: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Keith Wall <kwall@apache.org> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>

…consumer count overrides (kroxylicious#3887) * Add 10KB and 100KB single-topic workload definitions Add workload ConfigMaps for 1topic-10kb and 1topic-100kb to measure how encryption overhead scales with message size. These use random payloads (no payloadFile) since content has no effect on AES-GCM performance. Assisted-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk> * Parameterise producer and consumer counts in workload definitions producersPerTopic and consumerPerSubscription are now templated from benchmark.producersPerTopic and benchmark.consumerPerSubscription (defaulting to 1). This allows scaling client counts via --set without needing separate workload definitions. Assisted-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk> * fix(benchmarks): use profile JFR settings and restart async-profiler between rate probes When rate-sweep reuses infrastructure across probes (--skip-teardown), the JFR recording and async-profiler were restarted at the end of each probe ready for the next. Two bugs: - JFR was restarted with settings=default instead of settings=profile. The default profile omits jdk.NetworkUtilization and other I/O events, so all probes after the first recorded near-zero network traffic. - async-profiler was stopped at the end of each probe but never restarted, so all subsequent probes copied the stale flamegraph.html from probe 0. Fix both in the SKIP_TEARDOWN restart block: use settings=profile for JFR and restart async-profiler (when AGENT_LIB is available) so each probe gets an independent flamegraph covering its own benchmark window. Assisted-by: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk> * fix(benchmarks): guard async-profiler restart on ASYNC_PROFILER_FLAGS env var The previous fix restarted async-profiler between rate-sweep probes whenever AGENT_LIB was non-empty. AGENT_LIB (ASYNC_PROFILER_LIB) is set by kroxylicious-start.sh whenever the native library is present in the image, regardless of whether profiling was actually enabled. On clusters where the Unconfined seccomp patch is rejected by PodSecurity policy, probe 0 skips applying the async-profiler deployment patch. The patch is what sets ASYNC_PROFILER_FLAGS in the container env. Checking AGENT_LIB would have caused the restart to attempt profiling on restricted clusters where probe 0 deliberately skipped it. Use ASYNC_PROFILER_FLAGS from the JVM process environment as the guard — it is only present when the deployment patch was successfully applied. Assisted-by: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk> * feat(benchmarks): isolate proxy from Kafka brokers by default Adds a proxy-anti-affinity.yaml SSA patch and wires it into run-benchmark.sh (applied by default) to keep the proxy pod off any node running a Kafka broker pod (matched via strimzi.io/cluster=kafka). Without this, on a 3-worker cluster the proxy inevitably shares a node with a broker. The resulting CPU contention is enough to push the encryption throughput ceiling below its true value, making the proxy appear slower than it really is on a properly provisioned cluster. The patch uses requiredDuringSchedulingIgnoredDuringExecution — if the cluster has no free node (e.g. 3 workers, all hosting a broker) the proxy will not schedule and the benchmark will fail with a clear error rather than silently produce misleading results. --no-isolate-proxy opts out with a prominent warning, for clusters where strict isolation is not possible and indicative results are acceptable. The flag is passed through from rate-sweep.sh to run-benchmark.sh. Assisted-by: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk> * refactor(benchmarks): rename --no-isolate-proxy to --skip-proxy-isolation Aligns with the existing --skip-deploy / --skip-teardown naming convention. Also strengthens the warning to note that resource contention makes results hard to compare across runs with and without isolation active. Assisted-by: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk> * fix(benchmarks): clarify --skip-proxy-isolation warning The risk is unpredictable, run-to-run variation — not stable suppression (which would still be comparable). Update the warning to reflect this. Assisted-by: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk> * feat(benchmarks): add per-probe producer/consumer count overrides Adds --producers-per-topic and --consumers-per-subscription flags to run-benchmark.sh, wired through to new Helm values (omb.coordinatorProducersPerTopic, omb.coordinatorConsumerPerSubscription) which are injected via sed into the workload YAML at job creation time, mirroring the existing coordinatorProducerRate pattern. Assisted-by: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk> * feat(benchmarks): add connection-sweep.sh script Sweeps the number of producer connections to a single proxy pod while holding per-producer rate fixed (e.g. 1,2,4,8,16 producers at 30k msg/sec each). Consumers are scaled to match producers at each step. Deploys infrastructure once then calls run-benchmark.sh per step with --skip-deploy/--skip-teardown, matching the rate-sweep.sh pattern. Results land in <output-dir>/<scenario>/producers-<n>/result.json. Includes a connection-sweep-values.yaml profile (5 min warmup + 5 min test, workerReplicas=6 for headroom at high producer counts) and a dry-run mode that prints the step sequence without deploying. Assisted-by: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk> * refactor(benchmarks): re-apply workload ConfigMap per-probe instead of sed Replaces the env-var+sed approach for injecting per-probe overrides (producerRate, producersPerTopic, consumerPerSubscription) with a cleaner split: apply_workload_configmap() re-renders and applies just the workload ConfigMap via helm template --show-only before every benchmark run, regardless of --skip-deploy. This means --skip-deploy now truly skips only infrastructure (Kafka, Strimzi, proxy, OMB workers) while workload config is always kept in sync with the current probe's parameters. Both rate-sweep and connection-sweep benefit automatically without any extra plumbing. Removes the coordinator* override fields from values.yaml and the corresponding env vars and sed replacements from the Job template. Assisted-by: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk> * fix(benchmarks): restart operator after Kafka is ready to fix CrashLoopBackOff The Kroxylicious operator is installed at cluster provision time, before Strimzi is deployed. Its Kafka informer fails immediately (kafka.strimzi.io/v1beta2/kafkas not yet served) and JOSDK exits the process. By benchmark time the operator may be in CrashLoopBackOff with a backoff delay of up to 5 minutes, leaving the proxy pod absent and the benchmark coordinator failing to connect. Fix: after Strimzi CRDs are available (Kafka cluster Ready), force a rollout restart of kroxylicious-operator. This bypasses the exponential backoff and gives the operator a clean start with Strimzi already serving the Kafka CRDs. Also adds an explicit wait for KafkaProxy Ready after the restart so the proxy pod is guaranteed to exist before JFR setup runs. Assisted-by: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk> * Revert "fix(benchmarks): restart operator after Kafka is ready to fix CrashLoopBackOff" This reverts commit 666f9e5. Signed-off-by: Sam Barker <sam@quadrocket.co.uk> * fix(benchmarks): use version query param for Strimzi install URL https://strimzi.io/install/<version>?namespace=<ns> returns 404 for newer versions; the correct form is https://strimzi.io/install/latest?namespace=<ns>&version=<version>. Assisted-by: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk> * fix(benchmarks): fall back to curl for operator tarball download when gh absent gh CLI may not be available on all cluster nodes. Use curl with the GitHub releases URL as a fallback. Assisted-by: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk> * fix(benchmarks): install Strimzi from GitHub releases, not strimzi.io/install https://strimzi.io/install/latest?version=X ignores the version parameter and always serves the latest (1.0.0) manifests, including CRDs. This caused 0.51.0 installation to receive 1.0.0 CRDs that only serve v1, breaking the Kroxylicious operator which expects v1beta2. Switch to downloading directly from the GitHub release: - strimzi-crds-<version>.yaml for CRDs (cluster-scoped, no substitution) - strimzi-cluster-operator-<version>.yaml for the operator, with 'myproject' → NAMESPACE substitution via sed Assisted-by: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk> * fix(benchmarks): raise worker CPU limit in connection-sweep profile OMB assigns ~1/3 of workers to producers and the rest to consumers. With 6 workers and 8 producers, each of the 2 producer workers handles ~4 producers at 10k msg/s = 40k msg/s per JVM — enough to saturate the previous 1000m limit before the proxy CPU becomes the bottleneck. Raise the limit to 2000m so workers have clear headroom at the highest producer counts in the sweep. Assisted-by: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk> * fix(benchmarks): switch results-reader pod image to Chainguard busybox Docker Hub rate limits were causing the omb-results-reader pod to fail ImagePullBackOff on clusters with many unauthenticated pulls. Chainguard's busybox is equivalent in size (3.8 MB compressed) with no pull rate limits, and is pinned by digest for reproducibility. Assisted-by: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk> * feat(benchmarks): add phase timestamps and workload parameters to run-metadata.json ProbeContext now carries benchmark phase timing (warmupDurationMinutes, testDurationMinutes, benchmarkStartedAt, benchmarkCompletedAt) and workload shape (topics, partitionsPerTopic, messageSize, producersPerTopic, consumerPerSubscription). These fields make each probe's metadata self-describing without needing to open result.json, and allow analysis tools to filter CPU metrics to the measurement window rather than using a fixed snapshot-skip count. New factory methods ProbeContext.of() and withPhases()/withWorkload() keep call sites clean as the record grows. Assisted-by: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk> * feat(benchmarks): wire phase timestamps and workload params through to run-metadata.json run-benchmark.sh now records benchmarkStartedAt/benchmarkCompletedAt around the benchmark poll loop, reads warmupDurationMinutes/testDurationMinutes from the live Helm release, and reads topics/partitions/messageSize/producersPerTopic/ consumerPerSubscription from the collected result.json. All values are passed to collect-results.sh (full deploy path) and the inline jbang call (--skip-deploy path). collect-results.sh forwards the new flags to the jbang CLI. analyze-cpu-coefficient.py now filters CPU snapshots to the measurement phase using the phase timestamps from run-metadata.json, falling back to --skip-first when timestamps are absent (e.g. older result directories). Assisted-by: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk> * fix(benchmarks): replace JFR PVC with emptyDir to fix scheduling deadlock The JFR PVC used a local PersistentVolume which bound to a specific node. When that node also hosted a Kafka broker pod, the proxy anti-affinity rule prevented the proxy from scheduling there, creating an unresolvable deadlock (0/N nodes available: M nodes failed PV affinity, 1 node failed anti-affinity). Switch to emptyDir: no node affinity constraint, so the proxy schedules freely. JFR collection already copies the recording from the live pod via kubectl exec before teardown, so the PVC's "survive pod termination" property was never actually needed in practice. Removes: proxy-jfr-pvc.yaml patch, JFR PVC creation/deletion in run-benchmark.sh, jfr-collect fallback pod in collect-results.sh. Assisted-by: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk> * feat(benchmarks): capture proxy resource config in run-metadata.json Query the proxy pod and KafkaProxy CR at metadata collection time to record cpuRequest, cpuLimit, memoryRequest, memoryLimit, and replicas in a proxyConfig block. This avoids having to thread resource values through script flags and gives a self-verifying record of what was actually deployed for each sweep probe. collect-results.sh passes PROXY_POD and NAMESPACE to the JBang CLI when the pod name is set; baseline runs (no proxy pod) omit the block. Assisted-by: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk> * fix(benchmarks): scale publishRate by topic count in sweep summaries OMB reports publishRate per-topic in result.json. Multi-topic workloads (e.g. 10topics-1kb) were compared against the aggregate target rate, flagging every probe as saturated. Read topic count from run-metadata.json and multiply before the saturation check so the summary is correct for both 1-topic and multi-topic workloads. Assisted-by: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk> * fix(benchmarks): add useRandomizedPayloads to 10kb and 100kb workloads Without payloadFile or useRandomizedPayloads, OMB's FilePayloadReader NPEs on a null file path. Random payloads are appropriate for encryption benchmarks since AES-GCM processes arbitrary bytes. Assisted-by: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk> * fix(benchmarks): set randomizedPayloadPoolSize for 10kb and 100kb workloads useRandomizedPayloads requires a non-zero randomizedPayloadPoolSize; with the default of 0 OMB crashes at startLoad trying to pick from an empty payload pool. Also set randomBytesRatio: 1.0 so payloads are fully random. Assisted-by: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk> * feat(benchmarks): add JFR CPULoad fallback to analyze-cpu-coefficient.py When process_cpu_usage is absent from proxy-metrics.txt (not all deployments expose this Micrometer metric), the script now falls back to jdk.CPULoad events from the JFR recording. JFR CPULoad reports jvmUser + jvmSystem as a percentage of all machine CPUs. The node CPU count is read from clusterNodes.cpuPerNode in run-metadata.json and used to normalise the value to a fraction of one CPU, keeping the same scale as process_cpu_usage. Also: multi-topic achieved rates are now correctly scaled by the topics count from run-metadata.json, and the CPU source and absolute mc value are shown in the output table. Assisted-by: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk> * fix(benchmarks): avoid version strings that rot - Remove `latest` tag from busybox image reference in run-benchmark.sh; the SHA digest already pins the image content unambiguously. - Resolve STRIMZI_VERSION=latest dynamically in setup-cluster.sh by following the GitHub releases/latest redirect; the literal string 'latest' has no corresponding release artifact path and causes a 404. - Drop kafka.version from values.yaml; Strimzi defaults the Kafka version and an explicit pin will lag behind Strimzi upgrades. Assisted-by: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk> * fix(benchmarks): add blank line after license header in analyze-cpu-coefficient.py Satisfies the license-maven-plugin check which requires a blank line between the license block and the module docstring. Assisted-by: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk> * fix(benchmarks): make kafka.version optional in Helm template (#96) The kafka.version field was removed from values.yaml in commit ed1d43e to allow Strimzi to default the version, but the template still unconditionally rendered "version: {{ .Values.kafka.version }}", which produced "version: " (empty string/null) when the value was undefined. Strimzi 1.0.0 rejects this with: "spec.kafka.version: Invalid value: \"null\": spec.kafka.version in body must be of type string: \"null\"" Wrap the version field in a conditional so it's only rendered when .Values.kafka.version is set. When omitted, Strimzi uses its default Kafka version as documented in the CRD. Assisted-by: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Keith Wall <kwall@apache.org> Signed-off-by: Sam Barker <sam@quadrocket.co.uk> * fix(benchmarks): qualify Vault image with docker.io and document cluster overrides Unqualified image references (hashicorp/vault) are redirected to Red Hat's registry on some OpenShift clusters, causing ErrImagePull. Prefix with docker.io/ so the pull target is unambiguous. Add a vault image override example to the existing cluster-specific settings section in QUICKSTART.md so users know how to substitute an alternative image via --cluster-overrides without modifying scenario files. Assisted-by: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk> * refactor(benchmarks): reduce cognitive complexity in RunMetadata Replace repeated null-guard if-blocks with a putIfPresent helper, collapsing 13 checks in generate() and 4 in proxyInfo() to single-line calls. Behaviour is unchanged. Assisted-by: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk> * refactor(benchmarks): extract proxy container traversal to reduce proxyInfo complexity Extract the for-loop that finds the proxy container and reads its resource limits into a separate helper, removing nested branching from proxyInfo. Assisted-by: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk> --------- Signed-off-by: Sam Barker <sam@quadrocket.co.uk> Signed-off-by: Keith Wall <kwall@apache.org> Co-authored-by: Keith Wall <kwall@redhat.com>

SamBarker merged commit 27c8153 into SamBarker:feat/omb-connection-sweep May 19, 2026
26 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(benchmarks): make kafka.version optional in Helm template#96

fix(benchmarks): make kafka.version optional in Helm template#96
SamBarker merged 1 commit into
SamBarker:feat/omb-connection-sweepfrom
k-wall:fix/kafka-version-template

k-wall commented May 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

k-wall commented May 19, 2026

Summary

Problem

Solution

Testing

Notes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants