docs: add benchmarking blog posts and performance reference page#254
docs: add benchmarking blog posts and performance reference page#254SamBarker wants to merge 7 commits into
Conversation
Covers methodology, test environment, passthrough proxy results, encryption latency and throughput ceiling, the per-connection scaling insight, and sizing guidance. Includes a TODO placeholder for the connection sweep results before publication. Assisted-by: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
Covers why we chose OMB over Kafka's own tools, the benchmark harness we built (Helm chart, orchestration scripts, JBang result processors), workload design rationale, CPU flamegraphs with embedded interactive iframes, the per-connection ceiling discovery, bugs found in our own tooling, and the cluster recovery incident. Assisted-by: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
Adds /performance/ as a dedicated quick-reference page with headline benchmark numbers, comparison tables, and sizing guidance, linked from both blog posts. Updates the existing Performance section in overview.markdown with the key headline numbers and a link to the full reference page. Assisted-by: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
| | Kroxylicious proxy | 1.4% | | ||
| | GC | 0.1% | | ||
|
|
||
| The proxy is overwhelmingly I/O-bound. 59% of CPU is in `send`/`recv` syscalls — the inherent cost of maintaining two TCP connections (client→proxy, proxy→Kafka) with data flowing through the JVM. The proxy itself accounts for 1.4%. It really is a TCP relay with protocol awareness. |
There was a problem hiding this comment.
I wonder how much that's down to the decode predicate thing -- basically we know the filter chain, and what each filter in it wants to intercept, and I think we avoid doing the request/response decoding when we know nothing is interested. That was code that was in there from the beginning, but I don't actually know how relevant it is -- maybe some of the internal filters mean we're decoding requests and response always, in which case 1.4% is impressive. Or maybe we're acting more like a L4 proxy most of the time, in which case 1.4% is not quite as impressive.
There was a problem hiding this comment.
Great question — this is actually a stronger story than the original prose suggested. The default infrastructure filters (BrokerAddressFilter, TopicNameCacheFilter, ApiVersionsIntersect) are doing genuine L7 work: metadata, FindCoordinator, and API version exchanges are fully decoded for address rewriting and version negotiation. But the high-volume produce/consume traffic hits the decode predicate and passes through without full deserialisation. So the proxy is selectively L7 — real protocol awareness where it needs it, L4-like passthrough on the hot path. The 1.4% is the cost of that design, and it validates it. Updating the prose to make this explicit.
|
|
||
| The direct crypto cost is 13.3% (11.3% AES-GCM + 2.0% Kroxylicious filter logic). But encryption adds indirect costs too: | ||
|
|
||
| - **Buffer management (+5.8%)**: encrypted records need to be read into buffers, encrypted, and written to new buffers — more allocation, more copying |
There was a problem hiding this comment.
Did we ever figure out how to reuse the buffers more? I think that was a TODO at one point.
There was a problem hiding this comment.
Correct — the TODO was never addressed. A BufferPool class existed at one point but was deleted as unused in early 2024. Cipher instances are still created fresh per operation. These remain genuine open optimisation opportunities.
…aming - Shift publication dates to May 21 and May 28 - Replace speculative per-connection ceiling explanation with empirical finding: encryption throughput ceiling scales linearly with CPU budget (validated at 1000m, 2000m, 4000m) - Add sizing formula: CPU (mc) = 20 × produce_MB_per_s, with worked example - Add RF=3 masking caveat: initial 1-topic sweeps conflated Kafka replication ceiling with proxy CPU ceiling; coefficient derived from RF=1 multi-topic workloads - Post 2: add full investigation narrative — workload isolation approach, coefficient derivation, 4-core confirmation, and 2-core prediction/validation - Drop stale "future work" items that are now complete Assisted-by: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
The proxy is selectively L7: default infrastructure filters do genuine Kafka protocol work (address rewriting, API version negotiation, metadata caching) while high-volume produce/consume traffic bypasses full deserialisation via the decode predicate. The 1.4% proxy CPU share validates this design, not just reflects it. Also drop the Fyre cluster upgrade section — OCP-internal incident with no relevance to readers. Assisted-by: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
- Warm up test environment intro: realistic deployment framing - Add conversational lead-in to sizing guidance in both documents - Improve caveats opener in Post 1 - Add caveats section to performance page (RF=3 masking, message size, horizontal scaling) Assisted-by: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
PaulRMellor
left a comment
There was a problem hiding this comment.
I read through the first post. Reads well overall, and the tone is nice and approachable. I left a few suggestions, particularly in places where the AI-assisted wording feels a bit noticeable.
| categories: benchmarking performance | ||
| --- | ||
|
|
||
| All good benchmarking stories start with a hunch. Mine was that Kroxylicious is cheap to run — I'd stake my career on it, in fact — but it turns out that "trust me, I wrote it" is not a widely accepted unit of measurement. People want proof. Sensibly. |
There was a problem hiding this comment.
| All good benchmarking stories start with a hunch. Mine was that Kroxylicious is cheap to run — I'd stake my career on it, in fact — but it turns out that "trust me, I wrote it" is not a widely accepted unit of measurement. People want proof. Sensibly. | |
| All good benchmarking stories start with a hunch. I was confident Kroxylicious was cheap to run — but it turns out that "trust me, I wrote it" is not a widely accepted unit of measurement. People want proof. Sensibly. |
|
|
||
| All good benchmarking stories start with a hunch. Mine was that Kroxylicious is cheap to run — I'd stake my career on it, in fact — but it turns out that "trust me, I wrote it" is not a widely accepted unit of measurement. People want proof. Sensibly. | ||
|
|
||
| There's a practical question underneath the hunch too. The most common thing operators ask us is some variation of: "How many cores does the proxy need?" Which is really just "is this thing going to slow down my Kafka?" in a polite engineering hat. We'd been giving the classic answer: "it depends on your workload and traffic patterns, so you'll need to test in your environment." Which is true. And also deeply unsatisfying for everyone involved, including us. |
There was a problem hiding this comment.
| There's a practical question underneath the hunch too. The most common thing operators ask us is some variation of: "How many cores does the proxy need?" Which is really just "is this thing going to slow down my Kafka?" in a polite engineering hat. We'd been giving the classic answer: "it depends on your workload and traffic patterns, so you'll need to test in your environment." Which is true. And also deeply unsatisfying for everyone involved, including us. | |
| There's a practical question underneath the hunch too. The most common thing operators ask us is some variation of: "How many cores does the proxy need?" Which is really just another way of asking: "is this thing going to slow down my Kafka?" We'd been giving the classic answer: "it depends on your workload and traffic patterns, so you'll need to test in your environment." Which is true, but not especially useful. |
|
|
||
| There's a practical question underneath the hunch too. The most common thing operators ask us is some variation of: "How many cores does the proxy need?" Which is really just "is this thing going to slow down my Kafka?" in a polite engineering hat. We'd been giving the classic answer: "it depends on your workload and traffic patterns, so you'll need to test in your environment." Which is true. And also deeply unsatisfying for everyone involved, including us. | ||
|
|
||
| So we stopped saying "it depends", and got off the fence: we built something you can run **yourselves** on your own infrastructure with your own workload, and measured it. Here are some representative numbers from ours. |
There was a problem hiding this comment.
| So we stopped saying "it depends", and got off the fence: we built something you can run **yourselves** on your own infrastructure with your own workload, and measured it. Here are some representative numbers from ours. | |
| So instead of saying "it depends", we built something measurable you can run **yourselves** on your own infrastructure with your own workload, and measured it. Here are some representative numbers from ours. |
There was a problem hiding this comment.
"got off the fence" might be a bit colloquial for non-native speakers
|
|
||
| ## Test environment | ||
|
|
||
| No, we didn't run this on a laptop — it's a realistic deployment: a 6-node OpenShift cluster on Fyre, IBM's internal cloud platform — a controlled environment. Kroxylicious ran as a single proxy pod with a 1000m CPU limit. |
There was a problem hiding this comment.
| No, we didn't run this on a laptop — it's a realistic deployment: a 6-node OpenShift cluster on Fyre, IBM's internal cloud platform — a controlled environment. Kroxylicious ran as a single proxy pod with a 1000m CPU limit. | |
| We ran the benchmarks on a realistic deployment rather than a local development machine: a 6-node OpenShift cluster on Fyre, IBM's internal cloud platform, providing a controlled test environment. Kroxylicious ran as a single proxy pod with a 1000m CPU limit. |
| | E2E latency p99 | 499.00 ms | 499.00 ms | 0 | | ||
| | Publish rate | 500 msg/s | 500 msg/s | 0 | | ||
|
|
||
| **The headline: ~0.2 ms additional average publish latency. Throughput is unaffected.** |
There was a problem hiding this comment.
| **The headline: ~0.2 ms additional average publish latency. Throughput is unaffected.** | |
| **The headline: ~0.2 ms additional average publish latency. Measured throughput was unaffected.** |
|
|
||
| ## Record encryption: now we're doing real work | ||
|
|
||
| Ok, so let's make the proxy smarter — make it do something people actually care about! [Record encryption](https://kroxylicious.io/documentation/0.20.0/html/record-encryption-guide) uses AES-256-GCM to encrypt each record passing through the proxy. AES-256-GCM is going to ask the CPU to work relatively hard on its own, but it's also going to push the proxy to understand each record it receives, unpack it, copy it, encrypt it, and re-pack it before sending it on to the broker. With all that work going on we expect some impact to latency and throughput. To answer our original question we need to identify two things: the latency when everything is going smoothly, and the reduction in throughput all this work causes. Monitoring latency once we go past the throughput inflection point isn't very helpful — it's dominated by the throughput limits and their erratic impacts on the latency of individual requests (a big hello to batching and buffering effects). |
There was a problem hiding this comment.
| Ok, so let's make the proxy smarter — make it do something people actually care about! [Record encryption](https://kroxylicious.io/documentation/0.20.0/html/record-encryption-guide) uses AES-256-GCM to encrypt each record passing through the proxy. AES-256-GCM is going to ask the CPU to work relatively hard on its own, but it's also going to push the proxy to understand each record it receives, unpack it, copy it, encrypt it, and re-pack it before sending it on to the broker. With all that work going on we expect some impact to latency and throughput. To answer our original question we need to identify two things: the latency when everything is going smoothly, and the reduction in throughput all this work causes. Monitoring latency once we go past the throughput inflection point isn't very helpful — it's dominated by the throughput limits and their erratic impacts on the latency of individual requests (a big hello to batching and buffering effects). | |
| [Record encryption](https://kroxylicious.io/documentation/0.20.0/html/record-encryption-guide) is a more representative workload because the proxy actively processes each record. Record encryption uses AES-256-GCM to encrypt each record passing through the proxy. AES-256-GCM is going to ask the CPU to work relatively hard on its own, but it's also going to push the proxy to understand each record it receives, unpack it, copy it, encrypt it, and re-pack it before sending it on to the broker. With all that work going on we expect some impact to latency and throughput. To answer our original question we need to identify two things: the latency when everything is going smoothly, and the reduction in throughput all this work causes. Monitoring latency once we go past the throughput inflection point isn't very helpful — it's dominated by the throughput limits and their erratic impacts on the latency of individual requests (a big hello to batching and buffering effects). |
There was a problem hiding this comment.
Is understand the right word here: "push the proxy to understand each record it receives"?
Would parse suffice?
|
|
||
| ### Latency at sub-saturation rates | ||
|
|
||
| A quick note on percentiles for anyone not steeped in performance benchmarking: p99 latency is the value that 99% of requests complete within — meaning 1 in 100 requests takes longer. Averages flatter; the p99 is what your slowest clients actually experience, and it's usually the number that matters. |
There was a problem hiding this comment.
| A quick note on percentiles for anyone not steeped in performance benchmarking: p99 latency is the value that 99% of requests complete within — meaning 1 in 100 requests takes longer. Averages flatter; the p99 is what your slowest clients actually experience, and it's usually the number that matters. | |
| A quick note on percentiles for anyone not steeped in performance benchmarking: p99 latency is the value that 99% of requests complete within — meaning 1 in 100 requests takes longer. Average latency can hide tail latency effects; the p99 is what your slowest clients actually experience, and it's usually the number that matters. |
|
|
||
| Worked example: 100k msg/s at 1 KB = 100 MB/s produce → 100 × 20 = 2000m, plus headroom → ~2600m (~2.6 cores). | ||
|
|
||
| 2. **Latency budget**: well below saturation, expect 0.2–3 ms additional average publish latency and 15–40 ms additional p99. The overhead scales with how hard you're pushing — give yourself headroom and you'll barely notice it. |
There was a problem hiding this comment.
| 2. **Latency budget**: well below saturation, expect 0.2–3 ms additional average publish latency and 15–40 ms additional p99. The overhead scales with how hard you're pushing — give yourself headroom and you'll barely notice it. | |
| 2. **Latency budget**: well below saturation, expect 0.2–3 ms additional average publish latency and 15–40 ms additional p99. The overhead scales with how hard you're pushing — providing sufficient headroom keeps the latency overhead relatively small. |
|
|
||
| | Rate | Metric | Baseline | Encryption | Delta | | ||
| |------|--------|----------|------------|-------| | ||
| | 34,000 msg/s | Publish avg | 8.00 ms | 8.19 ms | +0.19 ms (+2%) | |
There was a problem hiding this comment.
we have mixture of msg/s and msg/sec in the post
we should be consistent
|
|
||
| ## Caveats and next steps | ||
|
|
||
| These are real results from real hardware, but they don't tell a story for your workload. A few things worth knowing before you put these numbers in a slide deck: |
There was a problem hiding this comment.
| These are real results from real hardware, but they don't tell a story for your workload. A few things worth knowing before you put these numbers in a slide deck: | |
| These are real results from real hardware, but they do not necessarily reflect your workload characteristics. A few things worth knowing before you put these numbers in a slide deck: |
Summary
/performance/reference page summarising key numbers and linking to both postsoverview.markdownwith headline performance figures and a link to the reference pageStatus
Draft — the posts are first drafts. Known open items:
Test plan
./run.shand verify site renders athttp://127.0.0.1:4000//performance/page renders with correct tables/performance/work🤖 Generated with Claude Code