Fair managed multi-seg: reuse a buffer instead of the ToArray API#41
Merged
Conversation
The managed multi-seg bench called TryExtractFullHeaderValidated, which does input.ToArray() per call (a heap allocation), while the FFI multi-seg bench reused a buffer (seq.CopyTo into a once-allocated array). That made the bindings look ~2x faster on multi-seg when the difference was allocation strategy, not parse speed. Now the managed multi-seg also linearizes into the reused buffer (seq.CopyTo + ROM parse), so every multi-seg path = reused-buffer linearize + parse. Result: multi-seg = contiguous + a memcpy for all, and the native-vs-managed gap matches contiguous. Managed multi-seg 32KB drops 9262 -> 5606. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Benchmark
Details
| Benchmark suite | Current: 62d6f9f | Previous: d3a06f0 | Ratio |
|---|---|---|---|
Benchmarks.FlexibleParserBenchmark.Small_ROM |
139.72749559084573 ns (± 0.6902930401856936) |
139.20921897888184 ns (± 0.3414138670448296) |
1.00 |
Benchmarks.FlexibleParserBenchmark.Small_MultiSegment |
354.3225266138713 ns (± 0.8942941863343116) |
349.4604838689168 ns (± 2.860363450836272) |
1.01 |
Benchmarks.FlexibleParserBenchmark.Header4K_ROM |
694.9425118764242 ns (± 2.3547129820667396) |
708.5051829020182 ns (± 2.487052061932313) |
0.98 |
Benchmarks.FlexibleParserBenchmark.Header4K_MultiSegment |
1778.321886698405 ns (± 17.50093798247063) |
1826.123291015625 ns (± 16.460001723138515) |
0.97 |
Benchmarks.FlexibleParserBenchmark.Header32K_ROM |
4831.17500559489 ns (± 26.374917042370054) |
4949.9634958903 ns (± 10.64757006352856) |
0.98 |
Benchmarks.FlexibleParserBenchmark.Header32K_MultiSegment |
12079.127970377604 ns (± 74.47769506531404) |
12010.515991210938 ns (± 66.48393176840845) |
1.01 |
Benchmarks.UltraHardenedParserBenchmark.Small_ROM |
252.49250237147012 ns (± 0.4901765172340976) |
252.88829962412515 ns (± 1.7230360417772392) |
1.00 |
Benchmarks.UltraHardenedParserBenchmark.Small_MultiSegment |
536.6848204930624 ns (± 0.5783545803477097) |
559.2204907735189 ns (± 4.167496342803848) |
0.96 |
Benchmarks.UltraHardenedParserBenchmark.Header4K_ROM |
1135.0538514455159 ns (± 2.1511206126767997) |
1118.6354840596516 ns (± 0.7807947528732518) |
1.01 |
Benchmarks.UltraHardenedParserBenchmark.Header4K_MultiSegment |
2202.86643854777 ns (± 20.347785007036975) |
2225.3782081604004 ns (± 17.533121532742204) |
0.99 |
Benchmarks.UltraHardenedParserBenchmark.Header32K_ROM |
7217.457476298015 ns (± 7.737945397352009) |
7139.710075378418 ns (± 18.12358659190381) |
1.01 |
Benchmarks.UltraHardenedParserBenchmark.Header32K_MultiSegment |
15316.31938680013 ns (± 67.76101236052526) |
15398.131754557291 ns (± 158.60331507605568) |
0.99 |
Benchmarks.FlexibleParserBenchmark.Small_ROM.Allocated |
0 ns (± 0) |
0 ns (± 0) |
1 |
Benchmarks.FlexibleParserBenchmark.Small_MultiSegment.Allocated |
112 ns (± 0) |
112 ns (± 0) |
1 |
Benchmarks.FlexibleParserBenchmark.Header4K_ROM.Allocated |
0 ns (± 0) |
0 ns (± 0) |
1 |
Benchmarks.FlexibleParserBenchmark.Header4K_MultiSegment.Allocated |
4128 ns (± 0) |
4128 ns (± 0) |
1 |
Benchmarks.FlexibleParserBenchmark.Header32K_ROM.Allocated |
0 ns (± 0) |
0 ns (± 0) |
1 |
Benchmarks.FlexibleParserBenchmark.Header32K_MultiSegment.Allocated |
32800 ns (± 0) |
32800 ns (± 0) |
1 |
Benchmarks.UltraHardenedParserBenchmark.Small_ROM.Allocated |
0 ns (± 0) |
0 ns (± 0) |
1 |
Benchmarks.UltraHardenedParserBenchmark.Small_MultiSegment.Allocated |
128 ns (± 0) |
128 ns (± 0) |
1 |
Benchmarks.UltraHardenedParserBenchmark.Header4K_ROM.Allocated |
0 ns (± 0) |
0 ns (± 0) |
1 |
Benchmarks.UltraHardenedParserBenchmark.Header4K_MultiSegment.Allocated |
4128 ns (± 0) |
4128 ns (± 0) |
1 |
Benchmarks.UltraHardenedParserBenchmark.Header32K_ROM.Allocated |
0 ns (± 0) |
0 ns (± 0) |
1 |
Benchmarks.UltraHardenedParserBenchmark.Header32K_MultiSegment.Allocated |
32800 ns (± 0) |
32800 ns (± 0) |
1 |
This comment was automatically generated by workflow using github-action-benchmark.
Reverts the earlier 'fair' swap. Since the C core is single-slab, the binding must linearize, so that copy is the binding's real cost (counted: reused-buffer CopyTo). By the same logic the managed column must show ITS real linearization — TryExtractFullHeaderValidated, which input.ToArray()s every request — not a hand-rolled reused buffer. So multi-seg now reflects what each actually does: managed allocates the linearization buffer per request (~9200 ns @ 32KB), the bindings reuse one (~4500 ns). The copy is in both; the ~2x gap is the per-request allocation the single-slab binding avoids (a managed caller can match it by hand-rolling CopyTo+ROM). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… buffer strategy) The C# FFI does linearize on multi-segment — both paths do. The bug was that the managed column used the one-shot API (ToArray) while the FFI used a reused buffer, smuggling the linearization strategy into a parser comparison. Now every multi-seg path does CopyTo/memcpy into a reused buffer + parse, so the copy is counted identically and the column reflects the parser: multi-seg = contiguous + a memcpy for all, native ~1.2x ahead in both modes. The TryExtractFullHeaderValidated ToArray-per-request cost (~9.2us vs ~5.4us @ 32KB) is now a footnote, not a confound. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Replaces the prose README with verified, copy-pasteable request-header-parsing examples for the C# library (UltraHardenedParser), the .NET binding (Glyph11Parser, zero-alloc caller storage), and the Kotlin binding (Glyph11.parse). To make the Kotlin example real, the binding now surfaces parsed headers/query as List<Glyph11Field> (name/value spans) instead of only a count. All three examples were compiled/run against the real libraries; Kotlin smoke now also asserts a header name/value. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The managed multi-seg bench called TryExtractFullHeaderValidated, which does input.ToArray() per call (a heap allocation), while the FFI multi-seg bench reused a buffer (seq.CopyTo into a once-allocated array). That made the bindings look ~2x faster on multi-seg when the difference was allocation strategy, not parse speed. Now the managed multi-seg also linearizes into the reused buffer (seq.CopyTo + ROM parse), so every multi-seg path = reused-buffer linearize + parse. Result: multi-seg = contiguous + a memcpy for all, and the native-vs-managed gap matches contiguous. Managed multi-seg 32KB drops 9262 -> 5606.