Skip to content

Fair managed multi-seg: reuse a buffer instead of the ToArray API#41

Merged
MDA2AV merged 4 commits into
mainfrom
feat/cross-bench
Jun 4, 2026
Merged

Fair managed multi-seg: reuse a buffer instead of the ToArray API#41
MDA2AV merged 4 commits into
mainfrom
feat/cross-bench

Conversation

@MDA2AV
Copy link
Copy Markdown
Member

@MDA2AV MDA2AV commented Jun 4, 2026

The managed multi-seg bench called TryExtractFullHeaderValidated, which does input.ToArray() per call (a heap allocation), while the FFI multi-seg bench reused a buffer (seq.CopyTo into a once-allocated array). That made the bindings look ~2x faster on multi-seg when the difference was allocation strategy, not parse speed. Now the managed multi-seg also linearizes into the reused buffer (seq.CopyTo + ROM parse), so every multi-seg path = reused-buffer linearize + parse. Result: multi-seg = contiguous + a memcpy for all, and the native-vs-managed gap matches contiguous. Managed multi-seg 32KB drops 9262 -> 5606.

The managed multi-seg bench called TryExtractFullHeaderValidated, which does input.ToArray() per call (a heap allocation), while the FFI multi-seg bench reused a buffer (seq.CopyTo into a once-allocated array). That made the bindings look ~2x faster on multi-seg when the difference was allocation strategy, not parse speed. Now the managed multi-seg also linearizes into the reused buffer (seq.CopyTo + ROM parse), so every multi-seg path = reused-buffer linearize + parse. Result: multi-seg = contiguous + a memcpy for all, and the native-vs-managed gap matches contiguous. Managed multi-seg 32KB drops 9262 -> 5606.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Benchmark

Details
Benchmark suite Current: 62d6f9f Previous: d3a06f0 Ratio
Benchmarks.FlexibleParserBenchmark.Small_ROM 139.72749559084573 ns (± 0.6902930401856936) 139.20921897888184 ns (± 0.3414138670448296) 1.00
Benchmarks.FlexibleParserBenchmark.Small_MultiSegment 354.3225266138713 ns (± 0.8942941863343116) 349.4604838689168 ns (± 2.860363450836272) 1.01
Benchmarks.FlexibleParserBenchmark.Header4K_ROM 694.9425118764242 ns (± 2.3547129820667396) 708.5051829020182 ns (± 2.487052061932313) 0.98
Benchmarks.FlexibleParserBenchmark.Header4K_MultiSegment 1778.321886698405 ns (± 17.50093798247063) 1826.123291015625 ns (± 16.460001723138515) 0.97
Benchmarks.FlexibleParserBenchmark.Header32K_ROM 4831.17500559489 ns (± 26.374917042370054) 4949.9634958903 ns (± 10.64757006352856) 0.98
Benchmarks.FlexibleParserBenchmark.Header32K_MultiSegment 12079.127970377604 ns (± 74.47769506531404) 12010.515991210938 ns (± 66.48393176840845) 1.01
Benchmarks.UltraHardenedParserBenchmark.Small_ROM 252.49250237147012 ns (± 0.4901765172340976) 252.88829962412515 ns (± 1.7230360417772392) 1.00
Benchmarks.UltraHardenedParserBenchmark.Small_MultiSegment 536.6848204930624 ns (± 0.5783545803477097) 559.2204907735189 ns (± 4.167496342803848) 0.96
Benchmarks.UltraHardenedParserBenchmark.Header4K_ROM 1135.0538514455159 ns (± 2.1511206126767997) 1118.6354840596516 ns (± 0.7807947528732518) 1.01
Benchmarks.UltraHardenedParserBenchmark.Header4K_MultiSegment 2202.86643854777 ns (± 20.347785007036975) 2225.3782081604004 ns (± 17.533121532742204) 0.99
Benchmarks.UltraHardenedParserBenchmark.Header32K_ROM 7217.457476298015 ns (± 7.737945397352009) 7139.710075378418 ns (± 18.12358659190381) 1.01
Benchmarks.UltraHardenedParserBenchmark.Header32K_MultiSegment 15316.31938680013 ns (± 67.76101236052526) 15398.131754557291 ns (± 158.60331507605568) 0.99
Benchmarks.FlexibleParserBenchmark.Small_ROM.Allocated 0 ns (± 0) 0 ns (± 0) 1
Benchmarks.FlexibleParserBenchmark.Small_MultiSegment.Allocated 112 ns (± 0) 112 ns (± 0) 1
Benchmarks.FlexibleParserBenchmark.Header4K_ROM.Allocated 0 ns (± 0) 0 ns (± 0) 1
Benchmarks.FlexibleParserBenchmark.Header4K_MultiSegment.Allocated 4128 ns (± 0) 4128 ns (± 0) 1
Benchmarks.FlexibleParserBenchmark.Header32K_ROM.Allocated 0 ns (± 0) 0 ns (± 0) 1
Benchmarks.FlexibleParserBenchmark.Header32K_MultiSegment.Allocated 32800 ns (± 0) 32800 ns (± 0) 1
Benchmarks.UltraHardenedParserBenchmark.Small_ROM.Allocated 0 ns (± 0) 0 ns (± 0) 1
Benchmarks.UltraHardenedParserBenchmark.Small_MultiSegment.Allocated 128 ns (± 0) 128 ns (± 0) 1
Benchmarks.UltraHardenedParserBenchmark.Header4K_ROM.Allocated 0 ns (± 0) 0 ns (± 0) 1
Benchmarks.UltraHardenedParserBenchmark.Header4K_MultiSegment.Allocated 4128 ns (± 0) 4128 ns (± 0) 1
Benchmarks.UltraHardenedParserBenchmark.Header32K_ROM.Allocated 0 ns (± 0) 0 ns (± 0) 1
Benchmarks.UltraHardenedParserBenchmark.Header32K_MultiSegment.Allocated 32800 ns (± 0) 32800 ns (± 0) 1

This comment was automatically generated by workflow using github-action-benchmark.

MDA2AV and others added 3 commits June 4, 2026 21:51
Reverts the earlier 'fair' swap. Since the C core is single-slab, the binding must linearize, so that copy is the binding's real cost (counted: reused-buffer CopyTo). By the same logic the managed column must show ITS real linearization — TryExtractFullHeaderValidated, which input.ToArray()s every request — not a hand-rolled reused buffer. So multi-seg now reflects what each actually does: managed allocates the linearization buffer per request (~9200 ns @ 32KB), the bindings reuse one (~4500 ns). The copy is in both; the ~2x gap is the per-request allocation the single-slab binding avoids (a managed caller can match it by hand-rolling CopyTo+ROM).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… buffer strategy)

The C# FFI does linearize on multi-segment — both paths do. The bug was that the managed column used the one-shot API (ToArray) while the FFI used a reused buffer, smuggling the linearization strategy into a parser comparison. Now every multi-seg path does CopyTo/memcpy into a reused buffer + parse, so the copy is counted identically and the column reflects the parser: multi-seg = contiguous + a memcpy for all, native ~1.2x ahead in both modes. The TryExtractFullHeaderValidated ToArray-per-request cost (~9.2us vs ~5.4us @ 32KB) is now a footnote, not a confound.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Replaces the prose README with verified, copy-pasteable request-header-parsing examples for the C# library (UltraHardenedParser), the .NET binding (Glyph11Parser, zero-alloc caller storage), and the Kotlin binding (Glyph11.parse). To make the Kotlin example real, the binding now surfaces parsed headers/query as List<Glyph11Field> (name/value spans) instead of only a count. All three examples were compiled/run against the real libraries; Kotlin smoke now also asserts a header name/value.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@MDA2AV MDA2AV merged commit a9f941a into main Jun 4, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant