Skip to content

[Java Extension] Add fast path for decoding / unescaping of ASCII-only strings#1034

Open
samyron wants to merge 2 commits into
ruby:masterfrom
samyron:sm/java-parser-fastpath
Open

[Java Extension] Add fast path for decoding / unescaping of ASCII-only strings#1034
samyron wants to merge 2 commits into
ruby:masterfrom
samyron:sm/java-parser-fastpath

Conversation

@samyron

@samyron samyron commented Jun 21, 2026

Copy link
Copy Markdown
Contributor

Overview

This PR is a follow up to #1004. This adds a fast path when decoding / unescaping ASCII-only strings.

Additionally, this adds a new JVM system propery jruby.json.validateUTF8Strings which defaults to true. If this property is disabled, UTF-8 validation is disabled which matches the behavior of the C parser. This provides an opt-in resolution to #138 where JRuby users can explicitly disable the UTF-8 validation. If the UTF-8 validation is disabled, there is a bit of a performance benefit on some benchmarks.

Benchmarks

These benchmarks were run on a M1 Macbook Air. There is a non-trivial difference between runs but I tried to get the best before benchmarks possible. THe after are a bit more stable.

SWAR + UTF-8 Validation

== Parsing activitypub.json (58160 bytes)
jruby 10.0.5.0 (3.4.5) 2026-04-06 5db1ba72f3 OpenJDK 64-Bit Server VM 24.0.1+9-30 on 24.0.1+9-30 +indy +jit [arm64-darwin]
Warming up --------------------------------------
               after   851.000 i/100ms
Calculating -------------------------------------
               after      8.841k (± 1.0%) i/s  (113.11 μs/i) -     44.252k in   5.005525s

Comparison:
before:     6843.5 i/s
 after:     8840.6 i/s - 1.29x  faster


== Parsing twitter.json (567916 bytes)
jruby 10.0.5.0 (3.4.5) 2026-04-06 5db1ba72f3 OpenJDK 64-Bit Server VM 24.0.1+9-30 on 24.0.1+9-30 +indy +jit [arm64-darwin]
Warming up --------------------------------------
               after    69.000 i/100ms
Calculating -------------------------------------
               after    717.676 (± 0.8%) i/s    (1.39 ms/i) -      3.588k in   4.999474s

Comparison:
before:      725.0 i/s
 after:      717.7 i/s - same-ish: difference falls within error


== Parsing citm_catalog.json (1727030 bytes)
jruby 10.0.5.0 (3.4.5) 2026-04-06 5db1ba72f3 OpenJDK 64-Bit Server VM 24.0.1+9-30 on 24.0.1+9-30 +indy +jit [arm64-darwin]
Warming up --------------------------------------
               after    40.000 i/100ms
Calculating -------------------------------------
               after    415.652 (± 1.2%) i/s    (2.41 ms/i) -      2.080k in   5.004184s

Comparison:
before:      411.5 i/s
 after:      415.7 i/s - same-ish: difference falls within error

SWAR + No UTF-8 Validation

activitypub.json is the same but twitter.json and citm_catalog.json see improvements.

== Parsing twitter.json (567916 bytes)
jruby 10.0.5.0 (3.4.5) 2026-04-06 5db1ba72f3 OpenJDK 64-Bit Server VM 24.0.1+9-30 on 24.0.1+9-30 +indy +jit [arm64-darwin]
Warming up --------------------------------------
               after    78.000 i/100ms
Calculating -------------------------------------
               after    795.335 (± 1.0%) i/s    (1.26 ms/i) -      3.978k in   5.001663s

Comparison:
before:      724.8 i/s
 after:      795.3 i/s - 1.10x  faster

Vector API + UTF-8 Validation

== Parsing activitypub.json (58160 bytes)
jruby 10.0.5.0 (3.4.5) 2026-04-06 5db1ba72f3 OpenJDK 64-Bit Server VM 24.0.1+9-30 on 24.0.1+9-30 +indy +jit [arm64-darwin]
Warming up --------------------------------------
               after   860.000 i/100ms
Calculating -------------------------------------
               after      8.884k (± 1.0%) i/s  (112.56 μs/i) -     44.720k in   5.033836s

Comparison:
before:     6971.4 i/s
 after:     8883.9 i/s - 1.27x  faster


== Parsing twitter.json (567916 bytes)
jruby 10.0.5.0 (3.4.5) 2026-04-06 5db1ba72f3 OpenJDK 64-Bit Server VM 24.0.1+9-30 on 24.0.1+9-30 +indy +jit [arm64-darwin]
Warming up --------------------------------------
               after    74.000 i/100ms
Calculating -------------------------------------
               after    750.589 (± 0.9%) i/s    (1.33 ms/i) -      3.774k in   5.028054s

Comparison:
before:      737.8 i/s
 after:      750.6 i/s - same-ish: difference falls within error


== Parsing citm_catalog.json (1727030 bytes)
jruby 10.0.5.0 (3.4.5) 2026-04-06 5db1ba72f3 OpenJDK 64-Bit Server VM 24.0.1+9-30 on 24.0.1+9-30 +indy +jit [arm64-darwin]
Warming up --------------------------------------
               after    41.000 i/100ms
Calculating -------------------------------------
               after    428.952 (± 0.9%) i/s    (2.33 ms/i) -      2.173k in   5.065834s

Comparison:
before:      423.7 i/s
 after:      429.0 i/s - same-ish: difference falls within error

Vector API + No UTF-8 Validation

== Parsing twitter.json (567916 bytes)
jruby 10.0.5.0 (3.4.5) 2026-04-06 5db1ba72f3 OpenJDK 64-Bit Server VM 24.0.1+9-30 on 24.0.1+9-30 +indy +jit [arm64-darwin]
Warming up --------------------------------------
               after    81.000 i/100ms
Calculating -------------------------------------
               after    823.273 (± 1.1%) i/s    (1.21 ms/i) -      4.131k in   5.017779s

Comparison:
before:      735.3 i/s
 after:      823.3 i/s - 1.12x  faster

samyron added 2 commits June 19, 2026 14:19
…n decoding ASCII-only strings with escape or control characters.
… being done if UTF-8 validation is disabled.
@samyron

samyron commented Jun 21, 2026

Copy link
Copy Markdown
Contributor Author

Note: I'm happy to remove the jruby.json.validateUTF8Strings if requested. I thought about adding this as an actual configuration option but decided against it. I consider the JVM property being a somewhat "undocumented but present parameter that can could be removed at any point." At the same time, as soon as it's released someone will use it and if it's ever removed, even if not documented, will potentially cause someone to raise an issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant