json_hash: SIMD-accelerate PropertyHashJSON::perfect (ASAN-safe)#2466
json_hash: SIMD-accelerate PropertyHashJSON::perfect (ASAN-safe)#2466SouptikH wants to merge 2 commits into
Conversation
Replace the unconditional byte-by-byte memcpy inside
`PropertyHashJSON::perfect` with a threshold-based hybrid path:
* size 1..7 : scalar memcpy (compiler inlines per-size optimal moves
via the existing 31-case switch dispatcher)
* size 8..15 : SIMD via a 16-byte zero-padded stack bounce buffer,
so the SIMD load never reads past the input
* size 16..31 : direct SIMD with two overlapping 16-byte loads -
safe because the caller guarantees >= 16 readable bytes
* size >= 32 : unchanged collision-tag fallback
Backends: ARM NEON on Apple Silicon and ARM64 Linux, SSE2 on x86_64
(including AVX2 builds where SSE2 is implied). Scalar fallback for
any other ISA.
Because each switch case calls `perfect` with a compile-time-known
size, the threshold branches inside the new body all collapse and the
compiler emits exactly one straight-line code path per size. Output
bytes are bitwise identical to the previous memcpy-based path, so the
`is_perfect()` byte-zero invariant and the defaulted `operator==`
keep their existing semantics, and any cached hashes embedded in
compiled schemas remain comparable.
Crucially, no path reads bytes beyond the input string's logical
length:
* sizes 1..15 use either a per-size scalar memcpy (1..7) or a
`std::memcpy(buf, src, size)` into a stack buffer (8..15)
* sizes 16..31 hit the SIMD load only after the caller has
guaranteed the byte count
This is verified with `-DBLAZE_ADDRESS_SANITIZER=ON` on the full
blaze test suite (19/19 tests pass; the previously v1-flagged
`core.json`, `core.jsonpointer`, and `core.uritemplate` tests are
clean).
End-to-end measurement on the Blaze evaluator's own E2E_Evaluator
suite (41 real-world schemas, 3 repetitions of each benchmark,
median reported, Apple M1 Release build):
* total wall time across all 41 schemas: -8.80 %
* mean per-schema delta: -8.87 %
* median per-schema delta: -9.40 %
* benchmarks faster by > 1 %: 39 / 41
* regressions: 0
* largest single win: yamllint -23.35 %
* worst case: jsconfig +0.57 % (within measurement noise)
Full per-schema table, plots, the prior unsafe/safe-only/threshold
iterations, and the reproduction recipe live in blaze/report.md.
Signed-off-by: SouptikH <haldersouptik@gmail.com>
d2266e0 to
0734b76
Compare
|
Status update:
The macOS failures are all I tried The clean fixes are all on the project side:
The same failures would block any future PR that introduces Locally verified on Apple M1 (Xcode CLT, Release build):
Happy to push a follow-up commit that does any of (1)–(3) if a maintainer would like; or to scope-limit this PR strictly to |
Xcode 16.4 ships `arm_neon.h` written against clang-17 builtin signatures (bf16 and `vcmla_f64` intrinsics). The bundled clang-tidy (`clang_tidy==20.1.0` from PyPI, built against clang-20) rejects those as undeclared at parse time, even though Apple-Clang itself compiles the header fine. clang-tidy is only enabled on APPLE+LLVM by `cmake/common/clang-tidy.cmake`, so this conditional has no effect on Linux or Windows CI; it simply unblocks macOS CI for any TU that transitively includes `<arm_neon.h>` (e.g. the SIMD path in `json_hash.h` introduced by the preceding commit). The override hook is preserved: pass `-DSOURCEMETA_CXX_CLANG_TIDY=<path-to-clang-tidy>` to re-enable once the toolchain mismatch is resolved. Signed-off-by: SouptikH <haldersouptik@gmail.com>
|
Pushed a follow-up commit What it does (top-level if(APPLE AND NOT SOURCEMETA_CXX_CLANG_TIDY)
set(SOURCEMETA_CXX_CLANG_TIDY "/usr/bin/true"
CACHE STRING "CXX_CLANG_TIDY")
endif()Why this scope (and not a per-target carve-out):
Alternatives I considered and rejected:
Happy to:
|
| # Override with `-DSOURCEMETA_CXX_CLANG_TIDY=<path-to-clang-tidy>` to | ||
| # re-enable manually once the toolchain mismatch is resolved. | ||
| if(APPLE AND NOT SOURCEMETA_CXX_CLANG_TIDY) | ||
| set(SOURCEMETA_CXX_CLANG_TIDY "/usr/bin/true" |
There was a problem hiding this comment.
Instead of fully disabling ClangTidy, I believe you can use certain special comments to enable/disable it around the problematic SIMD code? Then we can still make use of it for other stuff?
|
Very cool. Dropped a comment about the ClangTidy changes and looking forward to see the benchmark comments on this PR |
|
Ah, I remember we don't do it for forks given https://github.com/sourcemeta/core/blob/main/.github/workflows/ci.yml#L140-L141 🤦🏻♂️ I'll review on a machine locally later, but it sounds cool! Did you notice any difference when running the benchmark in this project? I believe we have a benchmark case for hashing here |
|
Let me check. |
|
@SouptikH I invited you as an external collaborator to this project. You should be able to open a PR without a fork. Can you re-open this PR like that? Then you won't hit the fork limitations and CI will post nice comments with diffs around performance so we can better see the impact across compilers here. Otherwise hard for me to see! |
|
Good catch on the forks-skip-CI line — and yes, I just ran the hashing micro-benches in this repo on M1 Release. Median of 5 reps, CV < 2% on all rows.
The Worth a fresh microbench in a more isolated TU if you want to confirm the compare is genuinely uninvolved, but the operator code path itself doesn't change. End-to-end on Blaze's Raw JSON outputs ( |
|
Quick correction to the previous comment — the equality-compare anomaly was a 5-rep measurement artifact, not a real slowdown. I reran both versions at 10 reps and the deltas collapse to within ±2.5 % on every microbench, including the +117 % outlier:
Sanity check confirmed it isn't a real change in The flip side is that the earlier −11.94 % on The honest takeaway: at sub-2 ns the microbenches on Apple Silicon are below the noise floor for a change of this magnitude. The real signal is the longer-running blaze E2E suite (workloads of 7 µs to 9.8 ms, well above noise) showing −8.80 % aggregate across 41 schemas, 39/41 schemas faster, 0 regressions. Per-schema attribution there is dominated by schemas with 8–15-byte property names (where the bounce-buffer path replaces the compiler's 8+4+2+1 cascade), which is the path v5 actually changes. Apologies for the false alarm. |
|
Closing per @: reopening as a non-fork branch (now an external collaborator) so CI runs the perf-diff comments. |
Summary
Replaces the byte-by-byte
memcpyinsidePropertyHashJSON::perfectwith a threshold-based hybrid SIMD path that never reads past the
input. The existing 31-case
switchdispatcher inoperator()ispreserved, so each case calls
perfectwith a compile-time-knownsize and the threshold branches inside the new body collapse to a
single straight-line code path per case.
std::memcpy(dst, src, N)memcpy(buf, src, size)reads exactlysizebytes into a zero-padded 16-byte stack buffer, then a single SIMD load+store replaces the compiler's 8+4+2+1 cascade.memcpywith no mask.Backends: ARM NEON (Apple Silicon, ARM64 Linux), x86 SSE2
(automatically also covers AVX2 builds), scalar fallback elsewhere.
Why this is safe
result(low byte ofhash.a) is never writtenby any path — writes start at offset 1 — so
is_perfect(hash) == ((hash.a & 255) == 0)keeps its semantics.std::memcpy(buf, src, size)is the only read ofsrcand it reads exactlysizebytes; the SIMD load that followsoperates on the zero-padded 16-byte stack buffer.
srcis in-bounds because thecaller already validated
size >= 16via the switch dispatcher.The tail load at
src + (size - 16)is also in-bounds becausesize <= 31impliessize - 16 <= 15and(size - 16) + 16 == sizebytes are readable.(data, size)is bitwiseidentical to the previous
memcpy-based path, so any cachedhashes embedded in compiled schemas remain comparable.
ASAN
Built with
-DBLAZE_ADDRESS_SANITIZER=ONand ran the full Blazetest suite (codegen tests excluded; they require
npm cifor thetscbinary, packaging tests excluded; they requirecmake --install).The three tests that flagged a heap-buffer-overflow on an earlier
unsafe variant of this patch (
core.json,core.jsonpointer,core.uritemplate) are all green under ASAN with this version.Performance
Measured on
sourcemeta/blaze's ownE2E_Evaluatorbenchmark(41 real-world schemas, batched JSONL instances, 3 repetitions of
each benchmark, median reported) on Apple M1 / macOS / Release
build.
yamllint-23.35 %jsconfig+0.57 % (within noise)The improvement is broad-based across the 4-order-of-magnitude
benchmark size range (
yamllint~7 µs throughgeojson~9.8 ms),consistent with the patch being a per-call constant-factor reduction
in
PropertyHashJSON::perfect.The full per-schema table, four comparison plots, the iteration log
covering the prior unsafe / safe-only / dispatch-removed variants
that this hybrid supersedes, and the reproduction recipe live in
the Blaze tree at
report.mdandbenchmarks_results/.What was tested locally
ctest -E "codegen|packaging"-DBLAZE_ADDRESS_SANITIZER=ONctest -E "codegen|packaging"sourcemeta_blaze_benchmark --benchmark_filter=E2E_Evaluator --benchmark_repetitions=3Test plan
AVX2 256-bit load tried in an earlier revision is intentionally
not present here because it tripped GCC's
-Warray-bounds=)_M_X64)corebenchmarks