Skip to content

perf: v0.12.20 — premultiplied StateIDs, break-at-match#154

Merged
kolkov merged 1 commit intomainfrom
release/v0.12.20
Mar 25, 2026
Merged

perf: v0.12.20 — premultiplied StateIDs, break-at-match#154
kolkov merged 1 commit intomainfrom
release/v0.12.20

Conversation

@kolkov
Copy link
Contributor

@kolkov kolkov commented Mar 25, 2026

Summary

Premultiplied/tagged StateIDs + Rust-aligned DFA determinize with break-at-match + Phase 3 elimination. 27 files, +734 -583 lines.

DFA Core

  • Premultiplied StateIDs — eliminate sid * stride multiply from hot loop (Rust LazyStateID)
  • Tagged StateIDs — match/dead/invalid in high bits, single IsTagged() branch
  • 4x loop unrolling in all DFA search functions
  • 1-byte match delay — Rust determinize approach (mod.rs:254-286)
  • Break-at-match — Rust determinize::next break semantics, replaces filterStatesAfterMatch
  • Epsilon closure rewrite — add-on-pop DFS, reverse Split push, incremental per-target (verified vs Rust cargo run)
  • Phase 3 eliminated — bidirectional DFA reduced from 3-pass to 2-pass

Meta Engine

  • DFA direct FindAll path (skip meta prefilter layer)
  • Anchored FindAll fast paths (skip pool overhead, first-byte rejection)
  • BreakAtMatch config: true for forward DFA, false for reverse DFA
  • Fix: dfaConfig now uses DefaultConfig() to inherit BreakAtMatch=true

NFA/Prefilter

  • Lazy SlotTable init, anchored BT large input fix
  • Memmem: Memchr(rareByte) + verify (Rust approach)

Benchmarks (EPYC CI, 6MB input)

Pattern vs stdlib vs Rust
ip 675x 18.5x faster
multiline_php 288x 2.0x faster
char_class 11x 1.3x faster
inner_literal 668x ~parity
email 506x 1.8x slower
LangArena total (13 pat) 30x 3.9x gap

No regressions vs v0.12.19 on any pattern.

Verification

  • go test ./... — all 9 packages pass
  • gofmt -l — clean
  • golangci-lint run — clean (only dupl on intentional DFS duplication)
  • DFA SearchAt verified against Rust regex-automata find_fwd — identical on 7 patterns
  • regex-bench CI green (EPYC 9V74 + EPYC 7763)
  • Regression check: tokens +0.3%, peak_hours +5.2% (within noise)

@codecov
Copy link

codecov bot commented Mar 25, 2026

@github-actions
Copy link

github-actions bot commented Mar 25, 2026

Benchmark Comparison

Comparing main → PR #154

Summary: geomean 84.33n 81.26n -3.64%

⚠️ Potential regressions detected:

Accelerate/memchr1-4       103.2n ± ∞ ¹   140.5n ± ∞ ¹  +36.14% (p=0.008 n=5)
Accelerate/memchr3-4       101.8n ± ∞ ¹   112.8n ± ∞ ¹  +10.81% (p=0.016 n=5)
geomean                               ³                +0.00%               ³
geomean                               ³                +0.00%               ³
geomean              36.19n         36.22n        +0.08%
geomean                         ³                +0.00%               ³
geomean                         ³                +0.00%               ³
AnchoredLiteralVsStdlib/stdlib_short-4                  258.0n ± ∞ ¹    259.1n ± ∞ ¹     +0.43% (p=0.024 n=5)
AnchoredLiteralVsStdlib/stdlib_medium-4                 369.5n ± ∞ ¹    382.2n ± ∞ ¹     +3.44% (p=0.016 n=5)
AnchoredLiteralVsStdlib/coregex_no_match-4              5.645n ± ∞ ¹    6.346n ± ∞ ¹    +12.42% (p=0.008 n=5)

Full results available in workflow artifacts. CI runners have ~10-20% variance.
For accurate benchmarks, run locally: ./scripts/bench.sh --compare

…ination

DFA Core — Premultiplied + Tagged StateIDs:
- StateID stores byte offset into flatTrans, eliminating multiply from hot loop
- Match/dead/invalid flags encoded in StateID high bits (single IsTagged branch)
- 4x loop unrolling in searchFirstAt, searchAt, searchEarliestMatch
- safeOffset eliminated from all DFA search paths

DFA Core — Rust-aligned Determinize:
- 1-byte match delay (Rust determinize mod.rs:254-286)
- Break-at-match: stop NFA iteration at Match state, drop prefix restarts
- Epsilon closure rewrite: add-on-pop DFS with reverse Split push order,
  matching Rust sparse set insertion order (verified via cargo run)
- Incremental per-target epsilon closure in moveWithWordContext
- filterStatesAfterMatch removed (replaced by break-at-match)
- BreakAtMatch config: true for forward DFA, false for reverse DFA
- Phase 3 (SearchAtAnchored re-scan) eliminated — 2-pass bidirectional DFA
- Fix: meta dfaConfig uses DefaultConfig() to inherit BreakAtMatch=true

Meta Engine:
- DFA direct FindAll path — skip meta prefilter layer, call DFA directly
- Fast path for start-anchored FindAll — skip pool overhead
- Inline first-byte rejection for anchored patterns
- Prefilter candidate pass-through to bidirectional DFA
- Skip reverse DFA for always-anchored patterns

NFA/PikeVM:
- Lazy SlotTable init — reduce cold start overhead
- Fix anchored BoundedBacktracker on large input — truncate to MaxInputSize

Prefilter:
- Memmem: Memchr(rareByte) + verify (Rust approach) — replaces MemchrPair

Benchmarks (EPYC CI, 6MB input, vs stdlib / vs Rust):
- ip: 675x faster than stdlib, 18.5x faster than Rust
- multiline_php: 288x faster than stdlib, 2.0x faster than Rust
- char_class: 11x faster than stdlib, 1.3x faster than Rust
- inner_literal: 668x faster than stdlib, at Rust parity
- email: 506x faster than stdlib
- LangArena total: 30x faster than stdlib, 3.9x gap vs Rust

27 files changed, +734 -583 lines. All tests pass.
@kolkov kolkov force-pushed the release/v0.12.20 branch from a8e8632 to d22c05c Compare March 25, 2026 19:06
@kolkov kolkov merged commit 90d77fd into main Mar 25, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant