perf: flat DFA + integrated prefilter — 35% faster than baseline by kolkov · Pull Request #151 · coregx/coregex

kolkov · 2026-03-24T09:32:41Z

Summary

Major DFA architecture upgrade — flat transition table, integrated prefilter skip-ahead in DFA and PikeVM, 4x loop unrolling. Rust-aligned architecture.

Kostya benchmark: 3.60s -> 1.17s (3x faster). 35% faster than v0.12.14 baseline.

Performance

Flat DFA transition table — replaced stateList[id].transitions[class] (2 pointer chases) with flatTrans[sid*stride+class] (1 flat array access). Applied to all 8 DFA search functions.
DFA integrated prefilter skip-ahead — when DFA returns to start state, uses prefilter.Find() to skip ahead. Rust approach (hybrid/search.rs:232-258).
PikeVM integrated prefilter skip-ahead — prefilter inside NFA search loop (pikevm.rs:1293). Safe for partial-coverage prefilters.
4x loop unrolling in searchFirstAt — process 4 bytes per iteration.
DFA prefilter skip for incomplete prefilters — DFA verifies full pattern, skip always safe.

Fixed

NFA candidate loop guard — partialCoverage flag instead of IsComplete(). errors pattern: 1984ms -> 80ms.
DFA prefilter skip for memmem/Teddy — sessions: 229ms -> 30ms.

Results (Kostya benchmark, 7.2MB x 10 iter)

Version	Total	vs baseline	vs Rust
v0.12.14 (baseline)	1.80s	—	4.5x
v0.12.17 (regressed)	3.60s	+100%	9.0x
v0.12.18	1.17s	-35%	2.9x
Rust	0.40s	—	—

regex-bench CI (EPYC)

3 patterns faster than Rust (ip 5.3x, multiline_php 1.1x, char_class 1.1x)
No regressions vs v0.12.14 on any platform (EPYC, Xeon, M1)
Stdlib compat: 38/38 PASS

Test plan

go test ./... — all pass
TestStdlibCompatibility — 38/38 PASS
golangci-lint run — 0 issues
regex-bench CI: Benchmark + LangArena + macOS ARM64
Local A/B: all patterns faster than v0.12.17

…lete IsComplete() guard blocked prefilter candidate loop for ALL incomplete prefilters, including prefix-only ones where all alternation branches are represented. This caused 22x regression on Kostya's errors pattern (1984ms vs 90ms on v0.12.14). Root cause: Rust integrates prefilter as skip-ahead INSIDE PikeVM (pikevm.rs:1293-1299), not as external correctness gate. When NFA states are empty, prefilter skips ahead. Partial coverage is safe because NFA continues scanning if prefilter misses. Fix: Added partialCoverage flag on literal.Seq (set only on overflow truncation). NFA candidate loop uses !partialCoverage guard instead of IsComplete(). DFA paths retain IsComplete() where needed. errors: 1984ms -> 109ms. Stdlib compat: 38/38 PASS.

Integrate prefilter inside PikeVM search loop as skip-ahead (pikevm.rs:1293). When NFA has no active threads, PikeVM jumps to next candidate via prefilter.Find() instead of byte-by-byte scan. Safe for partial-coverage prefilters — NFA processes all branches from each candidate position. This is architecturally cleaner than external candidate loop guards (partialCoverage flag still used for external BT candidate loop as BoundedBacktracker has no integrated skip-ahead). Also includes PR #150 changes: partialCoverage flag on literal.Seq, NFA candidate loop guard uses partialCoverage instead of IsComplete(). errors pattern: 1984ms -> 120ms. la_suspicious: 38/38 stdlib PASS.

Replace double indirection (stateList[id].transitions[class]) with flat transition table (flatTrans[sid*stride + class]) in searchFirstAt hot loop. Also replace State.IsMatch() with compact matchFlags[sid] bool slice. Fast path now works with state ID only — no *State pointer needed. State struct accessed only in slow path (determinize, word boundary). Inspired by Rust regex-automata hybrid/dfa.rs Cache.trans flat layout. Kostya benchmark: 3.60s -> 2.56s (1.4x faster). bots pattern restored to v0.12.14 baseline (278ms vs 287ms). Stdlib compat: 38/38 PASS.

Unroll DFA hot loop 4x — process 4 bytes per iteration when all transitions are in flat table (no unknown/dead states). Falls to single-byte slow path on any special state. Marginal improvement on x86 with SIMD prefilters (branch predictor handles single-byte well). May help more on ARM64 where branch prediction is less aggressive. Reference: Rust hybrid/search.rs:195-221. Stdlib compat: 38/38 PASS.

Extend flat table optimization from searchFirstAt to all 6 DFA search functions: searchAt, searchEarliestMatch, searchEarliestMatchAnchored, SearchReverse, SearchReverseLimited, IsMatchReverse. Hot loop pattern: ft[int(sid)*stride + classIdx] replaces stateList[id].transitions[class] — eliminates pointer chase. State struct accessed only in slow path (determinize, word boundary). Kostya benchmark: 2.56s -> 2.28s (+12%). errors pattern: 109ms -> 81ms (better than v0.12.14 baseline 90ms). Stdlib compat: 38/38 PASS.

IsComplete() guard in findIndicesDFA/findIndicesDFAAt blocked prefilter skip-ahead for incomplete prefilters (memmem, Teddy with prefix-only literals). But DFA verifies full pattern at candidate — skip is always safe. This was the root cause of sessions (229ms -> 36ms), api_calls (245ms -> 95ms), post_requests (259ms -> 114ms) regressions. Kostya benchmark total: 2.28s -> 1.62s (FASTER than v0.12.14 baseline 1.80s!). Stdlib compat: 38/38 PASS.

When DFA returns to start state with no match in progress, use prefilter to skip ahead to next candidate instead of byte-by-byte scanning. Applied to searchFirstAt and searchAt (bidirectional DFA path). This is the Rust approach (hybrid/search.rs:232-258): prefilter is called inside the DFA loop when a start state is detected, not externally. peak_hours: 197ms -> 90ms (2.2x faster, gap vs Rust: 9x -> 4x). Kostya total: 1.62s -> 1.38s (15% faster). Stdlib compat: 38/38 PASS.

Apply flat table to SearchAtAnchored — called for every prefilter candidate verification in bidirectional DFA path. Eliminates pointer chase in the most frequent DFA hot path. Kostya benchmark: 1.38s -> 1.17s (15% faster). Total improvement vs v0.12.14: 1.80s -> 1.17s (35% faster). Stdlib compat: 38/38 PASS.

…refilterAt Apply flat table to last 2 remaining functions with old Transition() calls. No more State pointer chase in ANY DFA hot loop. Kostya benchmark: 1.17s -> 1.19s (stable, tokens 116ms->51ms). All DFA search functions now use flatTrans[sid*stride+class]. Stdlib compat: 38/38 PASS.

codecov · 2026-03-24T09:35:09Z

Codecov Report

❌ Patch coverage is 53.36700% with 277 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
dfa/lazy/lazy.go	53.21%	212 Missing and 35 partials ⚠️
nfa/pikevm.go	0.00%	10 Missing and 2 partials ⚠️
dfa/lazy/cache.go	63.33%	7 Missing and 4 partials ⚠️
literal/seq.go	0.00%	4 Missing ⚠️
literal/extractor.go	0.00%	1 Missing and 1 partial ⚠️
meta/find_indices.go	92.30%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

github-actions · 2026-03-24T09:39:43Z

Benchmark Comparison

Comparing main → PR #151

Summary: geomean 120.0n 118.9n -0.95%

⚠️ Potential regressions detected:

Accelerate/memchr1-4       109.8n ± ∞ ¹   120.7n ± ∞ ¹  +9.93% (p=0.008 n=5)
Accelerate/memchr2-4       210.4n ± ∞ ¹   230.2n ± ∞ ¹  +9.41% (p=0.008 n=5)
geomean                               ³                +0.00%               ³
geomean                               ³                +0.00%               ³
OnePassIsMatch-4     21.42n ± ∞ ¹   21.53n ± ∞ ¹  +0.51% (p=0.040 n=5)
geomean              33.66n         33.68n        +0.08%
geomean                         ³                +0.00%               ³
geomean                         ³                +0.00%               ³
AhoCorasickLargeInput/stdlib_Match_64KB-4               14.42m ± ∞ ¹    14.70m ± ∞ ¹     +1.91% (p=0.032 n=5)
MatchAnchoredLiteral/medium_match-4                     7.893n ± ∞ ¹    8.428n ± ∞ ¹     +6.78% (p=0.008 n=5)

Full results available in workflow artifacts. CI runners have ~10-20% variance.
For accurate benchmarks, run locally: ./scripts/bench.sh --compare

On 386, int(StateID(0xFFFFFFFF)) = -1 (int is 32-bit). getState and IsMatchState used int(id) for slice indexing, causing panic: index out of range [-1]. Fix: check sid >= DeadState before int cast. DeadState (0xFFFFFFFE) and InvalidState (0xFFFFFFFF) are sentinel values not present in stateList/matchFlags.

On 386, int is 32-bit. int(StateID(0xFFFFFFFE)) = -2, causing negative slice index panic in flat table lookups. Added safeOffset() helper using uint arithmetic (always positive). Replaced all 23 occurrences of int(sid)*stride in hot loops. safeOffset inlines — zero overhead on 64-bit.

uint multiply overflows on 386: uint(0xFFFFFFFE)*uint(20) wraps around. Guard with sid >= DeadState check — returns MaxInt so bounds check fails safely. Normal state IDs (small values) take fast path without branch.

kolkov added 11 commits March 23, 2026 19:20

docs: update CHANGELOG for v0.12.18

6851adc

docs: update ROADMAP and CHANGELOG for v0.12.18

d942cfa

kolkov added 4 commits March 24, 2026 12:42

fix: safeOffset guard for DeadState/InvalidState on 386

db7a15e

uint multiply overflows on 386: uint(0xFFFFFFFE)*uint(20) wraps around. Guard with sid >= DeadState check — returns MaxInt so bounds check fails safely. Normal state IDs (small values) take fast path without branch.

docs: update README benchmark table and ROADMAP for v0.12.18

4c96632

kolkov merged commit 921d193 into main Mar 24, 2026
8 of 9 checks passed

kolkov deleted the feature/pikevm-skip-ahead branch March 24, 2026 10:25

kolkov mentioned this pull request Mar 24, 2026

fix: NFA candidate loop — partialCoverage instead of IsComplete #150

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: flat DFA + integrated prefilter — 35% faster than baseline#151

perf: flat DFA + integrated prefilter — 35% faster than baseline#151
kolkov merged 15 commits intomainfrom
feature/pikevm-skip-ahead

kolkov commented Mar 24, 2026

Uh oh!

codecov bot commented Mar 24, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 24, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kolkov commented Mar 24, 2026

Summary

Performance

Fixed

Results (Kostya benchmark, 7.2MB x 10 iter)

regex-bench CI (EPYC)

Test plan

Uh oh!

codecov bot commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions bot commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark Comparison

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

codecov bot commented Mar 24, 2026 •

edited

Loading

github-actions bot commented Mar 24, 2026 •

edited

Loading