feat(nvisy-pattern): metadata, PatternFilter, recursive discovery, engine reorg#153
Merged
Merged
Conversation
…sive discovery, engine reorg
Multi-part change to nvisy-pattern (and supporting ontology/engine
wiring) that lands together because the pieces interleave on the
same files:
## Code-quality pass on nvisy-pattern
- Fix phone validator: only reject leading-zero E.164 with explicit `+`;
national-format trunk-zero numbers (UK `020 7946 0958`) now validate.
- Document MM/DD vs DD/MM ambiguity on the date validator.
- Convert deny-list scan from O(k·n) substring loop to a lazily
compiled Aho-Corasick automaton (O(n + matches)).
- Fix overlap-dedup: prefer higher confidence, then tighter span;
full duplicates collapse to the earlier-visited copy.
- Make `PatternEngineError` `pub` so callers can downcast via
`Error::source()`.
- Built-in pattern/dictionary loaders panic on malformed embedded
assets (build-time bugs, not user-time errors).
- Drop `BoxPattern` / `BoxDictionary` aliases.
- `ScanContext` → public-fields struct; `AllowList` collapsed to
`FromIterator + Extend`.
- `SmallVec<[_; 2]>` for `RawMatch::recognition_methods`.
- `DenyList` → `HashMap` (no ordering need).
- Derive `Debug` on `RegexEntry` / `DictEntry`.
- Collapse parallel `values + columns` vecs into `Vec<DictionaryTerm>`.
- Downgrade `scan_entities` tracing span to `trace` level.
- Drop redundant `Display` impl on `PatternEngine`.
- Seal `Pattern` and `Dictionary` traits.
- Flatten lib.rs re-exports.
## Engine module reorganisation
- `engine/filter/{allow_list,deny_list,scan_context}.rs` — per-scan
configuration callers populate.
- `engine/scan/{entries,pattern_match,phases,dedup}.rs` — internal
matching machinery.
- `engine/{pattern_engine,builder,error}.rs` at the top level — the
public surface.
## Dictionary metadata + sidecars
- New `DictionaryMetadata` with `name`, `description`, `version`
(semver, default `0.0.0`), `languages` (BCP-47), `industries`,
`regions`, `compliance`. Loaded from sibling `<stem>.json` sidecars.
- Recursive discovery via `walkdir` for filesystem loads and a
hand-rolled walker for `include_dir`-embedded builtins.
- Path-based naming for nested dictionaries (`healthcare/drugs.csv`
→ `healthcare/drugs`); sidecar `name` wins verbatim.
- Sidecar parse errors are fatal for builtins, soft-fail for user dirs.
## Pattern metadata + PatternFilter
- New nested `metadata` block on every pattern JSON, parsed into
`PatternMetadata` (same fields as dictionaries + `compliance` +
`references`).
- `PatternFilter` (renamed from `DictionaryFilter`) on
`PatternDetection` selects patterns by tag overlap. Applies to
both regex and dictionary-backed patterns. Untagged on an axis =
universal on that axis.
- Recursive pattern discovery in `PatternRegistry::load_dir`.
## Builtins reorganised + tagged
- Dictionaries: `assets/dictionaries/{finance,general}/`.
- Patterns: `assets/patterns/{contact,credentials,finance,identity,
network,personal}/`.
- All 5 builtin dictionaries and all 28 builtin patterns ship with
metadata sidecars (dicts) or nested metadata blocks (patterns)
populated with reasonable defaults (region, industry, compliance,
language where applicable).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
6b9229a to
ff033e4
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Multi-part change to
nvisy-pattern(and supporting ontology/engine wiring). The pieces interleave on the same files so they land together.Code-quality pass on nvisy-pattern
BoxPattern/BoxDictionaryaliases, collapseAllowListinsertion APIs toFromIterator + Extend,ScanContext→ public-fields struct,SmallVec<[_; 2]>forRawMatch::recognition_methods,DenyList→HashMap.PatternEngineErrorispub(downcastable),RegexEntry/DictEntryderiveDebug, scan-entities span downgraded totrace.PatternandDictionarytraits are sealed.Engine module reorganisation
engine/filter/— per-scan configuration (AllowList,DenyList,ScanContext)engine/scan/— internal matching machinery (entries,pattern_match,phases,dedup)engine/{pattern_engine,builder,error}.rsat the top levelDictionary metadata + sidecars
DictionaryMetadatawithname,description,version(semver),languages(BCP-47),industries,regions,compliance<stem>.jsonsidecar next to each dictionary filewalkdir(filesystem) + hand-rolled walker (embedded)healthcare/drugs.csv→healthcare/drugs); sidecarnamefield wins verbatimPattern metadata +
PatternFiltermetadatablock on every pattern JSON →PatternMetadata(same fields as dicts +compliance+references)PatternFilter(renamed fromDictionaryFilter) applies to both regex and dictionary-backed patternslanguagespass any language filterPatternRegistry::load_dirBuiltins reorganised + tagged
assets/dictionaries/{finance,general}/assets/patterns/{contact,credentials,finance,identity,network,personal}/Test plan
cargo build --workspacecleancargo test --workspace— all ~390 tests passcargo clippy --workspace --all-targets— cleanPatternFilter { compliance: ["pci-dss"] }and confirm only PCI-tagged patterns fire.csv+.jsonsidecar pair into a custom dir, pointload_dirat it, confirm name/tags resolveStacking
PR is based on
feat/policy-precedence(PR #152). Once that merges this will auto-rebase ontomain.🤖 Generated with Claude Code