Skip to content

feat(nvisy-pattern): metadata, PatternFilter, recursive discovery, engine reorg#153

Merged
martsokha merged 1 commit into
mainfrom
feat/pattern-metadata-and-filtering
May 20, 2026
Merged

feat(nvisy-pattern): metadata, PatternFilter, recursive discovery, engine reorg#153
martsokha merged 1 commit into
mainfrom
feat/pattern-metadata-and-filtering

Conversation

@martsokha
Copy link
Copy Markdown
Member

Summary

Multi-part change to nvisy-pattern (and supporting ontology/engine wiring). The pieces interleave on the same files so they land together.

Code-quality pass on nvisy-pattern

  • Bugs: phone validator (national-format trunk-zero now valid), dedup correctness (prefer higher confidence, then tighter span), deny-list scan O(n + matches) via Aho-Corasick instead of O(k·n) substring loop.
  • API hygiene: drop BoxPattern/BoxDictionary aliases, collapse AllowList insertion APIs to FromIterator + Extend, ScanContext → public-fields struct, SmallVec<[_; 2]> for RawMatch::recognition_methods, DenyListHashMap.
  • Diagnostics: PatternEngineError is pub (downcastable), RegexEntry/DictEntry derive Debug, scan-entities span downgraded to trace.
  • Strictness: builtin pattern/dictionary loaders panic on malformed embedded assets (build-time bugs).
  • Sealing: Pattern and Dictionary traits are sealed.
  • Lib re-exports flattened.

Engine module reorganisation

  • engine/filter/ — per-scan configuration (AllowList, DenyList, ScanContext)
  • engine/scan/ — internal matching machinery (entries, pattern_match, phases, dedup)
  • engine/{pattern_engine,builder,error}.rs at the top level

Dictionary metadata + sidecars

  • DictionaryMetadata with name, description, version (semver), languages (BCP-47), industries, regions, compliance
  • <stem>.json sidecar next to each dictionary file
  • Recursive discovery via walkdir (filesystem) + hand-rolled walker (embedded)
  • Path-based naming for nested dicts (healthcare/drugs.csvhealthcare/drugs); sidecar name field wins verbatim
  • Sidecar parse errors: fatal for builtins, soft-fail for user dirs

Pattern metadata + PatternFilter

  • Nested metadata block on every pattern JSON → PatternMetadata (same fields as dicts + compliance + references)
  • PatternFilter (renamed from DictionaryFilter) applies to both regex and dictionary-backed patterns
  • Untagged on an axis = universal on that axis — patterns without languages pass any language filter
  • Recursive pattern discovery in PatternRegistry::load_dir

Builtins reorganised + tagged

  • Dictionaries: assets/dictionaries/{finance,general}/
  • Patterns: assets/patterns/{contact,credentials,finance,identity,network,personal}/
  • All 5 dicts + 28 patterns ship with metadata populated (region/industry/compliance/language)

Test plan

  • cargo build --workspace clean
  • cargo test --workspace — all ~390 tests pass
  • cargo clippy --workspace --all-targets — clean
  • manual smoke: run a pipeline with PatternFilter { compliance: ["pci-dss"] } and confirm only PCI-tagged patterns fire
  • manual smoke: drop a user-provided .csv + .json sidecar pair into a custom dir, point load_dir at it, confirm name/tags resolve

Stacking

PR is based on feat/policy-precedence (PR #152). Once that merges this will auto-rebase onto main.

🤖 Generated with Claude Code

@martsokha martsokha self-assigned this May 20, 2026
Base automatically changed from feat/policy-precedence to main May 20, 2026 03:03
@martsokha martsokha added the pattern nvisy-pattern: regex and dictionary detection label May 20, 2026
…sive discovery, engine reorg

Multi-part change to nvisy-pattern (and supporting ontology/engine
wiring) that lands together because the pieces interleave on the
same files:

## Code-quality pass on nvisy-pattern
- Fix phone validator: only reject leading-zero E.164 with explicit `+`;
  national-format trunk-zero numbers (UK `020 7946 0958`) now validate.
- Document MM/DD vs DD/MM ambiguity on the date validator.
- Convert deny-list scan from O(k·n) substring loop to a lazily
  compiled Aho-Corasick automaton (O(n + matches)).
- Fix overlap-dedup: prefer higher confidence, then tighter span;
  full duplicates collapse to the earlier-visited copy.
- Make `PatternEngineError` `pub` so callers can downcast via
  `Error::source()`.
- Built-in pattern/dictionary loaders panic on malformed embedded
  assets (build-time bugs, not user-time errors).
- Drop `BoxPattern` / `BoxDictionary` aliases.
- `ScanContext` → public-fields struct; `AllowList` collapsed to
  `FromIterator + Extend`.
- `SmallVec<[_; 2]>` for `RawMatch::recognition_methods`.
- `DenyList` → `HashMap` (no ordering need).
- Derive `Debug` on `RegexEntry` / `DictEntry`.
- Collapse parallel `values + columns` vecs into `Vec<DictionaryTerm>`.
- Downgrade `scan_entities` tracing span to `trace` level.
- Drop redundant `Display` impl on `PatternEngine`.
- Seal `Pattern` and `Dictionary` traits.
- Flatten lib.rs re-exports.

## Engine module reorganisation
- `engine/filter/{allow_list,deny_list,scan_context}.rs` — per-scan
  configuration callers populate.
- `engine/scan/{entries,pattern_match,phases,dedup}.rs` — internal
  matching machinery.
- `engine/{pattern_engine,builder,error}.rs` at the top level — the
  public surface.

## Dictionary metadata + sidecars
- New `DictionaryMetadata` with `name`, `description`, `version`
  (semver, default `0.0.0`), `languages` (BCP-47), `industries`,
  `regions`, `compliance`. Loaded from sibling `<stem>.json` sidecars.
- Recursive discovery via `walkdir` for filesystem loads and a
  hand-rolled walker for `include_dir`-embedded builtins.
- Path-based naming for nested dictionaries (`healthcare/drugs.csv`
  → `healthcare/drugs`); sidecar `name` wins verbatim.
- Sidecar parse errors are fatal for builtins, soft-fail for user dirs.

## Pattern metadata + PatternFilter
- New nested `metadata` block on every pattern JSON, parsed into
  `PatternMetadata` (same fields as dictionaries + `compliance` +
  `references`).
- `PatternFilter` (renamed from `DictionaryFilter`) on
  `PatternDetection` selects patterns by tag overlap. Applies to
  both regex and dictionary-backed patterns. Untagged on an axis =
  universal on that axis.
- Recursive pattern discovery in `PatternRegistry::load_dir`.

## Builtins reorganised + tagged
- Dictionaries: `assets/dictionaries/{finance,general}/`.
- Patterns: `assets/patterns/{contact,credentials,finance,identity,
  network,personal}/`.
- All 5 builtin dictionaries and all 28 builtin patterns ship with
  metadata sidecars (dicts) or nested metadata blocks (patterns)
  populated with reasonable defaults (region, industry, compliance,
  language where applicable).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@martsokha martsokha force-pushed the feat/pattern-metadata-and-filtering branch from 6b9229a to ff033e4 Compare May 20, 2026 03:07
@martsokha martsokha added the feat request for or implementation of a new feature label May 20, 2026
@martsokha martsokha merged commit bc434bb into main May 20, 2026
5 checks passed
@martsokha martsokha deleted the feat/pattern-metadata-and-filtering branch May 20, 2026 03:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feat request for or implementation of a new feature pattern nvisy-pattern: regex and dictionary detection

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant