nvisy-nlp: lemmatization deferred: decide path when a consumer needs it

## Context

While designing \`nvisy-nlp\` we modeled it after Presidio's \`NlpArtifacts\`. Presidio's pipeline carries \`lemmas\` to support one consumer: \`LemmaContextAwareEnhancer\`, which boosts entity confidence when context keywords (in lemma form) appear near a detected entity.

We are **shipping v1 without lemmas** because:

1. **No current consumer needs them.** \`nvisy-pattern::ContextRule\` does substring matching today, not lemma matching. Pattern authors compensate by using uninflected nouns in their keyword lists.
2. **No maintained Rust lemmatizer exists.** \`nlprule\` (the only realistic option) has been unmaintained since Feb 2022. Forks are build-maintenance only — no feature successor.
3. **Trait surface absorbs them cleanly later.** Adding \`pub lemmas: Option<Vec<String>>\` to \`NlpArtifacts\` is a non-breaking change.

## Trigger to revisit

Pick this issue up when *any* of the following lands:

- A consumer feature wants lemma-aware context boosting (e.g., a \`LemmaContextEnhancer\` analogue to Presidio's, or upgrading \`ContextRule\` to lemmatize both the window and keywords before matching)
- A consumer wants morphology normalization for any other reason (search, dedup, NER post-processing)
- Someone reports false negatives in \`ContextRule\` matching that lemmatization would fix

## Options when the time comes

Listed in increasing cost/scope. Pick the smallest that fits the use case.

### (a) Static lookup-table lemmatizer (\`nvisy-lemma\` micro-crate or feature in \`nvisy-nlp\`)

- \`HashMap<&'static str, &'static str>\` baked at compile time
- ~5,000 English entries covers ~80% of common morphology (\`ran\`→\`run\`, \`better\`→\`good\`, plural→singular, common verb forms)
- No runtime deps, no dead-crate risk, no Python
- Limited coverage: doesn't handle productive morphology, doesn't generalize beyond the table
- **Recommended default** if the consumer is just keyword normalization

### (b) Fork \`nlprule\` into the workspace as \`nvisy-nlprule\`

- ~10kloc Rust + a model format you'd own
- Full POS + lemma + chunking for en/de/es
- You sign up for maintenance, but the code itself is functional
- Right if multiple features want what nlprule offers (POS tagging, chunking, not just lemmas)

### (c) PyO3 bridge to spaCy via a separate \`nvisy-nlp-spacy\` crate

- Opt-in feature flag; never the default
- Full spaCy: tokenizer + lemmatizer + POS + dep parse + NER
- Costs (significant):
  - Python interpreter at build + runtime
  - GIL serializes NER calls across concurrent docs (throughput hit)
  - ~500MB-1GB per model
  - Deployment story changes (Docker bloat, no static binary, Python in CI)
  - Cross-compilation hurts
- Right only if more than just lemmas is needed *and* the deployment environment can absorb Python

## Recommendation

**(a) when a single consumer needs lemmas for keyword normalization.** Bump to (b) if multiple Rust-native features want POS/chunking too. Reach for (c) only if the consumer needs dep parse or coref or something else Rust genuinely cannot provide.

## Out of scope for this issue

Adding lemmas without a real consumer. Don't extend \`NlpArtifacts\` until the consumer is on a branch and you can verify the lemma source actually solves their problem.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nvisy-nlp: lemmatization deferred: decide path when a consumer needs it #154

Context

Trigger to revisit

Options when the time comes

(a) Static lookup-table lemmatizer (`nvisy-lemma` micro-crate or feature in `nvisy-nlp`)

(b) Fork `nlprule` into the workspace as `nvisy-nlprule`

(c) PyO3 bridge to spaCy via a separate `nvisy-nlp-spacy` crate

Recommendation

Out of scope for this issue

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

nvisy-nlp: lemmatization deferred: decide path when a consumer needs it #154

Description

Context

Trigger to revisit

Options when the time comes

(a) Static lookup-table lemmatizer (`nvisy-lemma` micro-crate or feature in `nvisy-nlp`)

(b) Fork `nlprule` into the workspace as `nvisy-nlprule`

(c) PyO3 bridge to spaCy via a separate `nvisy-nlp-spacy` crate

Recommendation

Out of scope for this issue

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions