Context
While designing `nvisy-nlp` we modeled it after Presidio's `NlpArtifacts`. Presidio's pipeline carries `lemmas` to support one consumer: `LemmaContextAwareEnhancer`, which boosts entity confidence when context keywords (in lemma form) appear near a detected entity.
We are shipping v1 without lemmas because:
- No current consumer needs them. `nvisy-pattern::ContextRule` does substring matching today, not lemma matching. Pattern authors compensate by using uninflected nouns in their keyword lists.
- No maintained Rust lemmatizer exists. `nlprule` (the only realistic option) has been unmaintained since Feb 2022. Forks are build-maintenance only — no feature successor.
- Trait surface absorbs them cleanly later. Adding `pub lemmas: Option<Vec>` to `NlpArtifacts` is a non-breaking change.
Trigger to revisit
Pick this issue up when any of the following lands:
- A consumer feature wants lemma-aware context boosting (e.g., a `LemmaContextEnhancer` analogue to Presidio's, or upgrading `ContextRule` to lemmatize both the window and keywords before matching)
- A consumer wants morphology normalization for any other reason (search, dedup, NER post-processing)
- Someone reports false negatives in `ContextRule` matching that lemmatization would fix
Options when the time comes
Listed in increasing cost/scope. Pick the smallest that fits the use case.
(a) Static lookup-table lemmatizer (`nvisy-lemma` micro-crate or feature in `nvisy-nlp`)
- `HashMap<&'static str, &'static str>` baked at compile time
- ~5,000 English entries covers ~80% of common morphology (`ran`→`run`, `better`→`good`, plural→singular, common verb forms)
- No runtime deps, no dead-crate risk, no Python
- Limited coverage: doesn't handle productive morphology, doesn't generalize beyond the table
- Recommended default if the consumer is just keyword normalization
(b) Fork `nlprule` into the workspace as `nvisy-nlprule`
- ~10kloc Rust + a model format you'd own
- Full POS + lemma + chunking for en/de/es
- You sign up for maintenance, but the code itself is functional
- Right if multiple features want what nlprule offers (POS tagging, chunking, not just lemmas)
(c) PyO3 bridge to spaCy via a separate `nvisy-nlp-spacy` crate
- Opt-in feature flag; never the default
- Full spaCy: tokenizer + lemmatizer + POS + dep parse + NER
- Costs (significant):
- Python interpreter at build + runtime
- GIL serializes NER calls across concurrent docs (throughput hit)
- ~500MB-1GB per model
- Deployment story changes (Docker bloat, no static binary, Python in CI)
- Cross-compilation hurts
- Right only if more than just lemmas is needed and the deployment environment can absorb Python
Recommendation
(a) when a single consumer needs lemmas for keyword normalization. Bump to (b) if multiple Rust-native features want POS/chunking too. Reach for (c) only if the consumer needs dep parse or coref or something else Rust genuinely cannot provide.
Out of scope for this issue
Adding lemmas without a real consumer. Don't extend `NlpArtifacts` until the consumer is on a branch and you can verify the lemma source actually solves their problem.
Context
While designing `nvisy-nlp` we modeled it after Presidio's `NlpArtifacts`. Presidio's pipeline carries `lemmas` to support one consumer: `LemmaContextAwareEnhancer`, which boosts entity confidence when context keywords (in lemma form) appear near a detected entity.
We are shipping v1 without lemmas because:
Trigger to revisit
Pick this issue up when any of the following lands:
Options when the time comes
Listed in increasing cost/scope. Pick the smallest that fits the use case.
(a) Static lookup-table lemmatizer (`nvisy-lemma` micro-crate or feature in `nvisy-nlp`)
(b) Fork `nlprule` into the workspace as `nvisy-nlprule`
(c) PyO3 bridge to spaCy via a separate `nvisy-nlp-spacy` crate
Recommendation
(a) when a single consumer needs lemmas for keyword normalization. Bump to (b) if multiple Rust-native features want POS/chunking too. Reach for (c) only if the consumer needs dep parse or coref or something else Rust genuinely cannot provide.
Out of scope for this issue
Adding lemmas without a real consumer. Don't extend `NlpArtifacts` until the consumer is on a branch and you can verify the lemma source actually solves their problem.