Skip to content

nvisy-nlp: lemmatization deferred: decide path when a consumer needs it #154

@martsokha

Description

@martsokha

Context

While designing `nvisy-nlp` we modeled it after Presidio's `NlpArtifacts`. Presidio's pipeline carries `lemmas` to support one consumer: `LemmaContextAwareEnhancer`, which boosts entity confidence when context keywords (in lemma form) appear near a detected entity.

We are shipping v1 without lemmas because:

  1. No current consumer needs them. `nvisy-pattern::ContextRule` does substring matching today, not lemma matching. Pattern authors compensate by using uninflected nouns in their keyword lists.
  2. No maintained Rust lemmatizer exists. `nlprule` (the only realistic option) has been unmaintained since Feb 2022. Forks are build-maintenance only — no feature successor.
  3. Trait surface absorbs them cleanly later. Adding `pub lemmas: Option<Vec>` to `NlpArtifacts` is a non-breaking change.

Trigger to revisit

Pick this issue up when any of the following lands:

  • A consumer feature wants lemma-aware context boosting (e.g., a `LemmaContextEnhancer` analogue to Presidio's, or upgrading `ContextRule` to lemmatize both the window and keywords before matching)
  • A consumer wants morphology normalization for any other reason (search, dedup, NER post-processing)
  • Someone reports false negatives in `ContextRule` matching that lemmatization would fix

Options when the time comes

Listed in increasing cost/scope. Pick the smallest that fits the use case.

(a) Static lookup-table lemmatizer (`nvisy-lemma` micro-crate or feature in `nvisy-nlp`)

  • `HashMap<&'static str, &'static str>` baked at compile time
  • ~5,000 English entries covers ~80% of common morphology (`ran`→`run`, `better`→`good`, plural→singular, common verb forms)
  • No runtime deps, no dead-crate risk, no Python
  • Limited coverage: doesn't handle productive morphology, doesn't generalize beyond the table
  • Recommended default if the consumer is just keyword normalization

(b) Fork `nlprule` into the workspace as `nvisy-nlprule`

  • ~10kloc Rust + a model format you'd own
  • Full POS + lemma + chunking for en/de/es
  • You sign up for maintenance, but the code itself is functional
  • Right if multiple features want what nlprule offers (POS tagging, chunking, not just lemmas)

(c) PyO3 bridge to spaCy via a separate `nvisy-nlp-spacy` crate

  • Opt-in feature flag; never the default
  • Full spaCy: tokenizer + lemmatizer + POS + dep parse + NER
  • Costs (significant):
    • Python interpreter at build + runtime
    • GIL serializes NER calls across concurrent docs (throughput hit)
    • ~500MB-1GB per model
    • Deployment story changes (Docker bloat, no static binary, Python in CI)
    • Cross-compilation hurts
  • Right only if more than just lemmas is needed and the deployment environment can absorb Python

Recommendation

(a) when a single consumer needs lemmas for keyword normalization. Bump to (b) if multiple Rust-native features want POS/chunking too. Reach for (c) only if the consumer needs dep parse or coref or something else Rust genuinely cannot provide.

Out of scope for this issue

Adding lemmas without a real consumer. Don't extend `NlpArtifacts` until the consumer is on a branch and you can verify the lemma source actually solves their problem.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions