Skip to content

feat(nvisy-nlp): new NLP crate + nvisy-engine wiring#155

Merged
martsokha merged 13 commits into
mainfrom
feat/nvisy-nlp
May 20, 2026
Merged

feat(nvisy-nlp): new NLP crate + nvisy-engine wiring#155
martsokha merged 13 commits into
mainfrom
feat/nvisy-nlp

Conversation

@martsokha
Copy link
Copy Markdown
Member

@martsokha martsokha commented May 20, 2026

Summary

New nvisy-nlp crate for NLP — NER, language detection, and tokenization — plus the nvisy-engine wiring that consumes it.

nvisy-nlp

Composable trait surface (NerBackend + LanguageDetector + Tokenizer) with a composite NlpEngine that orchestrates them in Presidio order (detect language → run NER with hint → tokenize → derive keywords). Second entrypoint analyze_in_language(text, lang) bypasses detection when the caller already knows the language.

Trait surface is transport-agnostic. v1 ships local-by-default implementations; LLM-mediated NER lives in nvisy-llm by deliberate crate split, not by trait restriction. Any third-party backend can implement NerBackend over any transport.

v1 implementations:

  • OrtNerBackend — HuggingFace token-classification via ort (ONNX Runtime, load-dynamic). Real per-span softmax confidence. Truncates inputs at max_sequence_length. Validates language hints against supported_languages. Inference dispatched via tokio::task::spawn_blocking so it doesn't starve the async runtime.
  • LinguaLanguageDetector — lingua-rs, real confidence scores, multilingual segmentation, two constructors (for_languages recommended, for_all_languages for unknown).
  • UnicodeTokenizerunicode-segmentation based, model-free.
  • HfTokenizer — wraps tokenizers::Tokenizer for ORT NER alignment.
  • NoopNerBackend — for tests.

Key surface decisions:

  • NerBackend + LanguageDetector required at build; Tokenizer optional. Builder uses type-state so calling .build() without both required components is a compile error, not a runtime panic.
  • Tokenizer::tokenize returns Result<Vec<Token>>.
  • LanguageDetector::detect returns Result<Option<LanguageDetection>> — splits "no answer" (Ok(None)) from real failure (Err).
  • LanguageDetector::detect_multiple returns Result<Vec<LanguageSpan>> for mixed-language documents; default impl falls back to single-language detect.
  • LanguageDetection { language, confidence: Option<f64>, provenance: LanguageProvenance } — distinguishes detected from caller-asserted.
  • Crate-wide Result<T> alias defaulting to Error; pub use nvisy_nlp::{Error, Result} at the crate root.
  • Token { start, end, text, is_stop } — no is_punct (broken across both impls; reviewer caught it).
  • User-provided model paths; no bundling, no auto-download. Users install libonnxruntime out-of-band.
  • No lemmas (nvisy-nlp: lemmatization deferred: decide path when a consumer needs it #154 — no maintained Rust lemmatizer).

nvisy-engine wiring

  • Renames EntityRecognition (LLM-driven) → LlmRecognition.
  • Adds NerRecognition wrapping Arc<dyn nvisy_nlp::NerBackend>.
  • Detection phase now runs three ops: LLM (skip if no provider), NER backend (skip if no backend), pattern (always).
  • Engine::with_ner_backend(...) builder method for engine init.

Behaviour unchanged when no backend is configured.

Layout

crates/nvisy-nlp/src/
├── lib.rs
├── error.rs                          Error + Result alias
├── artifacts/
│   ├── mod.rs                        Artifacts
│   └── token.rs                      Token
├── engine/
│   ├── mod.rs                        re-exports + integration tests
│   ├── nlp_engine.rs                 NlpEngine
│   └── builder.rs                    NlpEngineBuilder (type-state)
├── language/
│   ├── mod.rs                        LanguageDetector + LanguageDetection + LanguageSpan + LanguageProvenance
│   └── lingua.rs                     LinguaLanguageDetector
├── ner/
│   ├── mod.rs                        NerBackend trait
│   ├── ort.rs                        OrtNerBackend + Inferencer (pub(crate))
│   └── noop.rs                       NoopNerBackend
└── tokenizer/
    ├── mod.rs                        Tokenizer trait + SUPPORTED_STOPWORD_LANGUAGES
    ├── hugging_face.rs               HfTokenizer
    └── unicode.rs                    UnicodeTokenizer

Deferred (tracked as follow-ups)

Test plan

  • cargo build --workspace clean
  • cargo test --workspace — all ~440 tests pass, 28 in nvisy-nlp
  • cargo clippy --workspace --all-targets clean
  • cargo doc -p nvisy-nlp --no-deps clean (no broken intra-doc links)
  • cargo fmt clean
  • All commits independently buildable
  • Manual smoke: load a real ONNX NER model via OrtNerBackend and run against documents
  • Manual smoke: wire a backend into the engine and verify it runs alongside pattern recognition

🤖 Generated with Claude Code

martsokha and others added 4 commits May 20, 2026 15:39
Scope C, trimmed to what Rust can actually deliver maintainably:
NER, language detection, tokenization. No lemmas in v1 (see #154 —
no maintained Rust lemmatizer exists, nothing currently consumes
them, trait absorbs them later non-breakingly).

## Architecture

Composable trait surface, in contrast to Presidio's monolithic
NlpEngine. Each concern is independently implementable:

- `NerBackend` (async)             — NER over text
- `LanguageDetector` (sync)        — language ID
- `Tokenizer` (sync, fallible)     — token + offset extraction

Composite `NlpEngine` orchestrates them in Presidio order (detect
language → run NER with hint → tokenize → derive keywords) and has
a second entrypoint `analyze_in_language(text, lang)` for callers
who already know the document language.

`NerBackend` and `LanguageDetector` are required at build time;
`Tokenizer` is optional.

## v1 implementations

- `OrtNerBackend` — HuggingFace token-classification model via
  `ort` (ONNX Runtime, load-dynamic). Inference behind an
  `Inferencer` trait so unit tests use canned logits, not a model
  file. BIO-tag folding + label-map lookup produce `Entities`.
- `LinguaLanguageDetector` — lingua-rs, real confidence scores,
  multilingual segmentation via `detect_multiple`, two constructors
  (`for_languages` recommended, `for_all_languages` for genuinely
  unknown). Lingua defaults to high-accuracy mode.
- `UnicodeTokenizer` — `unicode-segmentation` based, model-free.
  Optional stopword set via `stop-words`.
- `HfTokenizer` — `tokenizers` crate wrapper. Loads `tokenizer.json`,
  produces tokens with HF offsets (for ORT NER alignment). Optional
  stopword set.
- `NoopNerBackend` — empty entities; for tests.

## Data carrier

`NlpArtifacts` mirrors Presidio's field set (entities, language,
tokens, keywords) reshaped for typed Rust access. No `lemmas`, no
spaCy `Doc` backref, no parallel offset arrays.

## Public surface decisions

- `LanguageDetection { language, confidence: Option<f64> }` and
  `LanguageSpan { start, end, language, confidence }` carried on the
  trait; engine strips down to `LanguageTag` for `NlpArtifacts`
  (consumers wanting confidence call the detector directly).
- `Token { start, end, text, is_stop }` — no `is_punct`; the
  detection was unreliable across both impls and nothing reads it.
- `Tokenizer::tokenize` is fallible (`Result<Vec<Token>, NlpError>`).
  HF errors propagate instead of silent-empty; Unicode impl is always
  `Ok`.

## Distribution / model story

User-provided model paths only. No bundling, no auto-download.
`load-dynamic` for ORT means users install `libonnxruntime`
out-of-band (Homebrew on macOS, distro package on Linux). Documented
in `README.md` and `DESIGN.md`.

## Deferred (tracked)

- GlotLID v3 as an alternative language detector.
- TiktokenTokenizer for LLM context-window counting.
- LinderaTokenizer for CJK content.
- Lemmatization (see #154).
- Burn as an alternative NER backend.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Renames `EntityRecognition` (LLM-driven NER) to `LlmRecognition` and
adds a new `NerRecognition` operation that wraps
`Arc<dyn nvisy_nlp::NerBackend>` and runs sequentially over text
spans. The detection phase now runs three operations in order:

1. `LlmRecognition`  — silently skipped if no LLM provider configured
2. `NerRecognition`  — silently skipped if no backend on the run ctx
3. `PatternRecognition` — always runs

Wiring:

- `Engine::with_ner_backend(Arc<dyn NerBackend>)` builder method to
  attach an offline backend at engine init.
- `EngineInner.ner_backend` carries it across runs.
- `Pipeline::new` accepts and forwards it.
- `RunContext.ner_backend` makes it available to the orchestrator.

Behaviour unchanged when no backend is configured: existing pipelines
continue to run LLM NER + patterns, no offline-NER overhead.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ts, fallible LanguageDetector

Doc + naming + ergonomics polish on top of the new crate. Squashed
into one commit because the changes interleave on the same files.

## Naming

- `NlpError` -> `Error` + `Result<T, E = Error>` alias; the crate
  now uses `Result<T>` everywhere internally.
- `NlpArtifacts` -> `Artifacts`.
- `tokenizer/hf.rs` -> `tokenizer/hugging_face.rs` (file rename;
  type name `HfTokenizer` unchanged).

## Trait change

`LanguageDetector` is now fallible:

- `detect(&str) -> Result<Option<LanguageDetection>>` — splits "no
  answer" (`Ok(None)`, e.g. short or ambiguous text) from real
  backend failure (`Err(_)`).
- `detect_multiple(&str) -> Result<Vec<LanguageSpan>>` — default
  impl returns a single full-text span built from `detect`; impls
  with real segmentation override.

`LinguaLanguageDetector` propagates these `Result`s. Tests updated.

## Framing

- Dropped all "offline" labelling from README, Cargo description,
  module docs, struct docs, tracing strings. The trait surface is
  transport-agnostic; LLM-mediated NER lives in `nvisy-llm` by
  deliberate crate split, not by trait restriction. README and inline
  docs now say so explicitly.
- Deleted `DESIGN.md`; the README now carries the small amount of
  framing actually needed.

## Builder

Dropped `with_ner_arc` and `with_language_detector_arc` from
`NlpEngineBuilder` — the `with_ner` / `with_language_detector`
methods (which take any `T: NerBackend/LanguageDetector + 'static`)
cover the same use cases.

## Layout

- `artifacts/` is a folder with `mod.rs` (Artifacts) and `token.rs`
  (Token).
- `language/` is a folder with `mod.rs` (trait + LanguageDetection +
  LanguageSpan) and `lingua.rs` (LinguaLanguageDetector).
- `error.rs` is a single file (no folder).

## nvisy-engine side

Same sweep applied to the consumer: removed "offline" framing from
`ner_recognition`, `llm_recognition`, `detection/mod`, orchestrator
and engine docs. The trait surface and operation behaviour are
unchanged.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Doc-only + style sweep across the crate:

- Hoist all intra-doc links and external URL links to bottom-of-block
  reference definitions, matching the existing style. Inline
  `[Foo](crate::path::Foo)` and `[bar](https://...)` become `[Foo]`
  and `[bar]` with reference defs below.
- Escape stray `[\`nvisy-nlp\`]` crate-name strings that rustdoc was
  mistakenly parsing as intra-doc links — they're now plain
  backticks.
- Drop the dangling "See DESIGN.md" line from ner/ort.rs (file was
  deleted last round).
- `cargo fmt` rewraps.
- `cargo clippy --workspace --all-targets` clean (one
  `manual_contains` lint fixed in `stopword_lang::is_supported`).
- `cargo doc -p nvisy-nlp --no-deps` clean.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@martsokha martsokha self-assigned this May 20, 2026
Acts on every item from the second-round code review. Changes by
category, with reviewer's anchor codes in parens.

## Bugs

- (A1) OrtNerConfig.max_sequence_length is now wired: the tokenizer
  is configured with TruncationParams at construction time, so long
  inputs are truncated instead of triggering a shape mismatch at
  inference time.
- (A2) OrtNerBackend.recognize now validates the language hint
  against `supported_languages` (when configured) and returns
  Error::UnsupportedLanguage on mismatch. Empty list still means
  "any language" per the trait doc.
- (A3) Inference moved off the async-runtime thread via
  tokio::task::spawn_blocking. OrtNerBackend.state lives behind an
  Arc so it can cross the spawn boundary cheaply.
- (A4) Inferencer trait returns flat (Vec<f32>, num_labels) instead
  of Vec<Vec<f32>>. 512 heap allocations per inference become one.

## Ergonomics

- (B1, E1) Inferencer + OrtInferencer are pub(crate); external
  backends implement NerBackend directly, no longer locked into the
  internal trait surface.
- (B2) OrtNerConfig::with_conll03_english helper for the common
  PER/ORG/LOC/MISC label set.
- (B3) Replaced hardcoded 0.85 confidence with real per-span
  softmax confidence — argmax_softmax computes both the winning
  label and its probability in one pass, fold_predictions averages
  per-token probs across the span.
- (B5) Type-state NlpEngineBuilder. NoNer/WithNer + NoLang/WithLang
  marker types make `.build()` only callable once both required
  components are attached. The runtime `expect` panics become
  unreachable.
- (B6) LanguageProvenance { Detected, Asserted } enum on
  LanguageDetection. analyze_in_language stamps Asserted;
  LinguaLanguageDetector stamps Detected. Callers no longer have to
  reverse-engineer provenance from `confidence: None`.

## Polish

- (B4) HfTokenizer doc explicitly notes that `is_stop` doesn't work
  for subword tokenizers (BPE/WordPiece/Unigram emit fragments, not
  whole words). Recommends UnicodeTokenizer for word-level filtering.
- (C1, C2) OrtState moves the tokenizer into Arc-shared storage and
  drops everything not needed at inference time (full
  OrtNerConfig held forever was the original cut corner).
- (C3) label_order rejects `"O"` in the user-supplied label_map at
  construction with a clear error.
- (C4) lingua_to_tag uses a OnceLock<Mutex<HashSet>>-backed
  warn-once cache. Repeated failures for the same ISO code log
  once per process instead of per call.
- (C6) `recognize_sync` -> `recognize_blocking`. Actually used now,
  by the new tests and as the body run inside spawn_blocking.
- (C7, D1, D2) Real unit tests for fold_predictions: a minimal
  hand-rolled tokenizer.json + CannedInferencer drives end-to-end
  span construction including multi-span output, BIO continuation,
  language-hint validation paths.
- (D3) NoopNerBackend::default() exercised in a test.
- (E2) Dedicated stopword_lang.rs collapsed into tokenizer/mod.rs
  as `SUPPORTED_STOPWORD_LANGUAGES` const + helper fn.

## Docs

- All "offline NLP" framing dropped in earlier round; this commit
  also fixes the broken intra-doc link to UnicodeTokenizer from
  HfTokenizer's struct doc.

## Test count

21 -> 28 tests, all pass. Workspace builds + clippy clean + fmt +
doc warnings clean.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@martsokha martsokha changed the title feat(nvisy-nlp): new offline NLP crate + nvisy-engine wiring feat(nvisy-nlp): new NLP crate + nvisy-engine wiring May 20, 2026
martsokha and others added 8 commits May 20, 2026 17:30
…, fold engine module

Engine module is now a single file (engine/mod.rs holds Engine; builder
stays in engine/builder.rs). LanguageDetection (with LanguageProvenance)
and LanguageSpan moved out of language/mod.rs into their own files so the
module root only carries the LanguageDetector trait + re-exports.

Engine name collides with nvisy_engine::Engine at use sites; callers
importing both will need to alias one.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…, NoopNer behind test-utils

nvisy-ontology
- Add Confidence(f64) primitive with Confidence::new -> Option (NonZero-style),
  Confidence::get accessor, serde-transparent. Rejects NaN/inf and values outside [0,1].
- Migrate Entity.confidence from f64 to Confidence. EntitySelector and policy
  comparisons now go through .get().

nvisy-nlp
- LanguageSpan shrinks to {start, end}. LanguageDetection gains
  span: Option<LanguageSpan> and confidence: Option<Confidence>.
- LanguageDetector trait: detect + detect_in both return Vec<LanguageDetection>.
  Empty vec means "no answer"; mixed-language input produces one entry per region.
  Drop detect_multiple. LinguaLanguageDetector populates span from lingua's
  multi-language API.
- Rename Request -> Context (file too). Drop Response: Engine::analyze now returns
  Artifacts directly. correlation_id stays on Context, input-only for the tracing span.
- Replace type-state EngineBuilder with derive_builder + custom setters; the four
  marker types are gone from the public API. .build() returns Result<Engine, EngineBuilderError>.
- Gate NoopNerBackend behind a new test-utils feature so production builds don't
  ship the no-op fallback.

nvisy-pattern, nvisy-provider, nvisy-engine
- Wrap entity.confidence at every construction site (Confidence::new(..clamp(0,1))
  at ML/LLM boundaries to absorb float rounding).
- Read entity.confidence.get() at every comparison site (selectors, dedup
  strategies, redaction thresholds).
- Bump nvisy-ontology dev-deps to enable test-utils where tests build Entities.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…oes internal

Reshape language module around a public factory trait.

- LanguageDetector trait drops to pub(crate). One method: detect(&str).
  Language scope is baked in at construction time; no detect_in.
- New public LanguagePolicy trait with associated Detector type +
  detector_for_all / detector_for(&[LanguageTag]) factory methods.
  Engine asks the policy for a fresh detector each call.
- Crate-private DynLanguagePolicy shim (object-safe twin, blanket-impl'd
  for every LanguagePolicy) lets Engine hold Arc<dyn DynLanguagePolicy>
  while the public trait keeps its associated type.
- impl<P: LanguagePolicy> LanguagePolicy for Arc<P> for cheap sharing.
- LinguaLanguagePolicy unit-like factory; LinguaLanguageDetector keeps
  pub struct visibility (needed for the associated type) but constructors
  become pub(crate) so external code only reaches it through the policy.
- EngineBuilder::with_language_detector renamed to with_language_policy.
  Asserted-language path (Context::language) bypasses the policy.
- Context::candidate_languages forwards to LanguagePolicy::detector_for
  every call; LinguaLanguagePolicy silently falls back to detector_for_all
  when no requested tag is recognised.

Module layout:
- language/detection.rs holds LanguageDetection, LanguageProvenance, and
  LanguageSpan (merged from former lang_detection.rs + lang_span.rs).
- language/dyn_policy.rs holds DynLanguagePolicy + the Arc<P> forwarder.
- language/lingua.rs holds both LinguaLanguageDetector and LinguaLanguagePolicy.
- language/mod.rs holds LanguageDetector (pub(crate)) and LanguagePolicy (pub).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…tEnhancer, serde wire shape

Per-call context hints (Presidio's analyze(context=[...]) equivalent, with
per-kind targeting on top):

- New ContextHint { kind: Option<EntityKind>, keywords, window?, boost? }.
  ScanContext::hints: Vec<ContextHint>. The enhancer picks at most one
  bucket per match: the entry with kind == Some(match.entity_kind), falling
  back to the first kind == None bucket. Prevents a "DOB" hint from
  silently boosting an SSN pattern.
- Pull RawMatch::apply_context_adjustment into a new pub(crate)
  ContextEnhancer in engine/scan/enhancer.rs. Strict-Presidio: patterns
  without a ContextRule are skipped regardless of hints.
- Per-call window/boost overrides only fire when hint.keywords is non-empty
  (avoids silent retuning).

Wire shape:

- AllowList, DenyList, DenyRule, ContextHint, ScanContext gain serde derives.
- AllowList / DenyList serialize transparently (HashSet<String> / HashMap<String, DenyRule>).
- DenyList's lazily-compiled scanner is #[serde(skip)] and rebuilt on first scan.
- Round-trip test locks in the JSON shape.

File reshuffle:

- engine/filter/scan_context.rs deleted; ScanContext lives in filter/mod.rs.
- engine/pattern_engine.rs deleted; PatternEngine + Debug impl + scan_entities
  / scan_raw + DEFAULT_ENGINE singleton fold into engine/mod.rs.
- engine/filter/deny_list.rs split: DenyScanner moves to its own
  engine/filter/deny_scanner.rs.
- engine/filter/context_hint.rs holds the new ContextHint type (singular
  because ScanContext::hints is a Vec).
- PatternEngine field visibility tightens from pub(super) to
  pub(in crate::engine) so the post-merge visibility matches the original
  "siblings in crate::engine only" intent.

Tests: 11 new enhancer tests including the kind-targeted boost isolation
test (kind_targeted_hint_does_not_apply_to_other_kinds) plus 2 serde
round-trip tests. 83 unit + 4 + 5 integration tests pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…no longer depends on nlp/pattern

Mirrors Presidio's presidio-analyzer split: detection becomes its own crate
with a Recognizer trait + DetectionEngine, and nvisy-engine reduces to
orchestration.

New crate nvisy-detection
- Recognizer trait (async recognize + reset; reset is a no-op default).
- NerRecognizer wraps Arc<nvisy_nlp::Engine> and translates DetectionContext
  to nvisy_nlp::Context. Behavior change vs the old NerRecognition: now
  routes through the full Engine (language detection + post-filter + tokens
  + keywords) instead of bypassing it.
- PatternRecognizer wraps nvisy_pattern::PatternEngine with the
  Shared/Arc/Owned variants preserved from the old PatternEngineRef.
- LlmRecognizer wraps NerAgent + DetectionConfig; overrides Recognizer::reset
  to clear coreference state.
- DetectionEngine orchestrates Vec<Arc<dyn Recognizer>> sequentially. detect()
  runs each; reset() fans out to clear per-document state at document
  boundaries.
- DetectionContext bundles per-call inputs (text, language, candidate_languages,
  entities, score_threshold, scan_context, correlation_id).
- extension::RebaseEntities (moved from nvisy-engine) shifts recognizer
  output from context-local to document-relative byte offsets.
- Error::{Recognizer{name,cause}, Misconfigured} with From → nvisy_core::Error.

nvisy-engine
- Drops nvisy-pattern and nvisy-nlp from its Cargo.toml. nvisy-engine now
  only depends on nvisy-detection (and nvisy-core, codec, ontology, provider)
  for detection.
- operation/detection/ folder (5 files: LlmRecognition, NerRecognition,
  PatternRecognition, PatternEngineRef, RebaseEntities) replaced by a single
  operation/detection.rs holding Detection { engine: Arc<DetectionEngine> }.
- EngineInner::ner_backend: Option<Arc<dyn NerBackend>> →
  detection_engine: Option<Arc<DetectionEngine>>.
- Engine::with_ner_backend → Engine::with_detection_engine. Users compose
  the detection engine externally with whatever recognizers they want and
  hand the assembled Arc<DetectionEngine> to the runtime engine.
- Pipeline / RunContext renamed accordingly.
- Orchestrator::run_detection collapses to a single Detection op invocation.
  Workflow DetectionConfig is currently ignored at the engine boundary
  (recognizer composition happens externally); threading per-call hints
  from the cfg into DetectionContext is left for a follow-up.

nvisy-nlp
- Engine::analyze rewritten to use tracing::Instrument instead of
  span.entered() so the returned future is Send. Required by NerRecognizer's
  async_trait impl. No behavior change.

nvisy-pattern
- ScanContext gains Clone so it can ride on DetectionContext (Clone).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Remove crates/nvisy-python (pyo3 bridge + exif Python module).
- Remove packages/ (uv workspace: nvisy-hf, nvisy-exif).
- Remove root pyproject.toml.
- Drop nvisy-python from workspace members and dependencies in
  root Cargo.toml.
- Drop pyo3, pyo3-async-runtimes, pythonize from workspace
  dependencies.
- Remove all four Install Python steps and the
  PYO3_USE_ABI3_FORWARD_COMPATIBILITY env from CI build workflow.
- Strip Python entries (__pycache__, *.py[cod], .venv,
  *.egg-info, .ruff_cache, .pytest_cache) from .gitignore.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ient threading

Splits the former nvisy-provider into two focused crates and moves
agent/HTTP wiring out of nvisy-engine.

New crate nvisy-rig
- LLM agents over the rig framework: base scaffolding, NER, CV,
  generate, audio (STT/TTS), and entity_verifier.
- Owns its own http/ (HttpClient + middleware + HttpConfig).
- Features: openai-gpt, openai-whisper, openai-tts,
  anthropic-claude, google-gemini.
- Does NOT depend on nvisy-ocr.

New crate nvisy-ocr
- Non-LLM OCR providers (Surya, PaddleX, AWS Textract,
  Google Vision, Azure DocAI).
- Owns its own http/ (HttpClient + middleware + HttpConfig).
- Features: aws-textract, google-vision, azure-docai.

agent/ocr split
- LLM-side prompting + verification logic moves to
  nvisy_rig::agent::entity_verifier::EntityVerifier. verify(image,
  &[ProposedEntity]) -> VerificationOutput. No knowledge of OCR
  providers.
- OCR-engine + verifier orchestration moves into
  nvisy_engine::operation::extraction::vision. The engine now
  drives the OCR -> propose entities -> LLM verify -> merge flow
  directly instead of routing through a unified OcrAgent.

nvisy-engine reshape
- Drops nvisy-provider; adds nvisy-rig + nvisy-ocr.
- EngineInner::http_client field removed.
- Pipeline::new no longer takes HttpClient.
- RunContext::http_client removed.
- Engine::http_client() accessor removed.
- Extraction::new, VisualExtraction::new, AudialExtraction::new no
  longer take &HttpClient. Each constructs its own client(s) from
  RuntimeConfig at build time.
- Feature aliases rewired: openai=nvisy-rig/*,
  google=nvisy-rig/google-gemini + nvisy-ocr/google-vision,
  microsoft=nvisy-ocr/azure-docai, amazon=nvisy-ocr/aws-textract.

HTTP polish (both crates)
- http/middleware/{retry,tracing}.rs collapsed into a single
  http/middleware.rs exposing backoff_policy, retry_layer,
  tracing_layer.
- HttpConfig fields switch from u64 seconds to std::time::Duration
  with humantime_serde for `timeout`, `connect_timeout`,
  `idle_timeout`. Accepts "120s", "2min", "500ms", etc. Added
  humantime-serde as a workspace dep.

nvisy-provider preservation
- Old crate moved to .ignore/nvisy-provider-old for comparison;
  not deleted, not in the workspace.

Verification: workspace builds, ~400 tests pass, clippy clean,
nightly fmt applied, docs build with no new warnings.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…bottom

- nvisy-pattern: stop linking from public docs to private items
  (scan, RawMatch, ContextEnhancer, DenyList::scanner); rename stale
  DictionaryFilter link to PatternFilter; switch AllowList's
  PatternEngine::scan_entities link to a crate-path reference so
  rustdoc can resolve it.
- nvisy-rig: stop linking EntityVerifier's public doc to the
  crate-private BaseAgent.
- Convert every remaining inline rustdoc link of the form
  [`Foo`](path::to::Foo) to reference-style: in-prose [`Foo`] with
  the path defined once at the bottom of the doc block.

Workspace doc builds with no rustdoc warnings (the one remaining
filename-collision warning is a structural cargo issue: nvisy-cli
defines a binary named "nvisy-server" that conflicts with the
nvisy-server lib's doc output path. Renaming the binary is a
behavioural change and out of scope here.)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@martsokha martsokha merged commit 657171d into main May 20, 2026
5 checks passed
@martsokha martsokha deleted the feat/nvisy-nlp branch May 20, 2026 23:42
@martsokha martsokha added core nvisy-core: content model, errors, types rig nvisy-rig: LLM agents (NER, CV, generate, STT/TTS, entity-verifier) over rig refactor code restructuring without behavior change labels May 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core nvisy-core: content model, errors, types refactor code restructuring without behavior change rig nvisy-rig: LLM agents (NER, CV, generate, STT/TTS, entity-verifier) over rig

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant