feat(nvisy-nlp): new NLP crate + nvisy-engine wiring#155
Merged
Conversation
Scope C, trimmed to what Rust can actually deliver maintainably: NER, language detection, tokenization. No lemmas in v1 (see #154 — no maintained Rust lemmatizer exists, nothing currently consumes them, trait absorbs them later non-breakingly). ## Architecture Composable trait surface, in contrast to Presidio's monolithic NlpEngine. Each concern is independently implementable: - `NerBackend` (async) — NER over text - `LanguageDetector` (sync) — language ID - `Tokenizer` (sync, fallible) — token + offset extraction Composite `NlpEngine` orchestrates them in Presidio order (detect language → run NER with hint → tokenize → derive keywords) and has a second entrypoint `analyze_in_language(text, lang)` for callers who already know the document language. `NerBackend` and `LanguageDetector` are required at build time; `Tokenizer` is optional. ## v1 implementations - `OrtNerBackend` — HuggingFace token-classification model via `ort` (ONNX Runtime, load-dynamic). Inference behind an `Inferencer` trait so unit tests use canned logits, not a model file. BIO-tag folding + label-map lookup produce `Entities`. - `LinguaLanguageDetector` — lingua-rs, real confidence scores, multilingual segmentation via `detect_multiple`, two constructors (`for_languages` recommended, `for_all_languages` for genuinely unknown). Lingua defaults to high-accuracy mode. - `UnicodeTokenizer` — `unicode-segmentation` based, model-free. Optional stopword set via `stop-words`. - `HfTokenizer` — `tokenizers` crate wrapper. Loads `tokenizer.json`, produces tokens with HF offsets (for ORT NER alignment). Optional stopword set. - `NoopNerBackend` — empty entities; for tests. ## Data carrier `NlpArtifacts` mirrors Presidio's field set (entities, language, tokens, keywords) reshaped for typed Rust access. No `lemmas`, no spaCy `Doc` backref, no parallel offset arrays. ## Public surface decisions - `LanguageDetection { language, confidence: Option<f64> }` and `LanguageSpan { start, end, language, confidence }` carried on the trait; engine strips down to `LanguageTag` for `NlpArtifacts` (consumers wanting confidence call the detector directly). - `Token { start, end, text, is_stop }` — no `is_punct`; the detection was unreliable across both impls and nothing reads it. - `Tokenizer::tokenize` is fallible (`Result<Vec<Token>, NlpError>`). HF errors propagate instead of silent-empty; Unicode impl is always `Ok`. ## Distribution / model story User-provided model paths only. No bundling, no auto-download. `load-dynamic` for ORT means users install `libonnxruntime` out-of-band (Homebrew on macOS, distro package on Linux). Documented in `README.md` and `DESIGN.md`. ## Deferred (tracked) - GlotLID v3 as an alternative language detector. - TiktokenTokenizer for LLM context-window counting. - LinderaTokenizer for CJK content. - Lemmatization (see #154). - Burn as an alternative NER backend. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Renames `EntityRecognition` (LLM-driven NER) to `LlmRecognition` and adds a new `NerRecognition` operation that wraps `Arc<dyn nvisy_nlp::NerBackend>` and runs sequentially over text spans. The detection phase now runs three operations in order: 1. `LlmRecognition` — silently skipped if no LLM provider configured 2. `NerRecognition` — silently skipped if no backend on the run ctx 3. `PatternRecognition` — always runs Wiring: - `Engine::with_ner_backend(Arc<dyn NerBackend>)` builder method to attach an offline backend at engine init. - `EngineInner.ner_backend` carries it across runs. - `Pipeline::new` accepts and forwards it. - `RunContext.ner_backend` makes it available to the orchestrator. Behaviour unchanged when no backend is configured: existing pipelines continue to run LLM NER + patterns, no offline-NER overhead. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ts, fallible LanguageDetector Doc + naming + ergonomics polish on top of the new crate. Squashed into one commit because the changes interleave on the same files. ## Naming - `NlpError` -> `Error` + `Result<T, E = Error>` alias; the crate now uses `Result<T>` everywhere internally. - `NlpArtifacts` -> `Artifacts`. - `tokenizer/hf.rs` -> `tokenizer/hugging_face.rs` (file rename; type name `HfTokenizer` unchanged). ## Trait change `LanguageDetector` is now fallible: - `detect(&str) -> Result<Option<LanguageDetection>>` — splits "no answer" (`Ok(None)`, e.g. short or ambiguous text) from real backend failure (`Err(_)`). - `detect_multiple(&str) -> Result<Vec<LanguageSpan>>` — default impl returns a single full-text span built from `detect`; impls with real segmentation override. `LinguaLanguageDetector` propagates these `Result`s. Tests updated. ## Framing - Dropped all "offline" labelling from README, Cargo description, module docs, struct docs, tracing strings. The trait surface is transport-agnostic; LLM-mediated NER lives in `nvisy-llm` by deliberate crate split, not by trait restriction. README and inline docs now say so explicitly. - Deleted `DESIGN.md`; the README now carries the small amount of framing actually needed. ## Builder Dropped `with_ner_arc` and `with_language_detector_arc` from `NlpEngineBuilder` — the `with_ner` / `with_language_detector` methods (which take any `T: NerBackend/LanguageDetector + 'static`) cover the same use cases. ## Layout - `artifacts/` is a folder with `mod.rs` (Artifacts) and `token.rs` (Token). - `language/` is a folder with `mod.rs` (trait + LanguageDetection + LanguageSpan) and `lingua.rs` (LinguaLanguageDetector). - `error.rs` is a single file (no folder). ## nvisy-engine side Same sweep applied to the consumer: removed "offline" framing from `ner_recognition`, `llm_recognition`, `detection/mod`, orchestrator and engine docs. The trait surface and operation behaviour are unchanged. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Doc-only + style sweep across the crate: - Hoist all intra-doc links and external URL links to bottom-of-block reference definitions, matching the existing style. Inline `[Foo](crate::path::Foo)` and `[bar](https://...)` become `[Foo]` and `[bar]` with reference defs below. - Escape stray `[\`nvisy-nlp\`]` crate-name strings that rustdoc was mistakenly parsing as intra-doc links — they're now plain backticks. - Drop the dangling "See DESIGN.md" line from ner/ort.rs (file was deleted last round). - `cargo fmt` rewraps. - `cargo clippy --workspace --all-targets` clean (one `manual_contains` lint fixed in `stopword_lang::is_supported`). - `cargo doc -p nvisy-nlp --no-deps` clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Acts on every item from the second-round code review. Changes by
category, with reviewer's anchor codes in parens.
## Bugs
- (A1) OrtNerConfig.max_sequence_length is now wired: the tokenizer
is configured with TruncationParams at construction time, so long
inputs are truncated instead of triggering a shape mismatch at
inference time.
- (A2) OrtNerBackend.recognize now validates the language hint
against `supported_languages` (when configured) and returns
Error::UnsupportedLanguage on mismatch. Empty list still means
"any language" per the trait doc.
- (A3) Inference moved off the async-runtime thread via
tokio::task::spawn_blocking. OrtNerBackend.state lives behind an
Arc so it can cross the spawn boundary cheaply.
- (A4) Inferencer trait returns flat (Vec<f32>, num_labels) instead
of Vec<Vec<f32>>. 512 heap allocations per inference become one.
## Ergonomics
- (B1, E1) Inferencer + OrtInferencer are pub(crate); external
backends implement NerBackend directly, no longer locked into the
internal trait surface.
- (B2) OrtNerConfig::with_conll03_english helper for the common
PER/ORG/LOC/MISC label set.
- (B3) Replaced hardcoded 0.85 confidence with real per-span
softmax confidence — argmax_softmax computes both the winning
label and its probability in one pass, fold_predictions averages
per-token probs across the span.
- (B5) Type-state NlpEngineBuilder. NoNer/WithNer + NoLang/WithLang
marker types make `.build()` only callable once both required
components are attached. The runtime `expect` panics become
unreachable.
- (B6) LanguageProvenance { Detected, Asserted } enum on
LanguageDetection. analyze_in_language stamps Asserted;
LinguaLanguageDetector stamps Detected. Callers no longer have to
reverse-engineer provenance from `confidence: None`.
## Polish
- (B4) HfTokenizer doc explicitly notes that `is_stop` doesn't work
for subword tokenizers (BPE/WordPiece/Unigram emit fragments, not
whole words). Recommends UnicodeTokenizer for word-level filtering.
- (C1, C2) OrtState moves the tokenizer into Arc-shared storage and
drops everything not needed at inference time (full
OrtNerConfig held forever was the original cut corner).
- (C3) label_order rejects `"O"` in the user-supplied label_map at
construction with a clear error.
- (C4) lingua_to_tag uses a OnceLock<Mutex<HashSet>>-backed
warn-once cache. Repeated failures for the same ISO code log
once per process instead of per call.
- (C6) `recognize_sync` -> `recognize_blocking`. Actually used now,
by the new tests and as the body run inside spawn_blocking.
- (C7, D1, D2) Real unit tests for fold_predictions: a minimal
hand-rolled tokenizer.json + CannedInferencer drives end-to-end
span construction including multi-span output, BIO continuation,
language-hint validation paths.
- (D3) NoopNerBackend::default() exercised in a test.
- (E2) Dedicated stopword_lang.rs collapsed into tokenizer/mod.rs
as `SUPPORTED_STOPWORD_LANGUAGES` const + helper fn.
## Docs
- All "offline NLP" framing dropped in earlier round; this commit
also fixes the broken intra-doc link to UnicodeTokenizer from
HfTokenizer's struct doc.
## Test count
21 -> 28 tests, all pass. Workspace builds + clippy clean + fmt +
doc warnings clean.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…, fold engine module Engine module is now a single file (engine/mod.rs holds Engine; builder stays in engine/builder.rs). LanguageDetection (with LanguageProvenance) and LanguageSpan moved out of language/mod.rs into their own files so the module root only carries the LanguageDetector trait + re-exports. Engine name collides with nvisy_engine::Engine at use sites; callers importing both will need to alias one. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…, NoopNer behind test-utils
nvisy-ontology
- Add Confidence(f64) primitive with Confidence::new -> Option (NonZero-style),
Confidence::get accessor, serde-transparent. Rejects NaN/inf and values outside [0,1].
- Migrate Entity.confidence from f64 to Confidence. EntitySelector and policy
comparisons now go through .get().
nvisy-nlp
- LanguageSpan shrinks to {start, end}. LanguageDetection gains
span: Option<LanguageSpan> and confidence: Option<Confidence>.
- LanguageDetector trait: detect + detect_in both return Vec<LanguageDetection>.
Empty vec means "no answer"; mixed-language input produces one entry per region.
Drop detect_multiple. LinguaLanguageDetector populates span from lingua's
multi-language API.
- Rename Request -> Context (file too). Drop Response: Engine::analyze now returns
Artifacts directly. correlation_id stays on Context, input-only for the tracing span.
- Replace type-state EngineBuilder with derive_builder + custom setters; the four
marker types are gone from the public API. .build() returns Result<Engine, EngineBuilderError>.
- Gate NoopNerBackend behind a new test-utils feature so production builds don't
ship the no-op fallback.
nvisy-pattern, nvisy-provider, nvisy-engine
- Wrap entity.confidence at every construction site (Confidence::new(..clamp(0,1))
at ML/LLM boundaries to absorb float rounding).
- Read entity.confidence.get() at every comparison site (selectors, dedup
strategies, redaction thresholds).
- Bump nvisy-ontology dev-deps to enable test-utils where tests build Entities.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…oes internal Reshape language module around a public factory trait. - LanguageDetector trait drops to pub(crate). One method: detect(&str). Language scope is baked in at construction time; no detect_in. - New public LanguagePolicy trait with associated Detector type + detector_for_all / detector_for(&[LanguageTag]) factory methods. Engine asks the policy for a fresh detector each call. - Crate-private DynLanguagePolicy shim (object-safe twin, blanket-impl'd for every LanguagePolicy) lets Engine hold Arc<dyn DynLanguagePolicy> while the public trait keeps its associated type. - impl<P: LanguagePolicy> LanguagePolicy for Arc<P> for cheap sharing. - LinguaLanguagePolicy unit-like factory; LinguaLanguageDetector keeps pub struct visibility (needed for the associated type) but constructors become pub(crate) so external code only reaches it through the policy. - EngineBuilder::with_language_detector renamed to with_language_policy. Asserted-language path (Context::language) bypasses the policy. - Context::candidate_languages forwards to LanguagePolicy::detector_for every call; LinguaLanguagePolicy silently falls back to detector_for_all when no requested tag is recognised. Module layout: - language/detection.rs holds LanguageDetection, LanguageProvenance, and LanguageSpan (merged from former lang_detection.rs + lang_span.rs). - language/dyn_policy.rs holds DynLanguagePolicy + the Arc<P> forwarder. - language/lingua.rs holds both LinguaLanguageDetector and LinguaLanguagePolicy. - language/mod.rs holds LanguageDetector (pub(crate)) and LanguagePolicy (pub). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…tEnhancer, serde wire shape
Per-call context hints (Presidio's analyze(context=[...]) equivalent, with
per-kind targeting on top):
- New ContextHint { kind: Option<EntityKind>, keywords, window?, boost? }.
ScanContext::hints: Vec<ContextHint>. The enhancer picks at most one
bucket per match: the entry with kind == Some(match.entity_kind), falling
back to the first kind == None bucket. Prevents a "DOB" hint from
silently boosting an SSN pattern.
- Pull RawMatch::apply_context_adjustment into a new pub(crate)
ContextEnhancer in engine/scan/enhancer.rs. Strict-Presidio: patterns
without a ContextRule are skipped regardless of hints.
- Per-call window/boost overrides only fire when hint.keywords is non-empty
(avoids silent retuning).
Wire shape:
- AllowList, DenyList, DenyRule, ContextHint, ScanContext gain serde derives.
- AllowList / DenyList serialize transparently (HashSet<String> / HashMap<String, DenyRule>).
- DenyList's lazily-compiled scanner is #[serde(skip)] and rebuilt on first scan.
- Round-trip test locks in the JSON shape.
File reshuffle:
- engine/filter/scan_context.rs deleted; ScanContext lives in filter/mod.rs.
- engine/pattern_engine.rs deleted; PatternEngine + Debug impl + scan_entities
/ scan_raw + DEFAULT_ENGINE singleton fold into engine/mod.rs.
- engine/filter/deny_list.rs split: DenyScanner moves to its own
engine/filter/deny_scanner.rs.
- engine/filter/context_hint.rs holds the new ContextHint type (singular
because ScanContext::hints is a Vec).
- PatternEngine field visibility tightens from pub(super) to
pub(in crate::engine) so the post-merge visibility matches the original
"siblings in crate::engine only" intent.
Tests: 11 new enhancer tests including the kind-targeted boost isolation
test (kind_targeted_hint_does_not_apply_to_other_kinds) plus 2 serde
round-trip tests. 83 unit + 4 + 5 integration tests pass.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…no longer depends on nlp/pattern
Mirrors Presidio's presidio-analyzer split: detection becomes its own crate
with a Recognizer trait + DetectionEngine, and nvisy-engine reduces to
orchestration.
New crate nvisy-detection
- Recognizer trait (async recognize + reset; reset is a no-op default).
- NerRecognizer wraps Arc<nvisy_nlp::Engine> and translates DetectionContext
to nvisy_nlp::Context. Behavior change vs the old NerRecognition: now
routes through the full Engine (language detection + post-filter + tokens
+ keywords) instead of bypassing it.
- PatternRecognizer wraps nvisy_pattern::PatternEngine with the
Shared/Arc/Owned variants preserved from the old PatternEngineRef.
- LlmRecognizer wraps NerAgent + DetectionConfig; overrides Recognizer::reset
to clear coreference state.
- DetectionEngine orchestrates Vec<Arc<dyn Recognizer>> sequentially. detect()
runs each; reset() fans out to clear per-document state at document
boundaries.
- DetectionContext bundles per-call inputs (text, language, candidate_languages,
entities, score_threshold, scan_context, correlation_id).
- extension::RebaseEntities (moved from nvisy-engine) shifts recognizer
output from context-local to document-relative byte offsets.
- Error::{Recognizer{name,cause}, Misconfigured} with From → nvisy_core::Error.
nvisy-engine
- Drops nvisy-pattern and nvisy-nlp from its Cargo.toml. nvisy-engine now
only depends on nvisy-detection (and nvisy-core, codec, ontology, provider)
for detection.
- operation/detection/ folder (5 files: LlmRecognition, NerRecognition,
PatternRecognition, PatternEngineRef, RebaseEntities) replaced by a single
operation/detection.rs holding Detection { engine: Arc<DetectionEngine> }.
- EngineInner::ner_backend: Option<Arc<dyn NerBackend>> →
detection_engine: Option<Arc<DetectionEngine>>.
- Engine::with_ner_backend → Engine::with_detection_engine. Users compose
the detection engine externally with whatever recognizers they want and
hand the assembled Arc<DetectionEngine> to the runtime engine.
- Pipeline / RunContext renamed accordingly.
- Orchestrator::run_detection collapses to a single Detection op invocation.
Workflow DetectionConfig is currently ignored at the engine boundary
(recognizer composition happens externally); threading per-call hints
from the cfg into DetectionContext is left for a follow-up.
nvisy-nlp
- Engine::analyze rewritten to use tracing::Instrument instead of
span.entered() so the returned future is Send. Required by NerRecognizer's
async_trait impl. No behavior change.
nvisy-pattern
- ScanContext gains Clone so it can ride on DetectionContext (Clone).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Remove crates/nvisy-python (pyo3 bridge + exif Python module). - Remove packages/ (uv workspace: nvisy-hf, nvisy-exif). - Remove root pyproject.toml. - Drop nvisy-python from workspace members and dependencies in root Cargo.toml. - Drop pyo3, pyo3-async-runtimes, pythonize from workspace dependencies. - Remove all four Install Python steps and the PYO3_USE_ABI3_FORWARD_COMPATIBILITY env from CI build workflow. - Strip Python entries (__pycache__, *.py[cod], .venv, *.egg-info, .ruff_cache, .pytest_cache) from .gitignore. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ient threading
Splits the former nvisy-provider into two focused crates and moves
agent/HTTP wiring out of nvisy-engine.
New crate nvisy-rig
- LLM agents over the rig framework: base scaffolding, NER, CV,
generate, audio (STT/TTS), and entity_verifier.
- Owns its own http/ (HttpClient + middleware + HttpConfig).
- Features: openai-gpt, openai-whisper, openai-tts,
anthropic-claude, google-gemini.
- Does NOT depend on nvisy-ocr.
New crate nvisy-ocr
- Non-LLM OCR providers (Surya, PaddleX, AWS Textract,
Google Vision, Azure DocAI).
- Owns its own http/ (HttpClient + middleware + HttpConfig).
- Features: aws-textract, google-vision, azure-docai.
agent/ocr split
- LLM-side prompting + verification logic moves to
nvisy_rig::agent::entity_verifier::EntityVerifier. verify(image,
&[ProposedEntity]) -> VerificationOutput. No knowledge of OCR
providers.
- OCR-engine + verifier orchestration moves into
nvisy_engine::operation::extraction::vision. The engine now
drives the OCR -> propose entities -> LLM verify -> merge flow
directly instead of routing through a unified OcrAgent.
nvisy-engine reshape
- Drops nvisy-provider; adds nvisy-rig + nvisy-ocr.
- EngineInner::http_client field removed.
- Pipeline::new no longer takes HttpClient.
- RunContext::http_client removed.
- Engine::http_client() accessor removed.
- Extraction::new, VisualExtraction::new, AudialExtraction::new no
longer take &HttpClient. Each constructs its own client(s) from
RuntimeConfig at build time.
- Feature aliases rewired: openai=nvisy-rig/*,
google=nvisy-rig/google-gemini + nvisy-ocr/google-vision,
microsoft=nvisy-ocr/azure-docai, amazon=nvisy-ocr/aws-textract.
HTTP polish (both crates)
- http/middleware/{retry,tracing}.rs collapsed into a single
http/middleware.rs exposing backoff_policy, retry_layer,
tracing_layer.
- HttpConfig fields switch from u64 seconds to std::time::Duration
with humantime_serde for `timeout`, `connect_timeout`,
`idle_timeout`. Accepts "120s", "2min", "500ms", etc. Added
humantime-serde as a workspace dep.
nvisy-provider preservation
- Old crate moved to .ignore/nvisy-provider-old for comparison;
not deleted, not in the workspace.
Verification: workspace builds, ~400 tests pass, clippy clean,
nightly fmt applied, docs build with no new warnings.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…bottom - nvisy-pattern: stop linking from public docs to private items (scan, RawMatch, ContextEnhancer, DenyList::scanner); rename stale DictionaryFilter link to PatternFilter; switch AllowList's PatternEngine::scan_entities link to a crate-path reference so rustdoc can resolve it. - nvisy-rig: stop linking EntityVerifier's public doc to the crate-private BaseAgent. - Convert every remaining inline rustdoc link of the form [`Foo`](path::to::Foo) to reference-style: in-prose [`Foo`] with the path defined once at the bottom of the doc block. Workspace doc builds with no rustdoc warnings (the one remaining filename-collision warning is a structural cargo issue: nvisy-cli defines a binary named "nvisy-server" that conflicts with the nvisy-server lib's doc output path. Renaming the binary is a behavioural change and out of scope here.) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
New
nvisy-nlpcrate for NLP — NER, language detection, and tokenization — plus thenvisy-enginewiring that consumes it.nvisy-nlp
Composable trait surface (
NerBackend+LanguageDetector+Tokenizer) with a compositeNlpEnginethat orchestrates them in Presidio order (detect language → run NER with hint → tokenize → derive keywords). Second entrypointanalyze_in_language(text, lang)bypasses detection when the caller already knows the language.Trait surface is transport-agnostic. v1 ships local-by-default implementations; LLM-mediated NER lives in
nvisy-llmby deliberate crate split, not by trait restriction. Any third-party backend can implementNerBackendover any transport.v1 implementations:
OrtNerBackend— HuggingFace token-classification viaort(ONNX Runtime, load-dynamic). Real per-span softmax confidence. Truncates inputs atmax_sequence_length. Validates language hints againstsupported_languages. Inference dispatched viatokio::task::spawn_blockingso it doesn't starve the async runtime.LinguaLanguageDetector— lingua-rs, real confidence scores, multilingual segmentation, two constructors (for_languagesrecommended,for_all_languagesfor unknown).UnicodeTokenizer—unicode-segmentationbased, model-free.HfTokenizer— wrapstokenizers::Tokenizerfor ORT NER alignment.NoopNerBackend— for tests.Key surface decisions:
NerBackend+LanguageDetectorrequired at build;Tokenizeroptional. Builder uses type-state so calling.build()without both required components is a compile error, not a runtime panic.Tokenizer::tokenizereturnsResult<Vec<Token>>.LanguageDetector::detectreturnsResult<Option<LanguageDetection>>— splits "no answer" (Ok(None)) from real failure (Err).LanguageDetector::detect_multiplereturnsResult<Vec<LanguageSpan>>for mixed-language documents; default impl falls back to single-language detect.LanguageDetection { language, confidence: Option<f64>, provenance: LanguageProvenance }— distinguishes detected from caller-asserted.Result<T>alias defaulting toError;pub use nvisy_nlp::{Error, Result}at the crate root.Token { start, end, text, is_stop }— nois_punct(broken across both impls; reviewer caught it).libonnxruntimeout-of-band.nvisy-engine wiring
EntityRecognition(LLM-driven) →LlmRecognition.NerRecognitionwrappingArc<dyn nvisy_nlp::NerBackend>.Engine::with_ner_backend(...)builder method for engine init.Behaviour unchanged when no backend is configured.
Layout
Deferred (tracked as follow-ups)
Test plan
cargo build --workspacecleancargo test --workspace— all ~440 tests pass, 28 in nvisy-nlpcargo clippy --workspace --all-targetscleancargo doc -p nvisy-nlp --no-depsclean (no broken intra-doc links)cargo fmtcleanOrtNerBackendand run against documents🤖 Generated with Claude Code