feat(nvisy-nlp): new NLP crate + nvisy-engine wiring by martsokha · Pull Request #155 · nvisycom/runtime

martsokha · 2026-05-20T13:42:50Z

Summary

New nvisy-nlp crate for NLP — NER, language detection, and tokenization — plus the nvisy-engine wiring that consumes it.

nvisy-nlp

Composable trait surface (NerBackend + LanguageDetector + Tokenizer) with a composite NlpEngine that orchestrates them in Presidio order (detect language → run NER with hint → tokenize → derive keywords). Second entrypoint analyze_in_language(text, lang) bypasses detection when the caller already knows the language.

Trait surface is transport-agnostic. v1 ships local-by-default implementations; LLM-mediated NER lives in nvisy-llm by deliberate crate split, not by trait restriction. Any third-party backend can implement NerBackend over any transport.

v1 implementations:

OrtNerBackend — HuggingFace token-classification via ort (ONNX Runtime, load-dynamic). Real per-span softmax confidence. Truncates inputs at max_sequence_length. Validates language hints against supported_languages. Inference dispatched via tokio::task::spawn_blocking so it doesn't starve the async runtime.
LinguaLanguageDetector — lingua-rs, real confidence scores, multilingual segmentation, two constructors (for_languages recommended, for_all_languages for unknown).
UnicodeTokenizer — unicode-segmentation based, model-free.
HfTokenizer — wraps tokenizers::Tokenizer for ORT NER alignment.
NoopNerBackend — for tests.

Key surface decisions:

NerBackend + LanguageDetector required at build; Tokenizer optional. Builder uses type-state so calling .build() without both required components is a compile error, not a runtime panic.
Tokenizer::tokenize returns Result<Vec<Token>>.
LanguageDetector::detect returns Result<Option<LanguageDetection>> — splits "no answer" (Ok(None)) from real failure (Err).
LanguageDetector::detect_multiple returns Result<Vec<LanguageSpan>> for mixed-language documents; default impl falls back to single-language detect.
LanguageDetection { language, confidence: Option<f64>, provenance: LanguageProvenance } — distinguishes detected from caller-asserted.
Crate-wide Result<T> alias defaulting to Error; pub use nvisy_nlp::{Error, Result} at the crate root.
Token { start, end, text, is_stop } — no is_punct (broken across both impls; reviewer caught it).
User-provided model paths; no bundling, no auto-download. Users install libonnxruntime out-of-band.
No lemmas (nvisy-nlp: lemmatization deferred: decide path when a consumer needs it #154 — no maintained Rust lemmatizer).

nvisy-engine wiring

Renames EntityRecognition (LLM-driven) → LlmRecognition.
Adds NerRecognition wrapping Arc<dyn nvisy_nlp::NerBackend>.
Detection phase now runs three ops: LLM (skip if no provider), NER backend (skip if no backend), pattern (always).
Engine::with_ner_backend(...) builder method for engine init.

Behaviour unchanged when no backend is configured.

Layout

crates/nvisy-nlp/src/
├── lib.rs
├── error.rs                          Error + Result alias
├── artifacts/
│   ├── mod.rs                        Artifacts
│   └── token.rs                      Token
├── engine/
│   ├── mod.rs                        re-exports + integration tests
│   ├── nlp_engine.rs                 NlpEngine
│   └── builder.rs                    NlpEngineBuilder (type-state)
├── language/
│   ├── mod.rs                        LanguageDetector + LanguageDetection + LanguageSpan + LanguageProvenance
│   └── lingua.rs                     LinguaLanguageDetector
├── ner/
│   ├── mod.rs                        NerBackend trait
│   ├── ort.rs                        OrtNerBackend + Inferencer (pub(crate))
│   └── noop.rs                       NoopNerBackend
└── tokenizer/
    ├── mod.rs                        Tokenizer trait + SUPPORTED_STOPWORD_LANGUAGES
    ├── hugging_face.rs               HfTokenizer
    └── unicode.rs                    UnicodeTokenizer

Deferred (tracked as follow-ups)

GlotLID v3 as alternative language detector — better tail coverage, short-text accuracy, Apache-2.0.
TiktokenTokenizer for LLM context-window counting.
LinderaTokenizer for CJK content.
Burn as alternative NER backend.
Lemmatization (nvisy-nlp: lemmatization deferred: decide path when a consumer needs it #154).

Test plan

cargo build --workspace clean
cargo test --workspace — all ~440 tests pass, 28 in nvisy-nlp
cargo clippy --workspace --all-targets clean
cargo doc -p nvisy-nlp --no-deps clean (no broken intra-doc links)
cargo fmt clean
All commits independently buildable
Manual smoke: load a real ONNX NER model via OrtNerBackend and run against documents
Manual smoke: wire a backend into the engine and verify it runs alongside pattern recognition

🤖 Generated with Claude Code

Scope C, trimmed to what Rust can actually deliver maintainably: NER, language detection, tokenization. No lemmas in v1 (see #154 — no maintained Rust lemmatizer exists, nothing currently consumes them, trait absorbs them later non-breakingly). ## Architecture Composable trait surface, in contrast to Presidio's monolithic NlpEngine. Each concern is independently implementable: - `NerBackend` (async) — NER over text - `LanguageDetector` (sync) — language ID - `Tokenizer` (sync, fallible) — token + offset extraction Composite `NlpEngine` orchestrates them in Presidio order (detect language → run NER with hint → tokenize → derive keywords) and has a second entrypoint `analyze_in_language(text, lang)` for callers who already know the document language. `NerBackend` and `LanguageDetector` are required at build time; `Tokenizer` is optional. ## v1 implementations - `OrtNerBackend` — HuggingFace token-classification model via `ort` (ONNX Runtime, load-dynamic). Inference behind an `Inferencer` trait so unit tests use canned logits, not a model file. BIO-tag folding + label-map lookup produce `Entities`. - `LinguaLanguageDetector` — lingua-rs, real confidence scores, multilingual segmentation via `detect_multiple`, two constructors (`for_languages` recommended, `for_all_languages` for genuinely unknown). Lingua defaults to high-accuracy mode. - `UnicodeTokenizer` — `unicode-segmentation` based, model-free. Optional stopword set via `stop-words`. - `HfTokenizer` — `tokenizers` crate wrapper. Loads `tokenizer.json`, produces tokens with HF offsets (for ORT NER alignment). Optional stopword set. - `NoopNerBackend` — empty entities; for tests. ## Data carrier `NlpArtifacts` mirrors Presidio's field set (entities, language, tokens, keywords) reshaped for typed Rust access. No `lemmas`, no spaCy `Doc` backref, no parallel offset arrays. ## Public surface decisions - `LanguageDetection { language, confidence: Option<f64> }` and `LanguageSpan { start, end, language, confidence }` carried on the trait; engine strips down to `LanguageTag` for `NlpArtifacts` (consumers wanting confidence call the detector directly). - `Token { start, end, text, is_stop }` — no `is_punct`; the detection was unreliable across both impls and nothing reads it. - `Tokenizer::tokenize` is fallible (`Result<Vec<Token>, NlpError>`). HF errors propagate instead of silent-empty; Unicode impl is always `Ok`. ## Distribution / model story User-provided model paths only. No bundling, no auto-download. `load-dynamic` for ORT means users install `libonnxruntime` out-of-band (Homebrew on macOS, distro package on Linux). Documented in `README.md` and `DESIGN.md`. ## Deferred (tracked) - GlotLID v3 as an alternative language detector. - TiktokenTokenizer for LLM context-window counting. - LinderaTokenizer for CJK content. - Lemmatization (see #154). - Burn as an alternative NER backend. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Renames `EntityRecognition` (LLM-driven NER) to `LlmRecognition` and adds a new `NerRecognition` operation that wraps `Arc<dyn nvisy_nlp::NerBackend>` and runs sequentially over text spans. The detection phase now runs three operations in order: 1. `LlmRecognition` — silently skipped if no LLM provider configured 2. `NerRecognition` — silently skipped if no backend on the run ctx 3. `PatternRecognition` — always runs Wiring: - `Engine::with_ner_backend(Arc<dyn NerBackend>)` builder method to attach an offline backend at engine init. - `EngineInner.ner_backend` carries it across runs. - `Pipeline::new` accepts and forwards it. - `RunContext.ner_backend` makes it available to the orchestrator. Behaviour unchanged when no backend is configured: existing pipelines continue to run LLM NER + patterns, no offline-NER overhead. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…ts, fallible LanguageDetector Doc + naming + ergonomics polish on top of the new crate. Squashed into one commit because the changes interleave on the same files. ## Naming - `NlpError` -> `Error` + `Result<T, E = Error>` alias; the crate now uses `Result<T>` everywhere internally. - `NlpArtifacts` -> `Artifacts`. - `tokenizer/hf.rs` -> `tokenizer/hugging_face.rs` (file rename; type name `HfTokenizer` unchanged). ## Trait change `LanguageDetector` is now fallible: - `detect(&str) -> Result<Option<LanguageDetection>>` — splits "no answer" (`Ok(None)`, e.g. short or ambiguous text) from real backend failure (`Err(_)`). - `detect_multiple(&str) -> Result<Vec<LanguageSpan>>` — default impl returns a single full-text span built from `detect`; impls with real segmentation override. `LinguaLanguageDetector` propagates these `Result`s. Tests updated. ## Framing - Dropped all "offline" labelling from README, Cargo description, module docs, struct docs, tracing strings. The trait surface is transport-agnostic; LLM-mediated NER lives in `nvisy-llm` by deliberate crate split, not by trait restriction. README and inline docs now say so explicitly. - Deleted `DESIGN.md`; the README now carries the small amount of framing actually needed. ## Builder Dropped `with_ner_arc` and `with_language_detector_arc` from `NlpEngineBuilder` — the `with_ner` / `with_language_detector` methods (which take any `T: NerBackend/LanguageDetector + 'static`) cover the same use cases. ## Layout - `artifacts/` is a folder with `mod.rs` (Artifacts) and `token.rs` (Token). - `language/` is a folder with `mod.rs` (trait + LanguageDetection + LanguageSpan) and `lingua.rs` (LinguaLanguageDetector). - `error.rs` is a single file (no folder). ## nvisy-engine side Same sweep applied to the consumer: removed "offline" framing from `ner_recognition`, `llm_recognition`, `detection/mod`, orchestrator and engine docs. The trait surface and operation behaviour are unchanged. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Doc-only + style sweep across the crate: - Hoist all intra-doc links and external URL links to bottom-of-block reference definitions, matching the existing style. Inline `[Foo](crate::path::Foo)` and `[bar](https://...)` become `[Foo]` and `[bar]` with reference defs below. - Escape stray `[\`nvisy-nlp\`]` crate-name strings that rustdoc was mistakenly parsing as intra-doc links — they're now plain backticks. - Drop the dangling "See DESIGN.md" line from ner/ort.rs (file was deleted last round). - `cargo fmt` rewraps. - `cargo clippy --workspace --all-targets` clean (one `manual_contains` lint fixed in `stopword_lang::is_supported`). - `cargo doc -p nvisy-nlp --no-deps` clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Acts on every item from the second-round code review. Changes by category, with reviewer's anchor codes in parens. ## Bugs - (A1) OrtNerConfig.max_sequence_length is now wired: the tokenizer is configured with TruncationParams at construction time, so long inputs are truncated instead of triggering a shape mismatch at inference time. - (A2) OrtNerBackend.recognize now validates the language hint against `supported_languages` (when configured) and returns Error::UnsupportedLanguage on mismatch. Empty list still means "any language" per the trait doc. - (A3) Inference moved off the async-runtime thread via tokio::task::spawn_blocking. OrtNerBackend.state lives behind an Arc so it can cross the spawn boundary cheaply. - (A4) Inferencer trait returns flat (Vec<f32>, num_labels) instead of Vec<Vec<f32>>. 512 heap allocations per inference become one. ## Ergonomics - (B1, E1) Inferencer + OrtInferencer are pub(crate); external backends implement NerBackend directly, no longer locked into the internal trait surface. - (B2) OrtNerConfig::with_conll03_english helper for the common PER/ORG/LOC/MISC label set. - (B3) Replaced hardcoded 0.85 confidence with real per-span softmax confidence — argmax_softmax computes both the winning label and its probability in one pass, fold_predictions averages per-token probs across the span. - (B5) Type-state NlpEngineBuilder. NoNer/WithNer + NoLang/WithLang marker types make `.build()` only callable once both required components are attached. The runtime `expect` panics become unreachable. - (B6) LanguageProvenance { Detected, Asserted } enum on LanguageDetection. analyze_in_language stamps Asserted; LinguaLanguageDetector stamps Detected. Callers no longer have to reverse-engineer provenance from `confidence: None`. ## Polish - (B4) HfTokenizer doc explicitly notes that `is_stop` doesn't work for subword tokenizers (BPE/WordPiece/Unigram emit fragments, not whole words). Recommends UnicodeTokenizer for word-level filtering. - (C1, C2) OrtState moves the tokenizer into Arc-shared storage and drops everything not needed at inference time (full OrtNerConfig held forever was the original cut corner). - (C3) label_order rejects `"O"` in the user-supplied label_map at construction with a clear error. - (C4) lingua_to_tag uses a OnceLock<Mutex<HashSet>>-backed warn-once cache. Repeated failures for the same ISO code log once per process instead of per call. - (C6) `recognize_sync` -> `recognize_blocking`. Actually used now, by the new tests and as the body run inside spawn_blocking. - (C7, D1, D2) Real unit tests for fold_predictions: a minimal hand-rolled tokenizer.json + CannedInferencer drives end-to-end span construction including multi-span output, BIO continuation, language-hint validation paths. - (D3) NoopNerBackend::default() exercised in a test. - (E2) Dedicated stopword_lang.rs collapsed into tokenizer/mod.rs as `SUPPORTED_STOPWORD_LANGUAGES` const + helper fn. ## Docs - All "offline NLP" framing dropped in earlier round; this commit also fixes the broken intra-doc link to UnicodeTokenizer from HfTokenizer's struct doc. ## Test count 21 -> 28 tests, all pass. Workspace builds + clippy clean + fmt + doc warnings clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…, fold engine module Engine module is now a single file (engine/mod.rs holds Engine; builder stays in engine/builder.rs). LanguageDetection (with LanguageProvenance) and LanguageSpan moved out of language/mod.rs into their own files so the module root only carries the LanguageDetector trait + re-exports. Engine name collides with nvisy_engine::Engine at use sites; callers importing both will need to alias one. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…, NoopNer behind test-utils nvisy-ontology - Add Confidence(f64) primitive with Confidence::new -> Option (NonZero-style), Confidence::get accessor, serde-transparent. Rejects NaN/inf and values outside [0,1]. - Migrate Entity.confidence from f64 to Confidence. EntitySelector and policy comparisons now go through .get(). nvisy-nlp - LanguageSpan shrinks to {start, end}. LanguageDetection gains span: Option<LanguageSpan> and confidence: Option<Confidence>. - LanguageDetector trait: detect + detect_in both return Vec<LanguageDetection>. Empty vec means "no answer"; mixed-language input produces one entry per region. Drop detect_multiple. LinguaLanguageDetector populates span from lingua's multi-language API. - Rename Request -> Context (file too). Drop Response: Engine::analyze now returns Artifacts directly. correlation_id stays on Context, input-only for the tracing span. - Replace type-state EngineBuilder with derive_builder + custom setters; the four marker types are gone from the public API. .build() returns Result<Engine, EngineBuilderError>. - Gate NoopNerBackend behind a new test-utils feature so production builds don't ship the no-op fallback. nvisy-pattern, nvisy-provider, nvisy-engine - Wrap entity.confidence at every construction site (Confidence::new(..clamp(0,1)) at ML/LLM boundaries to absorb float rounding). - Read entity.confidence.get() at every comparison site (selectors, dedup strategies, redaction thresholds). - Bump nvisy-ontology dev-deps to enable test-utils where tests build Entities. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…oes internal Reshape language module around a public factory trait. - LanguageDetector trait drops to pub(crate). One method: detect(&str). Language scope is baked in at construction time; no detect_in. - New public LanguagePolicy trait with associated Detector type + detector_for_all / detector_for(&[LanguageTag]) factory methods. Engine asks the policy for a fresh detector each call. - Crate-private DynLanguagePolicy shim (object-safe twin, blanket-impl'd for every LanguagePolicy) lets Engine hold Arc<dyn DynLanguagePolicy> while the public trait keeps its associated type. - impl<P: LanguagePolicy> LanguagePolicy for Arc<P> for cheap sharing. - LinguaLanguagePolicy unit-like factory; LinguaLanguageDetector keeps pub struct visibility (needed for the associated type) but constructors become pub(crate) so external code only reaches it through the policy. - EngineBuilder::with_language_detector renamed to with_language_policy. Asserted-language path (Context::language) bypasses the policy. - Context::candidate_languages forwards to LanguagePolicy::detector_for every call; LinguaLanguagePolicy silently falls back to detector_for_all when no requested tag is recognised. Module layout: - language/detection.rs holds LanguageDetection, LanguageProvenance, and LanguageSpan (merged from former lang_detection.rs + lang_span.rs). - language/dyn_policy.rs holds DynLanguagePolicy + the Arc<P> forwarder. - language/lingua.rs holds both LinguaLanguageDetector and LinguaLanguagePolicy. - language/mod.rs holds LanguageDetector (pub(crate)) and LanguagePolicy (pub). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…tEnhancer, serde wire shape Per-call context hints (Presidio's analyze(context=[...]) equivalent, with per-kind targeting on top): - New ContextHint { kind: Option<EntityKind>, keywords, window?, boost? }. ScanContext::hints: Vec<ContextHint>. The enhancer picks at most one bucket per match: the entry with kind == Some(match.entity_kind), falling back to the first kind == None bucket. Prevents a "DOB" hint from silently boosting an SSN pattern. - Pull RawMatch::apply_context_adjustment into a new pub(crate) ContextEnhancer in engine/scan/enhancer.rs. Strict-Presidio: patterns without a ContextRule are skipped regardless of hints. - Per-call window/boost overrides only fire when hint.keywords is non-empty (avoids silent retuning). Wire shape: - AllowList, DenyList, DenyRule, ContextHint, ScanContext gain serde derives. - AllowList / DenyList serialize transparently (HashSet<String> / HashMap<String, DenyRule>). - DenyList's lazily-compiled scanner is #[serde(skip)] and rebuilt on first scan. - Round-trip test locks in the JSON shape. File reshuffle: - engine/filter/scan_context.rs deleted; ScanContext lives in filter/mod.rs. - engine/pattern_engine.rs deleted; PatternEngine + Debug impl + scan_entities / scan_raw + DEFAULT_ENGINE singleton fold into engine/mod.rs. - engine/filter/deny_list.rs split: DenyScanner moves to its own engine/filter/deny_scanner.rs. - engine/filter/context_hint.rs holds the new ContextHint type (singular because ScanContext::hints is a Vec). - PatternEngine field visibility tightens from pub(super) to pub(in crate::engine) so the post-merge visibility matches the original "siblings in crate::engine only" intent. Tests: 11 new enhancer tests including the kind-targeted boost isolation test (kind_targeted_hint_does_not_apply_to_other_kinds) plus 2 serde round-trip tests. 83 unit + 4 + 5 integration tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…no longer depends on nlp/pattern Mirrors Presidio's presidio-analyzer split: detection becomes its own crate with a Recognizer trait + DetectionEngine, and nvisy-engine reduces to orchestration. New crate nvisy-detection - Recognizer trait (async recognize + reset; reset is a no-op default). - NerRecognizer wraps Arc<nvisy_nlp::Engine> and translates DetectionContext to nvisy_nlp::Context. Behavior change vs the old NerRecognition: now routes through the full Engine (language detection + post-filter + tokens + keywords) instead of bypassing it. - PatternRecognizer wraps nvisy_pattern::PatternEngine with the Shared/Arc/Owned variants preserved from the old PatternEngineRef. - LlmRecognizer wraps NerAgent + DetectionConfig; overrides Recognizer::reset to clear coreference state. - DetectionEngine orchestrates Vec<Arc<dyn Recognizer>> sequentially. detect() runs each; reset() fans out to clear per-document state at document boundaries. - DetectionContext bundles per-call inputs (text, language, candidate_languages, entities, score_threshold, scan_context, correlation_id). - extension::RebaseEntities (moved from nvisy-engine) shifts recognizer output from context-local to document-relative byte offsets. - Error::{Recognizer{name,cause}, Misconfigured} with From → nvisy_core::Error. nvisy-engine - Drops nvisy-pattern and nvisy-nlp from its Cargo.toml. nvisy-engine now only depends on nvisy-detection (and nvisy-core, codec, ontology, provider) for detection. - operation/detection/ folder (5 files: LlmRecognition, NerRecognition, PatternRecognition, PatternEngineRef, RebaseEntities) replaced by a single operation/detection.rs holding Detection { engine: Arc<DetectionEngine> }. - EngineInner::ner_backend: Option<Arc<dyn NerBackend>> → detection_engine: Option<Arc<DetectionEngine>>. - Engine::with_ner_backend → Engine::with_detection_engine. Users compose the detection engine externally with whatever recognizers they want and hand the assembled Arc<DetectionEngine> to the runtime engine. - Pipeline / RunContext renamed accordingly. - Orchestrator::run_detection collapses to a single Detection op invocation. Workflow DetectionConfig is currently ignored at the engine boundary (recognizer composition happens externally); threading per-call hints from the cfg into DetectionContext is left for a follow-up. nvisy-nlp - Engine::analyze rewritten to use tracing::Instrument instead of span.entered() so the returned future is Send. Required by NerRecognizer's async_trait impl. No behavior change. nvisy-pattern - ScanContext gains Clone so it can ride on DetectionContext (Clone). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

- Remove crates/nvisy-python (pyo3 bridge + exif Python module). - Remove packages/ (uv workspace: nvisy-hf, nvisy-exif). - Remove root pyproject.toml. - Drop nvisy-python from workspace members and dependencies in root Cargo.toml. - Drop pyo3, pyo3-async-runtimes, pythonize from workspace dependencies. - Remove all four Install Python steps and the PYO3_USE_ABI3_FORWARD_COMPATIBILITY env from CI build workflow. - Strip Python entries (__pycache__, *.py[cod], .venv, *.egg-info, .ruff_cache, .pytest_cache) from .gitignore. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…ient threading Splits the former nvisy-provider into two focused crates and moves agent/HTTP wiring out of nvisy-engine. New crate nvisy-rig - LLM agents over the rig framework: base scaffolding, NER, CV, generate, audio (STT/TTS), and entity_verifier. - Owns its own http/ (HttpClient + middleware + HttpConfig). - Features: openai-gpt, openai-whisper, openai-tts, anthropic-claude, google-gemini. - Does NOT depend on nvisy-ocr. New crate nvisy-ocr - Non-LLM OCR providers (Surya, PaddleX, AWS Textract, Google Vision, Azure DocAI). - Owns its own http/ (HttpClient + middleware + HttpConfig). - Features: aws-textract, google-vision, azure-docai. agent/ocr split - LLM-side prompting + verification logic moves to nvisy_rig::agent::entity_verifier::EntityVerifier. verify(image, &[ProposedEntity]) -> VerificationOutput. No knowledge of OCR providers. - OCR-engine + verifier orchestration moves into nvisy_engine::operation::extraction::vision. The engine now drives the OCR -> propose entities -> LLM verify -> merge flow directly instead of routing through a unified OcrAgent. nvisy-engine reshape - Drops nvisy-provider; adds nvisy-rig + nvisy-ocr. - EngineInner::http_client field removed. - Pipeline::new no longer takes HttpClient. - RunContext::http_client removed. - Engine::http_client() accessor removed. - Extraction::new, VisualExtraction::new, AudialExtraction::new no longer take &HttpClient. Each constructs its own client(s) from RuntimeConfig at build time. - Feature aliases rewired: openai=nvisy-rig/*, google=nvisy-rig/google-gemini + nvisy-ocr/google-vision, microsoft=nvisy-ocr/azure-docai, amazon=nvisy-ocr/aws-textract. HTTP polish (both crates) - http/middleware/{retry,tracing}.rs collapsed into a single http/middleware.rs exposing backoff_policy, retry_layer, tracing_layer. - HttpConfig fields switch from u64 seconds to std::time::Duration with humantime_serde for `timeout`, `connect_timeout`, `idle_timeout`. Accepts "120s", "2min", "500ms", etc. Added humantime-serde as a workspace dep. nvisy-provider preservation - Old crate moved to .ignore/nvisy-provider-old for comparison; not deleted, not in the workspace. Verification: workspace builds, ~400 tests pass, clippy clean, nightly fmt applied, docs build with no new warnings. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…bottom - nvisy-pattern: stop linking from public docs to private items (scan, RawMatch, ContextEnhancer, DenyList::scanner); rename stale DictionaryFilter link to PatternFilter; switch AllowList's PatternEngine::scan_entities link to a crate-path reference so rustdoc can resolve it. - nvisy-rig: stop linking EntityVerifier's public doc to the crate-private BaseAgent. - Convert every remaining inline rustdoc link of the form [`Foo`](path::to::Foo) to reference-style: in-prose [`Foo`] with the path defined once at the bottom of the doc block. Workspace doc builds with no rustdoc warnings (the one remaining filename-collision warning is a structural cargo issue: nvisy-cli defines a binary named "nvisy-server" that conflicts with the nvisy-server lib's doc output path. Renaming the binary is a behavioural change and out of scope here.) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

martsokha and others added 4 commits May 20, 2026 15:39

martsokha self-assigned this May 20, 2026

martsokha changed the title ~~feat(nvisy-nlp): new offline NLP crate + nvisy-engine wiring~~ feat(nvisy-nlp): new NLP crate + nvisy-engine wiring May 20, 2026

martsokha and others added 8 commits May 20, 2026 17:30

martsokha merged commit 657171d into main May 20, 2026
5 checks passed

martsokha deleted the feat/nvisy-nlp branch May 20, 2026 23:42

martsokha added core nvisy-core: content model, errors, types rig nvisy-rig: LLM agents (NER, CV, generate, STT/TTS, entity-verifier) over rig refactor code restructuring without behavior change labels May 21, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(nvisy-nlp): new NLP crate + nvisy-engine wiring#155

feat(nvisy-nlp): new NLP crate + nvisy-engine wiring#155
martsokha merged 13 commits into
mainfrom
feat/nvisy-nlp

martsokha commented May 20, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

martsokha commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

nvisy-nlp

nvisy-engine wiring

Layout

Deferred (tracked as follow-ups)

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

martsokha commented May 20, 2026 •

edited

Loading