Skip to content

Corpus-derived entity patterns #65

@VarunGitGood

Description

@VarunGitGood

Why

The first cut of entity detection in the intent resolver is regex-only. That covers the datasets currently shipped in the eval harness (HDFS blk_…, req_…, hex hashes, UUIDs, hyphenated identifiers) but is brittle to custom ID schemes like cust_7f3a or order#X9-2024. When the resolver misses an entity that's neither a service nor matched by a regex, fall back to corpus learning instead of asking the LLM every time.

Scope (in)

  • New entity_patterns table: (pattern text pk, frequency bigint, nl_density real, last_seen_at timestamptz) plus a descending-frequency index.
  • Incremental learning hook in repi/ingestion/log_ingestor.py: after each ingest batch, tokenise new chunk text, score by frequency * (1 - nl_density) against a small English wordlist, and upsert tokens above a threshold.
  • repi/intent/resolver.py loads the top-N patterns at startup with a ~5-min cache TTL.

Scope (out)

  • LLM-based entity extraction (a separate, more expensive path; not needed if corpus learning works).
  • Cross-tenant pattern sharing.

Acceptance

  • Ingesting a fixture corpus with the token cust_7f3a repeated more than 10 times across distinct chunks promotes it into entity_patterns.
  • resolve("why did cust_7f3a fail", known_services=[], now=...) (after the learning pass) returns entities=["cust_7f3a"] with no clarification.
  • Existing regex-only behaviour is unchanged when entity_patterns is empty.

Files (estimated)

  • db/schema.sql
  • repi/models/schema.py
  • repi/ingestion/log_ingestor.py
  • repi/intent/resolver.py
  • tests/ingestion/test_pattern_learning.py

Depends on

The regex-layer entity detection (#56) — must land first.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions