Why
The first cut of entity detection in the intent resolver is regex-only. That covers the datasets currently shipped in the eval harness (HDFS blk_…, req_…, hex hashes, UUIDs, hyphenated identifiers) but is brittle to custom ID schemes like cust_7f3a or order#X9-2024. When the resolver misses an entity that's neither a service nor matched by a regex, fall back to corpus learning instead of asking the LLM every time.
Scope (in)
- New
entity_patterns table: (pattern text pk, frequency bigint, nl_density real, last_seen_at timestamptz) plus a descending-frequency index.
- Incremental learning hook in
repi/ingestion/log_ingestor.py: after each ingest batch, tokenise new chunk text, score by frequency * (1 - nl_density) against a small English wordlist, and upsert tokens above a threshold.
repi/intent/resolver.py loads the top-N patterns at startup with a ~5-min cache TTL.
Scope (out)
- LLM-based entity extraction (a separate, more expensive path; not needed if corpus learning works).
- Cross-tenant pattern sharing.
Acceptance
- Ingesting a fixture corpus with the token
cust_7f3a repeated more than 10 times across distinct chunks promotes it into entity_patterns.
resolve("why did cust_7f3a fail", known_services=[], now=...) (after the learning pass) returns entities=["cust_7f3a"] with no clarification.
- Existing regex-only behaviour is unchanged when
entity_patterns is empty.
Files (estimated)
db/schema.sql
repi/models/schema.py
repi/ingestion/log_ingestor.py
repi/intent/resolver.py
tests/ingestion/test_pattern_learning.py
Depends on
The regex-layer entity detection (#56) — must land first.
Why
The first cut of entity detection in the intent resolver is regex-only. That covers the datasets currently shipped in the eval harness (HDFS
blk_…,req_…, hex hashes, UUIDs, hyphenated identifiers) but is brittle to custom ID schemes likecust_7f3aororder#X9-2024. When the resolver misses an entity that's neither a service nor matched by a regex, fall back to corpus learning instead of asking the LLM every time.Scope (in)
entity_patternstable:(pattern text pk, frequency bigint, nl_density real, last_seen_at timestamptz)plus a descending-frequency index.repi/ingestion/log_ingestor.py: after each ingest batch, tokenise new chunk text, score byfrequency * (1 - nl_density)against a small English wordlist, and upsert tokens above a threshold.repi/intent/resolver.pyloads the top-N patterns at startup with a ~5-min cache TTL.Scope (out)
Acceptance
cust_7f3arepeated more than 10 times across distinct chunks promotes it intoentity_patterns.resolve("why did cust_7f3a fail", known_services=[], now=...)(after the learning pass) returnsentities=["cust_7f3a"]with no clarification.entity_patternsis empty.Files (estimated)
db/schema.sqlrepi/models/schema.pyrepi/ingestion/log_ingestor.pyrepi/intent/resolver.pytests/ingestion/test_pattern_learning.pyDepends on
The regex-layer entity detection (#56) — must land first.