Skip to content

Add unknown text BM25 fallback#209

Open
Mathews-Tom wants to merge 1 commit into
feat/lang-extraction-fixturesfrom
feat/unknown-file-fallback
Open

Add unknown text BM25 fallback#209
Mathews-Tom wants to merge 1 commit into
feat/lang-extraction-fixturesfrom
feat/unknown-file-fallback

Conversation

@Mathews-Tom

@Mathews-Tom Mathews-Tom commented Jun 13, 2026

Copy link
Copy Markdown
Owner

Stack: C3 language coverage expansion

Position: 4/4
Base: feat/lang-extraction-fixtures
Head: feat/unknown-file-fallback

Changes:

  • Add deterministic 80-line windows for unknown UTF-8 text files.
  • Keep binary files skipped.
  • Keep unknown chunks BM25-only by excluding them from vector, rerank, and SPLADE embedding paths.
  • Reduce cold parse/index cost by extracting symbols and imports from one tree-sitter parse per file.
  • Reduce chunking/index cost by caching chunk candidate text/token counts and skipping surrogate generation for BM25-only indexes.

Validation:

  • uv run ruff check && uv run ruff format --check . && uv run pyright passed.
  • uv run pytest tests/parse/ tests/pipeline/ tests/index/test_chunker.py tests/test_performance.py::TestQueryCacheSkipsParse -q reported 510 passed, but exits non-zero because coverage is configured for the whole source tree.
  • uv run pytest -q reported 2240 passed, 4 deselected, 89.33% coverage.
  • Operator run before the optimization completed 35 benchmarks and archex benchmark gate printed Quality gate passed.
  • Existing-language recall regression check: 0 regressions across archex_query, archex_query_fusion, and archex_query_fusion_rerank.
  • Local micro-timing on the archex_project_index query improved from pre-fix mean parse/index/total 490.6/1632.9/2316.6 ms to 325.0/1092.4/1560.8 ms including first cold import, and approximately 264/1079/1499 ms after warm import. Rerun the operator benchmark gate to confirm the 10% benchmark-wide index-time ceiling with full task coverage.

@Mathews-Tom Mathews-Tom force-pushed the feat/lang-extraction-fixtures branch from 28630b0 to e3500e1 Compare June 13, 2026 12:32
@Mathews-Tom Mathews-Tom force-pushed the feat/unknown-file-fallback branch 5 times, most recently from b512686 to 6e295e9 Compare June 13, 2026 14:12
@Mathews-Tom Mathews-Tom force-pushed the feat/lang-extraction-fixtures branch from e3500e1 to bab3b98 Compare June 13, 2026 15:14
Stack-Id: c3-language-coverage-20260613

Stack-Position: 4/4
Stack-Id: c3-language-coverage-20260613
@Mathews-Tom Mathews-Tom force-pushed the feat/unknown-file-fallback branch from 6e295e9 to 431dd71 Compare June 13, 2026 15:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant