Skip to content

feat: ship default knowledge store + eval harness for forks#155

Merged
mabry1985 merged 6 commits into
mainfrom
feat/default-kb-and-eval-harness
Apr 27, 2026
Merged

feat: ship default knowledge store + eval harness for forks#155
mabry1985 merged 6 commits into
mainfrom
feat/default-kb-and-eval-harness

Conversation

@mabry1985
Copy link
Copy Markdown
Contributor

@mabry1985 mabry1985 commented Apr 27, 2026

Closes #154.

Summary

  • Default knowledge storeknowledge/store.py (sqlite + FTS5, LIKE fallback). One chunks table backs operator notes, daily-log entries, and conversation findings extracted by MemoryMiddleware. Default-on; middleware.knowledge: false to opt out.
  • Five memory tools in tools/lg_tools.pymemory_ingest, memory_recall, memory_list, memory_stats, daily_log — bound to the bundled store. echo removed.
  • Eval harness under evals/ — A2A client + runner + side-effect-verified tasks.json (15 starter cases: agent card, auth, abstention, every shipped tool, KB recall, chained, KnowledgeMiddleware). Audit log + reply text + KB state all checked independently — hallucinated tool results get caught.
  • Docsdocs/guides/evals.md how-to + every reference page updated to match the new defaults / nine-tool starter set.

Why side-effect verification

When the model produces "Logged: ..." without actually calling daily_log, text-only checks pass while the DB stays empty. The harness reads audit.jsonl and the chunks table after every turn so that bug doesn't ship.

Default flip — what existing forks will notice

knowledge_middleware and memory_middleware now default to true in LangGraphConfig. On first run:

  • The store opens at /sandbox/knowledge/agent.db (Docker default).
  • If that path is read-only or absent (running locally outside the container), it falls back to ~/.protoagent/knowledge/agent.db automatically — no permission errors.
  • Forks that want a stateless agent can flip middleware.knowledge: false in config/langgraph-config.yaml.

Test plan

  • python -m evals.runner against a wizard-set-up local agent — expect ≥13/15 cases passing on a small/fast model (text-pattern noise on a couple).
  • python -m evals.runner --category a2a-protocol — agent card + auth gating gate at 100% (deterministic).
  • Existing test suite — pytest tests/test_starter_tools.py tests/test_config_io.py tests/test_skill_emission.py tests/test_a2a_handler.py — should be all green (was 202/203 with one unrelated origin test failing on main).
  • Drawer Save & Reload — confirm the rebuilt graph still has the memory tools wired up (server.py reload path also constructs a store).

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Added persistent knowledge store with full‑text search, new memory tools (ingest, recall, list, stats) and daily logging; new evaluation harness and runner to validate agent/tool/KB behavior.
  • Configuration

    • Knowledge and memory middleware enabled by default; worker subagent default tool list updated and the simple "echo" starter tool removed.
  • Documentation

    • Expanded guides and references covering evals, configuration, starter tools, and tutorials.
  • Tests

    • Updated tests to reflect the new starter-tools set and tool-tracking changes.

The template now ships a working memory loop and end-to-end eval suite
on day one so forks have a green baseline before they touch a single
line of code. Closes #154.

What lands:

- knowledge/store.py — sqlite + FTS5 (LIKE fallback). One ``chunks``
  table backs operator notes (memory_ingest), daily-log entries
  (daily_log), and conversation findings extracted by
  MemoryMiddleware (domain='finding'). Path resolves env >
  config > default with an automatic ~/.protoagent/ fallback when
  /sandbox isn't writable.

- tools/lg_tools.py — five new memory tools (memory_ingest,
  memory_recall, memory_list, memory_stats, daily_log) bound to the
  store via a closure factory so tests get a fresh store per run.
  ``echo`` removed; ``get_all_tools(knowledge_store)`` actually uses
  its parameter now.

- server.py — _build_knowledge_store() constructs the store and
  threads it through both initial init and the drawer reload path.
  Defaults flipped: knowledge_middleware + memory_middleware now
  ON by default (config/langgraph-config.yaml + graph/config.py).

- evals/ — A2A client + runner + verify helpers + 15 starter cases
  (tasks.json) covering agent card discovery, bearer auth gating,
  abstention, every shipped tool, KB recall, a chained two-tool
  case, and KnowledgeMiddleware injection. Side-effect-verified:
  audit log + reply text + KB chunks all checked independently so
  hallucinated tool results get caught.

- docs/guides/evals.md — full how-to. README/TEMPLATE/configuration/
  starter-tools/first-agent updated to reflect the new defaults and
  the additional five memory tools.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 27, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Replaces the inline echo starter tool with five KnowledgeStore-bound memory tools plus daily_log, enables knowledge and memory middleware by default, adds a sqlite+FTS5 KnowledgeStore, wires it into server startup, and ships an eval harness (client, runner, verify, tasks, README) with docs and test updates.

Changes

Cohort / File(s) Summary
Knowledge Store
knowledge/__init__.py, knowledge/store.py
Add SQLite-backed KnowledgeStore and Chunk dataclass with FTS5 probe/fallback, search/list/stats/find/delete APIs, DB path resolution/fallback, safe error handling, and chunk/finding write APIs.
Memory Tools & Registry
tools/lg_tools.py
Remove echo; add _build_memory_tools(knowledge_store) producing memory_ingest, memory_recall, memory_list, memory_stats, daily_log; update get_all_tools(knowledge_store=None) to include memory tools only when a store is provided.
Server Integration & Config
server.py, graph/config.py, graph/subagents/config.py, config/langgraph-config.yaml
Enable knowledge/memory middleware by default; expand worker subagent tool allowlist to include memory/daily_log tools; add _build_knowledge_store(config) and pass knowledge_store into graph creation/reload; update inline docs and defaults.
Eval Harness
evals/client.py, evals/runner.py, evals/verify.py, evals/tasks.json, evals/README.md
Add async AgentClient/TaskResult, runner executing tasks.json kinds (agent_card, auth_check, ask, stream), verification helpers for audit/KB checks, setup/teardown, 13 starter tasks, and README describing usage and CI guidance.
Docs & Guides
docs/guides/evals.md, docs/guides/subagents.md, docs/guides/*, docs/reference/*, docs/tutorials/*
Add eval how‑to guide and update guides, tutorials, and reference docs to reflect memory tools, knowledge middleware defaults, updated starter-tools API (get_all_tools(knowledge_store)), and config examples.
Tests
tests/test_config_io.py, tests/test_skill_curator.py, tests/test_skill_emission.py, tests/test_starter_tools.py
Adjust fixtures and assertions to remove/replace echo with current_time, add expectations for web_search/fetch_url, and remove the echo-specific starter-tools test.
Starter tooling docs
docs/reference/starter-tools.md
Document new memory tools, update get_all_tools(knowledge_store) example, describe behavior when knowledge_store is disabled, and add examples for each memory tool.

Sequence Diagram(s)

sequenceDiagram
    participant Runner as Eval Runner
    participant Client as AgentClient
    participant Agent as Agent (A2A)
    participant Audit as Audit Log
    participant KB as Knowledge Store

    rect rgba(100, 150, 200, 0.5)
    Note over Runner,Client: Case setup & execution
    end

    Runner->>Client: ask(prompt, timeout_s=90)
    Client->>Agent: JSON-RPC /message/send
    Agent->>Audit: emit audit entries (tool calls)
    Agent->>KB: memory_* / daily_log operations
    Agent-->>Client: final TaskResult / status

    rect rgba(100, 200, 150, 0.5)
    Note over Runner,KB: Assertion validation
    end

    Runner->>Audit: audit_entries_since(ts)
    Audit-->>Runner: tool firing records
    Runner->>KB: search / find_chunk_containing / list_chunks
    KB-->>Runner: stored facts & side effects
    Runner->>Client: parse TaskResult.text
    Client-->>Runner: extracted reply text

    rect rgba(200, 150, 100, 0.5)
    Note over Runner,KB: Teardown
    end

    Runner->>KB: delete_by_content / delete_by_heading
    KB-->>Runner: deletion confirmations
Loading
sequenceDiagram
    participant Tool as Memory Tool
    participant KStore as KnowledgeStore
    participant DB as SQLite + FTS5

    rect rgba(150, 150, 200, 0.5)
    Note over Tool,DB: Tool invocation & persistence
    end

    Tool->>KStore: add_chunk(content, domain, heading)
    KStore->>DB: INSERT into chunks
    KStore->>DB: maintain chunks_fts via triggers (if FTS5)
    DB-->>KStore: row id | OperationalError -> None
    KStore-->>Tool: chunk_id | None

    rect rgba(150, 200, 150, 0.5)
    Note over Tool,DB: Retrieval with FTS
    end

    Tool->>KStore: search(query, domain)
    KStore->>DB: SELECT via FTS5 MATCH (or LIKE fallback)
    DB-->>KStore: scored results
    KStore-->>Tool: standardized result dicts

    rect rgba(200, 150, 150, 0.5)
    Note over Tool,DB: Cleanup
    end

    Tool->>KStore: delete_by_content / delete_by_heading
    KStore->>DB: DELETE + cascade FTS triggers
    DB-->>KStore: affected row count
    KStore-->>Tool: count
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 69.70% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main changeset: shipping a default knowledge store and eval harness.
Description check ✅ Passed The description covers the Summary, Closes reference, and detailed Test plan. It explains what changed (knowledge store, memory tools, eval harness, docs), why (side-effect verification), and what users will notice (default middleware flip). All template sections are addressed.
Linked Issues check ✅ Passed The PR fully implements the objectives from issue #154: ships a how-to guide (docs/guides/evals.md), three-channel assertion harness (audit-log, reply text, KB state), reference layout (evals/{client,runner,verify}.py + tasks.json), and 15 starter cases covering agent card, auth, tools, chained calls, and KnowledgeMiddleware.
Out of Scope Changes check ✅ Passed All changes align with the stated objectives: knowledge store implementation, memory tools, eval harness, documentation updates, and test updates. No extraneous changes detected; even the middleware default flip is part of the intended design to enable the new features by default.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/default-kb-and-eval-harness

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
tests/test_config_io.py (1)

321-331: 🧹 Nitpick | 🔵 Trivial

Tighten starter-tool contract coverage in this test.

Line 328–331 now verifies only part of the core starter set; a regression dropping web_search would no longer fail this test. Consider explicitly asserting it too.

✅ Small test hardening diff
 def test_list_available_tools_returns_starter_set():
@@
     assert "current_time" in names
     assert "calculator" in names
+    assert "web_search" in names
     assert "fetch_url" in names
     assert all(isinstance(n, str) for n in names)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_config_io.py` around lines 321 - 331, The test
test_list_available_tools_returns_starter_set currently asserts presence of
"current_time", "calculator", and "fetch_url" but misses "web_search"; update
the test in tests/test_config_io.py to also assert "web_search" is in the names
returned by list_available_tools (function list_available_tools in
graph.config_io), keeping the existing type check (all elements are str) so the
starter-tool contract is fully covered.
docs/reference/starter-tools.md (1)

183-188: ⚠️ Potential issue | 🟡 Minor

Preserve the bundled memory tools in the extension example.

Copying this snippet verbatim replaces the new conditional registry and drops all KnowledgeStore-backed tools when a store is configured. Show my_tool being appended to the existing pattern instead of returning a fresh list.

Suggested doc fix
 def get_all_tools(knowledge_store=None):
-    return [current_time, calculator, web_search, fetch_url, my_tool]
+    tools = [current_time, calculator, web_search, fetch_url, my_tool]
+    if knowledge_store is not None:
+        tools.extend(_build_memory_tools(knowledge_store))
+    return tools
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/reference/starter-tools.md` around lines 183 - 188, The snippet replaces
the existing conditional registry and drops KnowledgeStore-backed tools; update
get_all_tools to obtain the existing tools list (keeping bundled tools like
current_time, calculator, web_search, fetch_url and any KnowledgeStore-backed
tools when knowledge_store is provided) and then append my_tool instead of
returning a fresh list. Concretely, inside get_all_tools(knowledge_store=None)
call or build the existing tools list used elsewhere (respecting the conditional
registry when knowledge_store is set), then tools.append(my_tool) and return
tools; reference get_all_tools, my_tool, and the knowledge_store parameter when
making the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/guides/subagents.md`:
- Around line 89-99: The docs updated the tools allowlist (current_time,
calculator, web_search, fetch_url, memory_ingest, memory_recall, memory_list,
memory_stats, daily_log) but the Step 2 LangGraphConfig snippet still includes
the legacy "echo" tool, causing a mismatch; update the LangGraphConfig example
to replace or remove "echo" so it exactly matches the new tools list and
max_turns setting (e.g., ensure the LangGraphConfig.tools array contains the
same tool names and that LangGraphConfig.max_turns is set to 20) so both
snippets are consistent.

In `@docs/tutorials/first-tool.md`:
- Around line 53-56: The example redefines WORKER_CONFIG.tools and accidentally
omits the default knowledge-backed tools; instead of replacing the whole
allowlist in the SubagentConfig example, update the snippet to append "git_sha"
to the existing default tools (or explicitly include the default entries:
"memory_ingest", "memory_recall", "memory_list", "memory_stats", "daily_log"
alongside "current_time", "calculator", "web_search", "fetch_url", "git_sha") so
the example matches the template behavior; locate WORKER_CONFIG and
SubagentConfig in the doc and change only the tools list accordingly.

In `@evals/runner.py`:
- Around line 205-209: The _RUNNERS dispatch dict only maps "agent_card",
"auth_check", and "ask" (functions _run_agent_card, _run_auth_check, _run_ask)
so evals/client.py::AgentClient.stream() is never reachable from tasks.json; add
a runner entry that invokes the streaming path (e.g., "agent_stream":
_run_agent_stream) or modify an existing runner (e.g., enhance _run_agent_card)
to detect streaming tasks and call AgentClient.stream(), then register that new
function in _RUNNERS and ensure tasks.json uses the new key so A2A
streaming/push-notification cases are testable.
- Around line 133-149: The teardown must run in a finally block so it always
executes even if client.ask or the verifier raises: wrap the call sequence in
run_one so that after verify.apply_setup(case["setup"]) you set a flag (e.g.,
setup_applied = True) and perform client.ask / any verifier calls inside try,
and then in finally check if setup_applied and "teardown" in case and call
verify.apply_teardown(case["teardown"]); apply the same pattern for the later
teardown usage around the code near the 185-186 region to ensure seeded KB rows
are always cleaned up.
- Around line 158-168: The code currently uses expected_tools =
case.get("expected_tools") or [] which conflates "no key" and "empty list";
change to expected_tools = case.get("expected_tools", None) and then run the
audit check only when expected_tools is not None so that an explicit empty list
asserts that no tool entries were emitted. Keep the rest of the logic (await
asyncio.sleep, entries = verify.audit_entries_since(since), require_success
calculation, and calling verify.assert_tools_fired) but wrap it in if
expected_tools is not None: ... so assert_tools_fired is invoked for [] as well
and add any failure detail to problems as before.

In `@knowledge/store.py`:
- Around line 394-423: The helpers must reject empty or whitespace-only match
strings before constructing LIKE queries: in find_chunk_containing (parameter
text) and in delete_by_content / any verifier helper using contains, call
strip() on the incoming string and if the result is empty immediately return
None (or no-op for delete) instead of building a "%...%" pattern; do this check
at the start of the functions (e.g., in find_chunk_containing, check if not text
or not text.strip() then return None) to prevent LIKE '%%' queries that match
everything.
- Around line 171-174: After installing the FTS schema, ensure existing rows in
the external-content table get indexed: when self._fts_available is true and
after db.executescript(_FTS_SCHEMA) completes, run an FTS5 rebuild for the
chunks_fts virtual table (i.e. perform the FTS5 "rebuild" insert into
chunks_fts) so preexisting rows in chunks are indexed; add this rebuild call
alongside the existing db.executescript usage and keep the guard using
_has_fts5(db)/self._fts_available.
- Around line 162-166: The PRAGMA journal_mode=WAL call in _connect currently
makes WAL mandatory and can raise sqlite3.OperationalError for read-only DBs;
modify _connect (or move the PRAGMA to the writable init path) so WAL is
best-effort: open the sqlite3.Connection and set row_factory as now, then
attempt db.execute("PRAGMA journal_mode=WAL") inside a try/except that catches
sqlite3.OperationalError (and logs/debugs if desired) but does not close or
return None on failure, so read-only connections remain usable; alternatively
ensure only the writable initialization path (e.g., the method that prepares a
writable DB) performs the WAL PRAGMA while _connect always returns a usable
connection for reads.

In `@tools/lg_tools.py`:
- Around line 312-329: Clamp tool inputs before calling the store: cap the
incoming k (and any limit arg in the other memory tools on lines ~332-348) to a
small positive max and enforce a minimum of 1, then pass the clamped value to
knowledge_store.search instead of the raw argument. Concretely, add a MAX_K
constant (e.g. 20) and replace uses of k with something like clamped_k = max(1,
min(k, MAX_K)) (and do the same for any limit parameter in the other memory_*
function(s)), then call knowledge_store.search(query, k=clamped_k) so the tool
cannot request arbitrarily large slices of the KB; apply the same clamping logic
to all memory_* entry points in this file.

---

Outside diff comments:
In `@docs/reference/starter-tools.md`:
- Around line 183-188: The snippet replaces the existing conditional registry
and drops KnowledgeStore-backed tools; update get_all_tools to obtain the
existing tools list (keeping bundled tools like current_time, calculator,
web_search, fetch_url and any KnowledgeStore-backed tools when knowledge_store
is provided) and then append my_tool instead of returning a fresh list.
Concretely, inside get_all_tools(knowledge_store=None) call or build the
existing tools list used elsewhere (respecting the conditional registry when
knowledge_store is set), then tools.append(my_tool) and return tools; reference
get_all_tools, my_tool, and the knowledge_store parameter when making the
change.

In `@tests/test_config_io.py`:
- Around line 321-331: The test test_list_available_tools_returns_starter_set
currently asserts presence of "current_time", "calculator", and "fetch_url" but
misses "web_search"; update the test in tests/test_config_io.py to also assert
"web_search" is in the names returned by list_available_tools (function
list_available_tools in graph.config_io), keeping the existing type check (all
elements are str) so the starter-tool contract is fully covered.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: e658a3ad-718e-464a-a26c-f5dc1e6ba884

📥 Commits

Reviewing files that changed from the base of the PR and between 3c25c41 and 433b44e.

⛔ Files ignored due to path filters (2)
  • README.md is excluded by !*.md
  • TEMPLATE.md is excluded by !*.md
📒 Files selected for processing (27)
  • config/langgraph-config.yaml
  • docs/guides/customize-and-deploy.md
  • docs/guides/evals.md
  • docs/guides/fork-the-template.md
  • docs/guides/index.md
  • docs/guides/subagents.md
  • docs/reference/configuration.md
  • docs/reference/starter-tools.md
  • docs/tutorials/first-agent.md
  • docs/tutorials/first-tool.md
  • evals/README.md
  • evals/__init__.py
  • evals/client.py
  • evals/results/.gitkeep
  • evals/runner.py
  • evals/tasks.json
  • evals/verify.py
  • graph/config.py
  • graph/subagents/config.py
  • knowledge/__init__.py
  • knowledge/store.py
  • server.py
  • tests/test_config_io.py
  • tests/test_skill_curator.py
  • tests/test_skill_emission.py
  • tests/test_starter_tools.py
  • tools/lg_tools.py
💤 Files with no reviewable changes (1)
  • tests/test_starter_tools.py

Comment thread docs/guides/subagents.md
Comment thread docs/tutorials/first-tool.md
Comment thread evals/runner.py Outdated
Comment thread evals/runner.py Outdated
Comment thread evals/runner.py
Comment thread knowledge/store.py
Comment thread knowledge/store.py
Comment thread knowledge/store.py
Comment thread tools/lg_tools.py
Real bugs:
- evals/runner.py: teardown now runs in a finally block so seeded KB
  rows get cleaned up even when the verifier or client.ask() raises.
  expected_tools=[] now means "assert no tools fired" (was conflated
  with "no key" via the `or []` short-circuit, making the abstention
  case a no-op).
- evals/runner.py + tasks.json: added a `stream` runner kind so
  AgentClient.stream() is reachable from tasks.json — new
  streaming_status_updates case asserts the SSE event sequence.
- knowledge/store.py: PRAGMA journal_mode=WAL is now best-effort
  (read-only DBs no longer break _connect). FTS5 rebuild after
  schema install so an existing chunks table populated before FTS
  was added gets indexed. find_chunk_containing/delete_by_content
  reject empty/whitespace-only inputs to prevent LIKE '%%' wildcards
  from matching every row.

Hardening:
- tools/lg_tools.py: clamp memory_recall(k) to [1, 20] and
  memory_list(limit) to [1, 200] so the agent can't request
  arbitrarily large slices of the KB.

Doc cleanup:
- docs/guides/subagents.md: LangGraphConfig snippet had a stale
  "echo" reference; replaced with the new memory-tool list.
- docs/tutorials/first-tool.md: WORKER_CONFIG example now appends
  git_sha alongside the bundled defaults instead of replacing
  them and dropping the memory tools.
- docs/reference/starter-tools.md: "adding your own" snippet now
  preserves the conditional _build_memory_tools(knowledge_store)
  extension.
- tests/test_config_io.py: starter-tool contract assertion now
  also covers web_search.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mabry1985
Copy link
Copy Markdown
Contributor Author

Round 1 of CodeRabbit feedback addressed in 4d0d428. Real bugs first, hardening after, docs last.

Real bugs

  • evals/runner.py — teardown moved into a finally block so seeded KB rows always clean up, even when client.ask() or the verifier raises.
  • evals/runner.pyexpected_tools is not None so [] now correctly asserts no tools fired (the abstention case was a no-op before).
  • evals/runner.py + tasks.json — added stream runner kind + a streaming_status_updates case so AgentClient.stream() is reachable from task definitions.
  • knowledge/store.pyPRAGMA journal_mode=WAL is now best-effort (read-only DBs stay usable). FTS5 rebuild runs after schema install so a chunks table populated before FTS was added gets indexed. find_chunk_containing and delete_by_content reject empty / whitespace-only inputs (prevents LIKE '%%' from matching every row).

Hardening

  • tools/lg_tools.pymemory_recall(k) clamped to [1, 20], memory_list(limit) to [1, 200].

Docs

  • docs/guides/subagents.md — stale echo reference removed from the LangGraphConfig snippet.
  • docs/tutorials/first-tool.mdWORKER_CONFIG example now appends git_sha alongside the bundled defaults rather than replacing them and dropping the memory tools.
  • docs/reference/starter-tools.md — "adding your own" snippet preserves the conditional _build_memory_tools(knowledge_store) extension.
  • tests/test_config_io.py — starter-tool contract now also asserts web_search.

Tests in scope (203 across the affected files): same 1 pre-existing baseline failure (test_subscribe_endpoint_origin_rejected) that reproduces on main and is unrelated to this PR.

🤖 Generated with Claude Code

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@evals/runner.py`:
- Around line 243-244: The nested condition checking setup_applied before
testing for a teardown key is redundant; update the block that currently uses
`if setup_applied or "teardown" in case:` followed by `if "teardown" in case:`
to a single guard that directly checks `if "teardown" in case:` (remove the
`setup_applied` portion and the inner duplicate check) so the teardown branch
runs only when the `"teardown"` key is present; adjust any dependent logic that
relied on `setup_applied` being true inside that branch if necessary (look for
references to `setup_applied` and `case` in the surrounding code).

In `@knowledge/store.py`:
- Around line 91-92: Update the _now_iso function to use the Python 3.11+
datetime.UTC alias instead of timezone.utc; specifically modify the _now_iso()
implementation so it calls datetime.now(datetime.UTC).isoformat(); keep the
function name _now_iso and its return type, and only apply this change if the
project targets Python 3.11+ (otherwise leave as timezone.utc).

In `@tools/lg_tools.py`:
- Around line 331-334: Replace the explicit loop that builds lines with a list
comprehension for clarity and conciseness: instead of initializing lines and
looping over results, create lines with a comprehension like the one that
produces f"[{r.get('domain', '?')}] {r['preview']}" for each r in results, then
return "\n".join(lines); update the block containing variables results, lines,
and r accordingly to remove the loop and keep the same behavior.
- Around line 284-285: Update the _build_memory_tools function signature to
include a return type annotation (e.g. -> list[Tool] or -> list[Any]) to reflect
that it returns a list of tool objects; import or reference the appropriate type
(Tool or Any) from typing or the module that defines the tool class and adjust
the signature from def _build_memory_tools(knowledge_store): to include the
chosen return annotation so IDEs and type checkers pick up the correct return
type.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: c6e241b4-d596-4c12-9a86-9665470c476a

📥 Commits

Reviewing files that changed from the base of the PR and between 433b44e and 4d0d428.

📒 Files selected for processing (8)
  • docs/guides/subagents.md
  • docs/reference/starter-tools.md
  • docs/tutorials/first-tool.md
  • evals/runner.py
  • evals/tasks.json
  • knowledge/store.py
  • tests/test_config_io.py
  • tools/lg_tools.py

Comment thread evals/runner.py Outdated
Comment thread knowledge/store.py Outdated
Comment thread tools/lg_tools.py Outdated
Comment thread tools/lg_tools.py Outdated
- evals/runner.py: collapse redundant nested teardown guard into a
  single `if "teardown" in case:` (SIM102); remove now-unused
  `setup_applied` flag
- knowledge/store.py: use `datetime.UTC` alias (Python 3.11+, UP017)
- tools/lg_tools.py: add `-> list` return annotation to
  `_build_memory_tools` (ANN202); replace explicit loop with list
  comprehension in `memory_recall` (PERF401)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

https://claude.ai/code/session_01148o8ppbuQwuZBsVGTQWwQ
Copy link
Copy Markdown
Contributor Author

Round 2 review — addressed in commit cab3bd8

CodeRabbit's second review (submitted 2026-04-27T21:51Z, after the round-1 push at 21:47Z) posted 4 new inline comments, all labelled Trivial / Nitpick. All four have been applied:

File Finding Action
evals/runner.py L243-244 Redundant nested if setup_applied or "teardown" in case: wrapping if "teardown" in case: (SIM102) Collapsed to single if "teardown" in case:; removed the now-unused setup_applied flag.
knowledge/store.py L92 Use datetime.UTC alias instead of timezone.utc (UP017) Changed import to from datetime import UTC, datetime; updated call-site. Valid: project requires Python ≥ 3.11.
tools/lg_tools.py L284 Missing return type annotation on _build_memory_tools (ANN202) Added -> list to the function signature.
tools/lg_tools.py L331-334 Explicit loop building lines — should be a list comprehension (PERF401) Replaced with one-liner comprehension.

All three touched files were spot-checked against the pre-existing test run: same 11 failures (asyncio/import issues) with and without these changes — confirmed pre-existing baseline.


Generated by Claude Code

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

♻️ Duplicate comments (1)
evals/runner.py (1)

150-157: ⚠️ Potential issue | 🟠 Major

Setup failures can still leak seeded KB rows.

evals/verify.py::apply_setup() applies steps incrementally and returns on the first failure. Because this return happens before Line 162 enters the try, a partial setup failure skips the finally on Lines 237-242 and any configured teardown never runs, so later cases can inherit leftover fixture data. Move setup inside the try so cleanup also happens on setup errors.

🧹 Suggested fix
-    # Pre-seed state via direct DB writes (model never sees this).
-    if "setup" in case:
-        err = verify.apply_setup(case["setup"])
-        if err:
-            return CaseResult(
-                case["id"], case["category"], case["name"], False,
-                f"setup failed: {err}",
-            )
-
     events: list[dict] = []
     result: TaskResult | None = None
 
     try:
+        # Pre-seed state via direct DB writes (model never sees this).
+        if "setup" in case:
+            err = verify.apply_setup(case["setup"])
+            if err:
+                return CaseResult(
+                    case["id"], case["category"], case["name"], False,
+                    f"setup failed: {err}",
+                )
         since = verify.audit_now()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@evals/runner.py` around lines 150 - 157, The setup path currently calls
verify.apply_setup(...) before entering the surrounding try/finally, so a
partial failure can return early and skip the cleanup; move the call to
verify.apply_setup(case["setup"]) inside the same try block that contains the
finally cleanup, and on error still produce the same CaseResult (using
CaseResult(case["id"], case["category"], case["name"], False, f"setup failed:
{err}")) but only after the finally has run; reference verify.apply_setup and
the existing try/finally/CaseResult flow to locate where to relocate the setup
call and ensure teardown always executes even when setup fails.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@evals/runner.py`:
- Around line 97-103: The current code builds a bearer-only headers dict
(variables expected_status, bad, headers), which prevents tests from exercising
X-API-Key scenarios; modify the request header construction to accept headers
from the test case (e.g., case.get("headers")) and merge them with sensible
defaults like "Content-Type": "application/json" and a default Authorization
using case.get("bad_token",""), so tasks can specify missing/invalid X-API-Key
or other auth combos; ensure case-supplied headers override defaults when
present so auth scenarios in tasks.json are honored.
- Around line 192-199: The fixed 0.3s sleep is brittle; replace it with a short
polling loop that repeatedly calls verify.audit_entries_since(since) and
verify.assert_tools_fired(entries, expected_tools, require_success=...) until
either the assertion passes or a deadline (e.g. 1–3 seconds) is reached; on
timeout fail the test with the last detail. Locate the block that uses
expected_tools, since, verify.audit_entries_since and verify.assert_tools_fired
in runner.py and swap the single asyncio.sleep for an async loop that awaits
small intervals (e.g. 50–200ms) between polls, updates entries each iteration,
and breaks early when passed is True. Ensure the final behavior preserves
require_success computed from case.get("tool_outcome", "success") and returns or
raises the same failure path when timed out.

In `@knowledge/store.py`:
- Around line 311-316: The FTS5 query builds MATCH from raw tokens in
_search_fts (tokens = [...] and match = " OR ".join(tokens)), which lets
apostrophes and reserved words break the SQL; instead quote/escape each token
with SQLite's FTS5 quoting helper (use fts5_quote() or an equivalent escape
function) before joining so the MATCH expression is safe (e.g., mapped_token =
fts5_quote(t) for t in tokens), then join mapped tokens with " OR " and use that
for the MATCH; keep the early empty-token check and existing fallback-to-LIKE
behavior intact.
- Around line 176-199: The exception handlers in the store code only catch
sqlite3.OperationalError but must catch sqlite3.DatabaseError to honor the
corrupt-database/read-only contract; update every except block that currently
uses "except sqlite3.OperationalError" to "except sqlite3.DatabaseError" (or the
broader sqlite3.Error/DatabaseError as appropriate) so operations like the
schema init (the try around self._connect(), db.executescript(_SCHEMA), the FTS
rebuild block that calls db.execute(...), and the outer commit/close) and any
other error-handling paths that reference sqlite3.OperationalError will
correctly handle corrupt or read-only DBs (search for the other handlers in this
module and replace them similarly).
- Around line 429-458: The LIKE patterns in find_chunk_containing (the SQL using
"WHERE (content LIKE ? OR heading LIKE ?)") and delete_by_content (the "DELETE
FROM chunks WHERE content LIKE ?") must treat input literally: escape any '%'
and '_' in the input by replacing '%' -> '\%' and '_' -> '\_' (taking care to
escape the backslash in Python strings), build the parameter as f"%{escaped}%",
and add " ESCAPE '\'" to the SQL statements so SQLite treats '\' as the escape
character; update the params passed to db.execute accordingly and keep the
domain handling as-is.

---

Duplicate comments:
In `@evals/runner.py`:
- Around line 150-157: The setup path currently calls verify.apply_setup(...)
before entering the surrounding try/finally, so a partial failure can return
early and skip the cleanup; move the call to verify.apply_setup(case["setup"])
inside the same try block that contains the finally cleanup, and on error still
produce the same CaseResult (using CaseResult(case["id"], case["category"],
case["name"], False, f"setup failed: {err}")) but only after the finally has
run; reference verify.apply_setup and the existing try/finally/CaseResult flow
to locate where to relocate the setup call and ensure teardown always executes
even when setup fails.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: de696887-9338-4c91-8268-f0dc82e83016

📥 Commits

Reviewing files that changed from the base of the PR and between 4d0d428 and cab3bd8.

📒 Files selected for processing (3)
  • evals/runner.py
  • knowledge/store.py
  • tools/lg_tools.py

Comment thread evals/runner.py
Comment thread evals/runner.py
Comment thread knowledge/store.py Outdated
Comment thread knowledge/store.py Outdated
Comment thread knowledge/store.py Outdated
Real bugs:
- evals/runner.py: setup is now inside the try block so a partial
  setup failure (e.g. step 2 of 3 errors) still triggers the
  finally teardown — rows from steps that did succeed no longer
  leak into the next case. Was flagged as a duplicate from round 2.
- knowledge/store.py: LIKE patterns now escape % and _ via
  ESCAPE '\' on every clause that takes user input
  (find_chunk_containing, delete_by_content, _search_like). A
  query for "100%" or "hello_world" no longer silently matches
  every row containing "100" or any single character between
  "hello" and "world".
- knowledge/store.py: FTS5 MATCH tokens are now double-quoted via
  _fts_quote() so user-supplied query terms can't smuggle FTS5
  operators (column filters, prefix wildcards, NEAR, AND/OR/NOT).
  Defence in depth — the [\w']+ tokenizer already filters most
  special chars.

Hardening:
- evals/runner.py: the fixed 0.3s asyncio.sleep waiting for the
  audit log to flush is gone. _await_audit_assertion now polls
  every 50ms up to a 2s deadline and returns as soon as the
  assertion passes — exits early on success, only burns the full
  deadline when the tool genuinely never fired.
- evals/runner.py: _run_auth_check accepts case["headers"] so cases
  can override the default bearer-only header set and exercise
  X-API-Key auth scenarios (or both auths together).
- knowledge/store.py: per-method exception handlers broadened from
  sqlite3.OperationalError to sqlite3.DatabaseError. Catches
  IntegrityError, ProgrammingError, and corruption variants too
  without crashing the agent loop. _has_fts5 (probe) and _connect
  (connection-time errors only) keep the narrower OperationalError.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mabry1985
Copy link
Copy Markdown
Contributor Author

Round 3 of CodeRabbit feedback addressed in b3a9f1d. All 5 actionable comments + the duplicate Major from round 2 are applied.

Real bugs

  • evals/runner.pyapply_setup moved inside the try block so a partial setup failure (e.g. step 2 of 3 errors) still triggers the finally teardown. Was the round-2 duplicate Major.
  • knowledge/store.py — LIKE patterns now escape % and _ with ESCAPE '\' in find_chunk_containing, delete_by_content, and _search_like. Verified: a search for "100%" matches only rows containing literal 100%, not every row starting with 100.
  • knowledge/store.py — FTS5 MATCH tokens are double-quoted via _fts_quote() so user-supplied query terms can't smuggle FTS5 operators (column filters, prefix wildcards, NEAR, AND/OR/NOT). Defence in depth — the [\w']+ tokenizer already filtered most special chars, but the quote pass closes the rest.

Hardening

  • evals/runner.py — the fixed await asyncio.sleep(0.3) audit-log wait is gone. _await_audit_assertion polls every 50ms up to a 2s deadline and exits as soon as the assertion passes — early exit on success, deadline only used when the tool genuinely never fired.
  • evals/runner.py_run_auth_check now accepts case["headers"] so cases can exercise X-API-Key auth too. Default behaviour is unchanged.
  • knowledge/store.py — per-method exception handlers broadened from sqlite3.OperationalError to sqlite3.DatabaseError. Catches IntegrityError, ProgrammingError, and corruption variants too. _has_fts5 (probe) and _connect (connection-time only) keep the narrower OperationalError.

Smoke tests pass:

  • _escape_like("100%")"100\\%"; _fts_quote('he"llo')'"he""llo"'.
  • End-to-end: find_chunk_containing("100%") matches only the literal-percent row; find_chunk_containing("hello_world") matches only the underscore row; search("AND OR NOT") is treated as three literal tokens, no FTS5 syntax error.
  • Scoped pytest (168 tests across test_starter_tools / test_config_io / test_skill_emission / test_a2a_handler / test_a2a_integration): all pass except the same single pre-existing test_subscribe_endpoint_origin_rejected failure that reproduces on main.

🤖 Generated with Claude Code

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (2)
knowledge/store.py (1)

234-239: ⚠️ Potential issue | 🟠 Major

Widen exception scope in _get_db to honor the no-crash contract.

The module docstring (lines 20–22) states that the store "swallows sqlite3.DatabaseError" so it "never crashes the agent loop on a corrupt or read-only DB." However, _get_db only catches OperationalError. Corrupt databases raise sqlite3.DatabaseError (the parent class), which would propagate and crash the caller.

🔧 Suggested fix
     def _get_db(self) -> sqlite3.Connection | None:
         try:
             return self._connect()
-        except sqlite3.OperationalError as exc:
+        except sqlite3.DatabaseError as exc:
             log.error("[knowledge] connect failed: %s", exc)
             return None
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@knowledge/store.py` around lines 234 - 239, In _get_db, broaden the exception
catch to sqlite3.DatabaseError (the parent of OperationalError) so corrupt or
read-only DB errors are swallowed as the module docstring requires; change the
except clause in the _get_db method (currently "except sqlite3.OperationalError
as exc") to catch sqlite3.DatabaseError instead and keep the existing log.error
call and return None behavior.
evals/runner.py (1)

237-239: 🧹 Nitpick | 🔵 Trivial

Prefix unused entries variable with underscore.

The entries value from _await_audit_assertion is unpacked but never used within _run_prompt_case. Prefix it with _ to indicate it's intentionally ignored and silence the static analysis warning.

♻️ Suggested fix
-            entries, passed, detail = await _await_audit_assertion(
+            _entries, passed, detail = await _await_audit_assertion(
                 since, expected_tools, require_success=require_success,
             )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@evals/runner.py` around lines 237 - 239, In _run_prompt_case, the first value
returned by _await_audit_assertion is unpacked to entries but never used; change
the unpacking to prefix that variable with an underscore (e.g., _entries) so it
becomes _entries, passed, detail = await _await_audit_assertion(...) to signal
intentional unused value and silence static-analysis warnings referencing
_await_audit_assertion and _run_prompt_case.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@evals/runner.py`:
- Around line 170-179: Replace uses of asyncio.get_event_loop().time() with
asyncio.get_running_loop().time() when computing deadline and checking time in
the loop: update the two call sites around the deadline assignment and the loop
condition where deadline is compared, keeping the same logic with variables
'deadline', 'since', and the verify.audit_entries_since /
verify.assert_tools_fired calls unchanged so the behavior and comparisons remain
identical.
- Around line 339-340: The timestamp in the payload uses
datetime.now(timezone.utc).isoformat() which is inconsistent with the codebase
alias; update the payload construction to use the datetime.UTC alias (i.e.,
datetime.now(datetime.UTC).isoformat()) so it matches other files like
knowledge/store.py and keeps style consistent; locate the payload assignment in
evals/runner.py (the payload variable and the datetime.now(...) call) and
replace timezone.utc with datetime.UTC.

In `@knowledge/store.py`:
- Around line 229-230: Replace the log.error(...) calls inside the exception
handlers that catch sqlite3.DatabaseError (the blocks that currently read
"except sqlite3.DatabaseError as exc: log.error(..., exc)") with
log.exception(...) so the error message for schema/init/database failures
includes the full traceback; keep the same descriptive message (including
self.path or other context) when calling log.exception so callers see both the
message and the exception stack. Ensure you update every occurrence of that
pattern (the sqlite3.DatabaseError except blocks where log.error is used) rather
than adding new try/except logic.

---

Duplicate comments:
In `@evals/runner.py`:
- Around line 237-239: In _run_prompt_case, the first value returned by
_await_audit_assertion is unpacked to entries but never used; change the
unpacking to prefix that variable with an underscore (e.g., _entries) so it
becomes _entries, passed, detail = await _await_audit_assertion(...) to signal
intentional unused value and silence static-analysis warnings referencing
_await_audit_assertion and _run_prompt_case.

In `@knowledge/store.py`:
- Around line 234-239: In _get_db, broaden the exception catch to
sqlite3.DatabaseError (the parent of OperationalError) so corrupt or read-only
DB errors are swallowed as the module docstring requires; change the except
clause in the _get_db method (currently "except sqlite3.OperationalError as
exc") to catch sqlite3.DatabaseError instead and keep the existing log.error
call and return None behavior.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 547c1280-742e-4685-bc51-8c02425b92dd

📥 Commits

Reviewing files that changed from the base of the PR and between cab3bd8 and b3a9f1d.

📒 Files selected for processing (2)
  • evals/runner.py
  • knowledge/store.py

Comment thread evals/runner.py Outdated
Comment thread evals/runner.py Outdated
Comment thread knowledge/store.py Outdated
- evals/runner.py: use asyncio.get_running_loop() instead of the
  deprecated get_event_loop() inside the _await_audit_assertion coroutine
- evals/runner.py: prefix unused _entries return value with underscore
- evals/runner.py: use datetime.UTC alias (consistent with store.py),
  drop now-unused timezone import
- knowledge/store.py: broaden _get_db exception catch from
  OperationalError to DatabaseError so corrupt-DB errors are swallowed
  per the module's no-crash contract
- knowledge/store.py: replace log.error with log.exception in all three
  DatabaseError handlers (schema init, _get_db, add_chunk) so
  tracebacks appear in error logs

Co-Authored-By: Claude <claude@anthropic.com>

https://claude.ai/code/session_01YW5U6mtpLy4rzKmqd4trkH
Copy link
Copy Markdown
Contributor Author

Round 4 — fix(review-4) (commit b713d8d)

CodeRabbit's round-4 review (submitted 2026-04-27T22:30:33Z, covering commit b3a9f1d) posted 3 new actionable items plus 2 duplicate carryovers from round 3. All 5 are addressed below.


✅ Fixed

# File Finding Action
1 evals/runner.py L170, L179 asyncio.get_event_loop().time() used inside a coroutine — deprecated since Python 3.10, raises DeprecationWarning Changed to asyncio.get_running_loop().time() in both call sites inside _await_audit_assertion
2 knowledge/store.py _get_db Catches sqlite3.OperationalError instead of the broader sqlite3.DatabaseError, violating the module's stated no-crash contract (corrupt-DB errors would propagate) Changed except sqlite3.OperationalErrorexcept sqlite3.DatabaseError
3 knowledge/store.py L230, L238, L270 log.error(...) in three DatabaseError handlers omits tracebacks Replaced with log.exception(...) (no exc arg needed — captured automatically) across schema-init, _get_db, and add_chunk
4 evals/runner.py L237 entries unpacked from _await_audit_assertion but never used (duplicate carryover) Renamed to _entries
5 evals/runner.py L340 datetime.now(timezone.utc) inconsistent with knowledge/store.py style (duplicate carryover) Changed to datetime.now(UTC); removed now-unused timezone import

ℹ️ Not addressed (out of scope / already correct)

  • knowledge/store.py L133 (_has_fts5) and L202 (WAL PRAGMA) — both still catch sqlite3.OperationalError. These are intentional: _has_fts5 is probing for FTS5 support, and the WAL PRAGMA is explicitly best-effort for read-only mounts. Neither is a contract-violation.
  • knowledge/store.py L319, L422, L439, L479, L504, L522 — these DatabaseError handlers use log.warning, not log.error. CodeRabbit's finding targeted the log.error pattern specifically; warning-level handlers are deliberately lower-severity for expected operational failures (search miss, stats failure, etc.).

Round 4 has no new blockers. Ready to merge when the team is.


Generated by Claude Code

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
evals/runner.py (1)

231-241: ⚠️ Potential issue | 🟠 Major

Regression: expected_tools: [] no longer asserts abstention.

The logic at Line 235 still treats empty lists like normal expected-tool checks, and verify.assert_tools_fired(..., expected=[]) passes even when tools fired. This contradicts the Line 231-233 contract.

💡 Suggested fix
         expected_tools = case.get("expected_tools")
         if expected_tools is not None:
             require_success = case.get("tool_outcome", "success") == "success"
-            _entries, passed, detail = await _await_audit_assertion(
-                since, expected_tools, require_success=require_success,
-            )
-            if not passed:
-                problems.append(detail)
+            if not expected_tools:
+                loop = asyncio.get_running_loop()
+                deadline = loop.time() + _AUDIT_POLL_DEADLINE_S
+                while True:
+                    entries = verify.audit_entries_since(since)
+                    fired = [e.get("tool", "?") for e in entries if e.get("tool")]
+                    if fired:
+                        problems.append(f"unexpected tools fired: {fired}")
+                        break
+                    if loop.time() >= deadline:
+                        break
+                    await asyncio.sleep(_AUDIT_POLL_INTERVAL_S)
+            else:
+                _entries, passed, detail = await _await_audit_assertion(
+                    since, expected_tools, require_success=require_success,
+                )
+                if not passed:
+                    problems.append(detail)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@evals/runner.py` around lines 231 - 241, The test treats an explicit empty
list for expected_tools as an abstention assertion but the current branch treats
[] like a normal expectation; add an explicit branch for expected_tools == []
that enforces “no tools fired”: detect if expected_tools is a list and
len(expected_tools) == 0, then call the audit check that verifies zero tool
entries (either by adding an assert_none/expect_empty flag to
_await_audit_assertion or by calling a new helper like
_await_no_tools_assertion(since)) and append detail to problems if that check
fails; keep the existing path for non-empty expected_tools and continue to use
require_success derived from case.get("tool_outcome").
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@evals/runner.py`:
- Around line 231-241: The test treats an explicit empty list for expected_tools
as an abstention assertion but the current branch treats [] like a normal
expectation; add an explicit branch for expected_tools == [] that enforces “no
tools fired”: detect if expected_tools is a list and len(expected_tools) == 0,
then call the audit check that verifies zero tool entries (either by adding an
assert_none/expect_empty flag to _await_audit_assertion or by calling a new
helper like _await_no_tools_assertion(since)) and append detail to problems if
that check fails; keep the existing path for non-empty expected_tools and
continue to use require_success derived from case.get("tool_outcome").

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 1d06a7a4-c90c-4424-9401-2d0aeb53f4d3

📥 Commits

Reviewing files that changed from the base of the PR and between b3a9f1d and ae752a5.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (2)
  • evals/runner.py
  • knowledge/store.py

@mabry1985 mabry1985 merged commit 3749914 into main Apr 27, 2026
1 check passed
@mabry1985 mabry1985 deleted the feat/default-kb-and-eval-harness branch April 27, 2026 22:54
mabry1985 added a commit that referenced this pull request Apr 28, 2026
…uler

docs: sync after KB store + scheduler PRs (#155 / #156)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

docs: add eval-setup guide for forks

2 participants