feat: ship default knowledge store + eval harness for forks#155
Conversation
The template now ships a working memory loop and end-to-end eval suite on day one so forks have a green baseline before they touch a single line of code. Closes #154. What lands: - knowledge/store.py — sqlite + FTS5 (LIKE fallback). One ``chunks`` table backs operator notes (memory_ingest), daily-log entries (daily_log), and conversation findings extracted by MemoryMiddleware (domain='finding'). Path resolves env > config > default with an automatic ~/.protoagent/ fallback when /sandbox isn't writable. - tools/lg_tools.py — five new memory tools (memory_ingest, memory_recall, memory_list, memory_stats, daily_log) bound to the store via a closure factory so tests get a fresh store per run. ``echo`` removed; ``get_all_tools(knowledge_store)`` actually uses its parameter now. - server.py — _build_knowledge_store() constructs the store and threads it through both initial init and the drawer reload path. Defaults flipped: knowledge_middleware + memory_middleware now ON by default (config/langgraph-config.yaml + graph/config.py). - evals/ — A2A client + runner + verify helpers + 15 starter cases (tasks.json) covering agent card discovery, bearer auth gating, abstention, every shipped tool, KB recall, a chained two-tool case, and KnowledgeMiddleware injection. Side-effect-verified: audit log + reply text + KB chunks all checked independently so hallucinated tool results get caught. - docs/guides/evals.md — full how-to. README/TEMPLATE/configuration/ starter-tools/first-agent updated to reflect the new defaults and the additional five memory tools. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughReplaces the inline Changes
Sequence Diagram(s)sequenceDiagram
participant Runner as Eval Runner
participant Client as AgentClient
participant Agent as Agent (A2A)
participant Audit as Audit Log
participant KB as Knowledge Store
rect rgba(100, 150, 200, 0.5)
Note over Runner,Client: Case setup & execution
end
Runner->>Client: ask(prompt, timeout_s=90)
Client->>Agent: JSON-RPC /message/send
Agent->>Audit: emit audit entries (tool calls)
Agent->>KB: memory_* / daily_log operations
Agent-->>Client: final TaskResult / status
rect rgba(100, 200, 150, 0.5)
Note over Runner,KB: Assertion validation
end
Runner->>Audit: audit_entries_since(ts)
Audit-->>Runner: tool firing records
Runner->>KB: search / find_chunk_containing / list_chunks
KB-->>Runner: stored facts & side effects
Runner->>Client: parse TaskResult.text
Client-->>Runner: extracted reply text
rect rgba(200, 150, 100, 0.5)
Note over Runner,KB: Teardown
end
Runner->>KB: delete_by_content / delete_by_heading
KB-->>Runner: deletion confirmations
sequenceDiagram
participant Tool as Memory Tool
participant KStore as KnowledgeStore
participant DB as SQLite + FTS5
rect rgba(150, 150, 200, 0.5)
Note over Tool,DB: Tool invocation & persistence
end
Tool->>KStore: add_chunk(content, domain, heading)
KStore->>DB: INSERT into chunks
KStore->>DB: maintain chunks_fts via triggers (if FTS5)
DB-->>KStore: row id | OperationalError -> None
KStore-->>Tool: chunk_id | None
rect rgba(150, 200, 150, 0.5)
Note over Tool,DB: Retrieval with FTS
end
Tool->>KStore: search(query, domain)
KStore->>DB: SELECT via FTS5 MATCH (or LIKE fallback)
DB-->>KStore: scored results
KStore-->>Tool: standardized result dicts
rect rgba(200, 150, 150, 0.5)
Note over Tool,DB: Cleanup
end
Tool->>KStore: delete_by_content / delete_by_heading
KStore->>DB: DELETE + cascade FTS triggers
DB-->>KStore: affected row count
KStore-->>Tool: count
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 9
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
tests/test_config_io.py (1)
321-331: 🧹 Nitpick | 🔵 TrivialTighten starter-tool contract coverage in this test.
Line 328–331 now verifies only part of the core starter set; a regression dropping
web_searchwould no longer fail this test. Consider explicitly asserting it too.✅ Small test hardening diff
def test_list_available_tools_returns_starter_set(): @@ assert "current_time" in names assert "calculator" in names + assert "web_search" in names assert "fetch_url" in names assert all(isinstance(n, str) for n in names)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_config_io.py` around lines 321 - 331, The test test_list_available_tools_returns_starter_set currently asserts presence of "current_time", "calculator", and "fetch_url" but misses "web_search"; update the test in tests/test_config_io.py to also assert "web_search" is in the names returned by list_available_tools (function list_available_tools in graph.config_io), keeping the existing type check (all elements are str) so the starter-tool contract is fully covered.docs/reference/starter-tools.md (1)
183-188:⚠️ Potential issue | 🟡 MinorPreserve the bundled memory tools in the extension example.
Copying this snippet verbatim replaces the new conditional registry and drops all
KnowledgeStore-backed tools when a store is configured. Showmy_toolbeing appended to the existing pattern instead of returning a fresh list.Suggested doc fix
def get_all_tools(knowledge_store=None): - return [current_time, calculator, web_search, fetch_url, my_tool] + tools = [current_time, calculator, web_search, fetch_url, my_tool] + if knowledge_store is not None: + tools.extend(_build_memory_tools(knowledge_store)) + return tools🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/reference/starter-tools.md` around lines 183 - 188, The snippet replaces the existing conditional registry and drops KnowledgeStore-backed tools; update get_all_tools to obtain the existing tools list (keeping bundled tools like current_time, calculator, web_search, fetch_url and any KnowledgeStore-backed tools when knowledge_store is provided) and then append my_tool instead of returning a fresh list. Concretely, inside get_all_tools(knowledge_store=None) call or build the existing tools list used elsewhere (respecting the conditional registry when knowledge_store is set), then tools.append(my_tool) and return tools; reference get_all_tools, my_tool, and the knowledge_store parameter when making the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/guides/subagents.md`:
- Around line 89-99: The docs updated the tools allowlist (current_time,
calculator, web_search, fetch_url, memory_ingest, memory_recall, memory_list,
memory_stats, daily_log) but the Step 2 LangGraphConfig snippet still includes
the legacy "echo" tool, causing a mismatch; update the LangGraphConfig example
to replace or remove "echo" so it exactly matches the new tools list and
max_turns setting (e.g., ensure the LangGraphConfig.tools array contains the
same tool names and that LangGraphConfig.max_turns is set to 20) so both
snippets are consistent.
In `@docs/tutorials/first-tool.md`:
- Around line 53-56: The example redefines WORKER_CONFIG.tools and accidentally
omits the default knowledge-backed tools; instead of replacing the whole
allowlist in the SubagentConfig example, update the snippet to append "git_sha"
to the existing default tools (or explicitly include the default entries:
"memory_ingest", "memory_recall", "memory_list", "memory_stats", "daily_log"
alongside "current_time", "calculator", "web_search", "fetch_url", "git_sha") so
the example matches the template behavior; locate WORKER_CONFIG and
SubagentConfig in the doc and change only the tools list accordingly.
In `@evals/runner.py`:
- Around line 205-209: The _RUNNERS dispatch dict only maps "agent_card",
"auth_check", and "ask" (functions _run_agent_card, _run_auth_check, _run_ask)
so evals/client.py::AgentClient.stream() is never reachable from tasks.json; add
a runner entry that invokes the streaming path (e.g., "agent_stream":
_run_agent_stream) or modify an existing runner (e.g., enhance _run_agent_card)
to detect streaming tasks and call AgentClient.stream(), then register that new
function in _RUNNERS and ensure tasks.json uses the new key so A2A
streaming/push-notification cases are testable.
- Around line 133-149: The teardown must run in a finally block so it always
executes even if client.ask or the verifier raises: wrap the call sequence in
run_one so that after verify.apply_setup(case["setup"]) you set a flag (e.g.,
setup_applied = True) and perform client.ask / any verifier calls inside try,
and then in finally check if setup_applied and "teardown" in case and call
verify.apply_teardown(case["teardown"]); apply the same pattern for the later
teardown usage around the code near the 185-186 region to ensure seeded KB rows
are always cleaned up.
- Around line 158-168: The code currently uses expected_tools =
case.get("expected_tools") or [] which conflates "no key" and "empty list";
change to expected_tools = case.get("expected_tools", None) and then run the
audit check only when expected_tools is not None so that an explicit empty list
asserts that no tool entries were emitted. Keep the rest of the logic (await
asyncio.sleep, entries = verify.audit_entries_since(since), require_success
calculation, and calling verify.assert_tools_fired) but wrap it in if
expected_tools is not None: ... so assert_tools_fired is invoked for [] as well
and add any failure detail to problems as before.
In `@knowledge/store.py`:
- Around line 394-423: The helpers must reject empty or whitespace-only match
strings before constructing LIKE queries: in find_chunk_containing (parameter
text) and in delete_by_content / any verifier helper using contains, call
strip() on the incoming string and if the result is empty immediately return
None (or no-op for delete) instead of building a "%...%" pattern; do this check
at the start of the functions (e.g., in find_chunk_containing, check if not text
or not text.strip() then return None) to prevent LIKE '%%' queries that match
everything.
- Around line 171-174: After installing the FTS schema, ensure existing rows in
the external-content table get indexed: when self._fts_available is true and
after db.executescript(_FTS_SCHEMA) completes, run an FTS5 rebuild for the
chunks_fts virtual table (i.e. perform the FTS5 "rebuild" insert into
chunks_fts) so preexisting rows in chunks are indexed; add this rebuild call
alongside the existing db.executescript usage and keep the guard using
_has_fts5(db)/self._fts_available.
- Around line 162-166: The PRAGMA journal_mode=WAL call in _connect currently
makes WAL mandatory and can raise sqlite3.OperationalError for read-only DBs;
modify _connect (or move the PRAGMA to the writable init path) so WAL is
best-effort: open the sqlite3.Connection and set row_factory as now, then
attempt db.execute("PRAGMA journal_mode=WAL") inside a try/except that catches
sqlite3.OperationalError (and logs/debugs if desired) but does not close or
return None on failure, so read-only connections remain usable; alternatively
ensure only the writable initialization path (e.g., the method that prepares a
writable DB) performs the WAL PRAGMA while _connect always returns a usable
connection for reads.
In `@tools/lg_tools.py`:
- Around line 312-329: Clamp tool inputs before calling the store: cap the
incoming k (and any limit arg in the other memory tools on lines ~332-348) to a
small positive max and enforce a minimum of 1, then pass the clamped value to
knowledge_store.search instead of the raw argument. Concretely, add a MAX_K
constant (e.g. 20) and replace uses of k with something like clamped_k = max(1,
min(k, MAX_K)) (and do the same for any limit parameter in the other memory_*
function(s)), then call knowledge_store.search(query, k=clamped_k) so the tool
cannot request arbitrarily large slices of the KB; apply the same clamping logic
to all memory_* entry points in this file.
---
Outside diff comments:
In `@docs/reference/starter-tools.md`:
- Around line 183-188: The snippet replaces the existing conditional registry
and drops KnowledgeStore-backed tools; update get_all_tools to obtain the
existing tools list (keeping bundled tools like current_time, calculator,
web_search, fetch_url and any KnowledgeStore-backed tools when knowledge_store
is provided) and then append my_tool instead of returning a fresh list.
Concretely, inside get_all_tools(knowledge_store=None) call or build the
existing tools list used elsewhere (respecting the conditional registry when
knowledge_store is set), then tools.append(my_tool) and return tools; reference
get_all_tools, my_tool, and the knowledge_store parameter when making the
change.
In `@tests/test_config_io.py`:
- Around line 321-331: The test test_list_available_tools_returns_starter_set
currently asserts presence of "current_time", "calculator", and "fetch_url" but
misses "web_search"; update the test in tests/test_config_io.py to also assert
"web_search" is in the names returned by list_available_tools (function
list_available_tools in graph.config_io), keeping the existing type check (all
elements are str) so the starter-tool contract is fully covered.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: e658a3ad-718e-464a-a26c-f5dc1e6ba884
⛔ Files ignored due to path filters (2)
README.mdis excluded by!*.mdTEMPLATE.mdis excluded by!*.md
📒 Files selected for processing (27)
config/langgraph-config.yamldocs/guides/customize-and-deploy.mddocs/guides/evals.mddocs/guides/fork-the-template.mddocs/guides/index.mddocs/guides/subagents.mddocs/reference/configuration.mddocs/reference/starter-tools.mddocs/tutorials/first-agent.mddocs/tutorials/first-tool.mdevals/README.mdevals/__init__.pyevals/client.pyevals/results/.gitkeepevals/runner.pyevals/tasks.jsonevals/verify.pygraph/config.pygraph/subagents/config.pyknowledge/__init__.pyknowledge/store.pyserver.pytests/test_config_io.pytests/test_skill_curator.pytests/test_skill_emission.pytests/test_starter_tools.pytools/lg_tools.py
💤 Files with no reviewable changes (1)
- tests/test_starter_tools.py
Real bugs: - evals/runner.py: teardown now runs in a finally block so seeded KB rows get cleaned up even when the verifier or client.ask() raises. expected_tools=[] now means "assert no tools fired" (was conflated with "no key" via the `or []` short-circuit, making the abstention case a no-op). - evals/runner.py + tasks.json: added a `stream` runner kind so AgentClient.stream() is reachable from tasks.json — new streaming_status_updates case asserts the SSE event sequence. - knowledge/store.py: PRAGMA journal_mode=WAL is now best-effort (read-only DBs no longer break _connect). FTS5 rebuild after schema install so an existing chunks table populated before FTS was added gets indexed. find_chunk_containing/delete_by_content reject empty/whitespace-only inputs to prevent LIKE '%%' wildcards from matching every row. Hardening: - tools/lg_tools.py: clamp memory_recall(k) to [1, 20] and memory_list(limit) to [1, 200] so the agent can't request arbitrarily large slices of the KB. Doc cleanup: - docs/guides/subagents.md: LangGraphConfig snippet had a stale "echo" reference; replaced with the new memory-tool list. - docs/tutorials/first-tool.md: WORKER_CONFIG example now appends git_sha alongside the bundled defaults instead of replacing them and dropping the memory tools. - docs/reference/starter-tools.md: "adding your own" snippet now preserves the conditional _build_memory_tools(knowledge_store) extension. - tests/test_config_io.py: starter-tool contract assertion now also covers web_search. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Round 1 of CodeRabbit feedback addressed in 4d0d428. Real bugs first, hardening after, docs last. Real bugs
Hardening
Docs
Tests in scope (203 across the affected files): same 1 pre-existing baseline failure ( 🤖 Generated with Claude Code |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@evals/runner.py`:
- Around line 243-244: The nested condition checking setup_applied before
testing for a teardown key is redundant; update the block that currently uses
`if setup_applied or "teardown" in case:` followed by `if "teardown" in case:`
to a single guard that directly checks `if "teardown" in case:` (remove the
`setup_applied` portion and the inner duplicate check) so the teardown branch
runs only when the `"teardown"` key is present; adjust any dependent logic that
relied on `setup_applied` being true inside that branch if necessary (look for
references to `setup_applied` and `case` in the surrounding code).
In `@knowledge/store.py`:
- Around line 91-92: Update the _now_iso function to use the Python 3.11+
datetime.UTC alias instead of timezone.utc; specifically modify the _now_iso()
implementation so it calls datetime.now(datetime.UTC).isoformat(); keep the
function name _now_iso and its return type, and only apply this change if the
project targets Python 3.11+ (otherwise leave as timezone.utc).
In `@tools/lg_tools.py`:
- Around line 331-334: Replace the explicit loop that builds lines with a list
comprehension for clarity and conciseness: instead of initializing lines and
looping over results, create lines with a comprehension like the one that
produces f"[{r.get('domain', '?')}] {r['preview']}" for each r in results, then
return "\n".join(lines); update the block containing variables results, lines,
and r accordingly to remove the loop and keep the same behavior.
- Around line 284-285: Update the _build_memory_tools function signature to
include a return type annotation (e.g. -> list[Tool] or -> list[Any]) to reflect
that it returns a list of tool objects; import or reference the appropriate type
(Tool or Any) from typing or the module that defines the tool class and adjust
the signature from def _build_memory_tools(knowledge_store): to include the
chosen return annotation so IDEs and type checkers pick up the correct return
type.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: c6e241b4-d596-4c12-9a86-9665470c476a
📒 Files selected for processing (8)
docs/guides/subagents.mddocs/reference/starter-tools.mddocs/tutorials/first-tool.mdevals/runner.pyevals/tasks.jsonknowledge/store.pytests/test_config_io.pytools/lg_tools.py
- evals/runner.py: collapse redundant nested teardown guard into a single `if "teardown" in case:` (SIM102); remove now-unused `setup_applied` flag - knowledge/store.py: use `datetime.UTC` alias (Python 3.11+, UP017) - tools/lg_tools.py: add `-> list` return annotation to `_build_memory_tools` (ANN202); replace explicit loop with list comprehension in `memory_recall` (PERF401) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> https://claude.ai/code/session_01148o8ppbuQwuZBsVGTQWwQ
Round 2 review — addressed in commit cab3bd8CodeRabbit's second review (submitted 2026-04-27T21:51Z, after the round-1 push at 21:47Z) posted 4 new inline comments, all labelled Trivial / Nitpick. All four have been applied:
All three touched files were spot-checked against the pre-existing test run: same 11 failures (asyncio/import issues) with and without these changes — confirmed pre-existing baseline. Generated by Claude Code |
There was a problem hiding this comment.
Actionable comments posted: 5
♻️ Duplicate comments (1)
evals/runner.py (1)
150-157:⚠️ Potential issue | 🟠 MajorSetup failures can still leak seeded KB rows.
evals/verify.py::apply_setup()applies steps incrementally and returns on the first failure. Because this return happens before Line 162 enters thetry, a partial setup failure skips thefinallyon Lines 237-242 and any configured teardown never runs, so later cases can inherit leftover fixture data. Move setup inside thetryso cleanup also happens on setup errors.🧹 Suggested fix
- # Pre-seed state via direct DB writes (model never sees this). - if "setup" in case: - err = verify.apply_setup(case["setup"]) - if err: - return CaseResult( - case["id"], case["category"], case["name"], False, - f"setup failed: {err}", - ) - events: list[dict] = [] result: TaskResult | None = None try: + # Pre-seed state via direct DB writes (model never sees this). + if "setup" in case: + err = verify.apply_setup(case["setup"]) + if err: + return CaseResult( + case["id"], case["category"], case["name"], False, + f"setup failed: {err}", + ) since = verify.audit_now()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@evals/runner.py` around lines 150 - 157, The setup path currently calls verify.apply_setup(...) before entering the surrounding try/finally, so a partial failure can return early and skip the cleanup; move the call to verify.apply_setup(case["setup"]) inside the same try block that contains the finally cleanup, and on error still produce the same CaseResult (using CaseResult(case["id"], case["category"], case["name"], False, f"setup failed: {err}")) but only after the finally has run; reference verify.apply_setup and the existing try/finally/CaseResult flow to locate where to relocate the setup call and ensure teardown always executes even when setup fails.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@evals/runner.py`:
- Around line 97-103: The current code builds a bearer-only headers dict
(variables expected_status, bad, headers), which prevents tests from exercising
X-API-Key scenarios; modify the request header construction to accept headers
from the test case (e.g., case.get("headers")) and merge them with sensible
defaults like "Content-Type": "application/json" and a default Authorization
using case.get("bad_token",""), so tasks can specify missing/invalid X-API-Key
or other auth combos; ensure case-supplied headers override defaults when
present so auth scenarios in tasks.json are honored.
- Around line 192-199: The fixed 0.3s sleep is brittle; replace it with a short
polling loop that repeatedly calls verify.audit_entries_since(since) and
verify.assert_tools_fired(entries, expected_tools, require_success=...) until
either the assertion passes or a deadline (e.g. 1–3 seconds) is reached; on
timeout fail the test with the last detail. Locate the block that uses
expected_tools, since, verify.audit_entries_since and verify.assert_tools_fired
in runner.py and swap the single asyncio.sleep for an async loop that awaits
small intervals (e.g. 50–200ms) between polls, updates entries each iteration,
and breaks early when passed is True. Ensure the final behavior preserves
require_success computed from case.get("tool_outcome", "success") and returns or
raises the same failure path when timed out.
In `@knowledge/store.py`:
- Around line 311-316: The FTS5 query builds MATCH from raw tokens in
_search_fts (tokens = [...] and match = " OR ".join(tokens)), which lets
apostrophes and reserved words break the SQL; instead quote/escape each token
with SQLite's FTS5 quoting helper (use fts5_quote() or an equivalent escape
function) before joining so the MATCH expression is safe (e.g., mapped_token =
fts5_quote(t) for t in tokens), then join mapped tokens with " OR " and use that
for the MATCH; keep the early empty-token check and existing fallback-to-LIKE
behavior intact.
- Around line 176-199: The exception handlers in the store code only catch
sqlite3.OperationalError but must catch sqlite3.DatabaseError to honor the
corrupt-database/read-only contract; update every except block that currently
uses "except sqlite3.OperationalError" to "except sqlite3.DatabaseError" (or the
broader sqlite3.Error/DatabaseError as appropriate) so operations like the
schema init (the try around self._connect(), db.executescript(_SCHEMA), the FTS
rebuild block that calls db.execute(...), and the outer commit/close) and any
other error-handling paths that reference sqlite3.OperationalError will
correctly handle corrupt or read-only DBs (search for the other handlers in this
module and replace them similarly).
- Around line 429-458: The LIKE patterns in find_chunk_containing (the SQL using
"WHERE (content LIKE ? OR heading LIKE ?)") and delete_by_content (the "DELETE
FROM chunks WHERE content LIKE ?") must treat input literally: escape any '%'
and '_' in the input by replacing '%' -> '\%' and '_' -> '\_' (taking care to
escape the backslash in Python strings), build the parameter as f"%{escaped}%",
and add " ESCAPE '\'" to the SQL statements so SQLite treats '\' as the escape
character; update the params passed to db.execute accordingly and keep the
domain handling as-is.
---
Duplicate comments:
In `@evals/runner.py`:
- Around line 150-157: The setup path currently calls verify.apply_setup(...)
before entering the surrounding try/finally, so a partial failure can return
early and skip the cleanup; move the call to verify.apply_setup(case["setup"])
inside the same try block that contains the finally cleanup, and on error still
produce the same CaseResult (using CaseResult(case["id"], case["category"],
case["name"], False, f"setup failed: {err}")) but only after the finally has
run; reference verify.apply_setup and the existing try/finally/CaseResult flow
to locate where to relocate the setup call and ensure teardown always executes
even when setup fails.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: de696887-9338-4c91-8268-f0dc82e83016
📒 Files selected for processing (3)
evals/runner.pyknowledge/store.pytools/lg_tools.py
Real bugs: - evals/runner.py: setup is now inside the try block so a partial setup failure (e.g. step 2 of 3 errors) still triggers the finally teardown — rows from steps that did succeed no longer leak into the next case. Was flagged as a duplicate from round 2. - knowledge/store.py: LIKE patterns now escape % and _ via ESCAPE '\' on every clause that takes user input (find_chunk_containing, delete_by_content, _search_like). A query for "100%" or "hello_world" no longer silently matches every row containing "100" or any single character between "hello" and "world". - knowledge/store.py: FTS5 MATCH tokens are now double-quoted via _fts_quote() so user-supplied query terms can't smuggle FTS5 operators (column filters, prefix wildcards, NEAR, AND/OR/NOT). Defence in depth — the [\w']+ tokenizer already filters most special chars. Hardening: - evals/runner.py: the fixed 0.3s asyncio.sleep waiting for the audit log to flush is gone. _await_audit_assertion now polls every 50ms up to a 2s deadline and returns as soon as the assertion passes — exits early on success, only burns the full deadline when the tool genuinely never fired. - evals/runner.py: _run_auth_check accepts case["headers"] so cases can override the default bearer-only header set and exercise X-API-Key auth scenarios (or both auths together). - knowledge/store.py: per-method exception handlers broadened from sqlite3.OperationalError to sqlite3.DatabaseError. Catches IntegrityError, ProgrammingError, and corruption variants too without crashing the agent loop. _has_fts5 (probe) and _connect (connection-time errors only) keep the narrower OperationalError. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Round 3 of CodeRabbit feedback addressed in b3a9f1d. All 5 actionable comments + the duplicate Major from round 2 are applied. Real bugs
Hardening
Smoke tests pass:
🤖 Generated with Claude Code |
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (2)
knowledge/store.py (1)
234-239:⚠️ Potential issue | 🟠 MajorWiden exception scope in
_get_dbto honor the no-crash contract.The module docstring (lines 20–22) states that the store "swallows
sqlite3.DatabaseError" so it "never crashes the agent loop on a corrupt or read-only DB." However,_get_dbonly catchesOperationalError. Corrupt databases raisesqlite3.DatabaseError(the parent class), which would propagate and crash the caller.🔧 Suggested fix
def _get_db(self) -> sqlite3.Connection | None: try: return self._connect() - except sqlite3.OperationalError as exc: + except sqlite3.DatabaseError as exc: log.error("[knowledge] connect failed: %s", exc) return None🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@knowledge/store.py` around lines 234 - 239, In _get_db, broaden the exception catch to sqlite3.DatabaseError (the parent of OperationalError) so corrupt or read-only DB errors are swallowed as the module docstring requires; change the except clause in the _get_db method (currently "except sqlite3.OperationalError as exc") to catch sqlite3.DatabaseError instead and keep the existing log.error call and return None behavior.evals/runner.py (1)
237-239: 🧹 Nitpick | 🔵 TrivialPrefix unused
entriesvariable with underscore.The
entriesvalue from_await_audit_assertionis unpacked but never used within_run_prompt_case. Prefix it with_to indicate it's intentionally ignored and silence the static analysis warning.♻️ Suggested fix
- entries, passed, detail = await _await_audit_assertion( + _entries, passed, detail = await _await_audit_assertion( since, expected_tools, require_success=require_success, )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@evals/runner.py` around lines 237 - 239, In _run_prompt_case, the first value returned by _await_audit_assertion is unpacked to entries but never used; change the unpacking to prefix that variable with an underscore (e.g., _entries) so it becomes _entries, passed, detail = await _await_audit_assertion(...) to signal intentional unused value and silence static-analysis warnings referencing _await_audit_assertion and _run_prompt_case.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@evals/runner.py`:
- Around line 170-179: Replace uses of asyncio.get_event_loop().time() with
asyncio.get_running_loop().time() when computing deadline and checking time in
the loop: update the two call sites around the deadline assignment and the loop
condition where deadline is compared, keeping the same logic with variables
'deadline', 'since', and the verify.audit_entries_since /
verify.assert_tools_fired calls unchanged so the behavior and comparisons remain
identical.
- Around line 339-340: The timestamp in the payload uses
datetime.now(timezone.utc).isoformat() which is inconsistent with the codebase
alias; update the payload construction to use the datetime.UTC alias (i.e.,
datetime.now(datetime.UTC).isoformat()) so it matches other files like
knowledge/store.py and keeps style consistent; locate the payload assignment in
evals/runner.py (the payload variable and the datetime.now(...) call) and
replace timezone.utc with datetime.UTC.
In `@knowledge/store.py`:
- Around line 229-230: Replace the log.error(...) calls inside the exception
handlers that catch sqlite3.DatabaseError (the blocks that currently read
"except sqlite3.DatabaseError as exc: log.error(..., exc)") with
log.exception(...) so the error message for schema/init/database failures
includes the full traceback; keep the same descriptive message (including
self.path or other context) when calling log.exception so callers see both the
message and the exception stack. Ensure you update every occurrence of that
pattern (the sqlite3.DatabaseError except blocks where log.error is used) rather
than adding new try/except logic.
---
Duplicate comments:
In `@evals/runner.py`:
- Around line 237-239: In _run_prompt_case, the first value returned by
_await_audit_assertion is unpacked to entries but never used; change the
unpacking to prefix that variable with an underscore (e.g., _entries) so it
becomes _entries, passed, detail = await _await_audit_assertion(...) to signal
intentional unused value and silence static-analysis warnings referencing
_await_audit_assertion and _run_prompt_case.
In `@knowledge/store.py`:
- Around line 234-239: In _get_db, broaden the exception catch to
sqlite3.DatabaseError (the parent of OperationalError) so corrupt or read-only
DB errors are swallowed as the module docstring requires; change the except
clause in the _get_db method (currently "except sqlite3.OperationalError as
exc") to catch sqlite3.DatabaseError instead and keep the existing log.error
call and return None behavior.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 547c1280-742e-4685-bc51-8c02425b92dd
📒 Files selected for processing (2)
evals/runner.pyknowledge/store.py
- evals/runner.py: use asyncio.get_running_loop() instead of the deprecated get_event_loop() inside the _await_audit_assertion coroutine - evals/runner.py: prefix unused _entries return value with underscore - evals/runner.py: use datetime.UTC alias (consistent with store.py), drop now-unused timezone import - knowledge/store.py: broaden _get_db exception catch from OperationalError to DatabaseError so corrupt-DB errors are swallowed per the module's no-crash contract - knowledge/store.py: replace log.error with log.exception in all three DatabaseError handlers (schema init, _get_db, add_chunk) so tracebacks appear in error logs Co-Authored-By: Claude <claude@anthropic.com> https://claude.ai/code/session_01YW5U6mtpLy4rzKmqd4trkH
Round 4 —
|
| # | File | Finding | Action |
|---|---|---|---|
| 1 | evals/runner.py L170, L179 |
asyncio.get_event_loop().time() used inside a coroutine — deprecated since Python 3.10, raises DeprecationWarning |
Changed to asyncio.get_running_loop().time() in both call sites inside _await_audit_assertion |
| 2 | knowledge/store.py _get_db |
Catches sqlite3.OperationalError instead of the broader sqlite3.DatabaseError, violating the module's stated no-crash contract (corrupt-DB errors would propagate) |
Changed except sqlite3.OperationalError → except sqlite3.DatabaseError |
| 3 | knowledge/store.py L230, L238, L270 |
log.error(...) in three DatabaseError handlers omits tracebacks |
Replaced with log.exception(...) (no exc arg needed — captured automatically) across schema-init, _get_db, and add_chunk |
| 4 | evals/runner.py L237 |
entries unpacked from _await_audit_assertion but never used (duplicate carryover) |
Renamed to _entries |
| 5 | evals/runner.py L340 |
datetime.now(timezone.utc) inconsistent with knowledge/store.py style (duplicate carryover) |
Changed to datetime.now(UTC); removed now-unused timezone import |
ℹ️ Not addressed (out of scope / already correct)
knowledge/store.pyL133 (_has_fts5) and L202 (WAL PRAGMA) — both still catchsqlite3.OperationalError. These are intentional:_has_fts5is probing for FTS5 support, and the WAL PRAGMA is explicitly best-effort for read-only mounts. Neither is a contract-violation.knowledge/store.pyL319, L422, L439, L479, L504, L522 — theseDatabaseErrorhandlers uselog.warning, notlog.error. CodeRabbit's finding targeted thelog.errorpattern specifically; warning-level handlers are deliberately lower-severity for expected operational failures (search miss, stats failure, etc.).
Round 4 has no new blockers. Ready to merge when the team is.
Generated by Claude Code
There was a problem hiding this comment.
♻️ Duplicate comments (1)
evals/runner.py (1)
231-241:⚠️ Potential issue | 🟠 MajorRegression:
expected_tools: []no longer asserts abstention.The logic at Line 235 still treats empty lists like normal expected-tool checks, and
verify.assert_tools_fired(..., expected=[])passes even when tools fired. This contradicts the Line 231-233 contract.💡 Suggested fix
expected_tools = case.get("expected_tools") if expected_tools is not None: require_success = case.get("tool_outcome", "success") == "success" - _entries, passed, detail = await _await_audit_assertion( - since, expected_tools, require_success=require_success, - ) - if not passed: - problems.append(detail) + if not expected_tools: + loop = asyncio.get_running_loop() + deadline = loop.time() + _AUDIT_POLL_DEADLINE_S + while True: + entries = verify.audit_entries_since(since) + fired = [e.get("tool", "?") for e in entries if e.get("tool")] + if fired: + problems.append(f"unexpected tools fired: {fired}") + break + if loop.time() >= deadline: + break + await asyncio.sleep(_AUDIT_POLL_INTERVAL_S) + else: + _entries, passed, detail = await _await_audit_assertion( + since, expected_tools, require_success=require_success, + ) + if not passed: + problems.append(detail)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@evals/runner.py` around lines 231 - 241, The test treats an explicit empty list for expected_tools as an abstention assertion but the current branch treats [] like a normal expectation; add an explicit branch for expected_tools == [] that enforces “no tools fired”: detect if expected_tools is a list and len(expected_tools) == 0, then call the audit check that verifies zero tool entries (either by adding an assert_none/expect_empty flag to _await_audit_assertion or by calling a new helper like _await_no_tools_assertion(since)) and append detail to problems if that check fails; keep the existing path for non-empty expected_tools and continue to use require_success derived from case.get("tool_outcome").
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@evals/runner.py`:
- Around line 231-241: The test treats an explicit empty list for expected_tools
as an abstention assertion but the current branch treats [] like a normal
expectation; add an explicit branch for expected_tools == [] that enforces “no
tools fired”: detect if expected_tools is a list and len(expected_tools) == 0,
then call the audit check that verifies zero tool entries (either by adding an
assert_none/expect_empty flag to _await_audit_assertion or by calling a new
helper like _await_no_tools_assertion(since)) and append detail to problems if
that check fails; keep the existing path for non-empty expected_tools and
continue to use require_success derived from case.get("tool_outcome").
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 1d06a7a4-c90c-4424-9401-2d0aeb53f4d3
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock
📒 Files selected for processing (2)
evals/runner.pyknowledge/store.py
Closes #154.
Summary
knowledge/store.py(sqlite + FTS5, LIKE fallback). Onechunkstable backs operator notes, daily-log entries, and conversation findings extracted byMemoryMiddleware. Default-on;middleware.knowledge: falseto opt out.tools/lg_tools.py—memory_ingest,memory_recall,memory_list,memory_stats,daily_log— bound to the bundled store.echoremoved.evals/— A2A client + runner + side-effect-verifiedtasks.json(15 starter cases: agent card, auth, abstention, every shipped tool, KB recall, chained, KnowledgeMiddleware). Audit log + reply text + KB state all checked independently — hallucinated tool results get caught.docs/guides/evals.mdhow-to + every reference page updated to match the new defaults / nine-tool starter set.Why side-effect verification
When the model produces "Logged: ..." without actually calling
daily_log, text-only checks pass while the DB stays empty. The harness readsaudit.jsonland thechunkstable after every turn so that bug doesn't ship.Default flip — what existing forks will notice
knowledge_middlewareandmemory_middlewarenow default totrueinLangGraphConfig. On first run:/sandbox/knowledge/agent.db(Docker default).~/.protoagent/knowledge/agent.dbautomatically — no permission errors.middleware.knowledge: falseinconfig/langgraph-config.yaml.Test plan
python -m evals.runneragainst a wizard-set-up local agent — expect ≥13/15 cases passing on a small/fast model (text-pattern noise on a couple).python -m evals.runner --category a2a-protocol— agent card + auth gating gate at 100% (deterministic).pytest tests/test_starter_tools.py tests/test_config_io.py tests/test_skill_emission.py tests/test_a2a_handler.py— should be all green (was 202/203 with one unrelated origin test failing onmain).🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Configuration
Documentation
Tests