Skip to content

Trogers/sre hard assertion improvements#85

Open
teriyakichild wants to merge 2 commits into
worktree-sre-hard-e2efrom
trogers/sre-hard-assertion-improvements
Open

Trogers/sre hard assertion improvements#85
teriyakichild wants to merge 2 commits into
worktree-sre-hard-e2efrom
trogers/sre-hard-assertion-improvements

Conversation

@teriyakichild
Copy link
Copy Markdown
Contributor

No description provided.

Add deterministic tool-trace assertions grounded in the mock MCP's
canned responses instead of relying solely on answer-text substring
matching.

New assertion types:
- tool_args_contains: verify tool arguments (exact or substring)
- tools_not_called: negative tool assertions (routing discipline)
- answer_contains_any: alternative acceptable strings
- answer_not_contains: hallucination guards

Updated all 5 SRE-hard prompt specs with worker dispatch, tool name,
tool argument, and negative assertions. Fixed answer_contains values
to match actual probe paths (/healthz/ready, /actuator/health/ready).
Tightened category_min from 6 to 7 (all data is in one API call).

Bug fixes:
- Readiness loops in runner now fail on timeout instead of silently
  falling through
- model_name_from_config strips sre-hard-e2e- prefix
- extract-continuation-frames.sh path iteration fixed

Refs: #32
…prompt

The coordinator was producing meta-summaries referencing task numbers
("see Task 8") instead of inlining concrete findings from task results.
The user never sees task results directly, so these references were
opaque.

Root cause: neither the continuation prompt nor the respond_directly
tool description told the coordinator that its response IS the final
user-facing answer. The tool description actively misdirected toward
"general knowledge."

Changes:
- continuation_prompt.md: add synthesis rules requiring the coordinator
  to inline all concrete data points and never reference tasks by number
- respond_directly tool: update description and response field to
  clarify that the response is the only text the user sees

Before: multi-category-findings 0/9 alert names in answer
After: multi-category-findings 9/9 alert names in answer

Refs: #32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant