Token efficiency Phase 2: dynamic tool selection#18
Conversation
Instead of sending all 82 tools on every LLM call (~14K tokens overhead), the ToolSelector now loads a core set (~18 tools) and activates contextual tool groups when keywords appear in the conversation. Architecture: - ToolGroup enum: 17 contextual groups (git, docker, json, browser, etc.) each with keyword triggers and capability gates - ToolSelector: stateful per-bout selector with monotonic growth — once a group activates it stays active to maximize prompt cache hits - Core tools always loaded: file_read, file_write, shell_exec, web_fetch, web_search, memory, screenshot, channel_notify, self-config - shell_exec is always in core as the universal fallback — the model never loses capability, it just gets a leaner tool menu Token savings: - Core only (typical): ~18 tools, ~3K tokens (was 82 tools, ~14K tokens) - Core + 1 group: ~23 tools, ~3.8K tokens - Core + 2 groups: ~28 tools, ~4.7K tokens - 66-80% reduction in tool token overhead per request The tool executor still runs ANY tool regardless of what was sent to the LLM — capabilities are checked at execution time, not selection time. Includes 10 unit tests covering: core presence, keyword activation, capability gating, monotonic growth, change detection, case insensitivity, scan window limits, no duplicates, and count bounds. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 156aef31ef
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| memory: Arc::clone(&self.memory), | ||
| driver: Arc::clone(&self.driver), | ||
| available_tools, | ||
| available_tools: Vec::new(), |
There was a problem hiding this comment.
Keep capability tools when enabling dynamic selection
Ring::send_message now always passes available_tools: Vec::new(), which forces run_fighter_loop into keyword-triggered dynamic selection for every fighter. Because ToolSelector::core_tools only includes a subset of capabilities, restricted fighters without ShellExec can start a bout without their granted tools unless the user wording matches a hard-coded trigger, so requests that previously worked can fail on the first turn.
Useful? React with 👍 / 👎.
| ToolGroup::AgentCoordination => { | ||
| matches!(c, Capability::AgentSpawn | Capability::AgentMessage) | ||
| } |
There was a problem hiding this comment.
Gate agent coordination tools per individual capability
ToolGroup::AgentCoordination is considered permitted when either AgentSpawn or AgentMessage is present, but the group later contributes all three tools (agent_spawn, agent_message, agent_list). This exposes tools a fighter is not authorized to run (for example, AgentSpawn-only fighters seeing message/list tools), which leads to avoidable permission-denied tool calls and wasted loop iterations.
Useful? React with 👍 / 👎.
humancto
left a comment
There was a problem hiding this comment.
Review: Token efficiency Phase 2 — dynamic tool selection
Good concept — reducing 82 tools / ~14K tokens down to ~18 core tools per turn is a meaningful optimization. The monotonic growth design for prompt cache efficiency is smart, and the test coverage is solid. A few issues need addressing before merge.
Issues
1. Overly broad keywords cause false positives (High)
Several keywords are too common and will trigger groups unnecessarily, undermining the token savings:
"image"→ triggers Container group, but "image" is extremely common in non-Docker contexts (screenshots, file uploads, profile pictures)"push"/"pull"→ triggers SourceControl, but common in general conversation ("push a notification", "pull data from API")"click"→ triggers BrowserControl, but used in any UI discussion"process"→ triggers ProcessManagement, one of the most common words in programming"agent"/"fighter"/"spawn"→ triggers AgentCoordination — these are core Punch terminology that appears in nearly every system prompt and conversation"hash"→ triggers Crypto, but common in data structures context"render"→ triggers Template, but common in UI/frontend discussions"graph"→ triggers KnowledgeGraph, but common in general programming"definition"/"references"→ triggers CodeAnalysis, extremely common words"extract"→ triggers Archive, but common in data/text discussions
With monotonic growth, each false positive permanently inflates the tool list for the entire bout. Consider using multi-word phrases, requiring 2+ keyword matches per group, or using word-boundary matching instead of substring contains().
2. Keyword scan only reads msg.content, misses tool context (Medium)
build_scan_text only reads msg.content, but the Message struct stores tool calls in msg.tool_calls and tool results in msg.tool_results — both have empty content fields. If a user says "run this" and the tool result contains Docker output, the Container group won't activate on the next turn. This means the keyword scanner is blind to the tool-use context that's most informative about what tools are actually needed.
Consider also scanning tool_calls[].name and tool_results[].content (with a size cap to avoid scanning huge outputs).
3. available_tools.clone() every loop iteration in static path (Medium)
When use_dynamic_tools is false (workflows, gorillas), the code runs params.available_tools.clone() on every iteration of the main loop. This allocates a new Vec of potentially 80+ ToolDefinition structs each turn — a regression from the prior code which passed the vec directly. For gorillas running 10+ iterations this adds up. Consider using Cow or taking the vec by reference in the CompletionRequest.
4. The Codex P1 about restricted fighters is valid (Medium)
The existing tools_for_capabilities returned all granted tools immediately. The new ToolSelector::core_tools only returns a subset — tools like git_*, docker_*, json_*, code_*, process_*, http_* are gated behind keywords even when the fighter has the corresponding capability. A fighter with SourceControl but not ShellExec (no fallback) will have no way to interact with git on the first turn unless the user happens to say "git" or "commit". This is a capability regression for specialized fighters.
Consider pre-activating groups whose capability is granted, or making core_tools include all capability-matched tools (which defeats the purpose). Alternatively, scan the first user message more aggressively or use a larger scan window for turn 1.
5. DefaultHasher comment says "stable" but it isn't (Nit)
The doc comment on compute_hash says "Compute a stable hash" but DefaultHasher is explicitly not guaranteed stable across Rust versions. It's fine for same-process change detection here, but the comment is misleading. Either drop "stable" from the doc or use a fixed hasher.
What looks good
- Monotonic growth + BTreeSet ordering for deterministic tool lists is well-thought-out
- Separation of MCP tools from built-in tools is clean
- The
_changedreturn fromselect_toolsis a nice hook for future cache invalidation - 12 new tests with good coverage of edge cases (capability gating, window limits, monotonicity, dedup)
- Clean integration: workflows/gorillas/tests bypass dynamic selection naturally via
available_toolspre-population - CI is all green
Generated by Claude Code
…auto-activate Address all 5 review issues on the dynamic tool selection PR: 1. HIGH - Tighten overly broad keywords: Replace generic words like push, image, process, agent, hash, render, graph, definition, extract with specific multi-word phrases (e.g. git push, docker image, kill process, spawn fighter, sha256, render template, knowledge graph, find definition). 2. MEDIUM - Scan tool context: build_scan_text now includes tool call names, short argument values, and tool result snippets (first 200 chars). 3. MEDIUM - Avoid static path clone overhead: Move params.available_tools into static_tools Option before the loop. 4. MEDIUM - Auto-activate capability groups: Groups with no core-tool overlap auto-activate on construction when the fighter has the required capability. 5. NIT - Stable hash: Replace DefaultHasher with FNV-1a-inspired hash that produces identical results across Rust versions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
humancto
left a comment
There was a problem hiding this comment.
Re-review: All 5 issues addressed
Nice iteration. Going through each original issue:
-
Overly broad keywords - Fixed. Keywords are now multi-word phrases:
"git "(with trailing space),"git commit","docker image"instead of"image","click button"/"click element"instead of"click","spawn fighter"/"spawn agent"instead of"agent","sha256"instead of"hash","knowledge graph"instead of"graph","find definition"instead of"definition", etc. Much better false-positive resistance. -
Keyword scan misses tool context - Fixed.
build_scan_textnow scanstool_calls[].name+ short string args, and first 200 chars oftool_results[].content. -
available_tools.clone()per iteration - Partially fixed. The code movedparams.available_toolsintostatic_toolsto avoid holding the full params, but it still calls.clone()every iteration sinceCompletionRequesttakes ownership. Acceptable — would require a larger API change to fully eliminate. -
Capability regression for restricted fighters - Fixed.
ToolSelector::new()auto-activates groups (SourceControl, Container, DataManipulation, Schedule, BrowserControl, AgentCoordination, ProcessManagement) when the fighter has the corresponding capability. Clean approach withgroup_needs_auto_activate(). -
DefaultHasher stability - Fixed. Replaced with a proper FNV-1a implementation with explicit offset basis and prime. Truly stable across Rust versions.
One new issue (Low)
build_scan_text can panic on multi-byte UTF-8 boundaries:
let snippet = if tr.content.len() > 200 {
&tr.content[..200] // panics if byte 200 is mid-character
} else {
&tr.content
};tr.content[..200] slices at byte offset 200. If a multi-byte UTF-8 character spans that boundary, this panics at runtime. Tool results containing non-ASCII text (emoji, CJK, etc.) could trigger this. Fix:
let end = tr.content.get(..200).map_or(tr.content.len(), |_| 200);
// or more robustly:
let end = (200..=tr.content.len().min(203)).find(|&i| tr.content.is_char_boundary(i)).unwrap_or(tr.content.len());
let snippet = &tr.content[..end];Same applies to the tool call arg truncation at s.len() <= 200 — though that one uses <= comparison rather than slicing, so it's safe.
Summary
CI is green. Tests updated and expanded (new test_selector_no_auto_activate_without_capability, test_selector_full_caps_with_auto_activation). The UTF-8 slicing issue is low-severity (tool results are usually ASCII/JSON) but trivial to fix. LGTM after that.
Generated by Claude Code
Use floor_char_boundary(200) instead of byte slice [..200] to avoid panicking when byte 200 lands mid-character in multi-byte UTF-8 text. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
humancto
left a comment
There was a problem hiding this comment.
Re-review (round 3)
The UTF-8 slicing fix is correct — tr.content.floor_char_boundary(200) is the right approach and is stable since Rust 1.82.
However, the Format CI check is failing. The diff shows some reformatted assertions in the browser tool tests (lines around browser_navigate and browser_click) — looks like cargo fmt changed the formatting of those multi-line assert! calls. Please run cargo fmt --all and push.
Test check is still in progress at time of review. Everything else (Check, Clippy, GitGuardian) is green.
All original review items remain resolved. LGTM once the formatting CI is fixed.
Generated by Claude Code
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary
ToolSelector— instead of sending all 82 tools on every LLM call (~14K tokens), loads a core set (~18 tools) and activates contextual tool groups when keywords appear in conversationshell_execis always in core as the universal fallback; tool executor still runs ANY tool regardless of what was sent to the LLMToken savings
For a 10-turn conversation, this saves approximately 90K-110K input tokens.
Files changed
crates/punch-runtime/src/tools.rs—ToolGroupenum,ToolSelectorstruct, keyword matching, 10 unit testscrates/punch-runtime/src/fighter_loop.rs— per-turn dynamic tool selection whenavailable_toolsis emptycrates/punch-kernel/src/ring.rs— pass emptyavailable_tools+ separatemcp_toolsto enable dynamic selectioncrates/punch-runtime/src/lib.rs— export new typescrates/punch-kernel/src/{workflow,background,heartbeat_scheduler}.rs— addmcp_toolsfield (keep static tools for workflows/gorillas)crates/punch-runtime/tests/fighter_loop_tests.rs— addmcp_toolsfield to test paramsTest plan
cargo clippy --workspace -- -D warningscleancargo fmt --all -- --checkclean🤖 Generated with Claude Code