Token efficiency Phase 2: dynamic tool selection by humancto · Pull Request #18 · humancto/punch

humancto · 2026-03-26T07:09:40Z

Summary

Dynamic tool selection via ToolSelector — instead of sending all 82 tools on every LLM call (~14K tokens), loads a core set (~18 tools) and activates contextual tool groups when keywords appear in conversation
17 contextual tool groups (git, docker, json, browser, archive, crypto, etc.) each with keyword triggers and capability gates
Monotonic growth — once a group activates in a bout, it stays active to maximize Anthropic prompt cache hits
Zero capability loss — shell_exec is always in core as the universal fallback; tool executor still runs ANY tool regardless of what was sent to the LLM

Token savings

Scenario	Before	After	Savings
Core only (typical first turn)	82 tools, ~14K tokens	18 tools, ~3K tokens	78%
Core + 1 group (e.g. git)	82 tools, ~14K tokens	23 tools, ~3.8K tokens	73%
Core + 2 groups	82 tools, ~14K tokens	28 tools, ~4.7K tokens	66%

For a 10-turn conversation, this saves approximately 90K-110K input tokens.

Files changed

crates/punch-runtime/src/tools.rs — ToolGroup enum, ToolSelector struct, keyword matching, 10 unit tests
crates/punch-runtime/src/fighter_loop.rs — per-turn dynamic tool selection when available_tools is empty
crates/punch-kernel/src/ring.rs — pass empty available_tools + separate mcp_tools to enable dynamic selection
crates/punch-runtime/src/lib.rs — export new types
crates/punch-kernel/src/{workflow,background,heartbeat_scheduler}.rs — add mcp_tools field (keep static tools for workflows/gorillas)
crates/punch-runtime/tests/fighter_loop_tests.rs — add mcp_tools field to test params

Test plan

10 new unit tests for ToolSelector (core presence, keyword activation, capability gating, monotonic growth, change detection, case insensitive, scan window limits, no duplicates, count bounds, group coverage)
All 1500+ existing tests pass
cargo clippy --workspace -- -D warnings clean
cargo fmt --all -- --check clean
Manual test: rebuild daemon, verify Telegram bot still reads files, takes screenshots, uses tools correctly
Verify tool count in logs drops from ~82 to ~18 on first turn

🤖 Generated with Claude Code

Instead of sending all 82 tools on every LLM call (~14K tokens overhead), the ToolSelector now loads a core set (~18 tools) and activates contextual tool groups when keywords appear in the conversation. Architecture: - ToolGroup enum: 17 contextual groups (git, docker, json, browser, etc.) each with keyword triggers and capability gates - ToolSelector: stateful per-bout selector with monotonic growth — once a group activates it stays active to maximize prompt cache hits - Core tools always loaded: file_read, file_write, shell_exec, web_fetch, web_search, memory, screenshot, channel_notify, self-config - shell_exec is always in core as the universal fallback — the model never loses capability, it just gets a leaner tool menu Token savings: - Core only (typical): ~18 tools, ~3K tokens (was 82 tools, ~14K tokens) - Core + 1 group: ~23 tools, ~3.8K tokens - Core + 2 groups: ~28 tools, ~4.7K tokens - 66-80% reduction in tool token overhead per request The tool executor still runs ANY tool regardless of what was sent to the LLM — capabilities are checked at execution time, not selection time. Includes 10 unit tests covering: core presence, keyword activation, capability gating, monotonic growth, change detection, case insensitivity, scan window limits, no duplicates, and count bounds. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 156aef31ef

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-26T07:14:17Z

            memory: Arc::clone(&self.memory),
            driver: Arc::clone(&self.driver),
-            available_tools,
+            available_tools: Vec::new(),


Keep capability tools when enabling dynamic selection

Ring::send_message now always passes available_tools: Vec::new(), which forces run_fighter_loop into keyword-triggered dynamic selection for every fighter. Because ToolSelector::core_tools only includes a subset of capabilities, restricted fighters without ShellExec can start a bout without their granted tools unless the user wording matches a hard-coded trigger, so requests that previously worked can fail on the first turn.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-26T07:14:17Z

+            ToolGroup::AgentCoordination => {
+                matches!(c, Capability::AgentSpawn | Capability::AgentMessage)
+            }


Gate agent coordination tools per individual capability

ToolGroup::AgentCoordination is considered permitted when either AgentSpawn or AgentMessage is present, but the group later contributes all three tools (agent_spawn, agent_message, agent_list). This exposes tools a fighter is not authorized to run (for example, AgentSpawn-only fighters seeing message/list tools), which leads to avoidable permission-denied tool calls and wasted loop iterations.

Useful? React with 👍 / 👎.

humancto

Review: Token efficiency Phase 2 — dynamic tool selection

Good concept — reducing 82 tools / ~14K tokens down to ~18 core tools per turn is a meaningful optimization. The monotonic growth design for prompt cache efficiency is smart, and the test coverage is solid. A few issues need addressing before merge.

Issues

1. Overly broad keywords cause false positives (High)

Several keywords are too common and will trigger groups unnecessarily, undermining the token savings:

"image" → triggers Container group, but "image" is extremely common in non-Docker contexts (screenshots, file uploads, profile pictures)
"push" / "pull" → triggers SourceControl, but common in general conversation ("push a notification", "pull data from API")
"click" → triggers BrowserControl, but used in any UI discussion
"process" → triggers ProcessManagement, one of the most common words in programming
"agent" / "fighter" / "spawn" → triggers AgentCoordination — these are core Punch terminology that appears in nearly every system prompt and conversation
"hash" → triggers Crypto, but common in data structures context
"render" → triggers Template, but common in UI/frontend discussions
"graph" → triggers KnowledgeGraph, but common in general programming
"definition" / "references" → triggers CodeAnalysis, extremely common words
"extract" → triggers Archive, but common in data/text discussions

With monotonic growth, each false positive permanently inflates the tool list for the entire bout. Consider using multi-word phrases, requiring 2+ keyword matches per group, or using word-boundary matching instead of substring contains().

2. Keyword scan only reads msg.content, misses tool context (Medium)

build_scan_text only reads msg.content, but the Message struct stores tool calls in msg.tool_calls and tool results in msg.tool_results — both have empty content fields. If a user says "run this" and the tool result contains Docker output, the Container group won't activate on the next turn. This means the keyword scanner is blind to the tool-use context that's most informative about what tools are actually needed.

Consider also scanning tool_calls[].name and tool_results[].content (with a size cap to avoid scanning huge outputs).

3. available_tools.clone() every loop iteration in static path (Medium)

When use_dynamic_tools is false (workflows, gorillas), the code runs params.available_tools.clone() on every iteration of the main loop. This allocates a new Vec of potentially 80+ ToolDefinition structs each turn — a regression from the prior code which passed the vec directly. For gorillas running 10+ iterations this adds up. Consider using Cow or taking the vec by reference in the CompletionRequest.

4. The Codex P1 about restricted fighters is valid (Medium)

The existing tools_for_capabilities returned all granted tools immediately. The new ToolSelector::core_tools only returns a subset — tools like git_*, docker_*, json_*, code_*, process_*, http_* are gated behind keywords even when the fighter has the corresponding capability. A fighter with SourceControl but not ShellExec (no fallback) will have no way to interact with git on the first turn unless the user happens to say "git" or "commit". This is a capability regression for specialized fighters.

Consider pre-activating groups whose capability is granted, or making core_tools include all capability-matched tools (which defeats the purpose). Alternatively, scan the first user message more aggressively or use a larger scan window for turn 1.

5. DefaultHasher comment says "stable" but it isn't (Nit)

The doc comment on compute_hash says "Compute a stable hash" but DefaultHasher is explicitly not guaranteed stable across Rust versions. It's fine for same-process change detection here, but the comment is misleading. Either drop "stable" from the doc or use a fixed hasher.

What looks good

Monotonic growth + BTreeSet ordering for deterministic tool lists is well-thought-out
Separation of MCP tools from built-in tools is clean
The _changed return from select_tools is a nice hook for future cache invalidation
12 new tests with good coverage of edge cases (capability gating, window limits, monotonicity, dedup)
Clean integration: workflows/gorillas/tests bypass dynamic selection naturally via available_tools pre-population
CI is all green

Generated by Claude Code

…auto-activate Address all 5 review issues on the dynamic tool selection PR: 1. HIGH - Tighten overly broad keywords: Replace generic words like push, image, process, agent, hash, render, graph, definition, extract with specific multi-word phrases (e.g. git push, docker image, kill process, spawn fighter, sha256, render template, knowledge graph, find definition). 2. MEDIUM - Scan tool context: build_scan_text now includes tool call names, short argument values, and tool result snippets (first 200 chars). 3. MEDIUM - Avoid static path clone overhead: Move params.available_tools into static_tools Option before the loop. 4. MEDIUM - Auto-activate capability groups: Groups with no core-tool overlap auto-activate on construction when the fighter has the required capability. 5. NIT - Stable hash: Replace DefaultHasher with FNV-1a-inspired hash that produces identical results across Rust versions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

humancto

Re-review: All 5 issues addressed

Nice iteration. Going through each original issue:

Overly broad keywords - Fixed. Keywords are now multi-word phrases: "git " (with trailing space), "git commit", "docker image" instead of "image", "click button"/"click element" instead of "click", "spawn fighter"/"spawn agent" instead of "agent", "sha256" instead of "hash", "knowledge graph" instead of "graph", "find definition" instead of "definition", etc. Much better false-positive resistance.
Keyword scan misses tool context - Fixed. build_scan_text now scans tool_calls[].name + short string args, and first 200 chars of tool_results[].content.
available_tools.clone() per iteration - Partially fixed. The code moved params.available_tools into static_tools to avoid holding the full params, but it still calls .clone() every iteration since CompletionRequest takes ownership. Acceptable — would require a larger API change to fully eliminate.
Capability regression for restricted fighters - Fixed. ToolSelector::new() auto-activates groups (SourceControl, Container, DataManipulation, Schedule, BrowserControl, AgentCoordination, ProcessManagement) when the fighter has the corresponding capability. Clean approach with group_needs_auto_activate().
DefaultHasher stability - Fixed. Replaced with a proper FNV-1a implementation with explicit offset basis and prime. Truly stable across Rust versions.

One new issue (Low)

build_scan_text can panic on multi-byte UTF-8 boundaries:

let snippet = if tr.content.len() > 200 {
    &tr.content[..200]  // panics if byte 200 is mid-character
} else {
    &tr.content
};

tr.content[..200] slices at byte offset 200. If a multi-byte UTF-8 character spans that boundary, this panics at runtime. Tool results containing non-ASCII text (emoji, CJK, etc.) could trigger this. Fix:

let end = tr.content.get(..200).map_or(tr.content.len(), |_| 200);
// or more robustly:
let end = (200..=tr.content.len().min(203)).find(|&i| tr.content.is_char_boundary(i)).unwrap_or(tr.content.len());
let snippet = &tr.content[..end];

Same applies to the tool call arg truncation at s.len() <= 200 — though that one uses <= comparison rather than slicing, so it's safe.

Summary

CI is green. Tests updated and expanded (new test_selector_no_auto_activate_without_capability, test_selector_full_caps_with_auto_activation). The UTF-8 slicing issue is low-severity (tool results are usually ASCII/JSON) but trivial to fix. LGTM after that.

Generated by Claude Code

Use floor_char_boundary(200) instead of byte slice [..200] to avoid panicking when byte 200 lands mid-character in multi-byte UTF-8 text. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

humancto

Re-review (round 3)

The UTF-8 slicing fix is correct — tr.content.floor_char_boundary(200) is the right approach and is stable since Rust 1.82.

However, the Format CI check is failing. The diff shows some reformatted assertions in the browser tool tests (lines around browser_navigate and browser_click) — looks like cargo fmt changed the formatting of those multi-line assert! calls. Please run cargo fmt --all and push.

Test check is still in progress at time of review. Everything else (Check, Clippy, GitGuardian) is green.

All original review items remain resolved. LGTM once the formatting CI is fixed.

Generated by Claude Code

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

chatgpt-codex-connector Bot reviewed Mar 26, 2026

View reviewed changes

humancto commented Mar 26, 2026

View reviewed changes

Fix UTF-8 panic in build_scan_text tool result truncation

1a72d54

Use floor_char_boundary(200) instead of byte slice [..200] to avoid panicking when byte 200 lands mid-character in multi-byte UTF-8 text. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

humancto commented Mar 26, 2026

View reviewed changes

Fix formatting in browser tool tests (cargo fmt)

2d88ed9

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

humancto merged commit ce93514 into main Mar 26, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Token efficiency Phase 2: dynamic tool selection#18

Token efficiency Phase 2: dynamic tool selection#18
humancto merged 4 commits into
mainfrom
feat/token-efficiency-phase2

humancto commented Mar 26, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Mar 26, 2026

Uh oh!

chatgpt-codex-connector Bot Mar 26, 2026

Uh oh!

humancto left a comment

Uh oh!

humancto left a comment

Uh oh!

humancto left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

humancto commented Mar 26, 2026

Summary

Token savings

Files changed

Test plan

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

humancto left a comment

Choose a reason for hiding this comment

Review: Token efficiency Phase 2 — dynamic tool selection

Issues

What looks good

Uh oh!

humancto left a comment

Choose a reason for hiding this comment

Re-review: All 5 issues addressed

One new issue (Low)

Summary

Uh oh!

humancto left a comment

Choose a reason for hiding this comment

Re-review (round 3)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant