Skip to content

Token efficiency Phase 2: dynamic tool selection#18

Merged
humancto merged 4 commits into
mainfrom
feat/token-efficiency-phase2
Mar 26, 2026
Merged

Token efficiency Phase 2: dynamic tool selection#18
humancto merged 4 commits into
mainfrom
feat/token-efficiency-phase2

Conversation

@humancto
Copy link
Copy Markdown
Owner

Summary

  • Dynamic tool selection via ToolSelector — instead of sending all 82 tools on every LLM call (~14K tokens), loads a core set (~18 tools) and activates contextual tool groups when keywords appear in conversation
  • 17 contextual tool groups (git, docker, json, browser, archive, crypto, etc.) each with keyword triggers and capability gates
  • Monotonic growth — once a group activates in a bout, it stays active to maximize Anthropic prompt cache hits
  • Zero capability lossshell_exec is always in core as the universal fallback; tool executor still runs ANY tool regardless of what was sent to the LLM

Token savings

Scenario Before After Savings
Core only (typical first turn) 82 tools, ~14K tokens 18 tools, ~3K tokens 78%
Core + 1 group (e.g. git) 82 tools, ~14K tokens 23 tools, ~3.8K tokens 73%
Core + 2 groups 82 tools, ~14K tokens 28 tools, ~4.7K tokens 66%

For a 10-turn conversation, this saves approximately 90K-110K input tokens.

Files changed

  • crates/punch-runtime/src/tools.rsToolGroup enum, ToolSelector struct, keyword matching, 10 unit tests
  • crates/punch-runtime/src/fighter_loop.rs — per-turn dynamic tool selection when available_tools is empty
  • crates/punch-kernel/src/ring.rs — pass empty available_tools + separate mcp_tools to enable dynamic selection
  • crates/punch-runtime/src/lib.rs — export new types
  • crates/punch-kernel/src/{workflow,background,heartbeat_scheduler}.rs — add mcp_tools field (keep static tools for workflows/gorillas)
  • crates/punch-runtime/tests/fighter_loop_tests.rs — add mcp_tools field to test params

Test plan

  • 10 new unit tests for ToolSelector (core presence, keyword activation, capability gating, monotonic growth, change detection, case insensitive, scan window limits, no duplicates, count bounds, group coverage)
  • All 1500+ existing tests pass
  • cargo clippy --workspace -- -D warnings clean
  • cargo fmt --all -- --check clean
  • Manual test: rebuild daemon, verify Telegram bot still reads files, takes screenshots, uses tools correctly
  • Verify tool count in logs drops from ~82 to ~18 on first turn

🤖 Generated with Claude Code

Instead of sending all 82 tools on every LLM call (~14K tokens overhead),
the ToolSelector now loads a core set (~18 tools) and activates contextual
tool groups when keywords appear in the conversation.

Architecture:
- ToolGroup enum: 17 contextual groups (git, docker, json, browser, etc.)
  each with keyword triggers and capability gates
- ToolSelector: stateful per-bout selector with monotonic growth — once a
  group activates it stays active to maximize prompt cache hits
- Core tools always loaded: file_read, file_write, shell_exec, web_fetch,
  web_search, memory, screenshot, channel_notify, self-config
- shell_exec is always in core as the universal fallback — the model never
  loses capability, it just gets a leaner tool menu

Token savings:
- Core only (typical): ~18 tools, ~3K tokens (was 82 tools, ~14K tokens)
- Core + 1 group: ~23 tools, ~3.8K tokens
- Core + 2 groups: ~28 tools, ~4.7K tokens
- 66-80% reduction in tool token overhead per request

The tool executor still runs ANY tool regardless of what was sent to the
LLM — capabilities are checked at execution time, not selection time.

Includes 10 unit tests covering: core presence, keyword activation,
capability gating, monotonic growth, change detection, case insensitivity,
scan window limits, no duplicates, and count bounds.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 156aef31ef

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

memory: Arc::clone(&self.memory),
driver: Arc::clone(&self.driver),
available_tools,
available_tools: Vec::new(),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep capability tools when enabling dynamic selection

Ring::send_message now always passes available_tools: Vec::new(), which forces run_fighter_loop into keyword-triggered dynamic selection for every fighter. Because ToolSelector::core_tools only includes a subset of capabilities, restricted fighters without ShellExec can start a bout without their granted tools unless the user wording matches a hard-coded trigger, so requests that previously worked can fail on the first turn.

Useful? React with 👍 / 👎.

Comment on lines +455 to +457
ToolGroup::AgentCoordination => {
matches!(c, Capability::AgentSpawn | Capability::AgentMessage)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Gate agent coordination tools per individual capability

ToolGroup::AgentCoordination is considered permitted when either AgentSpawn or AgentMessage is present, but the group later contributes all three tools (agent_spawn, agent_message, agent_list). This exposes tools a fighter is not authorized to run (for example, AgentSpawn-only fighters seeing message/list tools), which leads to avoidable permission-denied tool calls and wasted loop iterations.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Owner Author

@humancto humancto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: Token efficiency Phase 2 — dynamic tool selection

Good concept — reducing 82 tools / ~14K tokens down to ~18 core tools per turn is a meaningful optimization. The monotonic growth design for prompt cache efficiency is smart, and the test coverage is solid. A few issues need addressing before merge.

Issues

1. Overly broad keywords cause false positives (High)

Several keywords are too common and will trigger groups unnecessarily, undermining the token savings:

  • "image" → triggers Container group, but "image" is extremely common in non-Docker contexts (screenshots, file uploads, profile pictures)
  • "push" / "pull" → triggers SourceControl, but common in general conversation ("push a notification", "pull data from API")
  • "click" → triggers BrowserControl, but used in any UI discussion
  • "process" → triggers ProcessManagement, one of the most common words in programming
  • "agent" / "fighter" / "spawn" → triggers AgentCoordination — these are core Punch terminology that appears in nearly every system prompt and conversation
  • "hash" → triggers Crypto, but common in data structures context
  • "render" → triggers Template, but common in UI/frontend discussions
  • "graph" → triggers KnowledgeGraph, but common in general programming
  • "definition" / "references" → triggers CodeAnalysis, extremely common words
  • "extract" → triggers Archive, but common in data/text discussions

With monotonic growth, each false positive permanently inflates the tool list for the entire bout. Consider using multi-word phrases, requiring 2+ keyword matches per group, or using word-boundary matching instead of substring contains().

2. Keyword scan only reads msg.content, misses tool context (Medium)

build_scan_text only reads msg.content, but the Message struct stores tool calls in msg.tool_calls and tool results in msg.tool_results — both have empty content fields. If a user says "run this" and the tool result contains Docker output, the Container group won't activate on the next turn. This means the keyword scanner is blind to the tool-use context that's most informative about what tools are actually needed.

Consider also scanning tool_calls[].name and tool_results[].content (with a size cap to avoid scanning huge outputs).

3. available_tools.clone() every loop iteration in static path (Medium)

When use_dynamic_tools is false (workflows, gorillas), the code runs params.available_tools.clone() on every iteration of the main loop. This allocates a new Vec of potentially 80+ ToolDefinition structs each turn — a regression from the prior code which passed the vec directly. For gorillas running 10+ iterations this adds up. Consider using Cow or taking the vec by reference in the CompletionRequest.

4. The Codex P1 about restricted fighters is valid (Medium)

The existing tools_for_capabilities returned all granted tools immediately. The new ToolSelector::core_tools only returns a subset — tools like git_*, docker_*, json_*, code_*, process_*, http_* are gated behind keywords even when the fighter has the corresponding capability. A fighter with SourceControl but not ShellExec (no fallback) will have no way to interact with git on the first turn unless the user happens to say "git" or "commit". This is a capability regression for specialized fighters.

Consider pre-activating groups whose capability is granted, or making core_tools include all capability-matched tools (which defeats the purpose). Alternatively, scan the first user message more aggressively or use a larger scan window for turn 1.

5. DefaultHasher comment says "stable" but it isn't (Nit)

The doc comment on compute_hash says "Compute a stable hash" but DefaultHasher is explicitly not guaranteed stable across Rust versions. It's fine for same-process change detection here, but the comment is misleading. Either drop "stable" from the doc or use a fixed hasher.

What looks good

  • Monotonic growth + BTreeSet ordering for deterministic tool lists is well-thought-out
  • Separation of MCP tools from built-in tools is clean
  • The _changed return from select_tools is a nice hook for future cache invalidation
  • 12 new tests with good coverage of edge cases (capability gating, window limits, monotonicity, dedup)
  • Clean integration: workflows/gorillas/tests bypass dynamic selection naturally via available_tools pre-population
  • CI is all green

Generated by Claude Code

…auto-activate

Address all 5 review issues on the dynamic tool selection PR:

1. HIGH - Tighten overly broad keywords: Replace generic words like push,
   image, process, agent, hash, render, graph, definition, extract with
   specific multi-word phrases (e.g. git push, docker image, kill process,
   spawn fighter, sha256, render template, knowledge graph, find definition).

2. MEDIUM - Scan tool context: build_scan_text now includes tool call names,
   short argument values, and tool result snippets (first 200 chars).

3. MEDIUM - Avoid static path clone overhead: Move params.available_tools
   into static_tools Option before the loop.

4. MEDIUM - Auto-activate capability groups: Groups with no core-tool overlap
   auto-activate on construction when the fighter has the required capability.

5. NIT - Stable hash: Replace DefaultHasher with FNV-1a-inspired hash that
   produces identical results across Rust versions.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Owner Author

@humancto humancto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review: All 5 issues addressed

Nice iteration. Going through each original issue:

  1. Overly broad keywords - Fixed. Keywords are now multi-word phrases: "git " (with trailing space), "git commit", "docker image" instead of "image", "click button"/"click element" instead of "click", "spawn fighter"/"spawn agent" instead of "agent", "sha256" instead of "hash", "knowledge graph" instead of "graph", "find definition" instead of "definition", etc. Much better false-positive resistance.

  2. Keyword scan misses tool context - Fixed. build_scan_text now scans tool_calls[].name + short string args, and first 200 chars of tool_results[].content.

  3. available_tools.clone() per iteration - Partially fixed. The code moved params.available_tools into static_tools to avoid holding the full params, but it still calls .clone() every iteration since CompletionRequest takes ownership. Acceptable — would require a larger API change to fully eliminate.

  4. Capability regression for restricted fighters - Fixed. ToolSelector::new() auto-activates groups (SourceControl, Container, DataManipulation, Schedule, BrowserControl, AgentCoordination, ProcessManagement) when the fighter has the corresponding capability. Clean approach with group_needs_auto_activate().

  5. DefaultHasher stability - Fixed. Replaced with a proper FNV-1a implementation with explicit offset basis and prime. Truly stable across Rust versions.

One new issue (Low)

build_scan_text can panic on multi-byte UTF-8 boundaries:

let snippet = if tr.content.len() > 200 {
    &tr.content[..200]  // panics if byte 200 is mid-character
} else {
    &tr.content
};

tr.content[..200] slices at byte offset 200. If a multi-byte UTF-8 character spans that boundary, this panics at runtime. Tool results containing non-ASCII text (emoji, CJK, etc.) could trigger this. Fix:

let end = tr.content.get(..200).map_or(tr.content.len(), |_| 200);
// or more robustly:
let end = (200..=tr.content.len().min(203)).find(|&i| tr.content.is_char_boundary(i)).unwrap_or(tr.content.len());
let snippet = &tr.content[..end];

Same applies to the tool call arg truncation at s.len() <= 200 — though that one uses <= comparison rather than slicing, so it's safe.

Summary

CI is green. Tests updated and expanded (new test_selector_no_auto_activate_without_capability, test_selector_full_caps_with_auto_activation). The UTF-8 slicing issue is low-severity (tool results are usually ASCII/JSON) but trivial to fix. LGTM after that.


Generated by Claude Code

Use floor_char_boundary(200) instead of byte slice [..200] to avoid
panicking when byte 200 lands mid-character in multi-byte UTF-8 text.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Owner Author

@humancto humancto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review (round 3)

The UTF-8 slicing fix is correct — tr.content.floor_char_boundary(200) is the right approach and is stable since Rust 1.82.

However, the Format CI check is failing. The diff shows some reformatted assertions in the browser tool tests (lines around browser_navigate and browser_click) — looks like cargo fmt changed the formatting of those multi-line assert! calls. Please run cargo fmt --all and push.

Test check is still in progress at time of review. Everything else (Check, Clippy, GitGuardian) is green.

All original review items remain resolved. LGTM once the formatting CI is fixed.


Generated by Claude Code

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@humancto humancto merged commit ce93514 into main Mar 26, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant