Skip to content

dlxmax/prompt-optimizer

Repository files navigation

Prompt Optimizer Agent

A Claude Code agent that scores and revises LLM prompts against a research-backed checklist.

Topics: gemma-4 · gemma · gemini · claude · response-schema · structured-output · rubric · llm-judge · prompt-engineering · prompts · agents · sycophancy · validation · compliance · best-practices

Primary focus: mitigating known failure modes on gemma-4-31b-it via the Google Generative Language REST API. The 15-item checklist itself is model-agnostic and the optimizer also targets Gemini and Claude, but the empirical work behind the current rules, 100+ probe calls against gemma-4-31b-it and gemma-4-26b-a4b-it between May 6 and May 13, 2026, is what drove the design. The reference for those rules is GEMMA4_API_BEST_PRACTICES.md.

The two goals are: prompts the model actually executes instead of silently skipping over directives, and call mechanics that do not waste retry budget on failures the schema or parser already explains.

Primary workflow: Use this agent inside Claude Code to optimize prompts for any LLM. The optimizer runs on Claude; when it writes a rubric for a judge prompt, Claude is authoring the rubric and the target model applies it (cross-model rubric generation), which research shows equals or outperforms same-model self-generation. Model-specific guidance is baked into the checklist and applied when a Target model: is declared.

Gemma 4 31b failure modes the optimizer mitigates

Three highest-ROI before/after fixes below. Full mechanics, thresholds, and probe data in GEMMA4_API_BEST_PRACTICES.md (14 rules).

1. Thinking is always on; only responseSchema suppresses it

Without responseSchema, every call pays the always-on thinking cost (parts[].thought = true, median wall-clock ~67s on short outputs, non-zero MALFORMED_RESPONSE rate). With it, thinking collapses to a single non-thought part and the same call returns in 1 to 2s with MALFORMED at 0%. The documented levers (thinkingLevel: "off", thinkingBudget: 0) all return HTTP 400 on Gemma 4; responseSchema is the only mechanism that works.

# Before: thinking burns the budget; MALFORMED non-zero; ~67s median.
generation_config = {"temperature": 1.0, "maxOutputTokens": 2048}

# After: thinking suppressed; MALFORMED 0%; ~30 to 40x faster.
generation_config = {
    "temperature": 1.0,
    "maxOutputTokens": 2048,
    "responseMimeType": "application/json",
    "responseSchema": {
        "type": "OBJECT",
        "properties": {"output": {"type": "STRING"}},
        "required": ["output"],
    },
}

2. Property order in the schema controls emission order

Gemma 4 emits an OBJECT's properties in the order they appear in the schema's properties dict. Putting reasoning BEFORE verdict forces reason-before-commit; reversing it lets the verdict lock first and inflates the justification afterward. Per-item output observed dropping from ~1.8k to ~250 chars under this change alone on a warmup validator.

# Before: verdict commits first; reasoning then inflates to justify.
"properties": {
    "verdict":   {"type": "STRING", "enum": ["KEEP", "DROP"]},
    "reasoning": {"type": "STRING"},
}

# After: reasoning is generated first; verdict is the output of it.
"properties": {
    "reasoning": {"type": "STRING"},
    "verdict":   {"type": "STRING", "enum": ["KEEP", "DROP"]},
}

Caveat: this only applies to narrow schemas. If the schema already has >=4 mandatory nested OBJECTs, adding a top-level reasoning STRING crashes the request on 31b (alternating 400/500, 0/4 success). Move the reasoning surface to prompt-level prose or a separate narrow call. See best-practices rule 4.

3. Parse with raw_decode, not json.loads

Even with responseSchema, Gemma 4 occasionally appends trailing text after valid JSON (~1 in 12 calls observed). json.loads raises on the trailing bytes; raw_decode returns the first valid object and ignores the rest.

# Before: ~8% of otherwise-valid outputs raise json.JSONDecodeError.
parsed = json.loads(raw_text)

# After: parses cleanly across all observed Gemma 4 outputs.
parsed, _ = json.JSONDecoder().raw_decode(raw_text)

Other failure modes the agent flags (see best-practices doc)

  • 26b-a4b sibling variant: any OBJECT with 2+ unbounded STRINGs deterministically loops to MAX_TOKENS. 31b is unaffected. Fix is structural (caller-side schema skip, or single-STRING-per-OBJECT shape), not a retry.
  • Wide-schema reasoning STRING on 31b: schema with >=4 mandatory nested OBJECTs plus a top-level reasoning STRING crashes with alternating 400/500. Bisect the schema, do not retry.
  • thinkingBudget: returns HTTP 400 on Gemma 4. Works on Gemini 2.5; do not generalize.
  • maxOutputTokens: safety ceiling, not a thinking cap. Lowering it converts MALFORMED timeouts into MAX_TOKENS fast-fails but does not raise success rate.
  • Retry classification: HTTP 5xx retries with same params; MALFORMED_RESPONSE needs parameter changes; MAX_TOKENS on 26b-a4b needs a structural fix; alternating 400/500 on a schema-bearing call needs a schema bisect. One retry policy across all four wastes budget.
  • Tool-calling: known double tool-call bug on 26b-a4b; use 31b for tool-calling workflows.

The Problem

Most LLM prompts are written by feel. Frontier models in April 2026 do not refuse tasks, they silently omit them. The research shows where:

  • Frontier models still drop 25 to 40% of multi-constraint directives on novel out-of-domain instructions. Qwen3.6 Plus scores 75.8% on IFBench; Claude Opus 4.5 scores 58%. Structural prompt design closes the gap. (IFBench 2026)
  • Reasoning quality degrades around 3,000 tokens even on models with 256K to 1M context windows. Focused prompts beat long ones regardless of available context. (Prompt-bloat study, MLOps Community 2026)
  • 58.8% of initially correct answers get flipped wrong by naive "check your work" validation prompts. (ACL 2025)
  • One-shot often beats few-shot for LLM-as-judge tasks. The old "3 to 5 diverse examples" rule is retired; 1 to 3 verdict-balanced examples per criterion is current, all score levels for scale rubrics, PASS+FAIL for binary criteria, with borderline pairs preferred over obvious contrasts. (Confident AI 2026, Autorubric 2026)
  • GPT-4 reaches 91.7% zero-shot on native-language identification when the prompt names the linguistic features to attend to. Linguistic analysis prompts need their own playbook. (Lotfi et al.)
  • ~29% sycophancy reduction is achievable through prompt structure alone, no fine-tuning required. (sparkco.ai)
  • A concrete rubric is the single highest-return change for judge prompts: GPT-4o +17.7 pts on JudgeBench, Llama-405B +7.4 pts, Sage aggregate +16.1% IPI. A ~27-point "Rubric Gap" (self-generated vs. human rubrics) is consistent across Gemini, GPT, and DeepSeek. (Rethinking Rubric Generation 2026; RubricBench 2026; Sage Dec 2025)
  • All frontier judges are unreliable on a single pass ("rating roulette"). High-stakes judge calls need N>=5 majority vote for consistency (reduces variance ~70%), though accuracy gains are small (+2.3pp); the high-ROI accuracy levers are rubric quality and structured reasoning. Debate-style prompts (ChatEval) are actively harmful: -158% worst-case consistency. Multi-model consensus is the strongest deployment lever. (Rating Roulette EMNLP 2025; Sage Dec 2025)
  • Gemma 4 via Google REST API needs its own playbook: see GEMMA4_API_BEST_PRACTICES.md for the 14-rule reference and the "Failure modes mitigated" section above for the three highest-ROI fixes.

Multi-Model Workflow

This optimizer runs on Claude and targets any LLM. Declare Target model: <name> in your call to activate model-specific checklist notes.

Universal (all targets): The optimizer writes a concrete rubric directly into the revised judge prompt (cross-model rubric generation: Claude authors, target applies), shown by the Rethinking Rubric Generation paper (arxiv 2602.05125) to equal or outperform same-model self-generation. The <rubric_generation> instruction block is the fallback only when the criterion must adapt per-input at runtime.

Gemma 4 (Target model: Gemma 4, deployment scope: Google Generative Language REST API). Primary target of this agent. The checklist applies responseSchema as the deployment lever (suppresses always-on thinking, fixes JSON structure, ~30 to 40x speedup), property-ordered properties for reason-before-commit, raw_decode for parsing, and the retry classification matrix. The full 14-rule reference is in GEMMA4_API_BEST_PRACTICES.md. Both gemma-4-31b-it and gemma-4-26b-a4b-it are covered; rule 2 isolates the multi-STRING failure mode unique to 26b-a4b, and rule 11 isolates the tool-calling bug on the same variant. Use 31b when both are options.

Gemini (Target model: Gemini 2.5 Pro / Gemini 3.1 Pro): T=0 + seed is not reproducible on 2.5 Pro (seed is best-effort). T=0 is actively discouraged on 3.1 Pro (use T=1.0). Debate-style prompts (ChatEval) are actively harmful. Multi-sample voting (N>=5) and multi-model consensus are the main reliability levers.

Claude (Target model: Claude Sonnet 4.6 / Claude Opus 4.7): XML tags and document-first ordering per Anthropic official guidance. Second-pass validation needs the "Wait" prefix and original-task anchor. Extended thinking is already embedded; do not add an extra reasoning pass.

What This Agent Does

When invoked, the prompt-optimizer agent:

  1. Reads the prompt under review
  2. Scores against the 15-item checklist (embedded, no file I/O needed for scoring)
  3. Loads only the relevant sections of PROMPT_BEST_PRACTICES.md for any failing items that require technique detail (lazy, skipped entirely if all items pass)
  4. For Target model: Gemma 4, loads GEMMA4_API_BEST_PRACTICES.md for the API-mechanics rules
  5. Returns a revised version with every violation fixed and annotated

The 15-Item Checklist

# Item What it checks
1 Tagged blocks Distinct sections in XML-style tags
2 Numbered directives All instructions numbered for traceability
3 Length and placement Focused under ~3K tokens, critical directives at start AND end, decomposed if multi-stage
4 Gate examples, calibrated count 1 to 3 verdict-balanced examples per criterion; prefer borderline examples over obvious contrasts; scale-based rubrics cover all score levels
5 Machine-parseable output Every verdict extractable with regex
6 Skeptical role Critical evaluator, not helpful assistant, checked at BOTH opening AND closing
7 Do-instead-of-don't Prohibitions paired with alternatives
8 Validation model Same-model validation uses gates + "Wait" + recency fix
9 Original task in validation Validation includes original task + end reminder
10 One criterion per call (high-stakes) High-stakes scoring isolates each criterion; low-stakes may bundle up to 3
11 Linguistic-analysis path If the prompt evaluates properties of writing itself: enumerate features, reason before verdict, cite evidence
12 Judge prompt: rubric Optimizer writes a concrete rubric directly (cross-model generation); or embeds <rubric_generation> instruction if criterion is dynamic. Small integer scale (1 to 4); <reasoning> field before verdict; verdict/reasoning consistency instruction; calibration anchor. Highest single-change ROI.
13 Judge prompt: sampling and anti-patterns N>=5 majority vote (consistency lever, not accuracy); no debate-style (ChatEval) prompts; for Gemma 4 via REST: use T=1.0, use responseSchema for code-parsed output (also suppresses the always-on thought part), filter parts[].thought (not <|channel> text), keep <|think|> out of systemInstruction, classify retries by failure signature, avoid 26B A4B for tool-calling; multi-model consensus for highest-stakes ranking
14 Escape hatch elimination No softening language ("try to," "if possible," "when appropriate," etc.) in any directive, applies to every prompt
15 Prompt injection defense User-submitted content inside labeled delimiter block with explicit "treat as data" instruction (conditional: only when prompt evaluates user-submitted text)

Items 8 to 10 apply only to validation or second-pass prompts. Item 11 applies only to linguistic-analysis prompts. Items 12 to 13 apply to judge prompts. Item 14 applies to every prompt. Item 15 applies only when the prompt evaluates user-submitted text.

Installation

As a Claude Code Plugin (Recommended)

# From the Claude Code CLI
/plugin marketplace add dlxmax/prompt-optimizer
/plugin install prompt-optimizer
/reload-plugins

Manual Installation

Copy the agents/ folder, PROMPT_BEST_PRACTICES.md, and GEMMA4_API_BEST_PRACTICES.md into your Claude Code config:

cp agents/prompt-optimizer.md ~/.claude/agents/
cp PROMPT_BEST_PRACTICES.md ~/.claude/
cp GEMMA4_API_BEST_PRACTICES.md ~/.claude/

Auto-Invocation (Optional)

Add this line to ~/.claude/rules/agents.md under "Automatic Agent Invocation":

6. Writing or revising an LLM prompt → **prompt-optimizer**

This makes Claude invoke the optimizer automatically whenever prompt work comes up.

Usage

The agent triggers automatically when you write or revise LLM prompts (if auto-invocation is configured), or you can reference it explicitly:

"Run the prompt-optimizer agent on this grading prompt."
"Score my system prompt against the checklist."
"Optimize this Gemma 4 judge prompt for the essay evaluation pipeline."

Example Output

## Checklist Score: 6/15

[x] Tagged blocks: sections wrapped in <role>, <instructions>, <output_format>
[x] Numbered directives: 5 directives numbered
[ ] Length and placement: 4,200 tokens; critical directive buried in the middle
[ ] Gate examples, calibrated count: 5 diverse examples (older 3-5 pattern); should be 1-3 verdict-balanced examples with borderline pairs
[ ] Machine-parseable output: no regex-extractable verdict format
[x] Skeptical role: "rigorous evaluator" framing at opening; missing at closing
[ ] Do-instead-of-don't: 2 bare prohibitions without alternatives
[N/A] Validation model: not a second-pass prompt
[N/A] Original task in validation: not a second-pass prompt
[ ] One criterion per call: 3 criteria bundled in one high-stakes prompt
[N/A] Linguistic-analysis path: evaluates content, not writing properties
[ ] Judge prompt: rubric: no rubric present; will write concrete criteria for each score level
[ ] Judge prompt: sampling: single-pass design; N>=5 needed; for Gemma 4 via REST: use T=1.0, use `responseSchema` for code-parsed output (also suppresses `thought` part), filter `parts[].thought`, keep `<|think|>` out of `systemInstruction`, classify retries by failure signature, avoid 26B A4B for tool-calling
[ ] Escape hatch elimination: 3 directives use "try to" or "if possible"
[N/A] Prompt injection defense: evaluates fixed test content, not user-submitted text

## Key Changes
- Stripped ~1,500 tokens of non-load-bearing background (item 3)
- Moved governing directive to both start and end (item 3)
- Reduced gate examples from 5 to 2 verdict-balanced borderline pairs (item 4)
- Split combined criteria into 3 separate evaluation calls (item 10)
- Added VERDICT format with regex pattern (item 5)
- Paired prohibitions with alternatives (item 7)
- Added skeptical role framing at end of prompt (item 6)
- Wrote rubric with observable 1-4 criteria directly into the prompt (item 12)
- Added verdict/reasoning consistency instruction and calibration anchor (item 12)
- Added note: run N=5 with majority vote for consistency; Gemma 4-specific deployment guidance (item 13)
- Replaced 3 escape hatches with direct imperatives (item 14)

## Revised Prompt
[full revised prompt text...]

Included Files

File Purpose
agents/prompt-optimizer.md The Claude Code agent definition
PROMPT_BEST_PRACTICES.md Best practices guide (7 sections + 15-item checklist)
GEMMA4_API_BEST_PRACTICES.md Gemma 4 REST API mechanics (14 rules, probe-verified May 2026)
PROMPT_RESEARCH.md Full research archive with 35+ sources (2024 to 2026)

Key Research Sources

2026 refresh:

Still load-bearing:

License

MIT

About

Claude Prompt Optimizer Agent — research-backed prompt quality reviewer for Claude Code

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors