Prompt Optimizer Agent

A Claude Code agent that scores and revises LLM prompts against a research-backed checklist.

Topics: gemma-4 · gemma · gemini · claude · response-schema · structured-output · rubric · llm-judge · prompt-engineering · prompts · agents · sycophancy · validation · compliance · best-practices

Primary focus: mitigating known failure modes on gemma-4-31b-it via the Google Generative Language REST API. The 15-item checklist itself is model-agnostic and the optimizer also targets Gemini and Claude, but the empirical work behind the current rules, 100+ probe calls against gemma-4-31b-it and gemma-4-26b-a4b-it between May 6 and May 13, 2026, is what drove the design. The reference for those rules is GEMMA4_API_BEST_PRACTICES.md.

The two goals are: prompts the model actually executes instead of silently skipping over directives, and call mechanics that do not waste retry budget on failures the schema or parser already explains.

Primary workflow: Use this agent inside Claude Code to optimize prompts for any LLM. The optimizer runs on Claude; when it writes a rubric for a judge prompt, Claude is authoring the rubric and the target model applies it (cross-model rubric generation), which research shows equals or outperforms same-model self-generation. Model-specific guidance is baked into the checklist and applied when a Target model: is declared.

Gemma 4 31b failure modes the optimizer mitigates

Three highest-ROI before/after fixes below. Full mechanics, thresholds, and probe data in GEMMA4_API_BEST_PRACTICES.md (14 rules).

1. Thinking is always on; only `responseSchema` suppresses it

Without responseSchema, every call pays the always-on thinking cost (parts[].thought = true, median wall-clock ~67s on short outputs, non-zero MALFORMED_RESPONSE rate). With it, thinking collapses to a single non-thought part and the same call returns in 1 to 2s with MALFORMED at 0%. The documented levers (thinkingLevel: "off", thinkingBudget: 0) all return HTTP 400 on Gemma 4; responseSchema is the only mechanism that works.

# Before: thinking burns the budget; MALFORMED non-zero; ~67s median.
generation_config = {"temperature": 1.0, "maxOutputTokens": 2048}

# After: thinking suppressed; MALFORMED 0%; ~30 to 40x faster.
generation_config = {
    "temperature": 1.0,
    "maxOutputTokens": 2048,
    "responseMimeType": "application/json",
    "responseSchema": {
        "type": "OBJECT",
        "properties": {"output": {"type": "STRING"}},
        "required": ["output"],
    },
}

2. Property order in the schema controls emission order

Gemma 4 emits an OBJECT's properties in the order they appear in the schema's properties dict. Putting reasoning BEFORE verdict forces reason-before-commit; reversing it lets the verdict lock first and inflates the justification afterward. Per-item output observed dropping from ~1.8k to ~250 chars under this change alone on a warmup validator.

# Before: verdict commits first; reasoning then inflates to justify.
"properties": {
    "verdict":   {"type": "STRING", "enum": ["KEEP", "DROP"]},
    "reasoning": {"type": "STRING"},
}

# After: reasoning is generated first; verdict is the output of it.
"properties": {
    "reasoning": {"type": "STRING"},
    "verdict":   {"type": "STRING", "enum": ["KEEP", "DROP"]},
}

Caveat: this only applies to narrow schemas. If the schema already has >=4 mandatory nested OBJECTs, adding a top-level reasoning STRING crashes the request on 31b (alternating 400/500, 0/4 success). Move the reasoning surface to prompt-level prose or a separate narrow call. See best-practices rule 4.

3. Parse with `raw_decode`, not `json.loads`

Even with responseSchema, Gemma 4 occasionally appends trailing text after valid JSON (~1 in 12 calls observed). json.loads raises on the trailing bytes; raw_decode returns the first valid object and ignores the rest.

# Before: ~8% of otherwise-valid outputs raise json.JSONDecodeError.
parsed = json.loads(raw_text)

# After: parses cleanly across all observed Gemma 4 outputs.
parsed, _ = json.JSONDecoder().raw_decode(raw_text)

Other failure modes the agent flags (see best-practices doc)

26b-a4b sibling variant: any OBJECT with 2+ unbounded STRINGs deterministically loops to MAX_TOKENS. 31b is unaffected. Fix is structural (caller-side schema skip, or single-STRING-per-OBJECT shape), not a retry.
Wide-schema reasoning STRING on 31b: schema with >=4 mandatory nested OBJECTs plus a top-level reasoning STRING crashes with alternating 400/500. Bisect the schema, do not retry.
thinkingBudget: returns HTTP 400 on Gemma 4. Works on Gemini 2.5; do not generalize.
maxOutputTokens: safety ceiling, not a thinking cap. Lowering it converts MALFORMED timeouts into MAX_TOKENS fast-fails but does not raise success rate.
Retry classification: HTTP 5xx retries with same params; MALFORMED_RESPONSE needs parameter changes; MAX_TOKENS on 26b-a4b needs a structural fix; alternating 400/500 on a schema-bearing call needs a schema bisect. One retry policy across all four wastes budget.
Tool-calling: known double tool-call bug on 26b-a4b; use 31b for tool-calling workflows.

The Problem

Most LLM prompts are written by feel. Frontier models in April 2026 do not refuse tasks, they silently omit them. The research shows where:

Frontier models still drop 25 to 40% of multi-constraint directives on novel out-of-domain instructions. Qwen3.6 Plus scores 75.8% on IFBench; Claude Opus 4.5 scores 58%. Structural prompt design closes the gap. (IFBench 2026)
Reasoning quality degrades around 3,000 tokens even on models with 256K to 1M context windows. Focused prompts beat long ones regardless of available context. (Prompt-bloat study, MLOps Community 2026)
58.8% of initially correct answers get flipped wrong by naive "check your work" validation prompts. (ACL 2025)
One-shot often beats few-shot for LLM-as-judge tasks. The old "3 to 5 diverse examples" rule is retired; 1 to 3 verdict-balanced examples per criterion is current, all score levels for scale rubrics, PASS+FAIL for binary criteria, with borderline pairs preferred over obvious contrasts. (Confident AI 2026, Autorubric 2026)
GPT-4 reaches 91.7% zero-shot on native-language identification when the prompt names the linguistic features to attend to. Linguistic analysis prompts need their own playbook. (Lotfi et al.)
~29% sycophancy reduction is achievable through prompt structure alone, no fine-tuning required. (sparkco.ai)
A concrete rubric is the single highest-return change for judge prompts: GPT-4o +17.7 pts on JudgeBench, Llama-405B +7.4 pts, Sage aggregate +16.1% IPI. A ~27-point "Rubric Gap" (self-generated vs. human rubrics) is consistent across Gemini, GPT, and DeepSeek. (Rethinking Rubric Generation 2026; RubricBench 2026; Sage Dec 2025)
All frontier judges are unreliable on a single pass ("rating roulette"). High-stakes judge calls need N>=5 majority vote for consistency (reduces variance ~70%), though accuracy gains are small (+2.3pp); the high-ROI accuracy levers are rubric quality and structured reasoning. Debate-style prompts (ChatEval) are actively harmful: -158% worst-case consistency. Multi-model consensus is the strongest deployment lever. (Rating Roulette EMNLP 2025; Sage Dec 2025)
Gemma 4 via Google REST API needs its own playbook: see GEMMA4_API_BEST_PRACTICES.md for the 14-rule reference and the "Failure modes mitigated" section above for the three highest-ROI fixes.

Multi-Model Workflow

This optimizer runs on Claude and targets any LLM. Declare Target model: <name> in your call to activate model-specific checklist notes.

Universal (all targets): The optimizer writes a concrete rubric directly into the revised judge prompt (cross-model rubric generation: Claude authors, target applies), shown by the Rethinking Rubric Generation paper (arxiv 2602.05125) to equal or outperform same-model self-generation. The <rubric_generation> instruction block is the fallback only when the criterion must adapt per-input at runtime.

Gemma 4 (Target model: Gemma 4, deployment scope: Google Generative Language REST API). Primary target of this agent. The checklist applies responseSchema as the deployment lever (suppresses always-on thinking, fixes JSON structure, ~30 to 40x speedup), property-ordered properties for reason-before-commit, raw_decode for parsing, and the retry classification matrix. The full 14-rule reference is in GEMMA4_API_BEST_PRACTICES.md. Both gemma-4-31b-it and gemma-4-26b-a4b-it are covered; rule 2 isolates the multi-STRING failure mode unique to 26b-a4b, and rule 11 isolates the tool-calling bug on the same variant. Use 31b when both are options.

Gemini (Target model: Gemini 2.5 Pro / Gemini 3.1 Pro): T=0 + seed is not reproducible on 2.5 Pro (seed is best-effort). T=0 is actively discouraged on 3.1 Pro (use T=1.0). Debate-style prompts (ChatEval) are actively harmful. Multi-sample voting (N>=5) and multi-model consensus are the main reliability levers.

Claude (Target model: Claude Sonnet 4.6 / Claude Opus 4.7): XML tags and document-first ordering per Anthropic official guidance. Second-pass validation needs the "Wait" prefix and original-task anchor. Extended thinking is already embedded; do not add an extra reasoning pass.

What This Agent Does

When invoked, the prompt-optimizer agent:

Reads the prompt under review
Scores against the 15-item checklist (embedded, no file I/O needed for scoring)
Loads only the relevant sections of PROMPT_BEST_PRACTICES.md for any failing items that require technique detail (lazy, skipped entirely if all items pass)
For Target model: Gemma 4, loads GEMMA4_API_BEST_PRACTICES.md for the API-mechanics rules
Returns a revised version with every violation fixed and annotated

The 15-Item Checklist

#	Item	What it checks
1	Tagged blocks	Distinct sections in XML-style tags
2	Numbered directives	All instructions numbered for traceability
3	Length and placement	Focused under ~3K tokens, critical directives at start AND end, decomposed if multi-stage
4	Gate examples, calibrated count	1 to 3 verdict-balanced examples per criterion; prefer borderline examples over obvious contrasts; scale-based rubrics cover all score levels
5	Machine-parseable output	Every verdict extractable with regex
6	Skeptical role	Critical evaluator, not helpful assistant, checked at BOTH opening AND closing
7	Do-instead-of-don't	Prohibitions paired with alternatives
8	Validation model	Same-model validation uses gates + "Wait" + recency fix
9	Original task in validation	Validation includes original task + end reminder
10	One criterion per call (high-stakes)	High-stakes scoring isolates each criterion; low-stakes may bundle up to 3
11	Linguistic-analysis path	If the prompt evaluates properties of writing itself: enumerate features, reason before verdict, cite evidence
12	Judge prompt: rubric ★	Optimizer writes a concrete rubric directly (cross-model generation); or embeds `<rubric_generation>` instruction if criterion is dynamic. Small integer scale (1 to 4); `<reasoning>` field before verdict; verdict/reasoning consistency instruction; calibration anchor. Highest single-change ROI.
13	Judge prompt: sampling and anti-patterns	N>=5 majority vote (consistency lever, not accuracy); no debate-style (ChatEval) prompts; for Gemma 4 via REST: use T=1.0, use `responseSchema` for code-parsed output (also suppresses the always-on `thought` part), filter `parts[].thought` (not `<\|channel>` text), keep `<\|think\|>` out of `systemInstruction`, classify retries by failure signature, avoid 26B A4B for tool-calling; multi-model consensus for highest-stakes ranking
14	Escape hatch elimination	No softening language ("try to," "if possible," "when appropriate," etc.) in any directive, applies to every prompt
15	Prompt injection defense	User-submitted content inside labeled delimiter block with explicit "treat as data" instruction (conditional: only when prompt evaluates user-submitted text)

Items 8 to 10 apply only to validation or second-pass prompts. Item 11 applies only to linguistic-analysis prompts. Items 12 to 13 apply to judge prompts. Item 14 applies to every prompt. Item 15 applies only when the prompt evaluates user-submitted text.

Installation

As a Claude Code Plugin (Recommended)

# From the Claude Code CLI
/plugin marketplace add dlxmax/prompt-optimizer
/plugin install prompt-optimizer
/reload-plugins

Manual Installation

Copy the agents/ folder, PROMPT_BEST_PRACTICES.md, and GEMMA4_API_BEST_PRACTICES.md into your Claude Code config:

cp agents/prompt-optimizer.md ~/.claude/agents/
cp PROMPT_BEST_PRACTICES.md ~/.claude/
cp GEMMA4_API_BEST_PRACTICES.md ~/.claude/

Auto-Invocation (Optional)

Add this line to ~/.claude/rules/agents.md under "Automatic Agent Invocation":

6. Writing or revising an LLM prompt → **prompt-optimizer**

This makes Claude invoke the optimizer automatically whenever prompt work comes up.

Usage

The agent triggers automatically when you write or revise LLM prompts (if auto-invocation is configured), or you can reference it explicitly:

"Run the prompt-optimizer agent on this grading prompt."
"Score my system prompt against the checklist."
"Optimize this Gemma 4 judge prompt for the essay evaluation pipeline."

Example Output

## Checklist Score: 6/15

[x] Tagged blocks: sections wrapped in <role>, <instructions>, <output_format>
[x] Numbered directives: 5 directives numbered
[ ] Length and placement: 4,200 tokens; critical directive buried in the middle
[ ] Gate examples, calibrated count: 5 diverse examples (older 3-5 pattern); should be 1-3 verdict-balanced examples with borderline pairs
[ ] Machine-parseable output: no regex-extractable verdict format
[x] Skeptical role: "rigorous evaluator" framing at opening; missing at closing
[ ] Do-instead-of-don't: 2 bare prohibitions without alternatives
[N/A] Validation model: not a second-pass prompt
[N/A] Original task in validation: not a second-pass prompt
[ ] One criterion per call: 3 criteria bundled in one high-stakes prompt
[N/A] Linguistic-analysis path: evaluates content, not writing properties
[ ] Judge prompt: rubric: no rubric present; will write concrete criteria for each score level
[ ] Judge prompt: sampling: single-pass design; N>=5 needed; for Gemma 4 via REST: use T=1.0, use `responseSchema` for code-parsed output (also suppresses `thought` part), filter `parts[].thought`, keep `<|think|>` out of `systemInstruction`, classify retries by failure signature, avoid 26B A4B for tool-calling
[ ] Escape hatch elimination: 3 directives use "try to" or "if possible"
[N/A] Prompt injection defense: evaluates fixed test content, not user-submitted text

## Key Changes
- Stripped ~1,500 tokens of non-load-bearing background (item 3)
- Moved governing directive to both start and end (item 3)
- Reduced gate examples from 5 to 2 verdict-balanced borderline pairs (item 4)
- Split combined criteria into 3 separate evaluation calls (item 10)
- Added VERDICT format with regex pattern (item 5)
- Paired prohibitions with alternatives (item 7)
- Added skeptical role framing at end of prompt (item 6)
- Wrote rubric with observable 1-4 criteria directly into the prompt (item 12)
- Added verdict/reasoning consistency instruction and calibration anchor (item 12)
- Added note: run N=5 with majority vote for consistency; Gemma 4-specific deployment guidance (item 13)
- Replaced 3 escape hatches with direct imperatives (item 14)

## Revised Prompt
[full revised prompt text...]

Included Files

File	Purpose
`agents/prompt-optimizer.md`	The Claude Code agent definition
`PROMPT_BEST_PRACTICES.md`	Best practices guide (7 sections + 15-item checklist)
`GEMMA4_API_BEST_PRACTICES.md`	Gemma 4 REST API mechanics (14 rules, probe-verified May 2026)
`PROMPT_RESEARCH.md`	Full research archive with 35+ sources (2024 to 2026)

Key Research Sources

2026 refresh:

IFBench leaderboard, April 2026: current frontier instruction-following scores
Rethinking Rubric Generation (RRD), arxiv 2602.05125: GPT-4o +17.7 pts, Llama-405B +7.4 pts from rubric design; cross-model generation validated
RubricBench, arxiv 2603.01562: ~27-pt Rubric Gap is equal across Gemini, GPT, DeepSeek, universal bottleneck
Same Input, Different Scores, arxiv 2603.04417: Gemini shows highest single-model variance among major families
LLMLingua-2, NAACL 2025: task-agnostic prompt compression, 3x to 6x
Prompt-bloat study, MLOps Community 2026: the ~3K token degradation threshold
Label Your Data LLM-as-judge 2026: few-shot instability and one-shot dominance
Native Language Identification with LLMs (Lotfi et al.): GPT-4 zero-shot 91.7% TOEFL11
Rating Roulette, EMNLP 2025: single-pass judges unreliable; N>=5 needed
Sage benchmark, Dec 2025: rubric generation +16.1% IPI; debate prompts -158%; Gemini degrades 200% on hard cases
Google Gemma 4 Technical Report, 2026: T=1.0 recommended, 26B A4B double tool-call bug, JSON adherence weakness, injection susceptibility
REST API empirical probes against gemma-4-31b-it, gemma-4-26b-a4b-it, gemini-2.5-flash, and gemini-3.1-flash-lite-preview on generativelanguage.googleapis.com/v1beta/models/<model>:generateContent. Summary of probe set (full mechanics in GEMMA4_API_BEST_PRACTICES.md):
- May 6, 2026: 28 calls. Thinking surfaces structurally as parts[].thought = true, not as <|channel> text markers. Thinking cannot be disabled via thinkingConfig. Baseline transient 500 INTERNAL rate ~20%.
- May 12, 2026: 72-call burst-rewrite benchmark. responseSchema produces ~30 to 40x wall-clock speedup and drops MALFORMED rate to 0%. 26b-a4b deterministically loops on multi-STRING schemas; 31b is unaffected.
- May 13, 2026: schema bisect. Wide schemas (>=4 mandatory nested OBJECTs) plus a top-level reasoning STRING crash 31b 0/4 with alternating 400/500. Property order in responseSchema controls emission order: reason-before-commit shrinks output ~7x on a warmup validator.
Judging the Judges, ACL/IJCNLP 2025: position bias is incoherent; swap-and-count less effective

Still load-bearing:

AGENTIF: NeurIPS 2025 decomposition finding (headline numbers superseded by IFBench 2026)
Self-Correction Blind Spot: the "Wait" prefix discovery
Dark Side of Self-Correction: ACL 2025 recency bias fix
HuggingFace LLM-as-Judge cookbook: 1-4 scale; evaluation field before verdict; 0.563 to 0.843 correlation improvement
Anthropic Claude Prompting Guide: XML tags and document-first ordering

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
.claude-plugin		.claude-plugin
agents		agents
.gitignore		.gitignore
GEMMA4_API_BEST_PRACTICES.md		GEMMA4_API_BEST_PRACTICES.md
LICENSE		LICENSE
PROMPT_BEST_PRACTICES.md		PROMPT_BEST_PRACTICES.md
PROMPT_RESEARCH.md		PROMPT_RESEARCH.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Prompt Optimizer Agent

Gemma 4 31b failure modes the optimizer mitigates

1. Thinking is always on; only `responseSchema` suppresses it

2. Property order in the schema controls emission order

3. Parse with `raw_decode`, not `json.loads`

Other failure modes the agent flags (see best-practices doc)

The Problem

Multi-Model Workflow

What This Agent Does

The 15-Item Checklist

Installation

As a Claude Code Plugin (Recommended)

Manual Installation

Auto-Invocation (Optional)

Usage

Example Output

Included Files

Key Research Sources

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Prompt Optimizer Agent

Gemma 4 31b failure modes the optimizer mitigates

1. Thinking is always on; only responseSchema suppresses it

2. Property order in the schema controls emission order

3. Parse with raw_decode, not json.loads

Other failure modes the agent flags (see best-practices doc)

The Problem

Multi-Model Workflow

What This Agent Does

The 15-Item Checklist

Installation

As a Claude Code Plugin (Recommended)

Manual Installation

Auto-Invocation (Optional)

Usage

Example Output

Included Files

Key Research Sources

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

1. Thinking is always on; only `responseSchema` suppresses it

3. Parse with `raw_decode`, not `json.loads`

Packages