APE — Automated Prompt Engineer

A system that automatically finds the best prompt for solving math problems, then routes user queries to the right pipeline.

Instead of manually tweaking prompts, APE runs an optimization loop — it generates prompt variations, scores them against a dataset, and evolves toward the highest-performing prompt across multiple generations.

How It Works

User Input
    ↓
Router (gpt-4o-mini classifier)
    ↓
┌─────────────┬──────────────────┐
│   Math QA   │  Email Evaluator │
│             │                  │
│ Best prompt │  LLM-as-judge    │
│ from APE    │  rubric scoring  │
└─────────────┴──────────────────┘

The APE optimization loop runs offline once:

Seed prompt → evaluate on 50 questions
    ↓
Generate 5 mutations (persona, reasoning style, format)
    ↓
Score each mutation
    ↓
Top 2 survive → repeat for N generations
    ↓
Best prompt saved to best_prompt.json

Project Structure

APE/
├── main.py               # Entry point — routes query, loads best prompt
├── router.py             # Classifies input as math_qa or email_gen
├── math_evaluator.py     # Solves + grades math questions
├── email_evaluator.py    # LLM-as-judge email scoring
├── ape_loop.py           # APE optimization loop
├── math_dataset.json     # 50 questions (25 medium, 25 hard)
├── best_prompt.json      # Output of APE loop (auto-generated)
├── ape_log.json          # Full generation history (auto-generated)
├── requirements.txt
└── .env                  # Your OpenAI API key (never commit this)

Setup

git clone https://github.com/Protham1/APE.git
cd APE
pip install -r requirements.txt
cp .env.example .env      # add your OpenAI API key

Usage

Step 1 — Run the APE optimization loop (once)

python ape_loop.py

This runs ~1100 API calls on gpt-4o-mini and takes around 15-25 minutes. It saves the best prompt found to best_prompt.json.

Step 2 — Run the pipeline

python main.py

The pipeline automatically loads the optimized prompt. If best_prompt.json doesn't exist yet it falls back to the default seed prompt.

Example Queries

Enter your query: solve 3x + 7 = 22
→ Router detects: math_qa
→ Uses APE-optimized prompt
→ FINAL ANSWER: 5

Enter your query: draft a follow-up email to a client about a delayed project
→ Router detects: email_gen
→ Evaluates on clarity, professionalism, grammar
→ Score: 87%

Tech Stack

Component	Details
Language	Python 3.10+
Models	gpt-4o-mini (all components)
Routing	Zero-shot classification
Math eval	Deterministic answer matching
Email eval	LLM-as-judge with rubric
Prompt search	Evolutionary mutation loop

Dataset

50 math questions with no easy questions — intentionally hard to give APE room to improve:

Difficulty	Count	Topics
Medium	25	Arithmetic, Algebra, Probability
Hard	25	Arithmetic, Algebra, Probability

APE Results

Generation	Best Score	Prompt Strategy
Seed	~75%	Default step-by-step
Gen 1	TBD	Mutated variations
Gen 2	TBD	Refined best survivors

(Update this table after running ape_loop.py)

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
JSON		JSON
__pycache__		__pycache__
evaluators		evaluators
.gitignore		.gitignore
README.md		README.md
Router.py		Router.py
ape_loop.py		ape_loop.py
email_ape_loop.py		email_ape_loop.py
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

APE — Automated Prompt Engineer

How It Works

Project Structure

Setup

Usage

Example Queries

Tech Stack

Dataset

APE Results

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

APE — Automated Prompt Engineer

How It Works

Project Structure

Setup

Usage

Example Queries

Tech Stack

Dataset

APE Results

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages