Skip to content

edholofy/dojo.md

Repository files navigation

dojo.md

dojo.md — Training Arena for AI Agents

Your agent demos well. It fails in production. dojo.md fixes that.

Train any model through scenario-based courses. Graduate with a SKILL.md — portable expertise that makes agents reliable. No fine-tuning. No weight modification. Just knowledge, distilled and proven.

npm CI License: MIT

Works with Claude Code, Codex, OpenClaw, Cursor, Windsurf, and any MCP-compatible agent.

"Hey Claude, train yourself on cold-email-b2b and loop until you hit 90"

  Iteration 1: 31/100 — doesn't know subject line rules, no personalization
  Iteration 2: 58/100 — SKILL.md injected, learns "under 50 chars, no caps"
  Iteration 3: 74/100 — gets structure right, still weak on CTAs
  Iteration 4: 86/100 — nails the pain→solution→ask framework
  Iteration 5: 91/100 — ✓ target reached

  SKILL.md → .claude/skills/cold-email-b2b/SKILL.md

Now every cold email it writes follows the framework. Permanently.

What Just Happened

You told Claude Code to train itself. It ran through 50 scenarios of progressively harder cold emails — bad subject lines, wrong tone, missing personalization, weak CTAs. An LLM judge scored every attempt. After 5 iterations, it extracted everything it learned into a SKILL.md that lives in your project forever.

Next time you say "write a cold email to this VP of Engineering", it loads the SKILL.md automatically. It knows the rules now.

This works for anything:

# Your agent writes terrible Google Ads? Fix it in 5 minutes
dojo train ad-copy-google-ads --target 85

# Support agent keeps giving wrong refund info? Train it
dojo train stripe-refunds --target 90

# Code reviews are too vague? There's a course for that
dojo train code-review-feedback-writing --target 85

# Incident postmortems are weak? Train on 50 real scenarios
dojo train incident-postmortem-writing --target 80

# Your agent can't write a proper RFC? Now it can
dojo train technical-rfc-writing --target 85

How the loop works

Scenarios → Mock Services → LLM Judge → Failure Patterns → SKILL.md → Re-inject → Repeat

Each iteration: the agent gets smarter. The SKILL.md compounds. It stops when it hits the target or plateaus.

Per-model skills

Claude, GPT, DeepSeek — they all fail differently. Each gets its own SKILL.md:

.claude/skills/cold-email-b2b/
├── anthropic--claude-sonnet-4-6/SKILL.md    # Was too formal, learned casual tone
├── openai--gpt-4o/SKILL.md                  # Was too long, learned brevity
└── deepseek--deepseek-v3.2/SKILL.md         # Missed personalization hooks

Quick Start

Option 1: Zero-cost with Claude Code or Codex (recommended)

Already paying for Claude Code or Codex? Training costs $0 extra. The agent trains AND judges itself — no API keys needed.

Just paste this into Claude Code:

Install dojo.md as an MCP server, then train yourself on cold-email-b2b
using autopilot mode. Loop until you hit 90.

Or add dojo as an MCP server manually:

{
  "mcpServers": {
    "dojo": { "command": "npx", "args": ["dojo.md", "mcp"] }
  }
}

Then tell your agent what to train on. It handles the rest.

CLI (dojo train) Autopilot (Claude Code / Codex)
Cost ~$0.50–5 per run $0 extra
Agent API calls Your subscription
Judge API calls Agent self-judges
Setup API keys required Just MCP config

Option 2: CLI with any model via OpenRouter

npm install -g dojo.md
export OPENROUTER_API_KEY=sk-or-...

# Train DeepSeek on Google Ads copy for $0.03
dojo train ad-copy-google-ads --model deepseek/deepseek-v3.2 --target 85

# Train GPT-5 on incident response, judged by Claude
dojo train incident-response --model openai/gpt-5.2 --judge claude-sonnet-4-6 --target 90

# Train Gemini on customer support escalation
dojo train customer-support-escalation --model google/gemini-3-flash-preview --target 80

Arena — Model Benchmarking

Compare models head-to-head on the same course. Same judge, same scenarios, no SKILL.md — raw capability only.

dojo arena ad-copy-google-ads --level 1

═══ Arena Leaderboard ════════════════════════
  1st  Claude Opus 4.6    █████████████████░░░  84
  2nd  Claude Sonnet 4.6  █████████████████░░░  84
  3rd  GPT-5.2            ████████████████░░░░  82
  4th  GLM 5              ████████████████░░░░  79
  5th  Gemini 3 Flash     ███████████████░░░░░  76
══════════════════════════════════════════════

Above 70, every point gets exponentially harder — like ELO, small gaps mean big differences. See the live leaderboard.

Any Model

200+ models via OpenRouter:

dojo train cold-email-b2b --model openai/gpt-4o
dojo train cold-email-b2b --model google/gemini-2.5-pro
dojo train cold-email-b2b --model deepseek/deepseek-v3.2
dojo train cold-email-b2b --model x-ai/grok-4.1-fast
dojo train cold-email-b2b --model meta-llama/llama-3.3-70b-instruct

125 Pre-Built Courses (6,250+ Scenarios)

Domain Examples Courses
Customer Support Stripe refunds, escalation, churn prevention, SLA breaches, onboarding 14
Marketing & Content Google Ads, Meta ads, SEO blogs, email sequences, social media, UGC 18
Sales & Revenue Cold email B2B, objection handling, proposals, battlecards, lead scoring 9
Engineering & DevOps Incident response, Docker, Kubernetes, CI/CD, AWS Lambda, security 17
Writing & Docs Technical RFCs, postmortems, SOPs, newsletters, Twitter/X threads 16
Data & Analytics A/B testing, cohort analysis, segmentation, funnel analysis, forecasting 9
Design & UX Accessibility audits, design systems, user personas, journey mapping 9
Education Quiz creation, study guides, workshop facilitation, training materials
Legal & Compliance Contract review, compliance checklists, clause summarization
Real Estate Listing descriptions, open house promos, buyer inquiry response
Healthcare Appointment reminders, intake review, billing inquiries, pre-auth
dojo list                    # See all 125 courses
dojo generate "Handle Zendesk ticket routing and priority assignment"  # Create your own

Works With Everything

dojo.md generates AgentSkills-standard SKILL.md files. Train once, use everywhere.

Claude Code

dojo.md is an MCP server — train from inside your IDE:

{
  "mcpServers": {
    "dojo": {
      "command": "npx",
      "args": ["dojo.md", "mcp"]
    }
  }
}

MCP tools: dojo_discover, dojo_train, dojo_tool, dojo_submit, dojo_results, dojo_skill, dojo_apply

OpenClaw

Drop your graduated SKILL.md into OpenClaw's skill directory. dojo.md skills follow the same AgentSkills standard — cross-compatible by design.

ClawHub has 13,000+ community skills. The difference: dojo skills are earned, not written. Every SKILL.md has a training score, validated scenarios, and failure patterns it addresses. It's a diploma, not a blog post.

Cursor, Windsurf, and any MCP agent

Same MCP config. Same SKILL.md output. Portable.

The SKILL.md Standard

Generated skills follow the AgentSkills open standard:

---
name: stripe-refunds
description: >-
  Handle Stripe refund requests correctly. Use when processing
  refunds, duplicate charges, or customer disputes.
---

## Domain Knowledge
[Non-obvious insights distilled from training curriculum]

## Quick Start
[Most common failure, corrected]

## Core Rules
[Freedom-calibrated: ALWAYS/step-by-step/prefer]

## Decision Tree
[If/then branching logic]

## Edge Cases
[Every trap, with correct handling]

## Anti-Patterns
[DON'T X. Instead, Y.]

The description triggers loading — ~100 tokens idle, ~5,000 tokens when activated. Progressive disclosure keeps context clean.

CLI Reference

Command Description
dojo train <course> Run training session
dojo train <course> -m openai/gpt-4o -j claude-sonnet-4-6 -t 85 Full multi-model auto-loop
dojo retrain <course> Auto-loop with defaults (target 90, max 5)
dojo arena <course> Benchmark multiple models head-to-head
dojo arena <course> --models m1,m2,m3 Arena with specific models
dojo results [course] Show latest results
dojo list List installed courses
dojo generate <skill> Generate a course from description

Train Options

Flag Description Default
-m, --model Agent model claude-sonnet-4-6
-j, --judge Judge model claude-sonnet-4-6
-t, --target Target score (enables auto-loop)
--max-retrain Max loop iterations 5
--level Run specific level only all
--report Save detailed report

Arena Options

Flag Description Default
--models Comma-separated model list top 5 models
-j, --judge Shared judge model claude-opus-4-6
-l, --level Run specific level only all
-o, --output Output JSON path auto-generated

Scenario Format

meta:
  id: simple-refund
  level: 1
  course: stripe-refunds
  description: Process a straightforward refund
  type: tool

state:
  customers:
    - id: cus_001
      email: alice@example.com
      name: Alice Johnson
  charges:
    - id: ch_001
      amount: 5000
      customer: cus_001
      status: succeeded

trigger: >
  Customer Alice Johnson (cus_001) is requesting
  a refund for charge ch_001 ($50.00).

assertions:
  - type: api_called
    tool: stripe_customers_retrieve
    description: Verify customer identity
  - type: api_called
    tool: stripe_refunds_create
    params: { charge: ch_001 }
    description: Create the refund
  - type: llm_judge
    criteria: >
      Agent confirms refund was processed and explains
      the 5-10 business day timeline for the credit
      to appear on the customer's statement.
    description: Communicate success with timeline

Development

git clone https://github.com/edholofy/dojo.md
cd dojo.md
npm install
npm run build
npm test       # 116 tests

# Dev mode
npm run dev -- train stripe-refunds

Mission

Turn experience into expertise for AI agents.

Today: Author courses, train models, graduate with SKILL.md. Tomorrow: Production feedback loops that generate scenarios from real failures. Future: The open knowledge layer for agent expertise — proven, portable, model-agnostic.

License

MIT

About

University for AI agents. 92 courses, 4400+ scenarios, any model via OpenRouter. Auto-training loops generate per-model SKILL.md documents. Works with Claude Code, OpenClaw, Cursor, Windsurf. No fine-tuning required.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors