A-EVO-Lab

A-Evo Lab (Agentic Evolution Laboratory) 🧬

The path to recursive self-improvement (RSI) is to let AI take over how humans build AI.

A-Evo Lab, led by Henry Lu, studies self-evolving agents under one thesis — AI-as-researcher: frontier agents and models play the researcher in the loop that builds better AI. Today humans build AI in three critical stages — pre-training → post-training → harness building. We are building an autonomous AI researcher for each, have reached SOTA results where we've shipped, and develop everything on one shared stack, A-Evolve, so we can iterate fast.

🗺 The Map

Human stage of building AI	Our program	What the AI researcher does	Status
Harness building	AI-Harness	Evolves prompts / skills / memory / tools around a frozen model	✅ SOTA across benchmarks
↳ long-running deployment	AI-Harness · Adaptive	Sustains performance on open-ended task streams	✅ Leads every reported stream metric
Post-training	AI-Training	Designs data mixtures, schedules, HPs & ablations end-to-end	🔜 First public datapoint of Auto-post-training on 30B scale
Pre-training	AI-Pretraining	—	🧭 The open frontier

🛠 AI-Harness — replacing human harness engineering

With zero manual harness engineering, A-Evolve's reference algorithms push a single Claude Opus-4.6 base model to top-tier performance across diverse agentic benchmarks:

🟢 MCP-Atlas 🥇 #1 _{Baseline → 79.4% (+3.4pp)}	🔵 SWE-bench Verified ~#5 _{Baseline → 76.8% (+2.6pp)}	🟣 Terminal-Bench 2.0 ~#7 _{Baseline → 76.5% (+13.0pp)}	🟡 SkillsBench #2 _{Baseline → 34.9% (+15.2pp)}
🟢 ARC-AGI 🥇 #2 Community Leaderboard _{Baseline → 12.3% (+2.2pp)}	🔵 OSWorld — _{Baseline → 69.6% (+3.9pp)}	🟣 SWE-bench Lite Evolved _{63.7 → 67.0% (+3.3pp)}	🟡 τ-bench Evolved _{72.7 → 77.0% (+4.3pp)}
🟢 CL-Bench Evolved _{29.5 → 34.0% (+4.5pp)}	🔵 WebArena-Infinity Evolved _{72.5 → 76.3% (+3.8pp)}

Single Claude Opus-4.6 base model, evolved with A-Evolve's reference algorithms. 0 hours of human harness engineering. CL-Bench, SWE-bench Lite, τ-bench & WebArena-Infinity show before → after on the same base model. Data checked March 2026.

Key finding — evolver capability decouples from harness quality. A 9B model (Qwen3.5) writes harness updates as good as Claude Opus 4.6 (best-vs-worst evolver ≤ 3.1pp); benefit is non-monotonic — mid-tier agents gain most, weak agents fail to even load the harness. Implication: put your capability budget on the agent, not the evolver.

📄 Evolver-Solver-Bench — Harness Updating Is Not Harness Benefit. arXiv 2605.30621 · HF Daily

📄 Evo-Harness — Context-to-Harness Skill Compilation (online evolution: feedback grounding, abstraction level, solver–evolver alignment). Releasing soon.

↳ Adaptive — sustaining agents on long-running streams

Naive self-evolving agents peak early and then decline — a single dense harness overfits to early evidence. Adaptive Auto-Harness fixes this with a stateful multi-agent evolver, a harness tree with solve-time routing, and scoped human-steering hooks — leading every reported metric against five auto-harness baselines plus the human-designed OctoTools:

Stream	Domain	A-Evolve-Adaptive	Next best
PolyBench	Prediction markets	80.9% Accuracy	50.8%
CTF-Dojo	Security competitions	50.2% Pass	45.2%
FutureX	Event forecasting	49.5% Pass	47.5%

📄 Adaptive Auto-Harness — Sustained Self-Improvement on Open-Ended Task Streams. arXiv 2606.01770

🧪 AI-Training — replacing human post-training

The same loop, carried all the way into model weights: an evolver autonomously runs end-to-end 30B post-training — designing data mixtures, training schedules, hyperparameter regimes, and ablation protocols — reaching parity with a human post-training team. To our knowledge, the first time an autonomous system has done so at this scale.

Four self-directed rounds on a production GPU cluster. The autonomously produced model placed 8th of ~4,000 on NVIDIA's Nemotron Reasoning Challenge (snapshot 6/1/2026) — one point behind the top human team.

The same autonomous system has since post-trained the 120B and 550B Nemotron models end-to-end — evidence the loop closes at that scale too. (No public human baseline exists there yet, so we report it as infrastructure evidence, not a competitiveness claim.)

Tech report — Tech Blog Tech Report.

🧭 AI-Pretraining — the open frontier

The largest and most expensive stage of building AI — and the one we have not automated yet. It is where this thesis goes next.

⚙️ One Shared Stack: A-Evolve

Every result above was developed on A-Evolve, our open-source infrastructure for self-improving agents — "the PyTorch for Agentic AI." It evolves any agent, in any domain, with any evolution algorithm, and is what makes fast iteration across all three programs possible.

import agent_evolve as ae

evolver = ae.Evolver(agent="./my_agent", benchmark="swe-verified")
results = evolver.run(cycles=10)        # SOTA agent. 3 lines. 0 hours of manual harness engineering.

Adopted & integrated by: OpenRLHF · DeepSpeed · SGLang · GEPA · AutoResearch

⭐ Star the repo → github.com/A-EVO-Lab/a-evolve

📫 Contact

Building in this direction, or want to collaborate? Reach out — X / Twitter · LinkedIn.

📢 News

6/11 New Tech Report on Auto-post-training, A-Evolve-Training: Autonomous Post-Training of a 30B Model. We bulit an AI system that ran the post-training loop for a 30B model — with no human in the loop. Four self-directed rounds on GPU clusters. The autonomously produced model placed 8th of ~4,000 on NVIDIA's Nemotron Reasoning Challenge — one point behind the top human team. The same autonomous system has since post-trained the 120B and 550B Nemotron models. This is, to the best of our knowledge, first public evidence at this scale.
6/1 New Research Paper, Adaptive Auto-Harness: Sustained Self-Improvement for Agentic System Deployment on Open-Ended Task Streams (arXiv 2606.01770). We address the brittleness of traditional auto-harness systems when moving from fixed benchmarks to open-ended, shifting task streams. We introduce Adaptive Auto-Harness, a framework that significantly outperforms five existing auto-harness baselines across prediction-market, security-competition, and event-forecasting streams. Code and algorithms are available at A-Evolve
5/30 New Paper — Harness Updating Is Not Harness Benefit (arXiv 2605.30621). 7 evolver models × 6 solver agents × 3 benchmarks: counterintuitive answers on who produces good harness updates and who benefits. Code and algorithms are available at A-Evolve
05/04 New Benchmark Results — A-Evolve results on ARC-AGI-3, evolving a multi-agent system from 10% → 12%.
04/20 New Algorithm — GEPA, submitted by the GEPA team.
04/10 Integration — into Orch-Research Skills Library, alongside AutoResearch, OpenRLHF, DeepSpeed, SGLang.
04/07 New Agent — transplanted our Terminal-Bench 2.0 harness onto ClawCode: 67.8% → 72.9% (+5.1pp).
04/03 New Algorithm — Meta-Harness.
03/25 🚀 Open-sourced A-Evolve + 4 reference algorithms achieving SOTA (#1, ~#5, ~#7, #2) on MCP-Atlas, SWE-bench Verified, Terminal-Bench 2.0, SkillsBench.
02/17 📄 Position paper: Agentic Evolution is the Path to Evolving LLMs (arXiv 2602.00359).

We are evolving fast — support our research by leaving a ⭐ on A-Evolve.

LinkedIn | Twitter/X

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A-EVO-Lab

A-Evo Lab (Agentic Evolution Laboratory) 🧬

🗺 The Map

🛠 AI-Harness — replacing human harness engineering

🟢 MCP-Atlas

🔵 SWE-bench Verified

🟣 Terminal-Bench 2.0

🟡 SkillsBench

🟢 ARC-AGI

🔵 OSWorld

🟣 SWE-bench Lite

🟡 τ-bench

🟢 CL-Bench

🔵 WebArena-Infinity

↳ Adaptive — sustaining agents on long-running streams

🧪 AI-Training — replacing human post-training

🧭 AI-Pretraining — the open frontier

⚙️ One Shared Stack: A-Evolve

📫 Contact

📢 News

Pinned Loading

Repositories

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

People

Top languages

Uh oh!

Most used topics

Uh oh!