"What if you could run a tireless AI researcher on your hardest problem โ one that reads the literature, designs experiments, runs them, and learns from every failure?"
That's ASI-Evolve. It is a general agentic framework that closes the loop between knowledge โ hypothesis โ experiment โ analysis โ and repeats it autonomously, round after round, until it finds something that works.
We built it for AI research. But the loop doesn't care about domain.
A financial analyst, a biomedical engineer, a climate scientist, or a game developer can all plug their own problem into ASI-Evolve and let it search for better solutions than any human has time to manually explore.
The paper validated ASI-Evolve on three hard AI problems โ each requiring expensive compute, open-ended search, and multi-dimensional feedback:
| Domain | What It Discovered | Gain |
|---|---|---|
| Neural Architecture Design | 105 SOTA linear attention architectures | +0.97 pts over DeltaNet (โ3ร recent human gains) |
| Pretraining Data Curation | Evolved pipeline that selects cleaner training data | +3.96 pts avg, +18 pts on MMLU |
| RL Algorithm Design | Novel optimization mechanisms with mathematical innovations | +12.5 pts on AMC32 vs GRPO |
| Biomedical (Drug-Target Interaction) | Stronger architecture for cold-start generalization | +6.94 AUROC |
These are not toy benchmarks. These are real, frontier-level results โ produced autonomously.
| Step | Action |
|---|---|
| 1 | LEARN โ retrieve relevant prior knowledge |
| 2 | DESIGN โ propose the next candidate program/idea |
| 3 | EXPERIMENT โ run it and collect metrics |
| 4 | ANALYZE โ turn results into reusable lessons |
Repeat the loop and improve each round.
Three agents drive this loop:
- Researcher โ reads the database and cognition store, proposes the next candidate
- Engineer โ executes the candidate and collects structured metrics
- Analyzer โ distills outcomes into transferable lessons for future rounds
Two memory systems keep the loop from going in circles:
- Cognition Store โ inject your domain knowledge, papers, or heuristics upfront so the AI doesn't start from zero
- Experiment Database โ every trial is stored with its motivation, code, result, and analysis; parent selection uses UCB1, greedy, random, or MAP-Elites island sampling
You don't need to be an AI researcher. You need:
- A problem where better code = better outcome โ optimization, algorithm design, pipeline tuning, simulation strategy
- An evaluation script โ something that takes a program and returns a score
- Some domain knowledge โ papers, rules of thumb, known good approaches
If you have those three things, ASI-Evolve can search the space for you.
Examples across domains:
| Industry | Problem You Can Throw At It |
|---|---|
| ๐งฌ Biomedicine | Drug-target interaction models, protein folding strategies, clinical trial algorithms |
| โก ML Infra | Inference schedulers, KV-cache eviction policies, batching strategies, kernel tiling |
| ๐ Climate / Energy | Grid load-balancing heuristics, carbon footprint optimization pipelines |
| ๐ฎ Game AI | Bot strategies, procedural level generation algorithms, reward shaping functions |
| ๐ญ Manufacturing | Quality control classifiers, scheduling heuristics, defect detection pipelines |
| ๐ฌ Materials Science | Crystal structure search algorithms, synthesis route optimization |
| ๐ฆ Logistics | Routing algorithms, warehouse assignment policies, demand forecasting models |
Scenario: You're an ML infra engineer. Your LLM serving system uses a fixed continuous-batching scheduler. You want to automatically discover a better request-scheduling policy that maximizes throughput while keeping P99 latency under budget โ without spending weeks hand-tuning heuristics.
git clone https://github.com/GAIR-NLP/ASI-Evolve.git
cd ASI-Evolve
pip install -r requirements.txtmkdir -p experiments/infer_scheduler/promptsexperiments/infer_scheduler/
โโโ input.md โ problem description
โโโ config.yaml โ API key + overrides
โโโ initial_program โ baseline scheduler to evolve from
โโโ init_cognition.py โ inject your domain knowledge
โโโ evaluator.py โ benchmark the candidate scheduler
โโโ eval.sh โ shell wrapper called each round
# LLM Serving Scheduler
Write a Python function `schedule(queue, gpu_mem_free_gb)` that selects
a batch of requests from `queue` (list of dicts with keys: seq_len, priority,
wait_ms) given available GPU memory, and returns a list of selected request ids.
Optimize for: tokens/sec throughput (primary), P99 latency ms (must stay < 500ms).def schedule(queue, gpu_mem_free_gb):
"""FCFS baseline โ just take requests in arrival order."""
budget = int(gpu_mem_free_gb * 1024) # rough token budget
selected, used = [], 0
for req in sorted(queue, key=lambda r: r["wait_ms"], reverse=True):
if used + req["seq_len"] <= budget:
selected.append(req["id"])
used += req["seq_len"]
return selectedfrom cognition.store import CognitionStore
store = CognitionStore(storage_dir="experiments/infer_scheduler/cognition_data")
store.add([
{
"title": "Continuous Batching",
"content": "Orca-style iteration-level scheduling avoids head-of-line blocking "
"by preempting long requests and inserting shorter ones mid-flight.",
},
{
"title": "Sequence Length Bucketing",
"content": "Grouping requests by similar sequence length reduces padding waste "
"and improves GPU utilization significantly.",
},
{
"title": "Priority + Starvation",
"content": "Pure priority scheduling causes starvation. Age-weighted priority "
"(priority * log(1 + wait_ms)) balances latency and throughput.",
},
])python experiments/infer_scheduler/init_cognition.pyimport importlib.util, sys, json
from benchmark import run_serving_simulation # your internal benchmark harness
def load(path):
spec = importlib.util.spec_from_file_location("candidate", path)
mod = importlib.util.module_from_spec(spec); spec.loader.exec_module(mod)
return mod.schedule
if __name__ == "__main__":
fn = load(sys.argv[1])
result = run_serving_simulation(scheduler_fn=fn, trace="traces/sharegpt_1k.jsonl")
# score = throughput; constraint violation penalizes automatically
score = result["tokens_per_sec"] if result["p99_latency_ms"] < 500 else 0
print(json.dumps({"score": score, "metrics": result}))# eval.sh
#!/bin/bash
python /path/to/experiments/infer_scheduler/evaluator.py "$1"python main.py \
--experiment infer_scheduler \
--steps 40 \
--sample-n 3 \
--eval-script /path/to/experiments/infer_scheduler/eval.shASI-Evolve will iterate through scheduling policies โ from simple heuristics to multi-factor priority functions with dynamic chunking โ and write a structured lesson after every trial. After 40 rounds you have a ranked database of every policy tried, why each worked or failed, and the best-performing code ready to deploy.
| Manual Research | ASI-Evolve |
|---|---|
| Try 5โ10 ideas per week | 50โ200 candidates per run |
| Knowledge stays in one person's head | Every insight written to the database |
| Cold start on each new hypothesis | Cognition store primes every round |
| Hard to know why something worked | Analyzer explains every outcome |
| Results hard to reproduce | Full experiment tree stored on disk |
ASI-Evolve/
โโโ main.py โ entry point
โโโ config.yaml โ global defaults
โโโ pipeline/ โ Researcher, Engineer, Analyzer agents
โโโ cognition/ โ cognition store (embedding + FAISS)
โโโ database/ โ experiment database (nodes + sampling)
โโโ utils/
โโโ experiments/
โโโ circle_packing_demo/ โ included runnable demo
โโโ best/circle_packing/ โ top programs from our ablation runs
pip install -r requirements.txtRequirements:
- Python 3.10+
bashandpython3on your system path- Any OpenAI-compatible API endpoint (GPT-4o, Claude, Gemini, local models via LiteLLM, etc.)
- Optional: Weights & Biases for experiment tracking
The included demo evolves a program to pack 26 circles in a unit square โ a clean benchmark used in our ablation studies.
# Initialize the cognition store
python experiments/circle_packing_demo/init_cognition.py
# Run 10 evolution steps
python main.py \
--experiment circle_packing_demo \
--steps 10 \
--sample-n 3 \
--eval-script /path/to/experiments/circle_packing_demo/eval.shASI-Evolve reaches SOTA-level circle-packing results in as few as 17 rounds.
Configuration merges in this order (later overrides earlier):
config.yamlat repository rootexperiments/<name>/config.yaml- An explicit file passed with
--config
Key settings:
| Key | What It Controls |
|---|---|
api.model |
The LLM driving all agents |
pipeline.sample_n |
How many historical nodes to show the Researcher each round |
pipeline.parallel.num_workers |
Parallel evolution workers (2โ4 for production) |
database.sampling.algorithm |
ucb1 / greedy / random / island |
cognition.retrieval.top_k |
How many cognition items to retrieve per round |
If you use ASI-Evolve in your work, please cite:
@misc{asi_evolve_2026,
title = {ASI-Evolve: AI Accelerates AI},
author = {Xu, Weixian and Mi, Tiantian and Liu, Yixiu and Nan, Yang and Zhou, Zhimeng and Ye, Lyumanshan and Zhang, Lin and Qiao, Yu and Liu, Pengfei},
year = {2026},
note = {SJTU / SII / GAIR. https://github.com/GAIR-NLP/ASI-Evolve}
}ASI-Evolve is open-source and ready for your domain.
Fork it. Point it at your problem. Let it run.
