SQL Debug Environment (`sql-debug-env`)

title	Sql Debug Env
emoji	💻
colorFrom	indigo
colorTo	gray
sdk	docker
pinned	false

SQL Debug Environment (`sql-debug-env`)

Deterministic OpenEnv benchmark for real SQL debugging workflows. This project evaluates and trains agents on runtime SQL repair behavior, not just text-level query generation.

Quick Links

Maintained with reproducible Space evidence and benchmark artifacts.

Live Space: https://md896-sql-debug-env.hf.space
Demo page: https://md896-sql-debug-env.hf.space/demo
Gradio app: https://md896-sql-debug-env.hf.space/gradio/
Swagger: https://md896-sql-debug-env.hf.space/docs
OpenAPI: https://md896-sql-debug-env.hf.space/openapi.json
GitHub: https://github.com/mdayan8/sql-debug-env
W&B dashboard: https://wandb.ai/mdayanbag-pesitm/sql-debug-grpo-best-budget/workspace?nw=nwusermdayanbag

Model Card Highlights

Model: md896/sql-debug-agent-qwen25-05b-grpo-wandb-continue-v2

Field	Value
Task	Text generation (SQL repair style prompts)
Libraries	Transformers, TRL (GRPO), Safetensors, TGI-compatible
Family tags	qwen2, grpo, conversational, text-generation-inference
Base tracks used in workflow	Qwen2.5-Coder 0.5B bridge + Qwen2.5-Coder 7B benchmark/eval track
Training signal	Execution-grounded reward from OpenEnv SQL tasks
Reference	arXiv:1910.09700 (as listed in model metadata)

Problem and Motivation

SQL debugging is expensive, repetitive, and operationally risky:

static checks catch syntax, not business-logic correctness
generated SQL can look plausible and still fail at execution time
production schemas and data distribution shifts expose brittle query behavior

This environment is designed to optimize for execution-grounded correctness with deterministic tasks, explicit feedback, and repeatable benchmarks.

Training Pipeline (How We Trained)

The project uses an execution-grounded RL loop where model outputs are scored by actual SQL runtime behavior.

Stage	What happens	Output artifact
1. Bridge run	Fast wiring validation on smaller model (0.5B track)	Session traces + first sanity metrics
2. Baseline eval	Run base 7B model on the same deterministic tasks	Baseline benchmark charts
3. RL training (GRPO)	Optimize policy with reward from SQL execution + grader signals	Trained checkpoints + W&B logs
4. Hard-test eval	Re-run benchmark suite post-training	Delta charts + proof PNGs
5. Publish	Push model + Space artifacts	HF model card + Space static evidence

Training references:

Colab training entrypoint: colab_real_world.py
Job launcher: launch_job.py
Main training/eval script: ultimate_sota_training.py
W&B workspace: sql-debug-grpo-best-budget

Agent Workflow (How It Works)

At runtime, the agent does not "one-shot" SQL. It iterates through an environment loop:

POST /reset initializes a task and returns typed observation.
Agent proposes action (submit_query, inspect_schema, inspect_error, inspect_sample, reset_query).
POST /step executes action in isolated session state (in-memory SQLite).
Environment returns next observation + reward + done + info.
Loop repeats until done/timeout, with reward shaping encouraging valid, safe, task-correct SQL.

This design converts SQL debugging into a measurable control loop rather than prompt-only guessing.

Benchmark Snapshot

Metric snapshot	Value
Spider chart: Industry baseline	48.2%
Spider chart: Qwen-7B base	52.4%
Spider chart: RL agent	78.5%
Performance leap view	0.0% -> 25.0%
Eval artifact pass	32-run

Benchmark Themes and Failure Classes

Task themes

Theme	Representative task	Typical difficulty driver
Syntax repair	`easy_syntax_fix`	Broken tokens / malformed SQL
Logic and grouping	`medium_logic_fix`	Aggregation correctness
Multi-bug joins	`hard_multi_bug`	Alias/join/key mismatch
Finance/cartesian risk	`hard_finance_explosion`	Fan-trap joins, unsafe blow-ups

Common failure classes tracked

Failure class	Symptom	Why execution feedback matters
Schema mismatch	Unknown table/column	Immediate error + corrective signal
Join logic bug	Duplicated/missing rows	Captures semantic correctness beyond syntax
Aggregation drift	Wrong totals / GROUP BY	Deterministic graders expose numeric error
Unsafe behavior	Risky or invalid query path	Reviewer/guardrail can block while preserving signal
Hallucinated structure	Invented fields/relations	Penalized by execution + schema checks

Proof and Evidence Artifacts

End-to-end environment map

Core benchmark visuals

Training and evaluation proofs

Run folders and model

Sample rewards (32 eval): HF artifacts folder
Earlier 32-eval pass: HF artifacts folder
Model card: md896/sql-debug-agent-qwen25-05b-grpo-wandb-continue-v2

System Architecture

flowchart LR
    agent[Client / Agent / Evaluator] --> api[FastAPI API Layer]
    api --> env[SQLDebugEnv]
    env --> db[In-memory SQLite DB]
    env --> tasks[Task Registry + Graders]
    tasks --> reward[Reward Engine]
    env --> reward
    reward --> api

Core components:

API layer: server/main.py
Environment engine: server/env.py
Episode DB: server/database.py
Typed models: server/models.py
Reward logic: server/reward.py
Task + graders: server/tasks/
Baseline runner: inference.py

OpenEnv Contract and Action Space

API surface:

POST /reset
POST /step
GET /state
GET /tasks
GET /health
GET /benchmark

Actions:

Action	Required fields	Purpose
`submit_query`	`query`	Execute/grade SQL candidate
`inspect_schema`	none	Return schema metadata
`inspect_error`	none	Return last execution error
`inspect_sample`	`table_name`	Return sample rows
`reset_query`	none	Restore original broken query

Reward (clamped to [0.0, 1.0]) blends:

correctness (0.0-0.6)
efficiency (0.0-0.2)
syntax_progress (0.0-0.1)
schema_bonus (0.0-0.1)
penalties (0.0-0.2 magnitude)

Task Suite

Easy: easy_syntax_fix
Medium: medium_logic_fix
Hard: hard_multi_bug
Expert: hard_finance_explosion (fan-trap/cartesian explosion)

Reliability and Validation

openenv validate --verbose: PASS
python3 -m unittest discover -s tests -p "test_*.py": PASS
Docker smoke checks: PASS (/health, /tasks, /reset, /step)

Live benchmark example:

curl "http://localhost:7860/benchmark?runs=20"

Quick Start

Local

pip install -r requirements.txt
uvicorn server.main:app --host 0.0.0.0 --port 7860

Docker

docker build -t sql-debug-env .
docker run -p 7860:7860 sql-debug-env

Baseline Inference

export API_BASE_URL="https://api.openai.com/v1"
export MODEL_NAME="gpt-4o-mini"
export OPENAI_API_KEY="your-key"
export HF_TOKEN="$OPENAI_API_KEY"
export ENV_BASE_URL="http://localhost:7860"
export SEED="1"
python inference.py

Repository Structure

sql-debug-env/
├── Dockerfile
├── openenv.yaml
├── README.md
├── requirements.txt
├── pyproject.toml
├── uv.lock
├── inference.py
├── launch_job.py
├── presentation_graphs.py
├── server/
│   ├── main.py
│   ├── gradio_ui.py
│   ├── demo_page.html
│   ├── env.py
│   ├── models.py
│   ├── database.py
│   ├── reward.py
│   ├── static/
│   └── tasks/
└── tests/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SQL Debug Environment (`sql-debug-env`)

Quick Links

Model Card Highlights

Problem and Motivation

Training Pipeline (How We Trained)

Agent Workflow (How It Works)

Benchmark Snapshot

Benchmark Themes and Failure Classes

Task themes

Common failure classes tracked

Proof and Evidence Artifacts

End-to-end environment map

Core benchmark visuals

Training and evaluation proofs

Run folders and model

System Architecture

OpenEnv Contract and Action Space

Task Suite

Reliability and Validation

Quick Start

Local

Docker

Baseline Inference

Repository Structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 81 Commits
archive		archive
docs		docs
scripts		scripts
server		server
skills/graphify		skills/graphify
sql-debug-env		sql-debug-env
tests		tests
.dockerignore		.dockerignore
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
colab_pro_training.py		colab_pro_training.py
colab_real_world.py		colab_real_world.py
inference.py		inference.py
launch_job.py		launch_job.py
openenv.yaml		openenv.yaml
presentation_graphs.py		presentation_graphs.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
spider_chart.py		spider_chart.py
ultimate_benchmark.py		ultimate_benchmark.py
ultimate_sota_training.py		ultimate_sota_training.py
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

SQL Debug Environment (sql-debug-env)

Quick Links

Model Card Highlights

Problem and Motivation

Training Pipeline (How We Trained)

Agent Workflow (How It Works)

Benchmark Snapshot

Benchmark Themes and Failure Classes

Task themes

Common failure classes tracked

Proof and Evidence Artifacts

End-to-end environment map

Core benchmark visuals

Training and evaluation proofs

Run folders and model

System Architecture

OpenEnv Contract and Action Space

Task Suite

Reliability and Validation

Quick Start

Local

Docker

Baseline Inference

Repository Structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

SQL Debug Environment (`sql-debug-env`)

Packages