Under which category would you file this issue?
Airflow Core
Apache Airflow version
3.2.2
What happened and how to reproduce it?
Apologies for AI slop but the issue is clearly reproducible.
<🤖>
Sending multiple concurrent POST /api/v2/backfills requests for the same DAG results in HTTP 500 with empty response body, but those failed requests partially commit backfill_dag_run rows to SQLite. The partially-created backfill then blocks future creation with "already running backfill".
Root cause visible in API server logs:
sqlite3.OperationalError: database is locked
sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) database is locked
_create_runs_non_partitioned inserts rows in a loop using begin_nested() savepoints. When SQLite's busy timeout fires mid-loop the exception propagates as HTTP 500, but rows flushed by earlier savepoints survive — leaving a partially-initialised backfill.
My mise.toml
[tools]
python = "3.12"
[env]
_.path = [".venv/bin"]
AIRFLOW_HOME = "{{config_root}}"
AIRFLOW__API__PORT = "48001"
AIRFLOW__CORE__EXECUTION_API_SERVER_URL = "http://localhost:48001/execution/"
AIRFLOW__CORE__LOAD_EXAMPLES = "false"
AIRFLOW__DAG_PROCESSOR__REFRESH_INTERVAL = "1"
AIRFLOW__LOGGING__WORKER_LOG_SERVER_HOST = "127.0.0.1"
AIRFLOW__SCHEDULER__MAX_DAGRUNS_TO_CREATE_PER_LOOP = "15"
AIRFLOW__CORE__MAX_ACTIVE_RUNS_PER_DAG = "15"
AIRFLOW__CORE__PARALLELISM = "50"
[tasks.setup]
description = "Install Airflow into venv"
run = """
pip install "apache-airflow==3.2.2" \
--constraint "https://raw.githubusercontent.com/apache/airflow/constraints-3.2.2/constraints-no-providers-3.12.txt"
"""
To reproduce:
-
Create a minimal DAG (test_backfill_bug.py):
from airflow.sdk import dag, task
from datetime import datetime
@dag(
dag_id="test_backfill_bug",
schedule="@daily",
start_date=datetime(2020, 1, 1),
end_date=datetime(2022, 12, 31),
catchup=False,
)
def test_backfill_bug():
@task
def noop():
pass
noop()
test_backfill_bug()
-
Fire 5 concurrent backfill creation requests:
TOKEN=$(curl -s -X POST "http://localhost:8080/auth/token" \
-H "Content-Type: application/json" \
-d '{"username": "admin", "password": "admin"}' \
| python3 -c "import sys,json; print(json.load(sys.stdin)['access_token'])")
for i in 1 2 3 4 5; do
curl -s -o /dev/null -w "[%{http_code}] " \
-X POST "http://localhost:8080/api/v2/backfills" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $TOKEN" \
-d '{"dag_id":"test_backfill_bug","from_date":"2020-01-01T00:00:00Z","to_date":"2022-12-31T00:00:00Z","max_active_runs":1,"reprocess_behavior":"none"}' &
done
wait
-
Check SQLite for partial data:
python3 -c "
import sqlite3, os
conn = sqlite3.connect(os.path.expanduser('~/airflow/airflow.db'))
cur = conn.cursor()
cur.execute('SELECT id, created_at FROM backfill WHERE dag_id=\"test_backfill_bug\" ORDER BY id')
for b_id, created_at in cur.fetchall():
cur.execute('SELECT COUNT(*) FROM backfill_dag_run WHERE backfill_id=?', (b_id,))
print(f'backfill {b_id}: {cur.fetchone()[0]}/1096 runs created_at={created_at}')
conn.close()
"
Observed:
[201] [500] [500] [409] [409]
backfill 1: 1096/1096 runs created_at=...
backfill 2: 2/1096 runs created_at=... <- HTTP 500 but partial data committed
What you think should happen instead?
Either full success (201) or a clean failure with zero rows written. A request that returns HTTP 500 should never leave partial data in the database. Currently the partially-created backfill cannot be retried (409 "already running") and must be manually cleaned up from SQLite.
Operating System
macOS
Deployment
Virtualenv installation
Apache Airflow Provider(s)
No response
Versions of Apache Airflow Providers
No response
Official Helm Chart version
Not Applicable
Kubernetes Version
No response
Helm Chart configuration
No response
Docker Image customizations
No response
Anything else?
Distinct from #61375 (premature completion, fixed in 3.2.2). This is a data integrity issue in the creation path — _create_runs_non_partitioned in airflow/models/backfill.py has no protection against partial commit on SQLite lock timeout.
Are you willing to submit PR?
Code of Conduct
Under which category would you file this issue?
Airflow Core
Apache Airflow version
3.2.2
What happened and how to reproduce it?
Apologies for AI slop but the issue is clearly reproducible.
<🤖>Sending multiple concurrent
POST /api/v2/backfillsrequests for the same DAG results in HTTP 500 with empty response body, but those failed requests partially commitbackfill_dag_runrows to SQLite. The partially-created backfill then blocks future creation with "already running backfill".Root cause visible in API server logs:
_create_runs_non_partitionedinserts rows in a loop usingbegin_nested()savepoints. When SQLite's busy timeout fires mid-loop the exception propagates as HTTP 500, but rows flushed by earlier savepoints survive — leaving a partially-initialised backfill.My mise.toml
To reproduce:
Create a minimal DAG (
test_backfill_bug.py):Fire 5 concurrent backfill creation requests:
Check SQLite for partial data:
Observed:
What you think should happen instead?
Either full success (201) or a clean failure with zero rows written. A request that returns HTTP 500 should never leave partial data in the database. Currently the partially-created backfill cannot be retried (409 "already running") and must be manually cleaned up from SQLite.
Operating System
macOS
Deployment
Virtualenv installation
Apache Airflow Provider(s)
No response
Versions of Apache Airflow Providers
No response
Official Helm Chart version
Not Applicable
Kubernetes Version
No response
Helm Chart configuration
No response
Docker Image customizations
No response
Anything else?
Distinct from #61375 (premature completion, fixed in 3.2.2). This is a data integrity issue in the creation path —
_create_runs_non_partitionedinairflow/models/backfill.pyhas no protection against partial commit on SQLite lock timeout.Are you willing to submit PR?
Code of Conduct