Skip to content

Concurrent POST /api/v2/backfills causes HTTP 500 + partial data with SQLite metadata DB #68699

@mwisnicki

Description

@mwisnicki

Under which category would you file this issue?

Airflow Core

Apache Airflow version

3.2.2

What happened and how to reproduce it?

Apologies for AI slop but the issue is clearly reproducible.

<🤖>

Sending multiple concurrent POST /api/v2/backfills requests for the same DAG results in HTTP 500 with empty response body, but those failed requests partially commit backfill_dag_run rows to SQLite. The partially-created backfill then blocks future creation with "already running backfill".

Root cause visible in API server logs:

sqlite3.OperationalError: database is locked
sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) database is locked

_create_runs_non_partitioned inserts rows in a loop using begin_nested() savepoints. When SQLite's busy timeout fires mid-loop the exception propagates as HTTP 500, but rows flushed by earlier savepoints survive — leaving a partially-initialised backfill.

My mise.toml

[tools]
python = "3.12"

[env]
_.path = [".venv/bin"]
AIRFLOW_HOME = "{{config_root}}"
AIRFLOW__API__PORT = "48001"
AIRFLOW__CORE__EXECUTION_API_SERVER_URL = "http://localhost:48001/execution/"
AIRFLOW__CORE__LOAD_EXAMPLES = "false"
AIRFLOW__DAG_PROCESSOR__REFRESH_INTERVAL = "1"
AIRFLOW__LOGGING__WORKER_LOG_SERVER_HOST = "127.0.0.1"
AIRFLOW__SCHEDULER__MAX_DAGRUNS_TO_CREATE_PER_LOOP = "15"
AIRFLOW__CORE__MAX_ACTIVE_RUNS_PER_DAG = "15"
AIRFLOW__CORE__PARALLELISM = "50"

[tasks.setup]
description = "Install Airflow into venv"
run = """
  pip install "apache-airflow==3.2.2" \
    --constraint "https://raw.githubusercontent.com/apache/airflow/constraints-3.2.2/constraints-no-providers-3.12.txt"
"""

To reproduce:

  1. Create a minimal DAG (test_backfill_bug.py):

     from airflow.sdk import dag, task
     from datetime import datetime
    
     @dag(
         dag_id="test_backfill_bug",
         schedule="@daily",
         start_date=datetime(2020, 1, 1),
         end_date=datetime(2022, 12, 31),
         catchup=False,
     )
     def test_backfill_bug():
         @task
         def noop():
             pass
         noop()
    
     test_backfill_bug()
    
  2. Fire 5 concurrent backfill creation requests:

     TOKEN=$(curl -s -X POST "http://localhost:8080/auth/token" \
       -H "Content-Type: application/json" \
       -d '{"username": "admin", "password": "admin"}' \
       | python3 -c "import sys,json; print(json.load(sys.stdin)['access_token'])")
    
     for i in 1 2 3 4 5; do
       curl -s -o /dev/null -w "[%{http_code}] " \
         -X POST "http://localhost:8080/api/v2/backfills" \
         -H "Content-Type: application/json" \
         -H "Authorization: Bearer $TOKEN" \
         -d '{"dag_id":"test_backfill_bug","from_date":"2020-01-01T00:00:00Z","to_date":"2022-12-31T00:00:00Z","max_active_runs":1,"reprocess_behavior":"none"}' &
     done
     wait
    
  3. Check SQLite for partial data:

     python3 -c "
     import sqlite3, os
     conn = sqlite3.connect(os.path.expanduser('~/airflow/airflow.db'))
     cur = conn.cursor()
     cur.execute('SELECT id, created_at FROM backfill WHERE dag_id=\"test_backfill_bug\" ORDER BY id')
     for b_id, created_at in cur.fetchall():
         cur.execute('SELECT COUNT(*) FROM backfill_dag_run WHERE backfill_id=?', (b_id,))
         print(f'backfill {b_id}: {cur.fetchone()[0]}/1096 runs  created_at={created_at}')
     conn.close()
     "
    

Observed:

[201] [500] [500] [409] [409]
backfill 1: 1096/1096 runs  created_at=...
backfill 2:    2/1096 runs  created_at=...   <- HTTP 500 but partial data committed

What you think should happen instead?

Either full success (201) or a clean failure with zero rows written. A request that returns HTTP 500 should never leave partial data in the database. Currently the partially-created backfill cannot be retried (409 "already running") and must be manually cleaned up from SQLite.

Operating System

macOS

Deployment

Virtualenv installation

Apache Airflow Provider(s)

No response

Versions of Apache Airflow Providers

No response

Official Helm Chart version

Not Applicable

Kubernetes Version

No response

Helm Chart configuration

No response

Docker Image customizations

No response

Anything else?

Distinct from #61375 (premature completion, fixed in 3.2.2). This is a data integrity issue in the creation path — _create_runs_non_partitioned in airflow/models/backfill.py has no protection against partial commit on SQLite lock timeout.

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    area:APIAirflow's REST/HTTP APIarea:MetaDBMeta Database related issues.area:backfillSpecifically for backfill relatedkind:bugThis is a clearly a bugneeds-triagelabel for new issues that we didn't triage yet

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions