Apache Airflow version
Other Airflow 3 version (please specify below)
If "Other Airflow 3 version" selected, which one?
3.1.6
What happened?
Many of my airflow DAGs are failing because some tasks are killed and marked as failed even if they completed successfully. This seems to happen randomly but often multiple DAGs fail at a similar time.
Behavior:
Just after a task logs their last log message and returns there is an internal error with the airflow sdk involved:
{"logger": "airflow.task.operators.dags.custom_operators.DjangoOperator", "filename": "python.py", "lineno": 216, "event": "Done. Returned value was: None", "level": "info"}
{"logger": "task", "filename": "task_runner.py", "lineno": 1562, "error_detail": [{"exc_type": "AirflowRuntimeError", "exc_value": "API_SERVER_ERROR: {'status_code': 409, 'message': 'Server returned error', 'detail': {'detail': {'reason': 'invalid_state', 'message': 'TI was not in the running state so it cannot be updated', 'previous_state': 'success'}}}", "exc_notes": [], "syntax_error": null, "is_cause": false, "frames": [{"filename": "/bin/app/lib/python3.12/site-packages/airflow/sdk/execution_time/task_runner.py", "lineno": 1555, "name": "main"}, {"filename": "/bin/app/lib/python3.12/site-packages/airflow/sdk/execution_time/task_runner.py", "lineno": 1083, "name": "run"}, {"filename": "/bin/app/lib/python3.12/site-packages/airflow/sdk/execution_time/comms.py", "lineno": 206, "name": "send"}, {"filename": "/bin/app/lib/python3.12/site-packages/airflow/sdk/execution_time/comms.py", "lineno": 270, "name": "_get_response"}, {"filename": "/bin/app/lib/python3.12/site-packages/airflow/sdk/execution_time/comms.py", "lineno": 257, "name": "_from_frame"}], "is_group": false, "exceptions": []}], "event": "Top level error", "level": "error"}
{"exit_code": 1, "event": "Process exited abnormally", "level": "warning", "logger": "task"}
[{"exc_notes":[],"exc_type":"AirflowRuntimeError","exc_value":"API_SERVER_ERROR: {'status_code': 409, 'message': 'Server returned error', 'detail': {'detail': {'reason': 'invalid_state', 'message': 'TI was not in the running state so it cannot be updated', 'previous_state': 'success'}}}","exceptions":[],"frames":[{"filename":"/bin/app/lib/python3.12/site-packages/airflow/sdk/execution_time/task_runner.py","lineno":1555,"name":"main"},{"filename":"/bin/app/lib/python3.12/site-packages/airflow/sdk/execution_time/task_runner.py","lineno":1083,"name":"run"},{"filename":"/bin/app/lib/python3.12/site-packages/airflow/sdk/execution_time/comms.py","lineno":206,"name":"send"},{"filename":"/bin/app/lib/python3.12/site-packages/airflow/sdk/execution_time/comms.py","lineno":270,"name":"_get_response"},{"filename":"/bin/app/lib/python3.12/site-packages/airflow/sdk/execution_time/comms.py","lineno":257,"name":"_from_frame"}],"is_cause":false,"is_group":false,"syntax_error":"\u003cnil\u003e"}]
What you think should happen instead?
After the task ends it should update its state to successful and updating the state should not result in an error. It looks a bit like airflow already updated the state and tries to update it again even if the task was already completed.
How to reproduce
I don't know how this can be reproduced but there are similar issues to this one which handle different cases of TI was not in the running state so it cannot be updated.
Operating System
WSL
Versions of Apache Airflow Providers
apache-airflow[celery, redis, amazon, postgres, docker, fab]==3.1.6
apache-airflow-providers-fab==3.1.2
Deployment
Other
Deployment details
Deployed on ECS using a Fargate Cluster. One task per worker, scheduler, etc. Only one container running per task with additional logging sidecar containers.
Anything else?
No response
Are you willing to submit PR?
Code of Conduct
Apache Airflow version
Other Airflow 3 version (please specify below)
If "Other Airflow 3 version" selected, which one?
3.1.6
What happened?
Many of my airflow DAGs are failing because some tasks are killed and marked as failed even if they completed successfully. This seems to happen randomly but often multiple DAGs fail at a similar time.
Behavior:
Just after a task logs their last log message and returns there is an internal error with the airflow sdk involved:
[{"exc_notes":[],"exc_type":"AirflowRuntimeError","exc_value":"API_SERVER_ERROR: {'status_code': 409, 'message': 'Server returned error', 'detail': {'detail': {'reason': 'invalid_state', 'message': 'TI was not in the running state so it cannot be updated', 'previous_state': 'success'}}}","exceptions":[],"frames":[{"filename":"/bin/app/lib/python3.12/site-packages/airflow/sdk/execution_time/task_runner.py","lineno":1555,"name":"main"},{"filename":"/bin/app/lib/python3.12/site-packages/airflow/sdk/execution_time/task_runner.py","lineno":1083,"name":"run"},{"filename":"/bin/app/lib/python3.12/site-packages/airflow/sdk/execution_time/comms.py","lineno":206,"name":"send"},{"filename":"/bin/app/lib/python3.12/site-packages/airflow/sdk/execution_time/comms.py","lineno":270,"name":"_get_response"},{"filename":"/bin/app/lib/python3.12/site-packages/airflow/sdk/execution_time/comms.py","lineno":257,"name":"_from_frame"}],"is_cause":false,"is_group":false,"syntax_error":"\u003cnil\u003e"}]
What you think should happen instead?
After the task ends it should update its state to successful and updating the state should not result in an error. It looks a bit like airflow already updated the state and tries to update it again even if the task was already completed.
How to reproduce
I don't know how this can be reproduced but there are similar issues to this one which handle different cases of
TI was not in the running state so it cannot be updated.Operating System
WSL
Versions of Apache Airflow Providers
apache-airflow[celery, redis, amazon, postgres, docker, fab]==3.1.6
apache-airflow-providers-fab==3.1.2
Deployment
Other
Deployment details
Deployed on ECS using a Fargate Cluster. One task per worker, scheduler, etc. Only one container running per task with additional logging sidecar containers.
Anything else?
No response
Are you willing to submit PR?
Code of Conduct