Skip to content

Task completes successfully but gets killed and marked as failed #63183

@cedric-fauth

Description

@cedric-fauth

Apache Airflow version

Other Airflow 3 version (please specify below)

If "Other Airflow 3 version" selected, which one?

3.1.6

What happened?

Many of my airflow DAGs are failing because some tasks are killed and marked as failed even if they completed successfully. This seems to happen randomly but often multiple DAGs fail at a similar time.

Behavior:

Just after a task logs their last log message and returns there is an internal error with the airflow sdk involved:

{"logger": "airflow.task.operators.dags.custom_operators.DjangoOperator", "filename": "python.py", "lineno": 216, "event": "Done. Returned value was: None", "level": "info"}
{"logger": "task", "filename": "task_runner.py", "lineno": 1562, "error_detail": [{"exc_type": "AirflowRuntimeError", "exc_value": "API_SERVER_ERROR: {'status_code': 409, 'message': 'Server returned error', 'detail': {'detail': {'reason': 'invalid_state', 'message': 'TI was not in the running state so it cannot be updated', 'previous_state': 'success'}}}", "exc_notes": [], "syntax_error": null, "is_cause": false, "frames": [{"filename": "/bin/app/lib/python3.12/site-packages/airflow/sdk/execution_time/task_runner.py", "lineno": 1555, "name": "main"}, {"filename": "/bin/app/lib/python3.12/site-packages/airflow/sdk/execution_time/task_runner.py", "lineno": 1083, "name": "run"}, {"filename": "/bin/app/lib/python3.12/site-packages/airflow/sdk/execution_time/comms.py", "lineno": 206, "name": "send"}, {"filename": "/bin/app/lib/python3.12/site-packages/airflow/sdk/execution_time/comms.py", "lineno": 270, "name": "_get_response"}, {"filename": "/bin/app/lib/python3.12/site-packages/airflow/sdk/execution_time/comms.py", "lineno": 257, "name": "_from_frame"}], "is_group": false, "exceptions": []}], "event": "Top level error", "level": "error"}
{"exit_code": 1, "event": "Process exited abnormally", "level": "warning", "logger": "task"}

[{"exc_notes":[],"exc_type":"AirflowRuntimeError","exc_value":"API_SERVER_ERROR: {'status_code': 409, 'message': 'Server returned error', 'detail': {'detail': {'reason': 'invalid_state', 'message': 'TI was not in the running state so it cannot be updated', 'previous_state': 'success'}}}","exceptions":[],"frames":[{"filename":"/bin/app/lib/python3.12/site-packages/airflow/sdk/execution_time/task_runner.py","lineno":1555,"name":"main"},{"filename":"/bin/app/lib/python3.12/site-packages/airflow/sdk/execution_time/task_runner.py","lineno":1083,"name":"run"},{"filename":"/bin/app/lib/python3.12/site-packages/airflow/sdk/execution_time/comms.py","lineno":206,"name":"send"},{"filename":"/bin/app/lib/python3.12/site-packages/airflow/sdk/execution_time/comms.py","lineno":270,"name":"_get_response"},{"filename":"/bin/app/lib/python3.12/site-packages/airflow/sdk/execution_time/comms.py","lineno":257,"name":"_from_frame"}],"is_cause":false,"is_group":false,"syntax_error":"\u003cnil\u003e"}]

What you think should happen instead?

After the task ends it should update its state to successful and updating the state should not result in an error. It looks a bit like airflow already updated the state and tries to update it again even if the task was already completed.

How to reproduce

I don't know how this can be reproduced but there are similar issues to this one which handle different cases of TI was not in the running state so it cannot be updated.

Operating System

WSL

Versions of Apache Airflow Providers

apache-airflow[celery, redis, amazon, postgres, docker, fab]==3.1.6
apache-airflow-providers-fab==3.1.2

Deployment

Other

Deployment details

Deployed on ECS using a Fargate Cluster. One task per worker, scheduler, etc. Only one container running per task with additional logging sidecar containers.

Anything else?

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    area:corekind:bugThis is a clearly a bugneeds-triagelabel for new issues that we didn't triage yet

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions