Fix task runner failure on duplicate TI success update conflict#63355
Fix task runner failure on duplicate TI success update conflict#63355sidshas03 wants to merge 1 commit intoapache:mainfrom
Conversation
Did you find out why did it errror with 409 invalid state in the first place. |
Yes, I checked this from my side. 409 invalid_state is coming when same TI gets terminal So issue is race/idempotency in state update flow, not wrong task result. So, this PR only handles that exact duplicate success path as no-op, so task process won’t fail for this case. Other invalid_state conflicts are still raised as before. I’m also checking which path is writing first in HA (scheduler side vs worker retry path). If I get clean reproducible logs, I’ll add them in follow-up. |
Yeah, that's what I meant, i.e. let's find out the root cause first and not add band-aid -- what's causing task to succeed before the worker reports in the first place. |
I traced this deeper and found the root-cause path. This is primarily a scheduler-side HA race, not only a task-runner finalization issue. In Then one attempt finishes first and sets TI to This matches the observed ordering in related HA logs:
So the band-aid behavior in this PR avoids task-process failure for duplicate-success report, but the underlying cause is the scheduler HA race on TI scheduling/ownership. But, this PR (#63355) fixes only the worker-side symptom for this specific case: when final It does not fully solve the scheduler HA root cause (duplicate TI scheduling / try-number race in schedule_tis). That part needs scheduler-side fix separately (related to #57618 / #60330 path). |
|
@sidshas03 Could you post logs please |
Hi @kaxil, as requested, sharing the relevant logs and exact lines.
Links:
From these logs, root cause is scheduler-side HA race/duplicate scheduling path; #63355 is a focused mitigation so duplicate-success 409 does not fail task process, while scheduler race handling is in #63367. |
|
Hi! Even though this does not fix the root cause I would appreciate a quick fix to get our DAGs working again rather then waiting for larger scoped issues to be resolved some time in the future 😄 |
|
cc @amoghrajesh Could you take a look please, I won't get to it today |
|
@sidshas03 A few things need addressing before review — see our Pull Request quality criteria.
No rush. Note: This comment was drafted by an AI-assisted triage tool and may contain mistakes. Once you have addressed the points above, an Apache Airflow maintainer — a real person — will take the next look at your PR. We use this two-stage triage process so that our maintainers' limited time is spent where it matters most: the conversation with you. |
Thanks for flagging this. I’m addressing the unresolved maintainer thread now (moving the duplicate-state handling to API side as suggested), then I’ll rebase this branch on latest I’ll resolve the review conversation after the changes are in. |
748e9cd to
ccfc763
Compare
|
Thanks for the review, addressed the requested changes in this update.
For Validation
|
If you say that tghe issue is due to a stale view, maybe this can be solved by an exclusive lock? or if the stale view comes from ORM cache, maybe we can clear the cache before running the query? as this probably should also be part of the critical section of the main scheduler loop, what do you think? |
Thanks @Nataneljpwd, very valid point. I dug into it and agree the root cause is scheduler side HA race (stale scheduler view / competing updates), not only task runner finalization. This PR is intentionally narrow: it makes the TI state endpoint idempotent for duplicate same state terminal updates (returns 200), so duplicate SUCCESS report doesn’t fail the task process. Real conflicting transitions still return 409. On lock/cache part:
The durable fix should be in scheduler write path ( |
ccfc763 to
4770f21
Compare
Are you running 2 schedulers? |
Yes, this was reproduced with HA scheduling enabled using 2 scheduler replicas. In the logs, the same task instance is handled by both scheduler pods ( |
|
While you already have them running, could you find these please:
The def get_running_dag_runs_to_examine(cls, session: Session) -> ScalarResult[DagRun]:
"""
Return the next DagRuns that the scheduler should attempt to schedule.
This will return zero or more DagRun rows that are row-level-locked with a "SELECT ... FOR UPDATE"
query, you should ensure that any scheduling decisions are made in a single transaction -- as soon as
the transaction is committed it will be unlocked.
"""
...
return session.scalars(with_row_locks(query, of=cls, session=session, skip_locked=True)).unique()Full Code: airflow/airflow-core/src/airflow/models/dagrun.py Lines 581 to 616 in 267a566 Given this, the From #57618 (comment) |
Thanks for the detailed questions. I tried collecting full logs, but I currently don’t have kube access from this local environment (no active kubectl context; kubectl falls back to localhost:8080), so I can’t fetch scheduler pod logs directly from this setup yet. |
4770f21 to
ee49981
Compare
|
Any updates on this one? |
942a209 to
f38a28f
Compare
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adjusts Task Instance state update behavior to treat duplicate terminal-state updates as idempotent, avoiding failures when the TI is already in the requested terminal state.
Changes:
- Added an idempotency short-circuit in the TI state update endpoint for duplicate terminal-state updates.
- Updated OpenAPI response descriptions to distinguish “already in requested state” vs “invalid transition”.
- Added unit tests to validate idempotent duplicate updates and conflict on terminal-state mismatches.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| airflow-core/tests/unit/api_fastapi/execution_api/versions/head/test_task_instances.py | Adds tests for idempotent duplicate terminal updates and conflict on mismatched terminal updates |
| airflow-core/src/airflow/api_fastapi/execution_api/routes/task_instances.py | Implements idempotent handling for same-state updates and updates documented responses |
|
@sidshas03 can you address the open comments and close what is already fixed? |
|
@sidshas03 Converting to draft — this PR doesn't yet meet our Pull Request quality criteria.
See the linked criteria for how to fix each item, then mark the PR "Ready for review". This is not a rejection — just an invitation to bring the PR up to standard. No rush. Note: This comment was drafted by an AI-assisted triage tool and may contain mistakes. Once you have addressed the points above, an Apache Airflow maintainer — a real person — will take the next look at your PR. We use this two-stage triage process so that our maintainers' limited time is spent where it matters most: the conversation with you. |
|
@sidshas03 This draft PR has been inactive for 13 days since the last triage comment and no response from the author. Closing to keep the queue clean. You are welcome to reopen this PR when you resume work, or to open a new one addressing the issues previously raised. There is no rush — take your time. Note: This comment was drafted by an AI-assisted triage tool and may contain mistakes. Once you have addressed the points above, an Apache Airflow maintainer — a real person — will take the next look at your PR. We use this two-stage triage process so that our maintainers' limited time is spent where it matters most: the conversation with you. |
|
@sidshas03 @potiuk we're already using this fix as a monkeypatch in our setup but it would be very helpful for us if this is going into the next release! |
|
I reopened it but moving forward with this depends on @sidshas03 to finish it. |
|
Hi @eladkal, thanks for reopening this. I'm still working on it — was caught up with office work and actually had the fixes done locally a while back but forgot to push. I'll address all 8 unresolved review threads now and push the update shortly. |
b994fb2 to
2bd59fc
Compare
|
Hey, sorry for going quiet on this - got caught up with work stuff. I'd actually made most of these fixes a while ago but completely forgot to push them. Changes I made:
The 4 CI failures are all the same SageMaker provider test ( Going through the open review threads now. |
2bd59fc to
780feb0
Compare
396f0e2 to
c5ca7c9
Compare
c5ca7c9 to
0976b34
Compare








Fixes #63183
Implemented a focused fix for duplicate terminal state update conflict.
What I changed:
invalid_stateprevious_state=successwhile sending success), task runner will not fail the processTests added:
Ran targeted task-sdk tests and lint locally.