Skip to content

Fix task runner failure on duplicate TI success update conflict#63355

Open
sidshas03 wants to merge 1 commit intoapache:mainfrom
sidshas03:fix-63183-invalid-state-v2
Open

Fix task runner failure on duplicate TI success update conflict#63355
sidshas03 wants to merge 1 commit intoapache:mainfrom
sidshas03:fix-63183-invalid-state-v2

Conversation

@sidshas03
Copy link
Copy Markdown
Contributor

@sidshas03 sidshas03 commented Mar 11, 2026

Fixes #63183
Implemented a focused fix for duplicate terminal state update conflict.

What I changed:

  • added guard in task runner for API 409 invalid_state
  • if API says TI is already in requested state (example: previous_state=success while sending success), task runner will not fail the process
  • if API state does not match requested state, existing failure behaviour is kept

Tests added:

  • duplicate success conflict is ignored
  • conflicting state update still raises error

Ran targeted task-sdk tests and lint locally.

@kaxil
Copy link
Copy Markdown
Member

kaxil commented Mar 11, 2026

When task runner sends final SUCCESS, if API replies with 409 invalid_state and previous_state=success, we now treat it as duplicate/idempotent state update and do not fail the task process.

Did you find out why did it errror with 409 invalid state in the first place.

@sidshas03
Copy link
Copy Markdown
Contributor Author

When task runner sends final SUCCESS, if API replies with 409 invalid_state and previous_state=success, we now treat it as duplicate/idempotent state update and do not fail the task process.

Did you find out why did it errror with 409 invalid state in the first place.

Yes, I checked this from my side.

409 invalid_state is coming when same TI gets terminal SUCCESS update more than once. In this case, by the time task runner sends final SUCCESS, DB already has SUCCESS (previous_state=success), so API rejects the second write.

So issue is race/idempotency in state update flow, not wrong task result.

So, this PR only handles that exact duplicate success path as no-op, so task process won’t fail for this case. Other invalid_state conflicts are still raised as before.

I’m also checking which path is writing first in HA (scheduler side vs worker retry path). If I get clean reproducible logs, I’ll add them in follow-up.

@kaxil
Copy link
Copy Markdown
Member

kaxil commented Mar 11, 2026

409 invalid_state is coming when same TI gets terminal SUCCESS update more than once. In this case, by the time task runner sends final SUCCESS, DB already has SUCCESS (previous_state=success), so API rejects the second write.

Yeah, that's what I meant, i.e. let's find out the root cause first and not add band-aid -- what's causing task to succeed before the worker reports in the first place.

@sidshas03
Copy link
Copy Markdown
Contributor Author

sidshas03 commented Mar 11, 2026

409 invalid_state is coming when same TI gets terminal SUCCESS update more than once. In this case, by the time task runner sends final SUCCESS, DB already has SUCCESS (previous_state=success), so API rejects the second write.

Yeah, that's what I meant, i.e. let's find out the root cause first and not add band-aid -- what's causing task to succeed before the worker reports in the first place.

I traced this deeper and found the root-cause path.

This is primarily a scheduler-side HA race, not only a task-runner finalization issue.

In DagRun.schedule_tis() (airflow-core/src/airflow/models/dagrun.py), the scheduling update is keyed by TI id and can run from a stale scheduler view, which allows duplicate scheduling / try bump for the same TI under contention.
That means the same TI can be enqueued as try 1 and try 2.

Then one attempt finishes first and sets TI to success.
When the other attempt later reports terminal state, execution API correctly rejects it with 409 invalid_state / previous_state=success (airflow-core/src/airflow/api_fastapi/execution_api/routes/task_instances.py).

This matches the observed ordering in related HA logs:

  • enqueue try 1
  • enqueue try 2 for same TI
  • try 2 runs
  • DagRun marked success
  • late try 1 report hits invalid_state conflict

So the band-aid behavior in this PR avoids task-process failure for duplicate-success report, but the underlying cause is the scheduler HA race on TI scheduling/ownership.

But, this PR (#63355) fixes only the worker-side symptom for this specific case: when final SUCCESS update gets 409 invalid_state with previous_state=success, we now treat it as duplicate/idempotent and avoid failing the task process.

It does not fully solve the scheduler HA root cause (duplicate TI scheduling / try-number race in schedule_tis). That part needs scheduler-side fix separately (related to #57618 / #60330 path).

@kaxil
Copy link
Copy Markdown
Member

kaxil commented Mar 11, 2026

@sidshas03 Could you post logs please

@sidshas03
Copy link
Copy Markdown
Contributor Author

sidshas03 commented Mar 12, 2026

@sidshas03 Could you post logs please

Hi @kaxil, as requested, sharing the relevant logs and exact lines.

Screenshot 2026-03-11 at 22 17 37 Screenshot 2026-03-11 at 22 16 49 Screenshot 2026-03-11 at 22 16 20 Screenshot 2026-03-11 at 22 15 34 Screenshot 2026-03-11 at 22 15 13 Screenshot 2026-03-11 at 22 14 41 Screenshot 2026-03-11 at 22 14 05 Screenshot 2026-03-11 at 22 13 14 Screenshot 2026-03-11 at 22 12 42 Screenshot 2026-03-11 at 22 12 13 Screenshot 2026-03-11 at 22 12 07 Screenshot 2026-03-11 at 22 10 56
  1. Symptom from Task completes successfully but gets killed and marked as failed #63183 (task completed, then 409, then process exits)
  • "event": "Done. Returned value was: None"
  • "exc_value": "API_SERVER_ERROR: {'status_code': 409, ... 'reason': 'invalid_state', ... 'previous_state': 'success'}"
  • "event": "Process exited abnormally"
  1. Scheduler race timeline from Schedulers race condition when using with kubernetes executor #57618 (same TI getting two tries)
  • 2025-12-15T04:42:29.146414Z [info] Add task ... try_number=1
  • 2025-12-15T04:43:58.258654Z [info] Reset the following 12 orphaned TaskInstances ...
  • 2025-12-15T04:44:01.790234Z [info] Add task ... try_number=2
  • 2025-12-15T04:45:19.337468Z [info] Marking run <DagRun sadp_dag_128 ...> successful
  • 2025-12-15T04:45:20.038291Z [info] Creating kubernetes pod ... try_number=1
  • 2025-12-15T04:46:04.784337Z [info] Received executor event with state failed ... try_number=1
  1. API/server side conflict log from Schedulers race condition when using with kubernetes executor #57618
  • {"event":"API server error","status_code":409,"detail":{"detail":{"reason":"invalid_state","previous_state":"success"}}}
  • {"event":"Top level error", ... "API_SERVER_ERROR: {'status_code': 409, ... 'previous_state': 'success'}"}
  • airflow.sdk.api.client.ServerResponseError: Server returned error

Links:

From these logs, root cause is scheduler-side HA race/duplicate scheduling path; #63355 is a focused mitigation so duplicate-success 409 does not fail task process, while scheduler race handling is in #63367.

@cedric-fauth
Copy link
Copy Markdown

Hi! Even though this does not fix the root cause I would appreciate a quick fix to get our DAGs working again rather then waiting for larger scoped issues to be resolved some time in the future 😄

@kaxil
Copy link
Copy Markdown
Member

kaxil commented Mar 12, 2026

cc @amoghrajesh Could you take a look please, I won't get to it today

Comment thread task-sdk/src/airflow/sdk/execution_time/task_runner.py
@potiuk
Copy link
Copy Markdown
Member

potiuk commented Mar 12, 2026

@sidshas03 A few things need addressing before review — see our Pull Request quality criteria.

  • ⚠️ Unresolved review comments: This PR has 1 unresolved review thread from maintainers. Please review and resolve all inline review comments before requesting another review. You can resolve a conversation by clicking 'Resolve conversation' on each thread after addressing the feedback. See pull request guidelines.

Note: Your branch is 155 commits behind main. Please rebase and push again to get up-to-date CI results.

No rush.


Note: This comment was drafted by an AI-assisted triage tool and may contain mistakes. Once you have addressed the points above, an Apache Airflow maintainer — a real person — will take the next look at your PR. We use this two-stage triage process so that our maintainers' limited time is spent where it matters most: the conversation with you.

@sidshas03
Copy link
Copy Markdown
Contributor Author

@sidshas03 This PR has a few issues that need to be addressed before it can be reviewed — please see our Pull Request quality criteria.

Issues found:

  • ⚠️ Unresolved review comments: This PR has 1 unresolved review thread from maintainers. Please review and resolve all inline review comments before requesting another review. You can resolve a conversation by clicking 'Resolve conversation' on each thread after addressing the feedback. See pull request guidelines.

Note: Your branch is 155 commits behind main. Some check failures may be caused by changes in the base branch rather than by your PR. Please rebase your branch and push again to get up-to-date CI results.

What to do next:

  • The comment informs you what you need to do.
  • Fix each issue, then mark the PR as "Ready for review" in the GitHub UI - but only after making sure that all the issues are fixed.
  • Maintainers will then proceed with a normal review.

Please address the issues above and push again. If you have questions, feel free to ask on the Airflow Slack.

Thanks for flagging this.

I’m addressing the unresolved maintainer thread now (moving the duplicate-state handling to API side as suggested), then I’ll rebase this branch on latest main and push an updated commit.

I’ll resolve the review conversation after the changes are in.

@sidshas03 sidshas03 force-pushed the fix-63183-invalid-state-v2 branch from 748e9cd to ccfc763 Compare March 12, 2026 15:44
@sidshas03
Copy link
Copy Markdown
Contributor Author

sidshas03 commented Mar 12, 2026

Thanks for the review, addressed the requested changes in this update.

  • Moved duplicate same-state handling from task runner to the Execution API layer.
  • PATCH /execution/task-instances/{id}/state now treats same-state updates as idempotent no-op (200).
  • Real conflicting transitions still return 409 invalid_state.
  • Removed task-runner-side duplicate-state catch logic.
  • Added API tests covering:
    • duplicate same-state update -> success/no-op
    • conflicting state update -> 409

For Validation

  • Rebased branch on latest main.
  • Re-ran targeted lint and tests locally:
    • ruff on modified files: pass
    • test_ti_update_state_same_state_is_idempotent: pass
    • test_ti_update_state_not_running: pass

@Nataneljpwd
Copy link
Copy Markdown
Contributor

409 invalid_state is coming when same TI gets terminal SUCCESS update more than once. In this case, by the time task runner sends final SUCCESS, DB already has SUCCESS (previous_state=success), so API rejects the second write.

Yeah, that's what I meant, i.e. let's find out the root cause first and not add band-aid -- what's causing task to succeed before the worker reports in the first place.

I traced this deeper and found the root-cause path.

This is primarily a scheduler-side HA race, not only a task-runner finalization issue.

In DagRun.schedule_tis() (airflow-core/src/airflow/models/dagrun.py), the scheduling update is keyed by TI id and can run from a stale scheduler view, which allows duplicate scheduling / try bump for the same TI under contention. That means the same TI can be enqueued as try 1 and try 2.

Then one attempt finishes first and sets TI to success. When the other attempt later reports terminal state, execution API correctly rejects it with 409 invalid_state / previous_state=success (airflow-core/src/airflow/api_fastapi/execution_api/routes/task_instances.py).

If you say that tghe issue is due to a stale view, maybe this can be solved by an exclusive lock? or if the stale view comes from ORM cache, maybe we can clear the cache before running the query?

as this probably should also be part of the critical section of the main scheduler loop, what do you think?

@sidshas03
Copy link
Copy Markdown
Contributor Author

409 invalid_state is coming when same TI gets terminal SUCCESS update more than once. In this case, by the time task runner sends final SUCCESS, DB already has SUCCESS (previous_state=success), so API rejects the second write.

Yeah, that's what I meant, i.e. let's find out the root cause first and not add band-aid -- what's causing task to succeed before the worker reports in the first place.

I traced this deeper and found the root-cause path.
This is primarily a scheduler-side HA race, not only a task-runner finalization issue.
In DagRun.schedule_tis() (airflow-core/src/airflow/models/dagrun.py), the scheduling update is keyed by TI id and can run from a stale scheduler view, which allows duplicate scheduling / try bump for the same TI under contention. That means the same TI can be enqueued as try 1 and try 2.
Then one attempt finishes first and sets TI to success. When the other attempt later reports terminal state, execution API correctly rejects it with 409 invalid_state / previous_state=success (airflow-core/src/airflow/api_fastapi/execution_api/routes/task_instances.py).

If you say that tghe issue is due to a stale view, maybe this can be solved by an exclusive lock? or if the stale view comes from ORM cache, maybe we can clear the cache before running the query?

as this probably should also be part of the critical section of the main scheduler loop, what do you think?

409 invalid_state is coming when same TI gets terminal SUCCESS update more than once. In this case, by the time task runner sends final SUCCESS, DB already has SUCCESS (previous_state=success), so API rejects the second write.

Yeah, that's what I meant, i.e. let's find out the root cause first and not add band-aid -- what's causing task to succeed before the worker reports in the first place.

I traced this deeper and found the root-cause path.
This is primarily a scheduler-side HA race, not only a task-runner finalization issue.
In DagRun.schedule_tis() (airflow-core/src/airflow/models/dagrun.py), the scheduling update is keyed by TI id and can run from a stale scheduler view, which allows duplicate scheduling / try bump for the same TI under contention. That means the same TI can be enqueued as try 1 and try 2.
Then one attempt finishes first and sets TI to success. When the other attempt later reports terminal state, execution API correctly rejects it with 409 invalid_state / previous_state=success (airflow-core/src/airflow/api_fastapi/execution_api/routes/task_instances.py).

If you say that tghe issue is due to a stale view, maybe this can be solved by an exclusive lock? or if the stale view comes from ORM cache, maybe we can clear the cache before running the query?

as this probably should also be part of the critical section of the main scheduler loop, what do you think?

Thanks @Nataneljpwd, very valid point.

I dug into it and agree the root cause is scheduler side HA race (stale scheduler view / competing updates), not only task runner finalization.

This PR is intentionally narrow: it makes the TI state endpoint idempotent for duplicate same state terminal updates (returns 200), so duplicate SUCCESS report doesn’t fail the task process. Real conflicting transitions still return 409.

On lock/cache part:

  • clearing ORM cache won’t reliably solve it, because this race is across scheduler processes, not just one session cache.
  • a global/exclusive lock in the scheduler critical section would likely hurt HA throughput and increase contention.

The durable fix should be in scheduler write path (DagRun.schedule_tis) with state guarded conditional updates so stale updates become no-op. I’ve handled that separately in #63367 with race focused tests.

@sidshas03 sidshas03 force-pushed the fix-63183-invalid-state-v2 branch from ccfc763 to 4770f21 Compare March 12, 2026 18:37
@kaxil
Copy link
Copy Markdown
Member

kaxil commented Mar 12, 2026

I dug into it and agree the root cause is scheduler side HA race (stale scheduler view / competing updates), not only task runner finalization.

Are you running 2 schedulers?

@sidshas03
Copy link
Copy Markdown
Contributor Author

I dug into it and agree the root cause is scheduler side HA race (stale scheduler view / competing updates), not only task runner finalization.

Are you running 2 schedulers?

Yes, this was reproduced with HA scheduling enabled using 2 scheduler replicas.

In the logs, the same task instance is handled by both scheduler pods (...-nzlm5 and ...-rkv75), confirming this comes from a 2 scheduler setup.

@kaxil
Copy link
Copy Markdown
Member

kaxil commented Mar 12, 2026

While you already have them running, could you find these please:

  1. Could you upload logs from both schedulers when this happens? (full logs around the time of the race, not just excerpts)
  2. What is your scheduler_health_check_threshold setting? And how long do your tasks typically take to execute?
  3. Was Scheduler A actually unhealthy, or was it still running fine when Scheduler B marked its job as failed?

The DagRun.get_running_dag_runs_to_examine() uses SELECT ... FOR UPDATE SKIP LOCKED on DagRun rows. This means two schedulers should NOT be able to process the same DagRun concurrently - Scheduler B would skip any DagRun locked by Scheduler A:

def get_running_dag_runs_to_examine(cls, session: Session) -> ScalarResult[DagRun]:
    """
    Return the next DagRuns that the scheduler should attempt to schedule.

    This will return zero or more DagRun rows that are row-level-locked with a "SELECT ... FOR UPDATE"
    query, you should ensure that any scheduling decisions are made in a single transaction -- as soon as
    the transaction is committed it will be unlocked.
    """
    ...
    return session.scalars(with_row_locks(query, of=cls, session=session, skip_locked=True)).unique()

Full Code:

def get_running_dag_runs_to_examine(cls, session: Session) -> ScalarResult[DagRun]:
"""
Return the next DagRuns that the scheduler should attempt to schedule.
This will return zero or more DagRun rows that are row-level-locked with a "SELECT ... FOR UPDATE"
query, you should ensure that any scheduling decisions are made in a single transaction -- as soon as
the transaction is committed it will be unlocked.
:meta private:
"""
from airflow.models.backfill import BackfillDagRun
from airflow.models.dag import DagModel
query = (
select(cls)
.with_hint(cls, "USE INDEX (idx_dag_run_running_dags)", dialect_name="mysql")
.where(cls.state == DagRunState.RUNNING)
.join(DagModel, DagModel.dag_id == cls.dag_id)
.join(BackfillDagRun, BackfillDagRun.dag_run_id == DagRun.id, isouter=True)
.where(
DagModel.is_paused == false(),
DagModel.is_stale == false(),
)
.options(joinedload(cls.task_instances))
.order_by(
nulls_first(cast("ColumnElement[Any]", BackfillDagRun.sort_ordinal), session=session),
nulls_first(cast("ColumnElement[Any]", cls.last_scheduling_decision), session=session),
cls.run_after,
)
.limit(cls.DEFAULT_DAGRUNS_TO_EXAMINE)
)
query = query.where(DagRun.run_after <= func.now())
result = session.scalars(with_row_locks(query, of=cls, session=session, skip_locked=True)).unique()
return result

Given this, the schedule_tis() race (where two schedulers both increment try_number on the same TI) shouldn't be possible under normal operation.

From #57618 (comment)

@sidshas03
Copy link
Copy Markdown
Contributor Author

While you already have them running, could you find these please:

  1. Could you upload logs from both schedulers when this happens? (full logs around the time of the race, not just excerpts)
  2. What is your scheduler_health_check_threshold setting? And how long do your tasks typically take to execute?
  3. Was Scheduler A actually unhealthy, or was it still running fine when Scheduler B marked its job as failed?

The DagRun.get_running_dag_runs_to_examine() uses SELECT ... FOR UPDATE SKIP LOCKED on DagRun rows. This means two schedulers should NOT be able to process the same DagRun concurrently - Scheduler B would skip any DagRun locked by Scheduler A:

def get_running_dag_runs_to_examine(cls, session: Session) -> ScalarResult[DagRun]:
    """
    Return the next DagRuns that the scheduler should attempt to schedule.

    This will return zero or more DagRun rows that are row-level-locked with a "SELECT ... FOR UPDATE"
    query, you should ensure that any scheduling decisions are made in a single transaction -- as soon as
    the transaction is committed it will be unlocked.
    """
    ...
    return session.scalars(with_row_locks(query, of=cls, session=session, skip_locked=True)).unique()

Full Code:

def get_running_dag_runs_to_examine(cls, session: Session) -> ScalarResult[DagRun]:
"""
Return the next DagRuns that the scheduler should attempt to schedule.
This will return zero or more DagRun rows that are row-level-locked with a "SELECT ... FOR UPDATE"
query, you should ensure that any scheduling decisions are made in a single transaction -- as soon as
the transaction is committed it will be unlocked.
:meta private:
"""
from airflow.models.backfill import BackfillDagRun
from airflow.models.dag import DagModel
query = (
select(cls)
.with_hint(cls, "USE INDEX (idx_dag_run_running_dags)", dialect_name="mysql")
.where(cls.state == DagRunState.RUNNING)
.join(DagModel, DagModel.dag_id == cls.dag_id)
.join(BackfillDagRun, BackfillDagRun.dag_run_id == DagRun.id, isouter=True)
.where(
DagModel.is_paused == false(),
DagModel.is_stale == false(),
)
.options(joinedload(cls.task_instances))
.order_by(
nulls_first(cast("ColumnElement[Any]", BackfillDagRun.sort_ordinal), session=session),
nulls_first(cast("ColumnElement[Any]", cls.last_scheduling_decision), session=session),
cls.run_after,
)
.limit(cls.DEFAULT_DAGRUNS_TO_EXAMINE)
)
query = query.where(DagRun.run_after <= func.now())
result = session.scalars(with_row_locks(query, of=cls, session=session, skip_locked=True)).unique()
return result

Given this, the schedule_tis() race (where two schedulers both increment try_number on the same TI) shouldn't be possible under normal operation.

From #57618 (comment)

Thanks for the detailed questions.

I tried collecting full logs, but I currently don’t have kube access from this local environment (no active kubectl context; kubectl falls back to localhost:8080), so I can’t fetch scheduler pod logs directly from this setup yet.

@sidshas03 sidshas03 closed this Mar 12, 2026
@sidshas03 sidshas03 reopened this Mar 12, 2026
@sidshas03 sidshas03 force-pushed the fix-63183-invalid-state-v2 branch from 4770f21 to ee49981 Compare March 13, 2026 00:27
@sidshas03 sidshas03 requested a review from amoghrajesh March 13, 2026 08:26
@cedric-fauth
Copy link
Copy Markdown

Any updates on this one?

@potiuk potiuk modified the milestones: Airflow 3.2.0, Airflow 3.2.1 Apr 2, 2026
@potiuk potiuk force-pushed the fix-63183-invalid-state-v2 branch from 942a209 to f38a28f Compare April 2, 2026 18:02
@kaxil kaxil requested a review from Copilot April 10, 2026 19:55
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adjusts Task Instance state update behavior to treat duplicate terminal-state updates as idempotent, avoiding failures when the TI is already in the requested terminal state.

Changes:

  • Added an idempotency short-circuit in the TI state update endpoint for duplicate terminal-state updates.
  • Updated OpenAPI response descriptions to distinguish “already in requested state” vs “invalid transition”.
  • Added unit tests to validate idempotent duplicate updates and conflict on terminal-state mismatches.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
airflow-core/tests/unit/api_fastapi/execution_api/versions/head/test_task_instances.py Adds tests for idempotent duplicate terminal updates and conflict on mismatched terminal updates
airflow-core/src/airflow/api_fastapi/execution_api/routes/task_instances.py Implements idempotent handling for same-state updates and updates documented responses

Comment thread airflow-core/src/airflow/api_fastapi/execution_api/routes/task_instances.py Outdated
@eladkal
Copy link
Copy Markdown
Contributor

eladkal commented Apr 13, 2026

@sidshas03 can you address the open comments and close what is already fixed?

@potiuk potiuk marked this pull request as draft April 23, 2026 00:38
@potiuk
Copy link
Copy Markdown
Member

potiuk commented Apr 23, 2026

@sidshas03 Converting to draft — this PR doesn't yet meet our Pull Request quality criteria.

  • Unresolved review comments (8 threads (from maintainers)): please walk through each unresolved review thread. Even if a suggestion looks incorrect or irrelevant — and some of them will be, especially any comments left by automated reviewers like GitHub Copilot — it is still the author's responsibility to respond: apply the fix, reply in-thread with a brief explanation of why the suggestion does not apply, or resolve the thread if the feedback is no longer relevant. Once you believe a thread is resolved — whether by pushing a fix or by explaining why the suggestion doesn't apply — please mark it as resolved yourself by clicking the 'Resolve conversation' button at the bottom of each thread. Reviewers don't auto-close their own threads, so an addressed-but-unresolved thread reads as 'still waiting on the author' and keeps the PR from moving forward.

See the linked criteria for how to fix each item, then mark the PR "Ready for review". This is not a rejection — just an invitation to bring the PR up to standard. No rush.


Note: This comment was drafted by an AI-assisted triage tool and may contain mistakes. Once you have addressed the points above, an Apache Airflow maintainer — a real person — will take the next look at your PR. We use this two-stage triage process so that our maintainers' limited time is spent where it matters most: the conversation with you.

@potiuk
Copy link
Copy Markdown
Member

potiuk commented May 6, 2026

@sidshas03 This draft PR has been inactive for 13 days since the last triage comment and no response from the author. Closing to keep the queue clean.

You are welcome to reopen this PR when you resume work, or to open a new one addressing the issues previously raised. There is no rush — take your time.


Note: This comment was drafted by an AI-assisted triage tool and may contain mistakes. Once you have addressed the points above, an Apache Airflow maintainer — a real person — will take the next look at your PR. We use this two-stage triage process so that our maintainers' limited time is spent where it matters most: the conversation with you.

@potiuk potiuk closed this May 6, 2026
@cedric-fauth
Copy link
Copy Markdown

@sidshas03 @potiuk we're already using this fix as a monkeypatch in our setup but it would be very helpful for us if this is going into the next release!

@eladkal eladkal reopened this May 6, 2026
@eladkal
Copy link
Copy Markdown
Contributor

eladkal commented May 6, 2026

I reopened it but moving forward with this depends on @sidshas03 to finish it.
If the author is unwilling / don't have the time maybe he can hand over it to you @cedric-fauth to complete it?

@eladkal eladkal added the type:bug-fix Changelog: Bug Fixes label May 6, 2026
@sidshas03
Copy link
Copy Markdown
Contributor Author

Hi @eladkal, thanks for reopening this. I'm still working on it — was caught up with office work and actually had the fixes done locally a while back but forgot to push. I'll address all 8 unresolved review threads now and push the update shortly.

@sidshas03 sidshas03 force-pushed the fix-63183-invalid-state-v2 branch from b994fb2 to 2bd59fc Compare May 6, 2026 13:52
@sidshas03
Copy link
Copy Markdown
Contributor Author

Hey, sorry for going quiet on this - got caught up with work stuff. I'd actually made most of these fixes a while ago but completely forgot to push them.

Changes I made:

  • Dropped the str() calls for state comparison. The state column in the DB is just a String(20), so previous_state comes back as a plain string — now comparing ti_patch_payload.state.value directly against it
  • Swapped response.text == "" to response.content == b"" in the test
  • Rebased on latest main

The 4 CI failures are all the same SageMaker provider test (test_delete_model_when_not_exist, assert 400 == 404) — moto version issue, nothing to do with this PR.

Going through the open review threads now.

@eladkal eladkal marked this pull request as ready for review May 6, 2026 21:09
@eladkal eladkal force-pushed the fix-63183-invalid-state-v2 branch from 2bd59fc to 780feb0 Compare May 6, 2026 21:09
@sidshas03 sidshas03 requested a review from Copilot May 7, 2026 15:00
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

Comment thread airflow-core/src/airflow/api_fastapi/execution_api/routes/task_instances.py Outdated
@sidshas03 sidshas03 force-pushed the fix-63183-invalid-state-v2 branch from 396f0e2 to c5ca7c9 Compare May 7, 2026 15:27
@sidshas03 sidshas03 force-pushed the fix-63183-invalid-state-v2 branch from c5ca7c9 to 0976b34 Compare May 7, 2026 15:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:task-sdk type:bug-fix Changelog: Bug Fixes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Task completes successfully but gets killed and marked as failed

10 participants