Include hook name in suppressed listener-exception log#66395
Draft
1fanwang wants to merge 7 commits intoapache:mainfrom
Draft
Include hook name in suppressed listener-exception log#663951fanwang wants to merge 7 commits intoapache:mainfrom
1fanwang wants to merge 7 commits intoapache:mainfrom
Conversation
When a listener hookimpl raises, the suppressed-exception log line now identifies which hook raised. The previous generic "error calling listener" made it impossible to tell which of the registered hooks failed without re-reading the stack trace, especially painful when several listeners are registered and one of them sporadically misbehaves. The new format is "error calling listener for hook 'on_task_instance_<x>'". Existing listener-error suppression behavior is preserved unchanged. Scope: task instance listener call sites (task_runner.py, taskinstance.py on the API server retry path, and the manual-set-state path in the fastapi service). DagRun, asset, and dag-processing listener call sites follow the same pattern and can be migrated incrementally.
This was referenced May 5, 2026
stdlib Logger.exception accepts (msg, *args) but not arbitrary kwargs;
mypy flagged log.exception('msg', hook=name) as call-arg error in
taskinstance.py and other stdlib-Logger sites.
Reverting to format-string form which works for both stdlib and structlog
loggers. The structlog adapter interpolates the format args into the
event field, so cap_structlog still captures the rendered hook name.
Tests updated to match the rendered event field.
1fanwang
added a commit
to 1fanwang/airflow
that referenced
this pull request
May 6, 2026
Mirrors the same fix in PR-B (apache#66395) — extends to lifecycle and asset listener call sites.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
When a listener hookimpl raises, the suppressed-exception log line now
identifies which hook raised. The previous generic
log.exception(\"error calling listener\")made it impossible to tell whichof the registered hooks failed without re-reading the stack trace —
especially painful when several listeners are registered and one of them
sporadically misbehaves.
New format:
```text
error calling listener for hook 'on_task_instance_failed'
```
The existing substring
error calling listenerremains in the message,so any downstream log-grep tooling continues to match.
Scope
This PR covers the task instance listener call sites only, mirroring
the surface of #66394:
task-sdk/src/airflow/sdk/execution_time/task_runner.py— 5 sites(running, success, skipped, up_for_retry, failed)
airflow-core/src/airflow/models/taskinstance.py— 1 site (APIserver retry path)
airflow-core/src/airflow/api_fastapi/core_api/services/public/task_instances.py— split the single try/except wrapping three branches into per-branch
try/except so each log line names exactly one hook
DagRun, asset, dag-processing, and lifecycle listener call sites follow the
same pattern and can be migrated as a follow-up; keeping this PR narrow
makes the diff trivially reviewable.
Behavior
Listener-exception suppression is preserved — task execution is not
affected by listener failures.
Testing
test_listener_suppresses_exceptions(airflow-core) isextended to also assert the hook name appears in the captured log
output. It uses the existing
throwing_listenerfixture which raisesin
on_task_instance_success.test_listener_error_log_includes_hook_name(task-sdk)registers a listener that raises in
on_task_instance_success,drives the runner, and asserts
log.exceptionwas called with(\"error calling listener for hook %r\", \"on_task_instance_success\").^ Add meaningful description above
Read the Pull Request Guidelines for more information.
E2E validation
9/9 TI listener call sites updated. Lifecycle hooks (
on_starting,before_stopping) are scoped to the follow-up #66397.Real e2e validation (Airflow standalone)
Re-ran with
airflow standalone(real scheduler + API server + LocalExecutor + sqlite). Registered a listener that raises on every TI hook; observed the new log format on every suppressed exception:All 4 TI hooks now identify themselves by name in the log when an impl raises. Listener-exception suppression behavior is preserved — the DAGs that triggered these errors all completed normally (success/failed/skipped) regardless of the listener throwing.
Integrated mega-branch validation (all 7 PRs composed)
This PR was independently validated, plus all seven PRs in this stack (#66394, #66395, #66397, #66399, #66402, #66405, #66410) were merged onto a single branch and exercised end-to-end through real services —
airflow standalonerunning scheduler + API server + LocalExecutor + Postgres-equivalent (sqlite for the test). A single listener plugin declaring every new hook and parameter was registered, then 5 DAGs covering every state-transition path were triggered + a manual-set-state PATCH via the public API was issued. The listener log is below — every annotation maps a line to the PR that introduced it:What this validates jointly:
msg=...started,success,failed,skipped,up_for_retry,manually_set_to_failed)error: BaseException | NoneRuntimeError(wasstron PR-A alone)AirflowTaskCheckpointedrunning → checkpointedtransition observed at the listener and at the supervisor message boundaryfailure_detailskwargfailure_details=Noneflowing through every failure (no executor populates yet)checkpointed task=checkpoint_task checkpoint_data={'step': 5, ...}Repro
Bugs surfaced and fixed during this validation
This step caught 6 bugs that the layer-2 unit-test pass missed — every fix is a separate commit on its respective PR's branch:
AirflowTaskCheckpointedimport inrun()(NameError)_generated.pyTaskInstanceStatemissing CHECKPOINTED (AttributeError)TaskStatesupervisor message Literal rejected CHECKPOINTED (PydanticValidationError)_generated.pyIntermediateTIStatemissing CHECKPOINTEDfailure_details=Nonedefault silently received Noneon_task_instance_failedcall site missingfailure_detailskwarg (HookCallError: hook call must provide argument)Last two would have broken every task failure on apache/airflow
mainif the foundation PRs landed without the call-site fixes. The standalone-against-editable-install harness is a fast catch for this class.Documented gap (deliberately not fixed in this stack)
task-sdk/.../supervisor.py:STATES_SENT_DIRECTLYlists the states the worker sends to the supervisor with a dedicated direct-send branch. CHECKPOINTED is not in that list, so it falls back toclient.task_instances.finish()which the API server constrains to terminal states. The mega listener log shows the worker successfully loggingTask checkpointed; reporting CHECKPOINTED state.andon_task_instance_checkpointedfiring with the correct payload — but the DB row eventually transitions tofailedbecause the supervisor cannot persist CHECKPOINTED throughfinish(). This is the AIP-96 design knob (auto-resume vs manual-resume-only) we deliberately want the discussion to settle, not silently pick. Documented in #66402.