Fix CloudWatch remote logging for ephemeral lifecycle executor#68779
Open
jason810496 wants to merge 2 commits into
Open
Fix CloudWatch remote logging for ephemeral lifecycle executor#68779jason810496 wants to merge 2 commits into
jason810496 wants to merge 2 commits into
Conversation
Member
Author
|
cc @sarvesh371, @seanghaeli Could you verify this patch for your setup when you have a moment? Since it likely #66633 won't catch the 3.3 release (we're close to dev freeze for 3.3), so we might release this provider-only patch first. Thanks. |
The streaming CloudWatch handler is rebuilt whenever it reports
shutting_down, so logs survive configure_logging() closing it. But
shutting_down alone cannot tell a mid-task close apart from genuine
teardown, so a record arriving after teardown would spin up an orphan
handler and its background queue thread that nobody flushes or closes.
The supervisor lifecycle makes the two cases distinguishable in time:
1. configure_logging() builds the handler via remote.processors
(processors does `_ = self.handler`), registering it in
logging._handlerList.
2. The same call then runs dictConfig, whose non-incremental reset
closes that handler -> watchtower sets shutting_down=True.
3. Child log records stream through proc -> self.handler, which sees
shutting_down and rebuilds. This is the case we must keep working.
4. At the last possible moment _upload_logs() -> upload() -> close()
flushes; nothing logs after this.
shutting_down is watchtower's flag set by dictConfig (step 2); the new
_closed flag is ours, set only by close() (step 4). dictConfig never
touches _closed, so the rebuild in step 3 still fires, while a late
record after step 4 keeps the closed handler instead of orphaning a new
one. close() on the outer CloudwatchTaskHandler now closes the handler
the IO is currently using rather than the reference captured in
set_context(), which dictConfig may have closed and the IO since rebuilt.
a92793f to
f28925a
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
While trying to setup cloudwatch remote logging in #68709 in order to persist the logs in real time. I encounter the same errors as above listed issues.
The root cause I found is same as #66475 (comment) pointed out. The
configure_logging -> dictConfig -> _clearExistingHandlerscall chain shutdown the watchtower handler.How
I went to the another than #66633, instead of configuring the
processorsafter thedictConfigcall. We could make the cloudwatch remote logger itself self-healing by creating the fresh instance if previous instance wasshutdownby thedictConfigcall but also ensure the.closesemantic by guarding with the_closestate.What
Fix the lifecycle issue of cloudwatch remote logging and verify with
breeze k8ssystem test with provider only changes without touching the Task-SDK changes.