avoid crash/assert on periodic events with too large a random spread#19
Open
hmh wants to merge 3 commits intoschoenw:masterfrom
Open
avoid crash/assert on periodic events with too large a random spread#19hmh wants to merge 3 commits intoschoenw:masterfrom
hmh wants to merge 3 commits intoschoenw:masterfrom
Conversation
Internal errors (bugs) could cause lmapd to call event_free() on a NULL pointer. event_free() is *not* documented to handle NULL pointers, unlike free(), and in fact it does not handle them well at all (at least on libevent2 2.1). That could cause a NULL pointer derreference, and a crash. It is better to cope with it: log the problem, and keep the daemon alive. This issue has been triggered in production, it can happen. CEPTRO.br-issue: MDC-1016
Should the random spread mechanism cause two firings of an event to accumulate, we'd call event_gaga() on an event that already has event->fire_event set and active. This either causes an assertion failure on event_gaga() when trying to add the second event while the first is still active, or if assertions are disabled (production build), it overwrites event->fire_event, forgetting about the already active event. This causes two "fire" events to be active, something the current code is not prepared to support. The obvious issue is that fire_cb() will run twice for the same (lmapd) struct event, which would crash before the change of that codepath to safe_event_free(). But there could be other issues too, such as incorrect event timestamp metadata. Explicitly detect that an event is already active in event_gaga() and output an error message (and cope gracefully, by losing the new event). This ensures that the event that does trigger will be handled correctly and have correct metadata, etc.
Refuse configurations that would knowingly cause periodic event reordering due to a too large random spread. It would be nice to detect this also for the calendar event type, but that's non-trivial, and not worth the hassle now that lmapd has been changed to handle this gracefully (instead of triggering a crash or assertion failure). CEPTRO.br-issue: MDC-1016
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
On production builds where assert() does nothing, a large random spread can cause event_gaga() to schedule a libevent2 "event->fire_event" past the time where the next libevent2 "event->trigger_event" will run.
When that trigger event runs, it schedules a new fire_event (which might come earlier or later than the one already schedule, depending on the two spreads). libevent2 suports this, and will callback fire_cb (the callback for event->fire_event) twice. But fire_cb does not, and lmapd will lose track of the situation. Eventually, one fire_cb will run with event->fire_event already set to NULL, and crash.
Also, events could be reordered by this situation.
On debug builds, you hit an assert at runtime when the situation first happens and event_gaga is called with event->fire_event already pending. The assert() will kill lmapd with a signal, and it will die without any exit cleanups (atexit handlers do not run in this situation).
The proposed fix attempts to turn such a situation (arguably caused by user error, but it can be hard to predict in complex calendar events) into properly logged warnings/errors and recover sanely, so that lmapd isn't stopped and measurements can continue, etc.