Skip to content

Conversation

@roystgnr
Copy link
Member

This should accommodate users who need to exit suddenly (without manually unwinding a stack) and quietly (without distracting from their own exit output) while still cleaning up (so they don't get segfaults when dangling references to our custom output buffers interact with iostreams cleanup later).

This should accomodate users who need to exit suddenly (without manually
unwinding a stack) and quietly (without distracting from their own exit
output) while still cleaning up (so they don't get segfaults when
dangling references to our custom output buffers interact with iostreams
cleanup later).
We don't currently have libMesh infrastructure for testing output of
lots of executable runs, so this will have to do until we get to MOOSE
CI.
Copy link
Member

@loganharbour loganharbour left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me


void libmesh_terminate_handler()
{
bool print_debug_info = true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
bool print_debug_info = true;
bool print_libmesh_info = true;

I don't know the best name here but this feels a little better to me. We're still printing the std_ex.what() so it feels like we're still printing some debug info. And this boolean also controls more things than strictly debug info such as perf log output. Anyway this isn't a big deal

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're actually not printing std_ex.what() if that gets set to off - TerminationException doesn't inherit from std::exception and doesn't even implement what().

"debug info" is a kind of bad name for something that includes the perf log, though, yeah. If this was a public-facing identifier or if this wasn't blocking a huge update I'd change it, but now that Logan reports it (plus a tiny MOOSE patch to use it) works for his segfault issues, I want to get this merged ASAP.

@jwpeterson
Copy link
Member

If I understand correctly, the idea here is to create an unlikely to be caught kind of exception so that the terminate handler is reached, but isn't it still implementation-defined whether the stack is unwound or not for an uncaught exception? I'm not against the change, just curious if there's any way to guarantee the stack won't be unwound.

@moosebuild
Copy link

Job Coverage, step Generate coverage on 1693412 wanted to post the following:

Coverage

b252b8 #4369 169341
Total Total +/- New
Rate 65.31% 65.31% -0.01% 0.00%
Hits 77568 77571 +3 0
Misses 41196 41208 +12 20

Diff coverage report

Full coverage report

Warnings

  • New new line coverage rate 0.00% is less than the suggested 90.0%

This comment will be updated on new commits.

@roystgnr
Copy link
Member Author

It is implementation-defined whether the stack will be unwound before an uncaught exception hits the terminate handler. In the case of unwinding ... damn, that's actually a problem (albeit a theoretical and pre-existing problem), isn't it? We'll hit all the FooFinalize() APIs instead of getting to MPI_Abort(), and if we only threw the uncaught exception on a proper subset of ranks then we'll hang in MPI_Finalize() on those ranks while we're in MPI_Whatever() on the rest.

Maybe we really just need to expose the terminate handler (or rather a terminate function with parameters that the handler function can call with default values)? But even that might not work! One process calls MPI_Abort() first, then the rest get killed by the MPI implementation, somehow (it's not specified by the MPI standard!). So even if the process which called MPI_Abort() does its cleanup first, we may still end up with other processes hitting iostreams cleanup (via static destructors or some C++ magic) without hitting our cleanup_stream_buffers() first, at which point those ranks could segfault.

Maybe we could put critical cleanup into an MPI error handler callback? Hand that to MPI_Comm_set_errhandler() and then MPI_Abort() is supposed to call our handler, I think on every process.

@roystgnr
Copy link
Member Author

Hand that to MPI_Comm_set_errhandler() and then MPI_Abort() is supposed to call our handler, I think on every process.

No, I think I'm misunderstanding the docs. That error handler isn't for when an abort hits you, it's for when any MPI function associated with that communicator (including MPI_Abort) is about to return an error.

So I'm again out of ideas. OpenMPI says it aborts processes via SIGTERM, so we could set a signal handler for that. With MPICH it looks like it might also be SIGTERM but it depends on process manager?

Maybe what we ought to be doing here is using dumb pointers to manage our _foo_syncd_thread_buffer objects? If what's going wrong is that we're hitting global destructors but not hitting cleanup, then the unique_ptrs we use are deleting those buffers without disconnecting them from the output streams first, which of course leads to any subsequent output stream usage being undefined behavior, a dangling pointer dereference ... but if we were using dumb pointers and manually deleting their targets after swapping them out, then we'd only be deleting after swapping. We'd now have a memory leak we might have to tell valgrind to ignore, but better that than a segfault.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants