xios: null-guard CXios::globalRegistry deref in CContext::finalize (attached-mode SIGSEGV)#6
Merged
Merged
Conversation
In attached mode (using_server=false) CContext::finalize hits the 'Mode attache' branch which dereferences CXios::globalRegistry unconditionally. CXios::globalRegistry is only allocated on client rank 0 (cxios.cpp:112) but the merge is gated on server->intraCommRank==0; in attached mode those two ranks need not coincide, so a non-rank-0 process segfaults in xios::CRegistry::mergeRegistry at xios_context_finalize. Add a null check; the patched line is equivalent for server mode (globalRegistry is always non-null on the server's rank 0) but lets attached-mode FESOM_WITH_XIOS=ON runs exit cleanly. Verified locally on Levante (intel + openmpi, 2-rank pi mesh, 6 XIOS-output fields): without patch SIGSEGV in xios_close, with patch 'fesom should stop with exit status = 0'.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes a SIGSEGV at
xios_context_finalizefor FESOM_WITH_XIOS=ON runs in attached mode (using_server=false), discovered while bringing up the newfesom2_xios.ymlCI run-test on FESOM/fesom2.In
src/node/context.cpp'sCContext::finalize, the "Mode attache" branch dereferencesCXios::globalRegistryunconditionally:CXios::globalRegistryis only allocated on client rank 0 (cxios.cpp:112) but the merge is gated onserver->intraCommRank==0; in attached mode those two ranks need not coincide, so a non-rank-0 process hits the deref on a null pointer and SIGSEGVs inxios::CRegistry::mergeRegistry. Files were already flushed bycloseAllFile()earlier in the same branch, so the crash leaves the model output intact but the process exits with status 139.The fix is a one-token addition of a null check; the patched line is equivalent for server mode (
globalRegistryis always non-null on the server's rank 0) and lets attached-mode runs exit cleanly. Applied to bothfesom2_ci/xios/andfesom2_test_refactoring/xios/source trees.Test plan
Caught signal 11inxios::CRegistry::mergeRegistry; with patch the run produces the same 6 netcdf files and exits cleanly withfesom should stop with exit status = 0.docker-publish.ymlrebuilds bothfesom2_ci-masterandfesom2_test_refactoring-master.fesom2_xios.ymlworkflow passes end-to-end against the new images.Notes
fesom2_ci/xios/andfesom2_test_refactoring/xios/carry duplicated source today (per the note in fesom2_test_refactoring: add XIOS 2.5 at /xios for upcoming xios run-test #5). A future cleanup that moves the XIOS source to a shared repo-root_shared/xios/will collapse this two-file change to one.