Skip to content

xios: null-guard CXios::globalRegistry deref in CContext::finalize (attached-mode SIGSEGV)#6

Merged
JanStreffing merged 1 commit into
masterfrom
fix/xios-finalize-segv
May 3, 2026
Merged

xios: null-guard CXios::globalRegistry deref in CContext::finalize (attached-mode SIGSEGV)#6
JanStreffing merged 1 commit into
masterfrom
fix/xios-finalize-segv

Conversation

@JanStreffing
Copy link
Copy Markdown
Contributor

Summary

Fixes a SIGSEGV at xios_context_finalize for FESOM_WITH_XIOS=ON runs in attached mode (using_server=false), discovered while bringing up the new fesom2_xios.yml CI run-test on FESOM/fesom2.

In src/node/context.cpp's CContext::finalize, the "Mode attache" branch dereferences CXios::globalRegistry unconditionally:

if (server->intraCommRank==0) CXios::globalRegistry->mergeRegistry(*registryOut) ;

CXios::globalRegistry is only allocated on client rank 0 (cxios.cpp:112) but the merge is gated on server->intraCommRank==0; in attached mode those two ranks need not coincide, so a non-rank-0 process hits the deref on a null pointer and SIGSEGVs in xios::CRegistry::mergeRegistry. Files were already flushed by closeAllFile() earlier in the same branch, so the crash leaves the model output intact but the process exits with status 139.

The fix is a one-token addition of a null check; the patched line is equivalent for server mode (globalRegistry is always non-null on the server's rank 0) and lets attached-mode runs exit cleanly. Applied to both fesom2_ci/xios/ and fesom2_test_refactoring/xios/ source trees.

Test plan

  • Verified locally on Levante (intel-oneapi-compilers/2022.0.1 + openmpi/4.1.2 + netcdf-c/4.8.1, 2-rank pi mesh, 6 XIOS-output fields sst/a_ice/temp/salt/u/v): without patch the run produces all 6 netcdf files but exits 139 with Caught signal 11 in xios::CRegistry::mergeRegistry; with patch the run produces the same 6 netcdf files and exits cleanly with fesom should stop with exit status = 0.
  • docker-publish.yml rebuilds both fesom2_ci-master and fesom2_test_refactoring-master.
  • FESOM/fesom2 PR #902's fesom2_xios.yml workflow passes end-to-end against the new images.

Notes

  • The patch is identical to one carried in some downstream XIOS forks. We could upstream it to the IPSL XIOS svn separately.
  • fesom2_ci/xios/ and fesom2_test_refactoring/xios/ carry duplicated source today (per the note in fesom2_test_refactoring: add XIOS 2.5 at /xios for upcoming xios run-test #5). A future cleanup that moves the XIOS source to a shared repo-root _shared/xios/ will collapse this two-file change to one.

In attached mode (using_server=false) CContext::finalize hits the
'Mode attache' branch which dereferences CXios::globalRegistry
unconditionally. CXios::globalRegistry is only allocated on client
rank 0 (cxios.cpp:112) but the merge is gated on
server->intraCommRank==0; in attached mode those two ranks need not
coincide, so a non-rank-0 process segfaults in
xios::CRegistry::mergeRegistry at xios_context_finalize.

Add a null check; the patched line is equivalent for server mode
(globalRegistry is always non-null on the server's rank 0) but lets
attached-mode FESOM_WITH_XIOS=ON runs exit cleanly.

Verified locally on Levante (intel + openmpi, 2-rank pi mesh, 6
XIOS-output fields): without patch SIGSEGV in xios_close, with
patch 'fesom should stop with exit status = 0'.
@JanStreffing JanStreffing merged commit 78fc0da into master May 3, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant