Skip to content

Update to v2.0.0-alpha.1#944

Draft
xylar wants to merge 34 commits intoMPAS-Dev:mainfrom
xylar:switch-to-mache-deploy
Draft

Update to v2.0.0-alpha.1#944
xylar wants to merge 34 commits intoMPAS-Dev:mainfrom
xylar:switch-to-mache-deploy

Conversation

@xylar
Copy link
Copy Markdown
Collaborator

@xylar xylar commented Mar 21, 2026

This pull request updates to mache.deploy, which uses the ./deploy.py script instead of ./conda/configure-compass-env.py.

It switches to using pixi in the background for creating environments with conda packages.

Updates:

  • esmf v8.9.1
  • mache v3.6.1 -- brings in mache.deploy, mache.jigsaw and mache.parallel as well as module updates on many machines and several bug fixes
  • moab v5.6.0
  • albany tag compass-2026-03-21
  • trilinos tag compass-2026-02-06

Testing

Only testing MALI, as MPAS-Ocean is no longer being tested regularly on Compass.

MALI with full_integration:

Deployed

MALI with full_integration:

  • Chrysalis (@xylar)
    • gnu and openmpi
  • Perlmutter (@xylar)
    • gnu and mpich
    • gnugpu and mpich

@xylar xylar force-pushed the switch-to-mache-deploy branch from 9c93e54 to 0303b90 Compare March 21, 2026 13:14
@xylar xylar added documentation Improvements or additions to documentation enhancement New feature or request ci Changes affect Azure Pipelines CI MALI-Dev PR finished dependencies and deployment Changes relate to creating conda and Spack environments, and creating a load script framework dependencies Pull requests that update a dependency file labels Mar 21, 2026
@xylar xylar force-pushed the switch-to-mache-deploy branch from 0303b90 to ba7c900 Compare March 21, 2026 13:41
@xylar xylar force-pushed the switch-to-mache-deploy branch from 74bb269 to cec99ef Compare March 21, 2026 16:07
@xylar xylar force-pushed the switch-to-mache-deploy branch 2 times, most recently from fd971e0 to 7f43434 Compare March 31, 2026 17:58
@xylar
Copy link
Copy Markdown
Collaborator Author

xylar commented Mar 31, 2026

@matthewhoffman and @trhille, to test this for now, use:

./deploy.py --with-albany --deploy-spack --mache-fork xylar/mache --mache-branch update-to-3.3.0 ...

This branch is needed until I tag a 3.3.0rc2 for mache.

@matthewhoffman
Copy link
Copy Markdown
Member

@xylar , can you walk me through a few more details about the transition to deploy.py?

First off, is the mache branch in your previous comment out of date? Mache branch update-to-3.3.0 doesn't exist on your mache fork. So I used fix-mache-deploy-with-mache-rc instead. I invoked it with:

./deploy.py --with-albany --deploy-spack --mache-fork xylar/mache --mache-branch fix-mache-deploy-with-mache-rc --compiler gnu --mpi mpich --machine pm-cpu

It ran great for awhile and seemed much faster than the old ./conda/configure-compass-env.py script. But after finishing the jigsaw build, it died with this error:

 Running:
   env -i bash -l /global/cfs/cdirs/fanssie/users/hoffman2/compass/v2.0.0-alpha.1/deploy_tmp/spack/build_compass_albany_gnu_mpich.bash

fatal: detected dubious ownership in repository at '/global/cfs/cdirs/e3sm/software/compass/pm-cpu/spack/dev_compass_2.0.0'
To add an exception for this directory, call:

	git config --global --add safe.directory /global/cfs/cdirs/e3sm/software/compass/pm-cpu/spack/dev_compass_2.0.0
Traceback (most recent call last):
  File "/global/cfs/cdirs/fanssie/users/hoffman2/compass/v2.0.0-alpha.1/deploy_tmp/bootstrap_pixi/.pixi/envs/default/bin/mache", line 6, in <module>
    sys.exit(main())
             ~~~~^^
  File "/global/cfs/cdirs/fanssie/users/hoffman2/compass/v2.0.0-alpha.1/deploy_tmp/bootstrap_pixi/.pixi/envs/default/lib/python3.14/site-packages/mache/__main__.py", line 21, in main
    args.func(args)
    ~~~~~~~~~^^^^^^
  File "/global/cfs/cdirs/fanssie/users/hoffman2/compass/v2.0.0-alpha.1/deploy_tmp/bootstrap_pixi/.pixi/envs/default/lib/python3.14/site-packages/mache/deploy/cli.py", line 91, in _dispatch_deploy
    run_deploy(args=args)
    ~~~~~~~~~~^^^^^^^^^^^
  File "/global/cfs/cdirs/fanssie/users/hoffman2/compass/v2.0.0-alpha.1/deploy_tmp/bootstrap_pixi/.pixi/envs/default/lib/python3.14/site-packages/mache/deploy/run.py", line 289, in run_deploy
    spack_results = deploy_spack_envs(
        ctx=ctx,
    ...<2 lines>...
        quiet=quiet,
    )
  File "/global/cfs/cdirs/fanssie/users/hoffman2/compass/v2.0.0-alpha.1/deploy_tmp/bootstrap_pixi/.pixi/envs/default/lib/python3.14/site-packages/mache/deploy/spack.py", line 374, in deploy_spack_envs
    _install_spack_env(
    ~~~~~~~~~~~~~~~~~~^
        ctx=ctx,
        ^^^^^^^^
    ...<10 lines>...
        quiet=quiet,
        ^^^^^^^^^^^^
    )
    ^
  File "/global/cfs/cdirs/fanssie/users/hoffman2/compass/v2.0.0-alpha.1/deploy_tmp/bootstrap_pixi/.pixi/envs/default/lib/python3.14/site-packages/mache/deploy/spack.py", line 914, in _install_spack_env
    check_call(cmd, log_filename=log_filename, quiet=quiet)
    ~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/global/cfs/cdirs/fanssie/users/hoffman2/compass/v2.0.0-alpha.1/deploy_tmp/bootstrap_pixi/.pixi/envs/default/lib/python3.14/site-packages/mache/deploy/bootstrap.py", line 222, in check_call
    raise subprocess.CalledProcessError(
        process.returncode, commands, output=stdout_data
    )
subprocess.CalledProcessError: Command 'env -i bash -l /global/cfs/cdirs/fanssie/users/hoffman2/compass/v2.0.0-alpha.1/deploy_tmp/spack/build_compass_albany_gnu_mpich.bash' returned non-zero exit status 128.

ERROR: Deployment step failed (exit code 1). See the error output above.

Am I doing this wrong? Is this trying to deploy for the entire project? I don't think you want me interacting with /global/cfs/cdirs/e3sm/software/compass/pm-cpu/spack/dev_compass_2.0.0 do you?

@xylar
Copy link
Copy Markdown
Collaborator Author

xylar commented Apr 3, 2026

@matthewhoffman, I'm sorry. I'm developing mache for 3 projects at once -- E3SM-Unified, Polaris and Compass. that's a situation I usually try to avoid for precisely this type of reason.

I needed to release mache 3.3.0 for Polaris yesterday. As a result, the update-to-3.3.0 branch is gone. But I neglected to update this Compass branch until just now. At this point, no --mache-fork and --mache-branch should be needed for testing.

You also don't want to deploy spack. That was a mistake in my command above.

./deploy.py --with-albany --compiler gnu --mpi mpich --machine pm-cpu

@xylar xylar force-pushed the switch-to-mache-deploy branch from d3cc399 to 21d2713 Compare April 3, 2026 07:54
@matthewhoffman
Copy link
Copy Markdown
Member

Thanks, @xylar . I made a little more progress with the command you suggested. I had to make this change:

diff --git a/deploy/cli_spec.json b/deploy/cli_spec.json
index 56a951b1f..ebdea2c4c 100644
--- a/deploy/cli_spec.json
+++ b/deploy/cli_spec.json
@@ -1,7 +1,7 @@
 {
   "meta": {
     "software": "compass",
-    "mache_version": "3.3.0rc2",
+    "mache_version": "3.3.0",
     "description": "Deploy compass environment"
   },
   "arguments": [
diff --git a/deploy/pins.cfg b/deploy/pins.cfg
index bfe79a90e..6ca63db19 100644
--- a/deploy/pins.cfg
+++ b/deploy/pins.cfg
@@ -4,7 +4,7 @@ bootstrap_python = 3.14
 python = 3.14
 esmf = 8.9.1
 geometric_features = 1.6.1
-mache = 3.3.0rc2
+mache = 3.3.0
 mpas_tools = 1.4.0
 otps = 2021.10
 parallelio = 2.6.9

but then I still ran into an issue of it trying to touch the deployed spack env in the e3sm project space:

 Running:
   source /global/cfs/cdirs/e3sm/software/compass/pm-cpu/spack/dev_compass_2.0.0/share/spack/setup-env.sh
   spack env activate compass_albany_gnu_mpich
   spack config add modules:prefix_inspections:lib:[LD_LIBRARY_PATH]
   spack config add modules:prefix_inspections:lib64:[LD_LIBRARY_PATH]

==> Error: cannot write to config file [Errno 13] Permission denied: '/global/cfs/cdirs/e3sm/software/compass/pm-cpu/spack/dev_compass_2.0.0/var/spack/environments/compass_albany_gnu_mpich/.spack.yaml.tmp'
Traceback (most recent call last):
  File "/global/cfs/cdirs/fanssie/users/hoffman2/compass/v2.0.0-alpha.1/deploy_tmp/bootstrap_pixi/.pixi/envs/default/lib/python3.14/site-packages/mache/deploy/hooks.py", line 103, in run_hook
    result = func(context)
  File "/global/cfs/cdirs/fanssie/users/hoffman2/compass/v2.0.0-alpha.1/deploy/hooks.py", line 62, in post_spack
    _set_ld_library_path_for_spack_env(
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
        ctx=ctx,
        ^^^^^^^^
        spack_path=spack_path,
        ^^^^^^^^^^^^^^^^^^^^^^
        env_name=env_name,
        ^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/global/cfs/cdirs/fanssie/users/hoffman2/compass/v2.0.0-alpha.1/deploy/hooks.py", line 212, in _set_ld_library_path_for_spack_env
    check_call(
    ~~~~~~~~~~^
        commands,
        ^^^^^^^^^
        log_filename=_get_log_filename(ctx),
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        quiet=bool(getattr(ctx.args, 'quiet', False)),
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/global/cfs/cdirs/fanssie/users/hoffman2/compass/v2.0.0-alpha.1/deploy_tmp/bootstrap_pixi/.pixi/envs/default/lib/python3.14/site-packages/mache/deploy/bootstrap.py", line 222, in check_call
    raise subprocess.CalledProcessError(
        process.returncode, commands, output=stdout_data
    )

@xylar
Copy link
Copy Markdown
Collaborator Author

xylar commented Apr 3, 2026

I had to make this change:

I think that's in 21d2713. Did you not have that commit or did I miss something?

@xylar
Copy link
Copy Markdown
Collaborator Author

xylar commented Apr 3, 2026

but then I still ran into an issue of it trying to touch the deployed spack env in the e3sm project space:

Yep, that's something I need to fix. Sorry about that!

@xylar
Copy link
Copy Markdown
Collaborator Author

xylar commented Apr 3, 2026

@matthewhoffman, the second issue should be fixed.

@matthewhoffman
Copy link
Copy Markdown
Member

@xylar , thanks for addressing the second issue. The first must have been because I had failed to update my local branch this morning. After updating to 160d75d , ./deploy runs successfully and I'm able to load the compass env. I will move on to trying to build MALI next. One question - do you plan to add the version number back to the load_compass_pm-cpu_gnu_mpich.sh script that gets generated?

@xylar
Copy link
Copy Markdown
Collaborator Author

xylar commented Apr 3, 2026

One question - do you plan to add the version number back to the load_compass_pm-cpu_gnu_mpich.sh script that gets generated?

The Compass version is in there:

export MACHE_DEPLOY_TARGET_VERSION="2.0.0-alpha.1"

It's just called something different than before. We can copy that into another environment variable if you need it.

@xylar
Copy link
Copy Markdown
Collaborator Author

xylar commented Apr 3, 2026

Oh, wait, it already is:

export COMPASS_VERSION="2.0.0-alpha.1"

@xylar
Copy link
Copy Markdown
Collaborator Author

xylar commented Apr 3, 2026

Are you not seeing that in you load script?

@matthewhoffman
Copy link
Copy Markdown
Member

matthewhoffman commented Apr 3, 2026

I just mean the name of the load script used to have the version in the filename, but I'm not seeing that. It's not a big deal, I was just wondering if that was intentional.

As for progress, when I compile MALI I am seeing the same PIO lib errors that you do in the issue you opened. I'm working on debugging them with help from ChatGPT and so far the obvious things are not working, but I'll keep at it while I have time.

@xylar
Copy link
Copy Markdown
Collaborator Author

xylar commented Apr 3, 2026

I see. No, the load script won't include the compass version anymore. I didn't find that to be particularly useful.

@xylar xylar force-pushed the switch-to-mache-deploy branch from 069483c to d121b6b Compare April 24, 2026 07:01
@matthewhoffman
Copy link
Copy Markdown
Member

@xylar and @mperego , I think we should go ahead and remove the exodus output from the tests. We had added it at some point because when runs fail it is sometimes useful to be able to look at the velocity solution on the exo mesh to see what's going on, and it's convenient to not have to rerun the tests. But I can't remember the last time I've actually had to look at them, so it's not a big inconvenience to disable them again, and that seems a much better use of time than trying to debug these libraries.

@xylar , are you ok if I push a commit to your branch that makes the changes to disable the exo output? That way we can test everything in this branch.

@matthewhoffman
Copy link
Copy Markdown
Member

Also, is there anything I should be aware of in your push since the last discussion messages? Should I rebuild my env locally before testing again?

@xylar
Copy link
Copy Markdown
Collaborator Author

xylar commented Apr 30, 2026

@xylar , are you ok if I push a commit to your branch that makes the changes to disable the exo output? That way we can test everything in this branch.

Yes, go for it.

@xylar
Copy link
Copy Markdown
Collaborator Author

xylar commented Apr 30, 2026

Also, is there anything I should be aware of in your push since the last discussion messages? Should I rebuild my env locally before testing again?

Yes, you need to rebuild. I switch to a much newer version of mache.

I think I rebuild the spack environments but it's hard to keep track of everything right now :-(

@matthewhoffman
Copy link
Copy Markdown
Member

@xylar , I'm getting a lot of failures, and it looks like problems are occurring in mpas_tools in mesh/conversion.py, surprisingly. Before I dig further, do you have any insight why that might be?

It looks like every test that is creating a mesh is failing on that step. Here is the first example (landice_dome_2000m_sia_restart_test):

compass calling: compass.landice.tests.dome.restart_test.RestartTest.run()
  inherited from: compass.testcase.TestCase.run()
  in /global/cfs/cdirs/fanssie/users/hoffman2/compass/v2.0.0-alpha.1/compass/testcase.py

compass calling: compass.run.serial._run_test()
  in /global/cfs/cdirs/fanssie/users/hoffman2/compass/v2.0.0-alpha.1/compass/run/serial.py

Running steps:
  setup_mesh
  full_run
  restart_run

  * step: setup_mesh

compass calling: compass.landice.tests.dome.setup_mesh.SetupMesh.runtime_setup()
  inherited from: compass.step.Step.runtime_setup()
  in /global/cfs/cdirs/fanssie/users/hoffman2/compass/v2.0.0-alpha.1/compass/step.py


compass calling: compass.landice.tests.dome.setup_mesh.SetupMesh.run()
  in /global/cfs/cdirs/fanssie/users/hoffman2/compass/v2.0.0-alpha.1/compass/landice/tests/dome/setup_mesh.py

      Failed
Exception raised while running the steps of the test case
Traceback (most recent call last):
  File "/global/cfs/cdirs/fanssie/users/hoffman2/compass/v2.0.0-alpha.1/compass/run/serial.py", line 322, in _log_and_run_test
    _run_test(test_case, available_resources)
    ~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/global/cfs/cdirs/fanssie/users/hoffman2/compass/v2.0.0-alpha.1/compass/run/serial.py", line 419, in _run_test
    _run_step(test_case, step, test_case.new_step_log_file,
    ~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
              available_resources)
              ^^^^^^^^^^^^^^^^^^^^
  File "/global/cfs/cdirs/fanssie/users/hoffman2/compass/v2.0.0-alpha.1/compass/run/serial.py", line 470, in _run_step
    step.run()
    ~~~~~~~~^^
  File "/global/cfs/cdirs/fanssie/users/hoffman2/compass/v2.0.0-alpha.1/compass/landice/tests/dome/setup_mesh.py", line 68, in run
    dsMesh = cull(dsMesh, logger=logger)
  File "/global/cfs/cdirs/fanssie/users/hoffman2/compass/v2.0.0-alpha.1/pixi-env/.pixi/envs/default/lib/python3.14/site-packages/mpas_tools/mesh/conversion.py", line 126, in cull
    dsIn = _masks_to_int(dsIn)
  File "/global/cfs/cdirs/fanssie/users/hoffman2/compass/v2.0.0-alpha.1/pixi-env/.pixi/envs/default/lib/python3.14/site-packages/mpas_tools/mesh/conversion.py", line 249, in _masks_to_int
    dsOut = xr.Dataset(dsIn, attrs=dsIn.attrs)
  File "/global/cfs/cdirs/fanssie/users/hoffman2/compass/v2.0.0-alpha.1/pixi-env/.pixi/envs/default/lib/python3.14/site-packages/xarray/core/dataset.py", line 389, in __init__
    raise TypeError(
    ...<2 lines>...
    )
TypeError: Passing a Dataset as `data_vars` to the Dataset constructor is not supported. Use `ds.copy()` to create a copy of a Dataset.

These versions fix an issue with copying datasets in the latest
xarray.
@xylar
Copy link
Copy Markdown
Collaborator Author

xylar commented May 5, 2026

@matthewhoffman, sorry about that. These are fixed in the latest mpas_tools and geometric_features. I have switched to those versions.

Another case of the redundant maintenance burden between Compass and Polaris meaning that things get lost, I'm afraid.

xylar added 2 commits May 5, 2026 08:01
@matthewhoffman
Copy link
Copy Markdown
Member

No problem, @xylar! Thanks for updating the branch so quickly. I'm no longer seeing that error during mesh generation. I'll give a broader update once my tests complete and I've had a chance to apply the update disabling the exodus output.

Due to issues with i/o library linking in this branch, Albany is dying
with FPE when MALI is compiled in debug mode.  The easiest solutoin is
to disable Exodus output being created by default in MALI's compass
tests.  If one needs to use it, they will need to compile MALI in
non-debug mode (or at least without the -ffpe-trap compiler flag.
@matthewhoffman
Copy link
Copy Markdown
Member

After a few false starts, I managed to successfully test this branch with the full_integration suite on pm-cpu, using a baseline of compass=0418b56e3 (1.9.0-alpha.2). For both the baseline and test run, MALI-Dev/E3SM version=f0521783f7 and MALI is compiled with DEBUG=true. The test includes disabling exodus output from Albany as set by the most recent commit on this branch.

All tests without Albany pass, including baseline comparison. All tests with with Albany pass execution and validation but fail baseline comparison. This is expected with the updated Albany and Trilinos versions. I checked that the baseline comparison diffs are expectedly small: <1e-3 m for thickness and <1e-7 m/s for normalVelocity across all tests, with some tests much smaller.

full_integration results
Test Runtimes:
00:27 PASS landice_dome_2000m_sia_restart_test
00:05 PASS landice_dome_2000m_sia_decomposition_test
00:06 PASS landice_dome_variable_resolution_sia_restart_test
00:04 PASS landice_dome_variable_resolution_sia_decomposition_test
00:29 PASS landice_enthalpy_benchmark_A
00:16 PASS landice_eismint2_decomposition_test
00:14 PASS landice_eismint2_enthalpy_decomposition_test
00:15 PASS landice_eismint2_restart_test
00:15 PASS landice_eismint2_enthalpy_restart_test
00:13 PASS landice_greenland_sia_restart_test
00:08 PASS landice_greenland_sia_decomposition_test
00:40 PASS landice_hydro_radial_restart_test
00:10 PASS landice_hydro_radial_decomposition_test
00:18 PASS landice_humboldt_mesh-3km_decomposition_test_velo-none_calving-none_subglacialhydro
00:49 PASS landice_humboldt_mesh-3km_restart_test_velo-none_calving-none_subglacialhydro
00:16 FAIL landice_dome_2000m_fo_decomposition_test
00:15 FAIL landice_dome_2000m_fo_restart_test
00:10 FAIL landice_dome_variable_resolution_fo_decomposition_test
00:10 FAIL landice_dome_variable_resolution_fo_restart_test
00:14 FAIL landice_circular_shelf_decomposition_test
00:37 FAIL landice_greenland_fo_decomposition_test
00:35 FAIL landice_greenland_fo_restart_test
00:20 FAIL landice_thwaites_fo_decomposition_test
00:29 FAIL landice_thwaites_fo_restart_test
00:12 FAIL landice_thwaites_fo-depthInt_decomposition_test
00:20 FAIL landice_thwaites_fo-depthInt_restart_test
00:48 FAIL landice_humboldt_mesh-3km_restart_test_velo-fo_calving-von_mises_stress_damage-threshold_faceMelting
00:19 FAIL landice_humboldt_mesh-3km_restart_test_velo-fo-depthInt_calving-von_mises_stress_damage-threshold_faceMelting
Total runtime 09:31
FAIL: 13 tests failed, see above.

With this set of tests, this branch is approved for current functionality! Before approving the PR, I still need to test the new debris friction functionality in MALI, which will require a throwaway merge of #938 and associated MALI throwaway merge.

@matthewhoffman
Copy link
Copy Markdown
Member

Testing with debris-friction updates

This comment describes throwaway testing for the in-waiting debris-friction feature in MALI and in-waiting debris-friction tests in compass. Testing is performed by merging each of those branches into the respective branches for MALI and compass used in the previous comment. These merges will occur after this v2.0.0-alpha.1 env is merged, but I need to ensure the Albany updates in this PR work correctly with those features, because that was the original reason for updating Albany in compass that led to this PR.

Results of new debris friction tests:

landice/mismipplus/smoke_test/2000m/regularized_coulomb
  * step: run_model
  test execution:      SUCCESS
  test validation:     PASS
  baseline comparison: PASS
  test runtime:        00:18
landice/mismipplus/smoke_test/2000m/debris_friction
  * step: run_model
  test execution:      SUCCESS
  test validation:     PASS
  baseline comparison: PASS
  test runtime:        00:21

Based on this, it appears everything is working as intended.

@matthewhoffman matthewhoffman self-requested a review May 5, 2026 21:59
Copy link
Copy Markdown
Member

@matthewhoffman matthewhoffman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving based on test results. I've only cursorily skimmed the changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci Changes affect Azure Pipelines CI dependencies and deployment Changes relate to creating conda and Spack environments, and creating a load script dependencies Pull requests that update a dependency file documentation Improvements or additions to documentation enhancement New feature or request framework MALI-Dev PR finished

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants