Skip to content

Plain MD restart support#1884

Open
jthorton wants to merge 3 commits intomainfrom
md_restarts
Open

Plain MD restart support#1884
jthorton wants to merge 3 commits intomainfrom
md_restarts

Conversation

@jthorton
Copy link
Collaborator

@jthorton jthorton commented Mar 23, 2026

Fixes #1719, #1720, #1721
This PR splits up the plain MD protocol unit into a setup and simulation unit and adds the ability to restart failed simulations.

Notes

  • The simulation must reach the production stage to generate a restart file, failures at the equilibration stages will not trigger a restart and will start again.
  • Restarts using the state checkpoint of openmm and are portable between hardware
  • The versions of openfe, gufe and openmm are required to be the same inorder for restarts to be safe

Checklist

  • All new code is appropriately documented (user-facing code must have complete docstrings).
  • Added a news entry, or the changes are not user-facing.
  • Ran pre-commit: you can run pre-commit locally or comment on this PR with pre-commit.ci autofix.

Manual Tests: these are slow so don't need to be run every commit, only before merging and when relevant changes are made (generally at reviewer-discretion).

Developers certificate of origin

@jthorton
Copy link
Collaborator Author

pre-commit.ci autofix

@github-actions
Copy link

No API break detected ✅

@IAlibay IAlibay self-requested a review March 23, 2026 13:21
Copy link
Member

@IAlibay IAlibay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks good to me. There's a few areas where it might be good to enhance things, especially when it comes maybe just sticking on the checkpoint reporter the whole way through the simulation, but they wouldn't affect the overall organization of the code.

return pdbs


class PlainMDProtocol(gufe.Protocol):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[scope] It may be good to take the opportunity to move Protocols, results, settings and units to their own files.

equilibration_length=1.0 * unit.nanosecond,
production_length=5.0 * unit.nanosecond,
),
output_settings=MDOutputSettings(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is changing the filename necessary or just a nice-to-have?

Exception if anything failed
"""

if verbose:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[scope] It would be nice to try to split up some of the code here into their own class methods. E.g. _prepare, _get_settings, etc.. like we do in the AFE and RFE Protocols).

We're not in a situation where we can deduplicate across Protocols yet, however attempting to do so in each Protocol would be useful in finding areas of duplication.

integrator_settings = protocol_settings.integrator_settings

# is the timestep good for the mass?
settings_validation.validate_timestep(forcefield_settings.hydrogen_mass, timestep)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[scope] we should add a _validate method to the Protocol.

# g. Save the system and positions to file
system_outfile = shared_basepath / "system.xml.bz2"
serialization.serialize(stateA_system, system_outfile)
# not really need if we save out the pre-minimized file
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I somewhat agree, however PDBs are low precision, savin the npy is good in my opinion.

traj.save_pdb(shared_basepath / output_settings.minimized_structure)
# equilibrate
# NVT equilibration
if equil_steps_nvt:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[scope] Thinking about this - is there any reason we can't just add the checkpoint reporter at this stage? It's more likely that someone running the PlainMD protocol would be running longer equilibrations / productions, so it might be nice to be able to restart at any point during it.

"""
if verbose:
self.logger.info("Creating system")
if shared_basepath is None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the paths can be None, I think you'll want to have logic to deal with this somewhere here.

stateA_topology, stateA_positions, file=f, keepIds=True
)

# 10. Get platform
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] The comments will need to be updated once the structure is settled.

if x.getName() == "MonteCarloBarostat":
x.setFrequency(0)

simulation.context.setVelocitiesToTemperature(to_openmm(temperature))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[scope / nit] it would definitely be good to split this up into method calls or something like that. There's a lot of code in an "else" section.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Switch PlainMDProtocol checkpointing to a serialized State for better inter-machine compatibility

2 participants