Important Notice: Our web hosting provider recently started charging us for additional visits, which was unexpected. In response, we're seeking donations. Depending on the situation, we may explore different monetization options for our Community and Expert Contributors. It's crucial to provide more returns for their expertise and offer more Expert Validated Answers or AI Validated Answers. Learn more about our hosting issue here.

Does Open MPI support checkpoint and restart of parallel jobs (similar to LAM/MPI)?

0
Posted

Does Open MPI support checkpoint and restart of parallel jobs (similar to LAM/MPI)?

0

The current stable release of Open MPI does not support the checkpointing and restarting of processes. However, the Open MPI development trunk does contain such support. The Open MPI team is actively working on integrating a variety of checkpoint and restart techniques into Open MPI, including similar functionality to that supported by LAM/MPI. Open MPI’s implementation will support both the BLCR checkpoint/restart system and a “self” checkpointer that allows applications to perform their own checkpoint/restart functionality. For both of these, Open MPI will provide a coordinated checkpoint/restart protocol and integration with a variety of network interconnects. The implementation introduces a series of new frameworks and components designed to support a variety of checkpoint and restart techniques. This will allow us to support the methods described above (application-directed, BLCR, etc.) as well as other kinds of checkpoint/restart systems (e.g., Condor, libckpt) and protocols (e.g

0

The current stable release of Open MPI does not support the checkpointing and restarting of processes. However, the Open MPI development trunk does contain such support. The Open MPI team is actively working on integrating a variety of checkpoint and restart techniques into Open MPI, including similar functionality to that supported by LAM/MPI. Open MPI’s implementation supports both the BLCR checkpoint/restart system and a “self” checkpointer that allows applications to perform their own checkpoint/restart functionality. For both of these, Open MPI provides a coordinated checkpoint/restart protocol and integration with a variety of network interconnects. The implementation introduces a series of new frameworks and components designed to support a variety of checkpoint and restart techniques. This will allow us to support the methods described above (application-directed, BLCR, etc.) as well as other kinds of checkpoint/restart systems (e.g., Condor, libckpt) and protocols (e.g., uncoo

Related Questions

What is your question?

*Sadly, we had to bring back ads too. Hopefully more targeted.