Does Open MPI support checkpoint and restart of parallel jobs (similar to LAM/MPI)?
The current stable release of Open MPI does not support the checkpointing and restarting of processes. However, the Open MPI development trunk does contain such support. The Open MPI team is actively working on integrating a variety of checkpoint and restart techniques into Open MPI, including similar functionality to that supported by LAM/MPI. Open MPI’s implementation will support both the BLCR checkpoint/restart system and a “self” checkpointer that allows applications to perform their own checkpoint/restart functionality. For both of these, Open MPI will provide a coordinated checkpoint/restart protocol and integration with a variety of network interconnects. The implementation introduces a series of new frameworks and components designed to support a variety of checkpoint and restart techniques. This will allow us to support the methods described above (application-directed, BLCR, etc.) as well as other kinds of checkpoint/restart systems (e.g., Condor, libckpt) and protocols (e.g
The current stable release of Open MPI does not support the checkpointing and restarting of processes. However, the Open MPI development trunk does contain such support. The Open MPI team is actively working on integrating a variety of checkpoint and restart techniques into Open MPI, including similar functionality to that supported by LAM/MPI. Open MPI’s implementation supports both the BLCR checkpoint/restart system and a “self” checkpointer that allows applications to perform their own checkpoint/restart functionality. For both of these, Open MPI provides a coordinated checkpoint/restart protocol and integration with a variety of network interconnects. The implementation introduces a series of new frameworks and components designed to support a variety of checkpoint and restart techniques. This will allow us to support the methods described above (application-directed, BLCR, etc.) as well as other kinds of checkpoint/restart systems (e.g., Condor, libckpt) and protocols (e.g., uncoo