A Heuristic Approach for the Automatic Insertion of Checkpoints in Message-Passing Codes
Gabriel Rodríguez (University of A Coruña, Spain)
Maria J. Martín (University of A Coruña, Spain)
Patricia González (University of A Coruña, Spain)
Juan Touriño (University of A Coruña, Spain)
Abstract: Checkpointing tools may be typically implemented at two different abstraction levels: at the system level or at the application level. The latter has become a more popular alternative due to its flexibility and the possibility of operating in different environments. However, application-level checkpointing tools often require the user to manually insert checkpoints in order to ensure that certain requirements are met (e.g. forcing checkpoints to be taken at the user code and not inside kernel routines). The approach presented in this work is twofold. First, a spatial coordination protocol for checkpointing parallel SPMD applications is proposed, based on forcing checkpoints to be taken at the same places in the application code by all processes. Thus, global consistency is achieved without adding any new runtime communications or piggybacked data, and without the need to use specific fault-tolerant message-passing implementations. Second, the paper also introduces a compilation technique for the automatic insertion of checkpoints using the spatial coordination protocol, based on a static analysis of communications and a heuristic analysis of computational load. These analyses can also be used to achieve automatic checkpoint insertion in approaches based on classical protocols, such as uncoordinated checkpointing or distributed snapshots.
Keywords: checkpointing, compiler-support, fault tolerance, message-passing, parallel programming
Categories: C.4, D.1.3