Application checkpointing

Last updated January 04, 2023

Checkpointing is a technique that provides fault tolerance for computing systems. It basically consists of saving a snapshot of the application's state, so that applications can restart from that point in case of failure. This is particularly important for long running applications that are executed in failure-prone computing systems.

Checkpointing in distributed systems
Implementations for applications
Save State
Checkpoint/Restart
Fault Tolerance Interface (FTI)
Berkeley Lab Checkpoint/Restart (BLCR)
DMTCP
Collaborative checkpointing
Docker
CRIU
Implementation for embedded and ASIC devices
Mementos
Idetic
See also
References
Further reading
External links

Checkpointing in distributed systems

In the distributed computing environment, checkpointing is a technique that helps tolerate failures that otherwise would force long-running application to restart from the beginning. The most basic way to implement checkpointing, is to stop the application, copy all the required data from the memory to reliable storage (e.g., parallel file system) and then continue with the execution.^[1] In case of failure, when the application restarts, it does not need to start from scratch. Rather, it will read the latest state ("the checkpoint") from the stable storage and execute from that. While there is ongoing debate on whether checkpointing is the dominating I/O workload on distributed computing systems, there is general consensus that checkpointing is one of the major I/O workloads.^[2]^[3]

There are two main approaches for checkpointing in the distributed computing systems: coordinated checkpointing and uncoordinated checkpointing. In the coordinated checkpointing approach, processes must ensure that their checkpoints are consistent. This is usually achieved by some kind of two-phase commit protocol algorithm. In the uncoordinated checkpointing, each process checkpoints its own state independently. It must be stressed that simply forcing processes to checkpoint their state at fixed time intervals is not sufficient to ensure global consistency. The need for establishing a consistent state (i.e., no missing messages or duplicated messages) may force other processes to roll back to their checkpoints, which in turn may cause other processes to roll back to even earlier checkpoints, which in the most extreme case may mean that the only consistent state found is the initial state (the so-called domino effect ).^[4]^[5]

Implementations for applications

Save State

One of the original and now most common means of application checkpointing was a "save state" feature in interactive applications, in which the user of the application could save the state of all variables and other data to a storage medium at the time they were using it and either continue working, or exit the application and at a later time, restart the application and restore the saved state. This was implemented through a "save" command or menu option in the application. In many cases it became standard practice to ask the user if they had unsaved work when exiting the application if they wanted to save their work before doing so.

This sort of functionality became extremely important for usability in applications where the particular work could not be completed in one sitting (such as playing a video game expected to take dozens of hours, or writing a book or long document amounting to hundreds or thousands of pages) or where the work was being done over a long period of time such as data entry into a document such as rows in a spreadsheet.

The problem with save state is it requires the operator of a program to request the save. For non-interactive programs, including automated or batch processed workloads, the ability to checkpoint such applications also had to be automated.

Checkpoint/Restart

As batch applications began to handle tens to hundreds of thousands of transactions, where each transaction might process one record from one file against several different files, the need for the application to be restartable at some point without the need to rerun the entire job from scratch became imperative. Thus the "checkpoint/restart" capability was born, in which after a number of transactions had been processed, a "snapshot" or "checkpoint" of the state of the application could be taken. If the application failed before the next checkpoint, it could be restarted by giving it the checkpoint information and the last place in the transaction file where a transaction had successfully completed. The application could then restart at that point.

Checkpointing tends to be expensive, so it was generally not done with every record, but at some reasonable compromise between the cost of a checkpoint vs. the value of the computer time needed to reprocess a batch of records. Thus the number of records processed for each checkpoint might range from 25 to 200, depending on cost factors, the relative complexity of the application and the resources needed to successfully restart the application.

Fault Tolerance Interface (FTI)

FTI is a library that aims to provide computational scientists with an easy way to perform checkpoint/restart in a scalable fashion.^[6] FTI leverages local storage plus multiple replications and erasures techniques to provide several levels of reliability and performance. FTI provides application-level checkpointing that allows users to select which data needs to be protected, in order to improve efficiency and avoid space, time and energy waste. It offers a direct data interface so that users do not need to deal with files and/or directory names. All metadata is managed by FTI in a transparent fashion for the user. If desired, users can dedicate one process per node to overlap fault tolerance workload and scientific computation, so that post-checkpoint tasks are executed asynchronously.

Berkeley Lab Checkpoint/Restart (BLCR)

The Future Technologies Group at the Lawrence National Laboratories are developing a hybrid kernel/user implementation of checkpoint/restart called BLCR. Their goal is to provide a robust, production quality implementation that checkpoints a wide range of applications, without requiring changes to be made to application code.^[7] BLCR focuses on checkpointing parallel applications that communicate through MPI, and on compatibility with the software suite produced by the SciDAC Scalable Systems Software ISIC. Its work is broken down into 4 main areas: Checkpoint/Restart for Linux (CR), Checkpointable MPI Libraries, Resource Management Interface to Checkpoint/Restart and Development of Process Management Interfaces.

DMTCP

DMTCP (Distributed MultiThreaded Checkpointing) is a tool for transparently checkpointing the state of an arbitrary group of programs spread across many machines and connected by sockets.^[8] It does not modify the user's program or the operating system. Among the applications supported by DMTCP are Open MPI, Python, Perl, and many programming languages and shell scripting languages. With the use of TightVNC, it can also checkpoint and restart X Window applications, as long as they do not use extensions (e.g. no OpenGL or video). Among the Linux features supported by DMTCP are open file descriptors, pipes, sockets, signal handlers, process id and thread id virtualization (ensure old pids and tids continue to work upon restart), ptys, fifos, process group ids, session ids, terminal attributes, and mmap/mprotect (including mmap-based shared memory). DMTCP supports the OFED API for InfiniBand on an experimental basis.^[9]

Collaborative checkpointing

Some recent protocols perform collaborative checkpointing by storing fragments of the checkpoint in nearby nodes.^[10] This is helpful because it avoids the cost of storing to a parallel file system (which often becomes a bottleneck for large-scale systems) and it uses storage that is closer. This has found use particularly in large-scale supercomputing clusters. The challenge is to ensure that when the checkpoint is needed when recovering from a failure, the nearby nodes with fragments of the checkpoints are available.

Docker

Docker and the underlying technology contain a checkpoint and restore mechanism.^[11]

CRIU

CRIU is a user space checkpoint library.

Implementation for embedded and ASIC devices

Mementos

Mementos is a software system that transforms general-purpose tasks into interruptible programs for platforms with frequent interruptions such as power outages. It was designed for batteryless embedded devices such as RFID tags and smart cards which rely on harvesting energy from ambient background sources. Mementos frequently senses the available energy in the system and decides whether to checkpoint the program due to impending power loss versus continuing computation. If checkpointing, data will be stored in a non-volatile memory. When the energy becomes sufficient for reboot, the data is retrieved from non-volatile memory and the program continues from the stored state. Mementos has been implemented on the MSP430 family of microcontrollers. Mementos is named after Christopher Nolan's Memento.^[12]

Idetic

Idetic is a set of automatic tools which helps application-specific integrated circuit (ASIC) developers to automatically embed checkpoints in their designs. It targets high-level synthesis tools and adds the checkpoints at the register-transfer level (Verilog code). It uses a dynamic programming approach to locate low overhead points in the state machine of the design. Since the checkpointing in hardware level involves sending the data of dependent registers to a non-volatile memory, the optimum points are required to have minimum number of registers to store. Idetic is deployed and evaluated on energy harvesting RFID tag device.^[13]

Related Research Articles

<span class="mw-page-title-main">Client–server model</span> Distributed application structure in computing

The client–server model is a distributed application structure that partitions tasks or workloads between the providers of a resource or service, called servers, and service requesters, called clients. Often clients and servers communicate over a computer network on separate hardware, but both client and server may reside in the same system. A server host runs one or more server programs, which share their resources with clients. A client usually does not share any of its resources, but it requests content or service from a server. Clients, therefore, initiate communication sessions with servers, which await incoming requests. Examples of computer applications that use the client–server model are email, network printing, and the World Wide Web.

Scalability is the property of a system to handle a growing amount of work by adding resources to the system.

MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster.

Replication in computing involves sharing information so as to ensure consistency between redundant resources, such as software or hardware components, to improve reliability, fault-tolerance, or accessibility.

Edge computing is a distributed computing paradigm that brings computation and data storage closer to the sources of data. This is expected to improve response times and save bandwidth. Edge computing is an architecture rather than a specific technology, and a topology- and location-sensitive form of distributed computing.

The Parallel Virtual File System (PVFS) is an open-source parallel file system. A parallel file system is a type of distributed file system that distributes file data across multiple servers and provides for concurrent access by multiple tasks of a parallel application. PVFS was designed for use in large scale cluster computing. PVFS focuses on high performance access to large data sets. It consists of a server process and a client library, both of which are written entirely of user-level code. A Linux kernel module and pvfs-client process allow the file system to be mounted and used with standard utilities. The client library provides for high performance access via the message passing interface (MPI). PVFS is being jointly developed between The Parallel Architecture Research Laboratory at Clemson University and the Mathematics and Computer Science Division at Argonne National Laboratory, and the Ohio Supercomputer Center. PVFS development has been funded by NASA Goddard Space Flight Center, The DOE Office of Science Advanced Scientific Computing Research program, NSF PACI and HECURA programs, and other government and private agencies. PVFS is now known as OrangeFS in its newest development branch.

Charm++ is a parallel object-oriented programming paradigm based on C++ and developed in the Parallel Programming Laboratory at the University of Illinois at Urbana–Champaign. Charm++ is designed with the goal of enhancing programmer productivity by providing a high-level abstraction of a parallel program while at the same time delivering good performance on a wide variety of underlying hardware platforms. Programs written in Charm++ are decomposed into a number of cooperating message-driven objects called chares. When a programmer invokes a method on an object, the Charm++ runtime system sends a message to the invoked object, which may reside on the local processor or on a remote processor in a parallel computation. This message triggers the execution of code within the chare to handle the message asynchronously.

Within cluster and parallel computing, a cluster manager is usually backend graphical user interface (GUI) or command-line interface (CLI) software that runs on a set of cluster nodes that it manages. The cluster manager works together with a cluster management agent. These agents run on each node of the cluster to manage and configure services, a set of services, or to manage and configure the complete cluster server itself In some cases the cluster manager is mostly used to dispatch work for the cluster to perform. In this last case a subset of the cluster manager can be a remote desktop application that is used not for configuration but just to send work and get back work results from a cluster. In other cases the cluster is more related to availability and load balancing than to computational or specific service clusters.

<span class="mw-page-title-main">Computer cluster</span> Set of computers configured in a distributed computing system

A computer cluster is a set of computers that work together so that they can be viewed as a single system. Unlike grid computers, computer clusters have each node set to perform the same task, controlled and scheduled by software.

Many-task computing (MTC) in computational science is an approach to parallel computing that aims to bridge the gap between two computing paradigms: high-throughput computing (HTC) and high-performance computing (HPC).

In computing, algorithmic skeletons, or parallelism patterns, are a high-level parallel programming model for parallel and distributed computing.

Data-intensive computing is a class of parallel computing applications which use a data parallel approach to process large volumes of data typically terabytes or petabytes in size and typically referred to as big data. Computing applications which devote most of their execution time to computational requirements are deemed compute-intensive, whereas computing applications which require large volumes of data and devote most of their processing time to I/O and manipulation of data are deemed data-intensive.

Quasi-opportunistic supercomputing is a computational paradigm for supercomputing on a large number of geographically disperse computers. Quasi-opportunistic supercomputing aims to provide a higher quality of service than opportunistic resource sharing.

Tachyon is a parallel/multiprocessor ray tracing software. It is a parallel ray tracing library for use on distributed memory parallel computers, shared memory computers, and clusters of workstations. Tachyon implements rendering features such as ambient occlusion lighting, depth-of-field focal blur, shadows, reflections, and others. It was originally developed for the Intel iPSC/860 by John Stone for his M.S. thesis at University of Missouri-Rolla. Tachyon subsequently became a more functional and complete ray tracing engine, and it is now incorporated into a number of other open source software packages such as VMD, and SageMath. Tachyon is released under a permissive license.

<span class="mw-page-title-main">CRIU</span> Checkpoint/restore software

Checkpoint/Restore In Userspace (CRIU), is a software tool for the Linux operating system. Using this tool, it is possible to freeze a running application and checkpoint it to persistent storage as a collection of files. One can then use the files to restore and run the application from the point it was frozen at. The distinctive feature of the CRIU project is that it is mainly implemented in user space, rather than in the kernel.

A distributed file system for cloud is a file system that allows many clients to have access to data and supports operations on that data. Each data file may be partitioned into several parts called chunks. Each chunk may be stored on different remote machines, facilitating the parallel execution of applications. Typically, data is stored in files in a hierarchical tree, where the nodes represent directories. There are several ways to share files in a distributed architecture: each solution must be suitable for a certain type of application, depending on how complex the application is. Meanwhile, the security of the system must be ensured. Confidentiality, availability and integrity are the main keys for a secure system.

Computation offloading is the transfer of resource intensive computational tasks to a separate processor, such as a hardware accelerator, or an external platform, such as a cluster, grid, or a cloud. Offloading to a coprocessor can be used to accelerate applications including: image rendering and mathematical calculations. Offloading computing to an external platform over a network can provide computing power and overcome hardware limitations of a device, such as limited computational power, storage, and energy.

In the high-performance computing environment, burst buffer is a fast intermediate storage layer positioned between the front-end computing processes and the back-end storage systems. It bridges the performance gap between the processing speed of the compute nodes and the Input/output (I/O) bandwidth of the storage systems. Burst buffers are often built from arrays of high-performance storage devices, such as NVRAM and SSD. It typically offers from one to two orders of magnitude higher I/O bandwidth than the back-end storage systems.

ACM SIGHPC is the Association for Computing Machinery's Special Interest Group on High Performance Computing, an international community of students, faculty, researchers, and practitioners working on research and in professional practice related to supercomputing, high-end computers, and cluster computing. The organization co-sponsors international conferences related to high performance and scientific computing, including: SC, the International Conference for High Performance Computing, Networking, Storage and Analysis; the Platform for Advanced Scientific Computing (PASC) Conference; Practice and Experience in Advanced Research Computing (PEARC); and PPoPP, the Symposium on Principles and Practice of Parallel Programming.

References

↑ Plank, J. S., Beck, M., Kingsley, G., & Li, K. (1994). Libckpt: Transparent checkpointing under unix. Computer Science Department.
↑ Wang, Teng; Snyder, Shane; Lockwood, Glenn; Carns, Philip; Wright, Nicholas; Byna, Suren (Sep 2018). "IOMiner: Large-Scale Analytics Framework for Gaining Knowledge from I/O Logs". 2018 IEEE International Conference on Cluster Computing (CLUSTER). IEEE. pp. 466–476. doi:10.1109/CLUSTER.2018.00062. ISBN 978-1-5386-8319-4. S2CID 53235850.
↑ "Comparative I/O workload characterization of two leadership class storage clusters Logs" (PDF). ACM. Nov 2015.
↑ Bouteiller, B., Lemarinier, P., Krawezik, K., & Capello, F. (2003, December). Coordinated checkpoint versus message log for fault tolerant MPI. In Cluster Computing, 2003. Proceedings. 2003 IEEE International Conference on (pp. 242-250). IEEE.
↑ Elnozahy, E. N., Alvisi, L., Wang, Y. M., & Johnson, D. B. (2002). A survey of rollback-recovery protocols in message-passing systems. ACM Computing Surveys, 34(3), 375-408.
↑ Bautista-Gomez, L., Tsuboi, S., Komatitsch, D., Cappello, F., Maruyama, N., & Matsuoka, S. (2011, November). FTI: high performance fault tolerance interface for hybrid systems. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (p. 32). ACM.
↑ Hargrove, P. H., & Duell, J. C. (2006, September). Berkeley lab checkpoint/restart (blcr) for linux clusters. In Journal of Physics: Conference Series (Vol. 46, No. 1, p. 494). IOP Publishing.
↑ Ansel, J., Arya, K., & Cooperman, G. (2009, May). DMTCP: Transparent checkpointing for cluster computations and the desktop. In Parallel & Distributed Processing, 2009. IPDPS 2009. IEEE International Symposium on (pp. 1-12). IEEE.
↑ "GitHub - DMTCP/DMTCP: DMTCP: Distributed MultiThreaded CheckPointing". GitHub . 2019-07-11.
↑ Walters, J. P.; Chaudhary, V. (2009-07-01). "Replication-Based Fault Tolerance for MPI Applications". IEEE Transactions on Parallel and Distributed Systems. 20 (7): 997–1010. CiteSeerX 10.1.1.921.6773 . doi:10.1109/TPDS.2008.172. ISSN 1045-9219. S2CID 2086958.
↑ "Docker - CRIU".
↑ Benjamin Ransford, Jacob Sorber, and Kevin Fu. 2011. Mementos: system support for long-running computation on RFID-scale devices. ACM SIGPLAN Notices 47, 4 (March 2011), 159-170. DOI=10.1145/2248487.1950386 http://doi.acm.org/10.1145/2248487.1950386
↑ Mirhoseini, A.; Songhori, E.M.; Koushanfar, F., "Idetic: A high-level synthesis approach for enabling long computations on transiently-powered ASICs," Pervasive Computing and Communications (PerCom), 2013 IEEE International Conference on , vol., no., pp.216,224, 18–22 March 2013 URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6526735&isnumber=6526701

External links

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Plank, J. S., Beck, M., Kingsley, G., & Li, K. (1994). Libckpt: Transparent checkpointing under unix. Computer Science Department.

[2] Wang, Teng; Snyder, Shane; Lockwood, Glenn; Carns, Philip; Wright, Nicholas; Byna, Suren (Sep 2018). "IOMiner: Large-Scale Analytics Framework for Gaining Knowledge from I/O Logs". 2018 IEEE International Conference on Cluster Computing (CLUSTER). IEEE. pp. 466–476. doi:10.1109/CLUSTER.2018.00062. ISBN 978-1-5386-8319-4. S2CID 53235850.

[3] "Comparative I/O workload characterization of two leadership class storage clusters Logs" (PDF). ACM. Nov 2015.

[4] Bouteiller, B., Lemarinier, P., Krawezik, K., & Capello, F. (2003, December). Coordinated checkpoint versus message log for fault tolerant MPI. In Cluster Computing, 2003. Proceedings. 2003 IEEE International Conference on (pp. 242-250). IEEE.

[5] Elnozahy, E. N., Alvisi, L., Wang, Y. M., & Johnson, D. B. (2002). A survey of rollback-recovery protocols in message-passing systems. ACM Computing Surveys, 34(3), 375-408.

[6] Bautista-Gomez, L., Tsuboi, S., Komatitsch, D., Cappello, F., Maruyama, N., & Matsuoka, S. (2011, November). FTI: high performance fault tolerance interface for hybrid systems. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (p. 32). ACM.

[7] Hargrove, P. H., & Duell, J. C. (2006, September). Berkeley lab checkpoint/restart (blcr) for linux clusters. In Journal of Physics: Conference Series (Vol. 46, No. 1, p. 494). IOP Publishing.

[8] Ansel, J., Arya, K., & Cooperman, G. (2009, May). DMTCP: Transparent checkpointing for cluster computations and the desktop. In Parallel & Distributed Processing, 2009. IPDPS 2009. IEEE International Symposium on (pp. 1-12). IEEE.

[9] "GitHub - DMTCP/DMTCP: DMTCP: Distributed MultiThreaded CheckPointing". GitHub . 2019-07-11.

[10] Walters, J. P.; Chaudhary, V. (2009-07-01). "Replication-Based Fault Tolerance for MPI Applications". IEEE Transactions on Parallel and Distributed Systems. 20 (7): 997–1010. CiteSeerX 10.1.1.921.6773 . doi:10.1109/TPDS.2008.172. ISSN 1045-9219. S2CID 2086958.

[11] "Docker - CRIU".

[12] Benjamin Ransford, Jacob Sorber, and Kevin Fu. 2011. Mementos: system support for long-running computation on RFID-scale devices. ACM SIGPLAN Notices 47, 4 (March 2011), 159-170. DOI=10.1145/2248487.1950386 http://doi.acm.org/10.1145/2248487.1950386

[13] Mirhoseini, A.; Songhori, E.M.; Koushanfar, F., "Idetic: A high-level synthesis approach for enabling long computations on transiently-powered ASICs," Pervasive Computing and Communications (PerCom), 2013 IEEE International Conference on , vol., no., pp.216,224, 18–22 March 2013 URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6526735&isnumber=6526701

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

v t e Parallel computing
General	Distributed computing Parallel computing Massively parallel Cloud computing High-performance computing Multiprocessing Manycore processor GPGPU Computer network Systolic array
Levels	Bit Instruction Thread Task Data Memory Loop Pipeline
Multithreading	Temporal Simultaneous (SMT) Speculative (SpMT) Preemptive Cooperative Clustered multi-thread (CMT) Hardware scout
Theory	PRAM model PEM model Analysis of parallel algorithms Amdahl's law Gustafson's law Cost efficiency Karp–Flatt metric Slowdown Speedup
Elements	Process Thread Fiber Instruction window Array
Coordination	Multiprocessing Memory coherence Cache coherence Cache invalidation Barrier Synchronization Application checkpointing
Programming	Stream processing Dataflow programming Models Implicit parallelism Explicit parallelism Concurrency Non-blocking algorithm
Hardware	Flynn's taxonomy SISD SIMD Array processing (SIMT) Pipelined processing Associative processing MISD MIMD Dataflow architecture Pipelined processor Superscalar processor Vector processor Multiprocessor symmetric asymmetric Memory shared distributed distributed shared UMA NUMA COMA Massively parallel computer Computer cluster Beowulf cluster Grid computer Hardware acceleration
APIs	Ateji PX Boost Chapel HPX Charm++ Cilk Coarray Fortran CUDA Dryad C++ AMP Global Arrays GPUOpen MPI OpenMP OpenCL OpenHMPP OpenACC Parallel Extensions PVM pthreads RaftLib ROCm UPC TBB ZPL
Problems	Automatic parallelization Deadlock Deterministic algorithm Embarrassingly parallel Parallel slowdown Race condition Software lockout Scalability Starvation
Category: Parallel computing