Stochastic Models for Fault Tolerance: Restart, Rejuvenation and Checkpointing
Format: PDF / Kindle (mobi) / ePub
As modern society relies on the fault-free operation of complex computing systems, system fault-tolerance has become an indispensable requirement. Therefore, we need mechanisms that guarantee correct service in cases where system components fail, be they software or hardware elements. Redundancy patterns are commonly used, for either redundancy in space or redundancy in time.
Wolter’s book details methods of redundancy in time that need to be issued at the right moment. In particular, she addresses the so-called "timeout selection problem", i.e., the question of choosing the right time for different fault-tolerance mechanisms like restart, rejuvenation and checkpointing. Restart indicates the pure system restart, rejuvenation denotes the restart of the operating environment of a task, and checkpointing includes saving the system state periodically and reinitializing the system at the most recent checkpoint upon failure of the system. Her presentation includes a brief introduction to the methods, their detailed stochastic description, and also aspects of their efficient implementation in real-world systems.
The book is targeted at researchers and graduate students in system dependability, stochastic modeling and software reliability. Readers will find here an up-to-date overview of the key theoretical results, making this the only comprehensive text on stochastic models for restart-related problems.
Bound on the number of restarts. Although this need not generally be the case (the hyper-exponential distribution is a counter example), for many distributions it may be wise to limit the number of restarts, or increase the period between restarts with the restart count. This leads to a situation with finite and non-identical restart intervals, for which we derive an algorithm to compute all moments. Perhaps one would expect that restarts should take place with fixed-length intervals between.
Performed. The many existing publications differ in assumptions on the aging, the system descriptions, the failure models, the way to determine the number k of partial renewals after which to perform a complete renewal, etc. (5) Shock models assume that the failure of a system or its components is caused by shocks the system experiences. These shocks happen randomly such that the time between shocks and the damage caused by a shock are random variables that follow some probability distribution.
Rejuvenation. To apply checkpointing the system failure behaviour and the task processing must be observed. The rollback to the most recent checkpoint can be applied to either the task, the system or both. As with rejuvenation the purpose is to circumvent system failures and achieve completion of processes in as short a time as possible. However, the most important difference between restart and both rejuvenation and checkpointing is that the former relates to a minimisation problem with known.
The system fails and the checkpointing overhead. Many stochastic models aim at optimising this trade-off with respect to some performance metric, typically either the expected completion time of the task, or the availability of the software system. Forked checkpointing When using sequential checkpointing on a uni-processor system no useful work is being performed during the whole save process of a checkpoint. Therefore, in forked checkpointing the process whose state is being saved creates a.
Requests to be retried. Therefore retry is necessary if restart is performed and the set of all retries is a subset of the set of all restarts. Both are restart retry reboot Fig. 1.1 Relation of retry, restart, and reboot 1.3 Preventive Maintenance 9 subsets of the set of reboots. In other words system reboot requires application restart and request retry, while application restart only requires request retry and retry is possible without any of restart and reboot. As a matter of fact,.