Fred Cohen & Associates

Basics of Fault Tolerant Computing:

Copyright(c), 1990, 1995 Fred Cohen - All Rights Reserved

Redundancy is the basic tool of fault tolerance. Since things fall apart, we attempt to use redundancy so that when some components fail, other components compensate for their failure. As an example, a circular table with 5 equally spaced legs around the perimeter will still stand even if one of the legs fails. By properly designing tables, we can make them tolerate numerous leg failures, but each time we add a new leg, we increase the price of the product. Some typical types of redundancy used in information systems include:

Backups
N-modular redundancy
Duplicate and compare
Cold, warm, and hot standbys
Parity, SEC, DED, and CRC codes

Backups are copies of information, kept so that in the event that information is lost or modified, another copy can be made available at a relatively low cost. Many businesses have come into being for the sole purpose of providing a remote off site backup facility designed to survive nearly any possible event. In some cases, duplicate copies of entire computer systems are kept off site in case of fires or other disasters. In most cases, it is more cost effective for a set of organizations to group together to provide such a facility, since the cost to any single organization may be prohibitive. For backups to cover many faults, they must be both separate and different. Separation prevents backups from being affected along with originals. Differences prevent events that damage only one type of medium from affecting all copies.

N Modular Redundancy (NMR) is a technique in which N versions of the same circuit are replicated, and N copies of a voting circuit are used to take a majority vote of the result. This uses more than N times the number of components as a circuit without redundancy, and standard analytical techniques show that after a certain amount of time, a circuit without NMR is more likely to be operational than an NMR version. Thus there is a tradeoff between redundancy and mission time. A similar technique is under study in software redundancy called N-Version Programming [Kelly83] [Chen78] [Chen78-2] [Littlewood84] .

Duplicate and compare is a technique in which errors are detected but not corrected. We detect component failures by duplicating a circuit and comparing the duplicated results. If they differ, one or the other must be in error, but we cannot tell which without further analysis. Many other forms of built-in self-test have been explored both in the hardware domain and in the software domain [Yau75] .

If we have some error detection technique, we may design systems with spare components which are activated when other components fail. Hot standby systems are systems in which a standby unit is performing all operations of the primary unit. Thus when the primary unit fails, switchover to the standby unit for continued operation can be made very quickly. Cold standbys are often used when short lapses in operation are acceptable and state data is not critical to operation. In these cases, the standby unit is kept non-operational until it is needed. Since systems tend to fail as they operate longer, the expected lifetime of a system may be increased by using cold standbys instead of hot standbys. Cold standbys also don't require as much power as hot standbys, and are thus often used in remote operations where power is hard to come by. In the AT&T ESS processors, duplicate and compare is used to allow a machine with a hot standby to shut itself down, thereby allowing the standby to carry on without error.

Parity is a simple method of coding the number of 1s (or 0s) in a binary representation into a single bit. By using a parity code, we can test to see whether an odd number of bit changes have occurred without our explicit instructions. Single error correcting (SEC) codes allow us to detect and correct a single bit error, while double error detecting (DED) codes allow us to detect a second error, so that we know whether there have been multiple errors. CRC codes allow us to detect large numbers of errors, and are commonly used on disks to assure that data isn't lost in cases where several bytes are in error. There are many other code methods, some involving cryptographic techniques, and all based to some degree on Shannon's information theory [Shannon48]

These examples only scratch the surface of the field of fault tolerant computing. For further reading, we suggest [Breuer81] , [Siewiorek82] , and [Scott84] .