Fault Tolerance

The typical scenario for providing fault tolerance is to provide more resources than necessary and then let one unused take over if a busy one fails. Design for fault tolerance requires careful selection and evaluation of an error model, otherwise you might spend a lot of effort and money for handling failures that have a very low chance to happen and you might forget another one that occurs once an hour.

Providing fault tolerance works like an insurance, there is always a rate to pay for covering losses, sometimes such a rate can be prohibitively high and you might be better off without fault tolerance if you can tolerate the failures.

There is no absolute safety.

The physical environment might require a non-trivial physical distribution of the hardware (fire, water, earthquakes, nuclear war, aliens, etc.).

Not only the boxes also the connections must be secured.


"Fault-tolerance" usually connotes resistance to mishap or accidental damage (e.g. unexpected failure of hardware, loss of network connectivity, accidental erasure of data) and not intentional sabotage. Hence it overlaps with, but has a different perspective than, the topic of security. In a sense, fault-tolerant design is a special case of designing for security, where the assumed attacker is Murphy.

Fault tolerance is not the same as ruggedness; see RuggedizedComputer.


For a discussion of the kinds of issues that must be considered in designing for fault-tolerance, see "A Fault-Tolerant System Architecture for Navy Applications", IBM Journal of Research & Development, Volume 27, Number 3, Page 219 (1983), on-line at

http://domino.watson.ibm.com/tchjr/journalindex.nsf/0/69c09a6f2ba76be785256bfa0067f580?openDocument

This is just an article I happen to remember having read, which happens to be available on-line. I'm sure it's not the only example of its kind, and it may not be the best. But it does give an overview of the issues involved, and it discusses a real-life system, designed over a period of years and intended to have a service life of two decades after installation, with an MTBF of 2000 hours. The article particularly mentions the need for cooperation between layers of the system: fault-tolerance must be designed in from top to bottom, including software, operating system, and hardware.


There is a RuleOfThree for fault-tolerance that says you should do every computation three times, on three independent processors. If they all agree, great; if two agree and one does not, the majority wins, and the dissenter is flagged for testing. Cf. RobertHeinlein's novel TheNumberOfTheBeast, the Spielberg film Minority Report (based on a story by Philip K. Dick, WikiName PhilDick), and other standard engineering references. :-)

Interestingly, the design for the Navy system referenced above did not take this approach: they used two CPUs, not three.

RuleOfThree may help for some failures, but not in cases where the specification has a fault, in which case all 3 processes may have the same fault. Another approach is a "back-up handler". If it's known that the primary process has a non-recoverable problem, then a secondary process takes over. The secondary process is usually simpler and only handles basic survivable operations. It's roughly comparable to the autonomic nervous system in mammals (not to be confused with "automatic"). It keeps the heart and lungs going, but doesn't handle more complex things like finding shelter or avoiding prey. It's usually not useful if you are attacked by a predator and are about to be eaten, but it can be of use if say you slip and fall down the side of a hill, smashing into stuff on the way down. Your brain may be knocked out, but perhaps this "backup brain" buys time for your comrades to find and assist.

Similarly, a simpler back-up system may not help a spacecraft trying to land, for it needs its full faculties for such complex tasks. However, if parked in orbit or already landed, then a basic survival system may buy it time while ground controllers work out solutions. For example, the backup process may do little more than regulate temperature and aim solar panels to keep the battery and transmitter alive. I believe NASA calls this "Safety Mode".

Sorry, did you say "where the specification has a fault"?!? That makes no sense whatsoever. You are taking about a defective product that could not have worked correctly, ever. That is something which should have been discovered at the specification inspection phase. How would such a product ever get built?


Flowers help my fault tolerance.


See: FaultIsolation, FailFast, FailureIsInevitable, ApplicationRecycling, FaultTreeAnalysis, RealTime, RecoveryOrientedComputing

CategoryRealTime


EditText of this page (last edited September 28, 2010) or FindPage with title or text search