Single Point Of Failure

A point in a system where, if a failure occurs, there is no redundancy to compensate for it:

Not to be confused with a system that can fail at one point only.

In fact, a system could have more than one SinglePointOfFailure.

Contributors: EricHerman, BrettNeumeier


The power supply failed in TeleHouse a year or so back, taking down most of the UK Internet for half a day. This wasn't supposed to be possible. Telehouse has 2 independent external power supplies, plus its own generator. It's highly unlikely they would all fail at once.

I don't believe the exact cause of the incident was ever made public, but apparently there was a switch which changed over to a new power supply when the current supply went broke. This switch acted as a SinglePointOfFailure. I tell the story to illustrate how hard it can be to truly eliminate these things.

(Of course, the Telehouse guys were aware of the potential problem and tested the change-over switch carefully. According to rumor, an engineer was engaged in testing the switch at the very time the incident happened...)

Is this a good example of a SinglePointOfFailure? If there are protective redundancies built into a system, but both (or all) of them fail at once (or concurrently), then multiple points have failed. I think that we often mistakenly join together multiple points into a SinglePointOfFailure. -- JeremyCromwell

The TeleHouse event wasn't due to two power supplies failing at the same time. It was due to the switch which selected the power supply failing. There was only one switch. So I think the switch failure is a good example.

Is it perhaps a good example of why testing live systems you're slightly nervous about can sometimes be a bad idea? Compare: doing a regular test restore of your backup tapes, and accidentally overwriting the live copy in the process. -- MatthewAstley


A better example might be the end of Star Wars Episode I.

(It can be argued, with regard to the end of Star Wars Episode I, that all those robots had intentional self-destruction logic built in, so as not to be captured by the enemy. After all, the loss of a command center is a rather grave reason for some panic: robots breaking down can be thought of as a panic-consequence-minimization measure. As to the command center itself, I recall that it was hit somewhat hard, in a force majeure manner: this failure had much sooner occurred due to flawed combat tactics than to an engineering mistake, no?)


(I was aiming for a Compare: line, like the See also: line, but I got carried away.)

Compare: Chain being only as strong as its weakest link.

Actually, the comparison is between a chain and a mesh. A chain is a sequence of single points of failure, whereas a mesh is a highly redundant set of links.

Interesting. This fits the security aspect of the cliche quite nicely too: "your system is only as secure as the largest hole in the mesh", or something. 8-)


See: LocationTransparency, RedundancyIsInertia


EditText of this page (last edited August 25, 2008) or FindPage with title or text search