Mean Time Between Failure For Software

[inspired by comments on ThereIsNoSuchThingAsNoBugs.]

[Note: MTBF is also commonly used to stand for mean time before failure, thus de-emphasising the repetitive aspect some people have a problem with when applying the term to software]

Hardware reliability is often defined as "mean time between failures."

But, is this an appropriate measure for software?

No. :-)

Why not? Most of the arguments I've seen against this were to do with most software failures being one-offs. A bug gets patched, we never see it again. But this misses the point that MTBF is a statistical measure, an estimate made at the point when the hardware is put into service. To take another example of 'bug finding' - discoveries of new species of insect are one-offs in exactly the same way; once its discovered, its discovered. However, by looking at the time between each discovery, you can estimate the number of bugs that remain to be discovered. At any given point in time (such as, say, the launch date of Windows XP) it becomes possible to estimate the amount of time before the next bug - insect or otherwise - is found. There is some information on estimating MTBF on systems which become increasingly more reliable (i.e. where you can fix bugs) here: http://www.itl.nist.gov/div898/handbook/apr/section1/apr19.htm .

A better argument against MTBF for software is that for many kinds of software we cannot reproduce 'normal use'. See eg. http://www.cigitallabs.com/resources/definitions/software_reliability.html and thus the product history from testing that I used above is not realistic. However many software products do have environments which we can reproduce and this argument does not apply (think e.g. bespoke application delivered on its own server, instead of Word running on some users PC with god knows what else running)

MTBF is not a very good measure for hardware either. There are a lot of ways to make the calculation and you often pick the way that gives the best final number. It is often a matter of adjusting the calculation, not the design, to improve the MTBF number.

Depends what you use it for. Examples:

Marketing info. Pure snake oil.
Standards conformance. If there is in your industry a standard way of measuring MTBF for your widget, then you can't fake it the way you describe.
Meeting requirements. I'm sure NASA would want to spec not only the MTBF but how you measure that.
Internal QA measure. Used to figure out what support costs will be after release, and minimize that cost. If you fiddle the books here you're only kidding yourself.

Knowing MTBF or some equivalent measure of reliability by looking at product history (or pre-release test history) lets you estimate, among other things, how much it's going to cost you to find the next bug. This can be useful in deciding if a product is ready for release.

MTBF is usually calculated in hardware by taking a DoD specification which lists various characteristics such as transistor count or information provided by the vendor. This information is usually generated using an accelerated life-time procedure where hundreds of parts are operated under extreme conditions and the results of this test are used to extrapolate what the MTBF is for a single part. The system builder then uses a "derating" equation to convert this MTBF value into one for the expected operating environment for the system. The individual MTBF values are then combined to create a system MTBF. None of these operations are really statistically justifiable and a lot of judgment is involved. The calculated MTBF values usually have no correlation to measured values obtained after the system is fielded as most failures result from things that do not even feed into the equations. Making decisions based on the calculated MTBF values is kidding yourself.

Software has this habit of working 100% reliably until you hit a bug. Then it may hit 100% failure mode from then on.

Consider the AT&T bug with 1-800 service: When call volume hit a certain level, an uninitialized variable in exception processing caused the telephone switch to crash. Other switches, sensing the problem, would reroute calls to other running switches -- causing them to overload and crash. Any switch that came up would instantly be swamped with calls, overload and crash.

Fortunately, they were able to "downgrade" the switches to a previous software version that did not have the bug, otherwise there may not have been any way they could restore telephone service.

So how do you measure this in terms of a traditional MTBF?

MTBF is a measure of average uptime, not downtime. Clearly in this case the uptime was related to the frequency with which call volumes would reach a given threshold. Its actually more common for broken hardware to be 100% broke when its broke :o).

The payroll system works until you hire Mr. O'Leary (a person with a quote character in their name). Then you can no longer print checks. So you mangle that person's name. Now the system works again -- until the next data entry clerk enters O'Rilley.

Software failures are design problems, not wear and tear problems. Software doesn't fail due to getting "worn out" with use. Software fails because of design errors. So, while an engine will fail from time to time (without preventative maintenance) due to parts wearing out, software will work perfectly until it encounters inputs it was not designed to handle, then it fails completely until those inputs are removed. [ Hardware fails due to design errors too, and there are MTBF models specifically aimed at this 'prototyping' phase. See the argument up top and the link to http://www.itl.nist.gov/div898/handbook/apr/section1/apr19.htm ]

While strictly true, observation suggests the opposite. My laptop works fine. Then I install FrameMaker and I can't dial out twice. Then I install an update to InternetExplorer and the machine won't shut down. Then... Sure looks like wear and tear to me. See also BitRot.

Can someone post a link to an article on this AT&T bug?

It'll take some effort to track it down. I read about it in comp.risks and Communications of the ACM.

From the CACM article, it became clear to me that the error was caused by a C 'break' statement executed in exceptional circumstances (IE: heavy load) that bypassed some necessary initialization logic. -- JeffGrigg

TheScienceOfDebugging has a detailed account. IIRC, it involved a classic C mistake, forgetting to include a break at the end of a case.

CategoryTime