Tests Cant Prove The Absence Of Bugs

Issue #2: You didn't really KNOW that the change was safe.

There being no instances of ">2 line" transactions in the data does not prove that there is no code that can produce them; it only shows that in that period of time none did. If the system imported transactions from another system, for example, it might be just a coincidence that they had not yet done "year end processing" or some other process that produces "N line" transactions. And after refactoring this functionality out of the system, what would you do on the day that the system breaks online, because you missed one of the cases?

Here's another reason to be nervous: None of the UnitTests failed. How well are you testing the functionality of the Transaction class, if changing some of its methods so they throw away data does not cause any of the tests to fail?

-- JeffGrigg

So true, they can't. Consider two new airframes. The one on the left has flown 10,000 hours of tests. At the beginning, it crashed a lot. Lately, it hasn't crashed at all. The one on the right hasn't been tested. You need to get home from this desert island. Which one will you ride in? Justify your answer. -- RonJeffries

Hang on a minute. Has the airframe on the right undergone strict Fagan Inspection by a trained team of problem-domain competent Inspectors? Thanks, I'll take that one. Four people, skilled and experienced, trained in the Way of the Process, have inspected this airframe for defects and have found none. None. Zero. Zip. Zilch. Nada. Any defects that were found in previous inspections were corrected and the fixes themselves inspected in context for potential new defects introduced by the fix. This iterative process continued until no more defects showed up.

Gas 'er up.

''This concern appears to be based on the notion that the UnitTests were devised at random, not with careful attention. While it is certainly the case that errors do slip through conscientious testing, it's still way ahead of the usual alternative, i.e. not testing.

Jeff, what alternative practice do you follow that provides even better certainty during refactoring?

This is the old saw that tests can only prove the presence of features, not the absence of bugs. True, but irrelevant for our purposes. There is some defect rate for any project below which you are willing to act like you don't have defects as long as the tests run.

We crossed this threshold while I was working on LifeTech. At first, nobody could make a single move because they would break something else. (ExtremeFrustration!) After about four months of OnDemandUnitTesting? (see UnitTestTrial, I think), we changed from thinking "all the tests running just proves we don't have enough tests" to thinking "all the tests run and we can't think of anything that might break, so put it into production".

Terrific. We tested this implanted defibrillator. Well, we tested it until we ran out of things to test. Here, let's stick it in your chest...

Most times we were right, but occasionally we were wrong and we would put code into production that broke. This happened less and less frequently as we got more tests. Every time something broke in production we learned about another kind of test we needed to write. -- KentBeck

Would coverage tools help ensure the completeness of testing? Clearly, nothing will let you say, "there are absolutely no bugs", but being able to state "we've executed every line of code, and all our tests passed" adds to the comfort level.

Coverage tools add comfort, for sure. The fact that coverage is complete also increases the number of "edge" cases that are checked. However, it's coverage in combination that really does the checking - this generally requires human thought: "OK, but what if he's at the savings match and gets a raise this month". Don't let your coverage tester discourage thinking and discussion of what to test.

(Also, code coverage only shows that a piece of code was _executed_ once, and not that every line code has been tested with every option...)

When I first met Kent, I did not understand this either. Let me try to explain my newfound understanding. Many people assume that UnitTesting as used within ExtremeProgramming is the same as UnitTesting as per some other methodology. There are some subtle differences that make it more effective.

The first is a misconception that comes from familiarity with BigBangTesting. That is, you sit down and just test and test your code all in one session. And you always, always miss several big bugs. Thus you find TestsCantProveTheAbsenceOfBugs. But this is not XP style. With XP, you create tests before you code, while you code, and then even after you code. You create tests whenever a bug is found. You create tests when you look at someone else's code and see a testing hole. And you don't just throw those tests away after you believe you are done. You evolve and expand your test suite over time. The testing suite that results is (trust me) very different from what you might imagine.

The second misconception comes from a lack of YouArentGonnaNeedIt. That is, you wonder how much of your system is not being exercised by your tests. The XP answer is none of it. You have been very careful not to include anything extra and you tested 100% of everything you did include. If you missed something someone else will add a test for it sooner than later.

After a while (and I am talking about only a few months of evolution) you will have a collection of tests which do, in fact, prove the absence of many bugs. Not all of course, but any relevant ones will be covered. -- DonWells

I believe the objection is to the use of the word "prove", which has a particular mathematical interpretation. I don't usually use "prove" for exactly this reason, although I can believe I boo-booed in the article. The question is whether or not you can act like you have proved the absence of defects, even though you know you haven't, not in the mathematical sense of the word. -- KentBeck

ProofsCantProveTheAbsenceOfBugs, either.

Passing a test suite just means you are ready to release the software. This is a good thing, but there's always another bug.

To illustrate, a story by EricUlevik (don't trust the details):

Doing forex trading, we found a case where our software would get a bad calculation. Once a week or so. Eventually, the problem turned out to be a failure in a CPU cache line refresh. This was a hardware design fault in the PC.

The test suite included running for up to two weeks at maximum update rate without error, so this bug was found.

If you need this level of quality, a good place to look is Intel's web site for processor errata.

See HardwareErrata.

Note that after restricting the functionality of the Transaction object, all the UnitTests ran successfully. While this does increase one's confidence that the change did not break existing functionality (the most severe risk of a change like this), it also leads to another, more disturbing conclusion:

The unit tests for the Transaction class are not sufficient to detect this loss of functionality.

My response would be to add more Transaction unit tests. -- JeffGrigg

ItDepends. If the functionality is required, i.e. you really do have a card for it, then test it and leave it in. The test was erroneously left out. On the other hand, if the functionality is not an immediate requirement, you should invoke YouArentGonnaNeedIt and remove the functionality rather than test it.

Either way, by paying attention to having tests and functionality aligned, our confidence in the system increases.

A question for the XP gurus:

Doesn't refactoring (which will change the visible interface of classes) lead to a restatement of some tests (unit and functional)?
Isn't this a problem when you choose to measure stability and productivity in terms of those tests? (i.e. you can't use exactly the same tests.)
Is this not a problem in practice? If it is a problem, how do you live with it?

In refactoring, you are either changing implementation, or you are changing interface. It is bad to change both at once. If you are changing only implementation, then the tests do not require changing.

If you are changing interface, the tests need changing. Generally, of course, they are quite similar to the old ones.

It is in no way a problem: if you were implementing new functionality, you'd have to do tests, typically from scratch. New tests are not a problem. If you're refactoring interface, you still have to do tests, and they aren't even as hard as new tests, which aren't hard.

As always, with new code or refactoring, if a test breaks, there are two possibilities: the code is wrong, or the test is wrong. If the developer can't tell which, he should consider a job in Marketing. ;-> -- RonJeffries

What about ProbabilisticProofs?? The reason we can generate large PrimeNumbers for cryptography is that we use a prime testing algorithm that's very fast, only it goofs up from time to time. But the algorithm also has the property that the more times we run it successfully on a number, the more certain we can be that the number is prime. This really pisses off the purists, but seems to work in the real world.

Maybe a sufficient suite of unit (and functional) tests serve as a probabilistic proof that the system is correct.

-- JohnBrewer

It's refreshing to hear this point of view. People get hung up on extremes, either it's provable or everything sucks. After all, just because a test can't definitively prove the absence of bugs in a system does not mean that we as engineers shouldn't test that system as rigorously as possible. It's funny how hung-up we get on extremes. So, I would agree, you don't have to convince everyone that software can be proven correct in order to show the value of testing. I think someone (maybe WattsHumphrey?) came up with the maxim that if you have found a bug, you can be certain there is another one. :) -- RobertDiFalco

That's a pretty safe conclusion even if you haven't found a bug yet. -- GarethMcCaughan

I think WattsHumphrey's view is that finding a bug in a particular area suggests that there are more bugs in that area. Example: code that has a lot of syntax errors will likely also have more serious errors. Not sure how that fits on this page... -- JasonYip

Yeah, that was it. Maybe I was mistaken, but it seemed relevant to the idea that you cannot prove the absence of bugs. In fact, finding a bug is often an indication that more serious bugs exist. So, while testing cannot prove the absence of bugs it can prove their existence - and this is an important and good thing. This allows the engineer to remove those detected bugs and, hence, increase confidence in the system. Seems logical to me. Another great thing about UnitTesting is that (when you are rigorous about them and add them to your daily build) you can waste less of QA's time when they do integration testing. There is no excuse for giving QA a build that has errors in the implementation (unintegrated).

This will be slightly off-topic, but I have always been pretty religious about developing many small and atomic UnitTests that map to component interfaces. Actually, I use to create a single TestCase for every public method. Now I just make sure that each method works, write tests to simulate every fail-condition, and make sure that running everyone's UnitTests is an automatic part of the DailyBuild script. However, (while admittedly a subtle distinction) I don't really believe that I am removing any bugs (nor proving the absence of new bugs). Actually, I visualize my unit tests much like an extension to the traditional DesignByContract. The TestMethods? provide a harness for exercising these contracts. My general rule is that I shouldn't have behavior (for failure or success) that can't be asserted in some way. The more rigorous my assertions, the higher my confidence level is in the code's behavior. So, in this way, each TestCase is really just a harness or an artificial agitator for my Contracts.

Finally, just as I would never skip UnitTesting, I would also never be naive enough to think UnitTesting alone is sufficient. And so, I also make certain that QA performs black and white box functionality tests that don't test my implementation as much as they test the solutions my units of implementation provide when integrated together. -- RobertDiFalco

There is evidence of BadModules? where one finds 80% of a system's faults in 20% of its modules. This seems to be true, so far, of all systems whatever methodology or tools are used.

That's a pretty safe conclusion even if you haven't found a bug yet.

Another thought experiment. Two modules to run your pacemaker have undergone large amounts of testing, equivalent as far as you know. From module A, no defects have ever been found. From module B, hundreds have been found so far. Your pacemaker has a JulY2K problem and you must choose a new module NOW, or die. Which one do you choose? Justify your answer. -- RonJeffries

I know what you are trying to prove, Ron - but your example has major holes. Obviously, it must depend on the kind of bugs found, how much actual time the two have been in use, the reputation of the manufacturers and many other factors. Otherwise, we would expect a bug-ridden OS like Windows to fail utterly compared with the much more stable Linux.

The observation that bugs have been found is certainly a good predictor that still more remain - but it is not a reliable predictor that people will not use the product. -- RussellGold

All you know is that both modules have undergone extensive testing, equivalent as far as you know. One has never exhibited a defect, one has exhibited hundreds. Your heart is stopping. Pick one. Justify your answer. -- rj

Ron, if both underwent rigorous testing, I would have to select module A. Just by the fact that hundreds have been found in module B, I can pretty safely bet that many more are yet to be found. However, since so few (actually none) have been found in module A, chances are that none will be found. Is this where you were heading? Which would you pick? BTW, I think this was a useful thought experiment. -- RobertDiFalco

A. I can accept no errors in my pacemaker. Only one of the systems has exhibited no errors. End of consideration. For extra credit, consider pacemaker software C, which has shown a declining error curve. It started with lots, but no new defects have been discovered for a "long time" now. Do we prefer it to B? Do we prefer it to A? And why? Your heart is stopping. Pick one. ;->

Ron, I would still choose module A. I still prefer code which has never shown bugs in production testing to one that has radically reduced its number of bugs. For anyone who disagrees with the concept of UnitTesting, here's another question. How can you achieve the success of module A without UnitTesting? Personally, I think the only answer is either serendipity or a team mate who did some unit testing behind your back. :) -- RobertDiFalco

This is the TheresTheBug principle.

My flippant remark appears to be being taken much more seriously than it deserved. Anyway, obviously the answer to the pacemaker question is that I pick module A rather than module B. If pacemakers actually contain non-trivial software (which I rather doubt, but what do I know?) then I'd guess A probably has bugs, but clearly any bugs it has show up less than the bugs in B. Do I prefer A or C? Depends on the amount of testing they've had. If the total is the same, then C's latest version, despite not having exhibited any bugs, hasn't been tested as much as A's (presumably only) version. If the total on C's latest version equals the total on A's only version, I might pick C. I don't think any of this is surprising, unless someone here is claiming that testing isn't useful. I doubt anyone is. -- GarethMcCaughan

I'm going to step out on a limb since I haven't actually worked on a pacemaker, but from talking to biomedical engineers, they do consider pacemaker software to be trivial. Or at least, it is straightforward to formally verify the algorithms. Software tends not to be the main failure condition but rather rejection and hardware problems. -- JasonYip

The people wearing the pacemaker do not consider the software to be trivial. My father's pacemaker refused to speed up when appropriate. Programming? Configuration? Still an error. -- GeorgeCatlin?

I'll try not to belabor the point too much, but the testing thing is not the issue here. Tests may be at fault or improperly implemented. Inspection by a trained team, knowledgeable in what they are doing, is kinda hard to fool. Forgive me if I sound like some kind of new religious convert. It's just that I have seen the FaganDefectFreeProcess work in real life, right in front of my face, using real code and real documents that we created. I have yet to see any of the aspects of XP come to bear fruit at this client. My time here has been short - about ten months - but so far, nothing. Fagan, on the other hand, is happening right away and in real time. -- MartySchrader

I might pick C. Because when you do TDD, your module should fail the test at least once. Even if A has failed the test case once the first time, the fact the A has never failed the test makes me think that it is possible that the developer doesn't write the good test case at all (maybe all his test cases were just an empty test method). While C has failed all the tests first and been gradually fixed to do what the test cases said. I'm quite sure module C's developer spent quite some time writing his test cases. Module B's test cases may be as good as C's or as bad as A's. However module B doesn't pass the tests so I'll not pick it.

Hmm... (light bulb flashes briefly)... I think a reason why this all works is, with all the small simplest-thing-that-could-possibly-work going on, the formal process of analysis can be done in a moment, rather than taking 'x' number of minutes, hours or days. Or am I on crack? -- CwillUnderwood?

TestsCantProveTheAbsenceOfBugs

Of course not. But proof is not a required condition for usefulness.

It is for liability.

Veering away from the absolute for a moment, while testing cannot prove the absence of all errors, it can demonstrate the absence of certain errors.

TestsCantProveTheAbsenceOfBugs

This seems to be a semantic quibble masquerading as some profound truth. Tests, particularly as done in TestDrivenDesign are a good indicator of correctness.

An individual test shows correct operation under a very specific set of conditions, some of which may not be obvious. It remains a matter of experience to determine how much one can extrapolate from a single test. It is not feasible to test every possible parameter value and every combination of parameter values for a function. Hence, one chooses a small representative set of parameters to verify operation of the function and assumes correct operation for the remaining cases.

A second scenario concerns differences between the operational scenario of a function and the test scenario. If the operational scenario requires the function to do something that was not covered in the test scenario, the result will be indeterminate. It may work, it may not. If one follows Test Driven Design, this can result only if the developer did not know what needed to be built. In an independent testing approach, this will result if the tester did not know what needed to be tested (we cannot make any assumptions about the knowledge of the developer in this case).

Testing, especially when using Test Driven Design is a powerful way to determine the conditions when a function will operate correctly. It takes some knowledge and experience to extrapolate whether these successful conditions indicate the function will work without fault in operation.

-- WayneMack

Corollary: ErrorsCannotProveTheAbsenceOfTesting? :-)

See SmallestProgramWithaBug, ProofsCantProveTheAbsenceOfBugs, TestingCanReduceNumberOfErrors?, NothingCanProveTheAbsenceOfErrors?

CategoryTesting CategoryBug