Lots Of Cheap Tests

a local WikiZen beat a nice essay on cheap tests out of MichaelBolton:

reposted from comp.software.testing without permission & with tiny edits

The blog entry at http://www.developsense.com/blog.html contained an intriguing line; "lots of quick tests that have very low cost but uncertain value"

(I don't know where its permanent link will go. Google the web for the subject line if you are in the future.)

How defensible is such a position? How could cheap but innaccurate tests possibly help anything??

As Lllary pointed out, "uncertain value" does not mean "inaccurate". It means "we don't know whether this test will reveal a bug or not". In fact, all tests are like that, so it really means "we're even less sure than usual that this test will reveal a bug". What we do know about a cheap test is that it takes little time, little effort, and little cost to perform, so why not?

Certain kinds of automated tests are often--though not always--cheap. In fact, the intent of good automated tests is to make testing cheaper. If we can push a button, walk away, do some other testing tasks (such as manual tests or writing more automated tests) that's swell.

Here's a great example of a cheap test, arrived at through automation. It comes from Doug Hoffman. A while ago, he was working a new 32-bit processor, and he was trying to think of powerful tests for the integer square root function. He considered values that would be interesting various number bases, he considered values that would twiddle the carry and overflow and sign flags in various ways, he selected values that were at the edges of various conceptual boundaries, and based on all sorts of theories of error. All those tests passed. Everything looked good. And then he thought: wait--why not try them all? On a 32-bit processor, there are only 4 billion integers to choose from, and he had access an oracle in the form of a reference processor, so he cobbled together a little program to try them all. The program took six minutes to run. He found two bugs that he contends he never would have found otherwise; they just didn't fall into any theory of error that he had considered.

Now: why wouldn't you always do this? Today we're being asked to test a 64-bit processor. The same heuristic--"why not try them all?"--has a teensy little problem: the cost of the test has gone up from 6 minutes to 40,000 years. So now we have to go back to more targeted approaches--tests that we believe are most likely to expose bugs, because what was once cheap is now expensive.

There are other kinds of cheap tests, some of which can be done with automation assistance, and some of which can more quickly and easily be done manually. Undermining the application is a form of cheap test. Unplug the network cable in the middle of a transaction--almost no setup cost, and we expect that our product will recover gracefully, but maybe not. Use a tool like PerlClip? to create a string of 10,000,000 characters on the clipboard in a second or two, and throw that at any given input field. We expect the input to be truncated pretty close to the GUI level, but maybe not. On the last project I worked on, I wanted to do this in Fitnesse, but found that either the tool or our fixtures wouldn't support strings in excess of 60K or so. (Why such a big number? A moderately huge input (60K) has some likelihood of exposing a buffer overflow; a huge-huge input (10M) has a higher likelihood.)

My daughter performs cheap tests on JumpStart? Preschool by clicking at weird times in weird places, too soon after the last click, and so on--a quick test technique that James calls "click frenzy". Her hand-eye-mouse-pointer co-ordination is a little rough, which increases the randomness of the behaviour. But we get crashes often enough that she recognizes the Windows General Protection Fault dialog (which appears sometimes) and the runtime library's Bad Memory Allocation dialog, and she says "OH-OH!" every time she sees them. She's learned to be sanguine about it, but for some time our lives would be easier if JumpStart?'s testers had performed tests using this same approach. It takes less than a minute per screen. It would likely have been possible to automate this, but apparently they either automated it poorly or didn't automate it at all, thinking it too expensive. A click frenzy session might have been sufficiently valuable and sufficiently cheap for them, had they known to do it.

Cheap tests often cause developers to say "no user would ever do that". No user, for example, would place a shoe on a keyboard in order to see how the application reacted to an endless stream of keystrokes. But we reframe "no user would ever do that" into "no user that I can think of, and that I like, would do that on purpose". A nice user might place a book or a binder on a keyboard by mistake. A malicious or impish user (think of a 15-year-old encountering a keyboard on a kiosk) might hold down a key for fun.

Here's the rub: TestDrivenDevelopment tests, when run after the code has been written and has passed the test the first time, tend fall into the category of cheap tests too. We don't really believe that we're going to break anything, do we? But we run them, because they're already written, they have very low cost, and uncertain but potentially very high value. Moreover, WardCunningham has pointed out that TDD tests should be relatively trivial; if they get too elaborate, they take too long to run, and thus they won't get run; and if they get too elaborate, they get too specific, and they'll need to be modified when the code changes significantly enough. The whole idea is to keep them at some uncertain value but low cost. As a tester, I like the idea of TDD. Its risk is that the tests will reflect the same mental models of the developers that lead to bugs in the code, but the value is that they will detect certain kinds of problems very rapidly effectively, and they're inserted into the development process at the time when it's really inexpensive to fix a mistake (that is, immediately after the mistake has been made).

In answer to your descripition of "inaccurate tests". I don't know what you mean by inaccurate tests, so you can define them as you like. :) Do you mean "useless tests"? Tests that are based on erroneous presumptions? Tests that ignore things like system requirements? Tests that could not conceivably matter to anyone? Tests that do not address some risk? Such tests can exist, but I don't think I would call these "inaccurate"; perhaps "inappropriate" or "unhelpful" or "(probably) valueless". But you may have something else in mind.

You suggested "I think it's a test that fails even when the user could never perceive any bug." There are a couple of sticking points for me in that suggestion. One it "the user"--but which user? Some users might be exposed to or vulnerable to some risk; others not. Some users--like hackers--might exploit some risk that "normal" users would not. Some users--like blind, deaf, or physically disabled people--might be ill-served by things that would do fine for able-bodied folks. "Never" is another sticking point, and "perceive" is another one. As an ordinary observer, I would "never perceive" a blown pointer, for example--for a while.

In the paradox of Achilles and the Tortoise, we imagine the Greek hero Achilles in a footrace with the plodding reptile. Because he is so fast a runner, Achilles graciously allows the tortoise a head start of a hundred feet. If we suppose that each racer starts running at some constant speed (one very fast and one very slow), then after some finite time, Achilles will have run a hundred feet, bringing him to the tortoise's starting point; during this time, the tortoise has "run" a (much shorter) distance, say one foot. It will then take Achilles some further period of time to run that distance, during which the tortoise will advance farther; and then another period of time to reach this third point, while the tortoise moves ahead. Thus, whenever Achilles reaches somewhere the tortoise has been, he still has farther to go. Therefore, Zeno says, swift Achilles can never overtake the tortoise. Thus, while common sense and common experience would hold that one runner can catch another, according to the above argument, he cannot; this is the paradox.

(Any similarity between that and the WikiPedia entry must be just a shocking coincidence.)

Now suppose Achilles is a software tester, trying to exhaustively test a program. He writes 100 test cases, and tests the program to 50% exhaustion. The next 100 test cases go to 66% exhaustion. The next 100 go to 70% exhaustion. This geometric progression shows that Achilles can never squeeze out the last 0.00...1% of exhaustion.

Now suppose Achilles writes cheap tests, and writes a program that can be constrained by them. The tests can fail too easily, even when there's no bug. The metaphor here is Achilles leaping over the turtle instead of running to it. Achilles writes such a program as can be constrained by tests as cheap as he has written.

Your Achilles metaphor is nicely put. However, I'd suggest that the tortise has an effectively infinite head start, since anything other than the most trivial program has an effectively infinite input space (the set of all valid inputs PLUS the set of all invalid inputs gets to infinity pretty quickly). We never get anywhere near 100% exhaustion. What we might get is near 100% skepticism that there's a bug there, but it's asymptotic. We do that to some degree with shotguns--some number of quick, uncomplicated, scattered tests that fill a space relatively close by--and to some degree with rifles--some number of harsh, elaborate, precisely targeted tests that might be (to stretch the metaphor) some distance off.

There are at least two risks associated with any test. First, the test might fail to identify a bug that's there, and second, it might falsely identify a bug that isn't there. In the first case, we know going in that a cheap test might fail to identify a bug, but running the test is by definition cheap enough that we don't mind. In the second case, if a cheap test consistently finds bugs that aren't there, it defines itself out of existence by being expensive, so we don't run it.

Finally, note that a test strategy composed entirely of cheap tests is likely to be a failing strategy. But a test strategy that is composed entirely of expensive tests is likely to be... expensive.

Again, thanks for affording me the opportunity to explain this stuff. I hope it helps.

---Michael B.