Bugs Arent Voodoo

If your program has a bug, you have two courses of action:

One: Assume a CompilerBug. Assume an OS bug. Assume a CowOrker error. Assume that black magic is at work. Whatever it takes to keep you from admitting the possibility that you made a mistake. Now, to avoid seeing that particular bug, forever stay away from that particular compiler, OS, coworker, or phase of the moon. This is convenient, but you never figure out what actually happened.

Two: Assume it's something you've done. Figure out what it was. Write a test to make sure it doesn't happen again.

Other excuses heard 'round the programming world:

But I didn't change anything since that last build!
Oh, weird. It shouldn't be doing that.
You're sure you didn't check in any new code yesterday?
The new operating system service pack must have broken it.
It works on my machine. There must be something wrong with yours.
What do you expect? C++ has pointers, so it's dangerous and it's impossible to write code that doesn't crash once in a while.
The design document must be flawed. I implemented exactly what was specified.
...

Sometimes bugs are due to the phase of the moon, or more famously the change of the century.

When you're writing code and you find that something doesn't work right, assume that you made a mistake. 99.9% of the time you'll be right.

Note that this would mean that, over your career, about one time in a thousand you'd really have a compiler bug or whatever. That sounds like about the right ratio... Not convinced: how many real compiler and/or OS bugs have you found? In 12 years, maybe 6. (Not counting Java runtime bugs.)

Also note that compiler bugs are relatively easy to find, if you really think you have one. Just look at the disassembled instructions (or get a RealProgrammer to check if you can't do it yourself). OS bugs are easy to find as well: just create a test case where you call the API with the parameters your program uses, and if it returns an incorrect result, then you've proven that a bug exists. In either case, there is no excuse for saying "I think there's a bug in the compiler/OS, but I'm not sure."

I gave the 99.9% figure, which is hard to argue against as I'm sure the ratio of compiler to application errors is much much smaller than 0.1%!!! Yes, I've found compiler errors. The Microsoft M80 8-bit 8080 assembler had a problem with symbols used in named common blocks. (This is something you'd only do if you were writing TurboDOS drivers.) And I found something like a dozen different errors in Sybase Transact-SQL -- typically, "call a stored procedure with a NULL parameter, but change it to a real value after the execution plan has been bound," or other such unusual stuff. -- JeffGrigg

Do OSes not have intermittent bugs resulting from race conditions?

Those of us who worked the computer help desk always got a chuckle out of folks that came in saying that they had found a bug in the FORTRAN compiler. That changed when I started developing embedded software. The tools there were (are?) miserable. What is worse is that the processors themselves were buggy. I remember tracking a task switching bug down to the read-interrupt-status instruction which would give the an unpredictable answer if an interrupt came in during the instruction. The vendor denied that this was an error and claimed that the proper way to read the interrupt status is to execute the instruction three times and average. I had similar problems with memory that would forget things. We bought memory cards from two different memory manufacturers. Neither could reliably refresh their own dynamic memory under high interrupt load. Tracking this stuff down requires very simple and reliable tools. Here is what I used ...

a source processor that could insert trap instructions before every program statement.
a trap handler that could trace, dump and disassemble.
a bus monitor that could detect trace output to selectable ports
an oscilloscope triggered by the bus monitor
a missing event detector (to hunt DRAM refresh bugs)

The first two of these are programs that I wrote myself. The rest is hardware that is easy to build out of a handful of parts, though I didn't do the building. I've often wondered how people find these sorts of bugs without this kind of equipment. I guess they just don't. -- WardCunningham

Inspiration:

When we realized that the "garbage" bytes we kept getting from RAM were exactly the same as the ROM bytes in the other bank of memory... Yes, we had a design problem in the memory bank switching hardware.
Unreliable machine that randomly rebooted: One *BANG* on the desk it sat on, which caused the reboot, and "aha! we have a lose connection." (Didn't take long to find a bad solder joint in the power supply.)

-- JeffGrigg

I once spent a week trying to determine why we couldn't get a video card to initialize properly. This was an old video card, with an unsupported beta OS/2 driver, so we expected a few problems. The manufacturer no longer supported the hardware nor the software, so we were on our own. We got the screen background to turn blue once in a while, but never got any graphics drawn. We tried every driver option, tried various combinations of initialization parameters, traced through some of the disassembled driver code, ...

Finally, on Friday, someone took the card out of the machine, blew the dust off, and put it back in. Everything worked fine after that.

I had a similar experience with an embedded traffic signal controller I was debugging. It wasn't behaving the way it was supposed to, and everyone in the company was stumped. This time I only wasted a day rather than a week--replacing the serial cable between my desktop machine and the controller made everything work.

This is the stuff they never cover in ComputerScience classes...

-- KrisJohnson

Step one: reseat the boards. Step two: reseat the boards. Step three: reseat the boards....

It works on my machine. There must be something wrong with yours.

Step one: Recompile (preferably from scratch). Test on your machine. Step two: Give newly built version to user. Test on their machine. Step three: Isolate the differences between your machine and their machine. In Java, for example, be particularly alert for things like system classpath settings, or libraries placed in the ext directory. Step four: Having identified the differences, go back and fix your program so that the differences don't cause a problem.

A high percentage (arbitarily set at 90%) of problems in this category will be "solved" by either step one or step two ("solved" meaning either you can reproduce the problem on your box, or the problem no longer reproduces on theirs).

Step Zero: Reboot both your machine and their machines and try again before proceeding. Especially on Windows.