Bug From Hell

A BugFromHell is any bug where several hours or more of time is spent by a veteran developer attempting to track-down (and fix) the cause of a software bug. By definition, any bug that takes this long to find is almost always the result of a side-effect of the problematic code (otherwise, the problem would be readily visible via typical debug tools--e.g., stack trace, stepping through code in debug mode, etc). A BugFromHell is very elusive and is typically cannot be isolated or consistently reproduced.

The effects of a BugFromHell typically appear anywhere except near the problematic code. Such a bug will write to random part of memory, flip bits that aren't detected for a long period of running time, or appear to happen randomly without appearing to have been triggered by anything; or, worse, appear to be affected by the act of observing it (a HeisenBug).

Examples:


I've got my own BugFromHell - Copying database records to an array in memory, where logical FALSEs appear in the middle of said array. No logical fields are part of the record.


C code which depends upon the order of evaluation, left to right or right to left.

  x[i] = y[i++];
The example was hidden in a macro and the code worked on one compiler and not on another.


I find it interesting that most all of the examples given can't happen in Java. :)

Of course, you can still get a BFH. Not synchronizing concurrent access to a shared object. Not acquiring all pairs of locks in a deterministic order (can lead to deadlock). Not validating some preconditions. (Say a certain collection is of a type and implementation that tolerates nulls, but this particular one should never be null. Access to it should go through encapsulation, and the accessors should test parameters for null before adding them. One doesn't. Code A inevitably gets written that calls this with a null, which it dutifully stuffs into the collection, which naturally accepts it without question. Hours later, code B happens to draw the null out of the collection and dereferences it. Boom! NPE, with a stack trace pointing far from the real cause of the problem. Fortunately, proper encapsulation makes this not so hard to fix, because adding stuff to the collection is probably done through just one method, with proper factoring. Add the precondition test. Run again -- boom, NPE with a stack trace showing the attempt to insert the problem null. Now look for the cause of this null. Repeat as necessary.)


I have a lovely one, CeePlusPlus code which runs when compiled with the optimiser on but not when it is off!! My thoughts are fairly unprintable. Solved - but it was the optimiser which was wrong, making wrong code work. -- JohnFletcher

Back in my CeePlusPlus days, I encountered that on occasion. It was usually the result of exceeding array bounds or using an un-initialised pointer. What was the problem with your code?

The problem was the failure to include return *this; in an operator= member of a class. The optimizer supplied my missing intention. All I needed to do was to reach for my copy of AdvancedCeePlusPlusProgrammingStylesAndIdioms and look at page 41. I now have a better knowledge of my code.

Are you sure the optimizer was wrong? I think you may have just encountered UndefinedBehavior.

No, the compiler is bugged. Missing return statement of a non-int type, non-void type function is supposed to be a compile time error.


I now have another one (I do love this page for letting off steam) - I am running some tests on a program which is comparing calculations on the main computer with the same calculations on a GPGPU (GeneralPurposeGraphicsProcessUnits). Yesterday they ran and most cases do today but one set inconsistently returns NotaNumber where it should return a small number. The detail depends on what has run immediately before. I have now turned the computer off for a break - its not this one where I type onto C2. Time for a rest.

This has turned out to be quite a saga. It is not reproducible except that it goes away with a different choice of BLAS routines. More to come. This is turning into quite a saga. There is in fact more than one problem, which always makes for confusion. One is in an array of data which is acquired and then used. Setting it to zero first makes a difference. Secondly it turns out that GotoBlas has bugs on the particular CPU in use and needs to be compiled for a different one. Mosts things now work but there are still some unexplained inconsistencies. Work in progress. Sigh. -- JohnFletcher


I know you guys get tired of hearing this, but in my entire career I have not once ever encountered a bug that would have remained hidden from a serious application of formal inspection. I have found bugs that could never have been encountered in execution (my own code has contained such), but never have I run into something that would have survived inspection of either the design or the code.

Testing can only prove the presence of features, not the absence of bugs.

-- MartySchrader

You're a fookin' hero, Marty. Remind me to put you in charge of everything. With such perfect hindsight, I bet you can fix all of yesterday, today.

Well, thanks, AnonymousCoward. I would prefer not to be in charge of everything, but if you expect yer tests to bail you out of trouble then I guess I'll have to step up. [Aside, to self] Weenie.


A good one: I was working under HP-UX and was porting CeePlusPlus under Solaris. While all programs were tested OK under HP-UX (debug and release) and there was a recurring core dump under Solaris under certain conditions only (second build !). The stack trace was pointing inside the STL in some strange template. After investigating the code and finding nothing, the fact was that Solaris CeePlusPlus compiler was generating temporary code from the STL templates instantiation (called template DB) in a hidden directory and was not reinitialized when rebuilding. After deleting the template DB, all was OK. I adapted the make files accordingly.

This bug took me two hours to find and an application team were blocked for more than one week on that problem. -- OlivierRey


Mine was 3+ hours when I was using a scripting dialect of Basic that did COM interop with a C dll. Basic has a true value of 1 and C has -1 (both use 0 for false). My basic subroutine of toggling the flag was 1 - result, which works in basic for true/false but not in C.

I don't suppose you could have done actual comparisons to "true" and "false," eh?

[Comparisons to "false" might have helped. But C (pre C99) is allowed to use any non-zero value for true. It doesn't even have to be consistant, so comparisons to "true" are useless.]

Good grief. If a project can't be moved to a compiler created in this millennium then I suggest you bail, and post haste. I had one such project back in about 1994 where the product was built on a Latice C compiler from 1988. The multi-megabyte mess couldn't compile under a modern compiler because of thousands of various coding standards violations. After spending a couple of weeks trying to get the code cleaned up sufficiently to compile under Borland I gave up. Oh, well. Luckily, they dumped me soon thereafter. Good riddance to bad rubbish.

If a tool prevents you from using what you know to be good practice and forces you into a compromising position with respect to code robustness or even correctness then the tool needs to go. If you can't convince the client (or your boss) of that then you need to go. Otherwise, there is a gotcha awaiting you down the road, and no amount of "I told you so" will get you out of that bear trap once you've stepped in it.

{Amen to that!}

[I agree with the part about ditching tools that get in the way. I disagree with the part about when the compiler was created. In 2005, we were still supporting an application written in a variant of BASIC from the late 70s. It met the client's needs and wasn't any harder to maintain than the applications we supported written for more modern languages.]

//Indeed. In the early 2000s, we were still occasionally using a C compiler written in the late 1980s because it was the only one to support both of the platforms upon which an older, but still supported, version of our product ran. Would it have been reasonable to re-write a legacy product solely so we could use a newer compiler? Of course not.//

And would your early-2000s product have passed lint? Would it have caused a modern day compiler to RunAwayScreaming? If so, then perhaps you are defending the wrong issue. Products that have inherent problems requiring the use of some tool or other that will ignore the improprieties are not to be tolerated. These sorts of products are destined to byte you in the buttocks, you betcha. Hiding bugs is not the same as squashing bugs.

//Yes. It did pass lint. By the way, it wasn't an early-2000s product as such. It was an early 90s product that was still being maintained as of the early-2000s.//

Okay. Can we stop discussing ancient history and get to modern times? If a currently-supported product has a history of bugs and can only be built under a very narrow set of tool and OS conditions then isn't this a clear sign that the product has built-in problems? Is it not our duty to recommend to the client that he think about a code audit to see where some of these potential problems may be hiding? Perhaps we can recommend some inexpensive tools for regression testing and such. All of this presumes the client is completely aghast at the idea of doing formal inspections, of course. Heh.


FebruaryEleven

CategoryBug


EditText of this page (last edited February 6, 2011) or FindPage with title or text search