Heroic Debugging

CategoryDebugging: See also HeroicProgramming.

Also see SpecializationIsForInsects. Before reading that page and its discussion, ponder how many of the stories below involve collaborating experts.

Heroic Debugging sometimes caused by "HeisenBugs."

WarStories:


I remember this one time when one of our SDS 940's was crashing intermittently. Other systems running the same software were working fine, so even the hardware guys were convinced it was hardware. They flew Paul, our best hardware guy, Steve, our best operating system guy, and me, out to New Jersey to fix it.

Steve quickly found a fairly repeatable way to make it happen. Paul put probes on the computer and we got to debugging. The system would crash, Steve would tell us what was in the registers and what that meant. I would explain to Paul what Steve meant, and Paul would tell me what the transistors were doing. I would suggest an experiment to Steve, or he would think of one. We zeroed in.

Middle of the night, we found it - Paul found what he called a "slow" diode. Replaced it, system wouldn't crash any more.

We realized that no one or two of us could have done it, we needed all three roles. It was a great night, except that we had to stay in New Jersey until the next day.

-- RonJeffries


One of our server programs was leaking memory. It runs continuously, 24 hours a day, so that's bad. This was a COBOL program with Embedded SQL accessing an Oracle database and making some use of Unix V message queues. We figured it was probably the Oracle run-time library, so top management decided to quickly neatly and easily solve the problem by having a DBA open an incident with Oracle, and demand that they fix the problem for us. I asked for permission to track down the problem but was repeatedly denied this "because it's Oracle's problem."

Oracle didn't make much progress with our thousands of lines of COBOL code, with database schema, megabytes of data, and message queue management code. They insisted that the memory leak must be in our code, and that the Oracle code "doesn't have any bugs in it." (In case you missed it, both statements in the previous sentence are ridiculous nonsense: Memory leak in COBOL code? There's no dynamic memory management in the language. No bugs in a product the size of an RDBMS? HA!) This was compounded by a careless statement by one of their technicians that "you must be doing something stupid" [...for customer program to have this bug and be leaking memory].

After two weeks of stalemate between the two companies, I convinced the people on our side to let me track down the problem to produce a simple test case.

I used conventional methods (the BinaryChop):

  1. Make a test case to demonstrate the problem: I used "ps" to monitor the memory pages of the process.
  2. Divide and conquer: Disable or delete parts of the program until the problem goes away, then put parts back and eliminate others. This narrows down the problem to smaller parts of the program until the problem is literally staring you in the face.

The Oracle runtime was leaking memory whenever VARCHAR2 columns were read into character variables using an explicitly defined cursor. (A surprising problem, as this is common usage.) I sent the one-page COBOL program to Oracle, and they quickly verified that it was, indeed, a problem in their run-time library. (And the next infuriating revelation was that it was a known bug: they knew about it before we encountered the problem.)

Lessons Learned: Tracking down problems in 3rd party products is a cooperative venture: Customer and vendor must work together to narrow the complaint down to a simple, easily reproducible test case, and to be honest about known bugs in the product. The Customer must be willing to put work into demonstrating the problem. The Vendor must be honest about known bugs. -- JeffGrigg


Third-party products can be a big problem. I remember having a problem at a customer site in Germany in a voice over IP product I was the chief engineer of. The specialized Ethernet card would just drop off the network once ever two or three weeks until it was re-loaded.

We couldn't reproduce the problem at all in our labs and the vendor claimed that the problem was with the customer's network. We sent them a memory dump of the board and they said the customer's network was "too busy for their Ethernet card to handle" (looking at their stats, it didn't seem any more loaded than our Ethernet which didn't produce the problem).

Finally, I had a conference with the vendor - thinking it may be a hardware problem I offered to help fix their problem if they would send me schematics and specs for the board. They said they didn't have them because a third-party company (to them) did the hardware design! Aieeee!

The board never worked right and eventually failed at other customer sites as well. The product is now being redesigned with completely new hardware.

Lessons learned: Never base your product on a third-party unless you can get support from them in writing. Even so, be wary of their support tactics if blaming everything but their product.

-- MarkGrosberg


(debugging a HeisenBug)

Tracked down a problem in a 3rd party geometry library having problems determining if a point was within a polygon. Problem occurred when code was run full-speed, but result was correct when I single-stepped through the code. Experimentation with running full speed to breakpoints set in varying locations helped me to narrow down a rough location of the problem. (...another BinaryChop.) Then I found that if I set the breakpoint inside one of the important functions it was calling, the problem happened every time. Careful examination of this function revealed that in a commonly occurring case it failed to initialize one of its variables, but always used it. The problem was hard to debug because the previous function being called happened to always put a non-zero value in the stack location that the uninitialized variable used, while the execution of debugger breakpoints happened to always put a zero (the correct value) into this stack location. I fixed the code locally, and reported the problem to the vendor.

(A few additional details on this story can be found on the ContradictionIndicatesFalseFact page. -- jg)

Lessons Learned:

(If you're still not sure, let's talk about AT&Ts loss of 1-800 service for a day due to "break;"-ing out of a loop...) -- JeffGrigg


A relational database stored procedure worked fine in development, but "crashed" in production (causing loss of that user's connection). Could not reproduce the problem in development environment. Tested (very carefully) in production. The production version used a different execution plan, due to there being more data in the tables, and this (along with some other rather unusual factors unique to the way we were using stored procedure parameters) was triggering a rather obscure bug in the database code.

Lessons Learned:

  1. If you're pushing the vendor's product beyond the bounds of common usage, you will probably find bugs in the product.
  2. Size matters: To properly test a relational database system, you must run at least some tests against a test database with at least as much data as you expect to have in production.
Less data changes execution plans, which can lead to dramatic changes in performance, and possibly functionality. -- JeffGrigg

Seconded. Query optimizers are influenced by table sizes and other factors such as key distributions. When the DB is part of a chain that involves components that might time-out (e.g., web servers), performance problems can quickly turn into functionality problems. The added time and expense it takes to simulate a customer environment, including data mix, can pay off when compared to the cost of flying people across country to dampen a fire at a customer site. -- DaveSmith


BinaryChop was the only technique available in my most HeroicDebugging performance. In a big C program (SNMP agent embedded in an OO language interpreter), I was getting a CoreDump. I added a printf, and it dumped in a different place. I added another printf, and it didn't dump at all. So then the parameters were clear - I had to debug using only one printf. So, for two weeks we moved that printf backwards and forwards through the code. If the printf happened, the bug was after the printf. If the printf didn't happen, the bug was before the printf. Finally we got down to a suspicious looking routine in the ASN1 parser which was overrunning a 60 character buffer. We extended it to be 256 characters, so someday someone will have that problem again :-). -- JohnFarrell


There's a true HeroicDebugging story in TarongaZooStory.

Lesson learned: Let sleeping lions lie.


I was working on a data collection system designed to record digital data collected from jet engines. We had installed more than 20 of these systems in our various test cells. Since the system was critical for collecting data, there was a watchdog timer monitoring the system that would ring an alarm whenever a particular output bit would stop changing for more than a quarter second.

One day we got a call from one of our test cells complaining that the stupid watchdog timer kept going off. Sure enough, they were right. Every 5 minutes exactly the system would halt for about half a second, the alarm would sound, and then the system would continue running.

What was happening every 5 minutes? The data system ran on a cycle of 1 second. Nothing was scheduled at the minute level, so we were mystified about the cause of the alarm. So, we shut down the program and started it back up. The alarm still occurred every 5 minutes. In fact, it was occurring on the original 5 minute schedule. Must be some system software that's causing the problem we thought. So we rebooted the system. After a cold reboot, the system came back online with the watchdog alarm still on the original 5 minute schedule! Finally, we powered down the system, let it cool and then let it power back up. We still had an alarm problem, but the 5 minute cycle had shifted in time. It was a hardware problem (the rallying cry of software folk everywhere).

So, the question still remains what piece of hardware was causing the problem. After verifying that the problem system had the same components configured in the same way as a working system, we brought in the big guns: an HP Logic Analyzer. It had an octopus of probes that we could connect to any electrical signal in our system and capture the signal levels before and after a triggering event. We started by monitoring the watchdog bit and discovered that every 5 minutes it would stop toggling for 500 milliseconds, and then continue as if nothing happened. I'll spare you the hours I spent in front of that analyzer, but I finally narrowed it down to the SCSI bus. Every 5 minutes, the SCSI bus would stop for 500 milliseconds and hold off the CPU board using it. Since the CPU using the SCSI board was the same board that sent out our 10 ms timing ticks, the whole system stopped, waiting for more timing information.

A quick call to the disk drive manufacturer turned up the fact that the latest shipment of disk drives we received had "thermal compensation logic" built into them. Every 5 minutes, the disk drive would run its thermal compensation logic for 500 milliseconds to recalibrate due to temperature changes. During that time, it refused to respond to SCSI commands.

Lesson Learned: You need the right tools for debugging. Without the logic analyzer, we would have had difficulty pinpointing the source of the problem. -- JimWeirich


Years ago, we were due to buy a new parallel computer: a Meiko CS-2, remember them? But it would be several months before it arrived, and I wanted to start writing my code, which was a parallel discrete event simulation system based on Time Warp. I decided to write a simulation of my simulation. (As you can imagine, finding good names got complicated.) It was important to me that I keep track of the amount of time spent in various parts of the code, so I wrote a class to keep timing statistics. Anyway, as is normal I had a bug in which my program would crash. I used BinaryChop to try to locate the bug, but every time I put in a print statement the program would seem to crash in a different, earlier, place. I remember commenting to my wife that this bug was driving me wild; it seemed to be related to the real time that the program had been running, and was therefore unrelated to how far through the logic the program had run. After several days pulling out my hair, I eventually tracked down the bug to an arithmetic overflow in my timing statistics class. I was recursively updating both the mean and sum-of-squares, so that I could report time +- std-error. Of course, a cumulative sum-of-squares very quickly overflows a float, which generated a hardware exception that I hadn't caught. When I realized that I had spotted the essential clue very early on in the debugging process, but had not been able to turn data into information, I went puce with rage that I could only direct at myself.

Lesson Learned: don't ignore data, even if it seems irrelevant. Any potential explanation for the problem you have should be able to explain all the data. -- ChrisBooth


Once upon a time, when I was a mere stripling, I interned for a couple summers at a telecoms equipment firm. We made digital voice/data multiplexers (in fact, they still do: http://www.nuera.com). What this meant, for the purposes of this story, was two things: Our boxes had a main network interface where you'd plug in a big fat cable; and all the programming for all the chips in the box was done basically right down there on the bare metal, all hand-coded embedded assembler, no OS, nothing - just interrupts and I/O.

We had this bug once. Seems that every so very infrequently, like every 50 or 100 times, when you'd unplug and reconnect the main network cable, the box would crash and reboot. It was the ultimate HeisenBug - it never occurred the same way twice, never consistently reproducible.

Jerry found the bug - sitting there for hours, watching the stack trace on an HP protocol analyzer, plugging and unplugging the damn cable. See, what happened was that he'd ported his code from a TI TMS350C25 to a TMS350C50. and when TI put out the 'C50, its interrupt vector table was in a very slightly different location than the 'C25. So when this particular interrupt would come in to re-establish the main network connection, it just happened to be the first interrupt in the IVT. So control would jump to this slightly wrong place, just before the start of the IVT, and start executing. and just a few bytes later, what would it find? It would find the correct branch instruction, branch to the correct ISR, and keep right on going like nothing had happened, and no-one would be the wiser. And after a few hundred rounds of executing a few spurious bytes before entering the ISR, it would have corrupted enough memory to bring the system down. Jerry said that he found it because the hex dump of the disassembly looked a bit familiar - he changed it to ASCII display, and it was executing our copyright notice from the start of the ROMs.

Lessons Learned: The same thing about the HP protocol analyzer. Also, that Jerry Riner is a friggin' genius.


This isn't a HeisenBug story, but a regular bug story, no hardware involved. There seems to be a difference in what is to be learned when the problem is purely software... Or else I'm missing something.

We were working on a 2D geometry library. One problem the library was aimed at was to partition an arbitrary line segment into sub-segments such that each segment would lie within exactly one polygon. The polygons could be nested, but would never intersect themselves or each other. My first attempt was a mess, but I learned a lot and came up with this object that would traverse the given line and, at appropriate intervals, look up where it was in a graph of polygon containment relationships, which was precomputed.

This was some years ago. Today, I'd attack this problem with Scheme and set theory. Then, it was C++ and brute force.

We had a testing framework (I learned about the importance of automated testing frameworks from RobWhittick, who carries one around inside his head, in Pascal, and re-writes it at each office he moves to, in the current language) You could give your objects a graphical representation, and the harness would draw polygons and lines and such as the computation proceeded, as well as checking results. It was a cool thing.

I'd covered two sheets of A3 with maths to derive my (I believed) complete set of possibilities for a pair of line segments meeting at a polygon edge, which became the test cases.

Most of the time it was okay, but sometimes it would fail, mis-assigning a segment and all the segments thereafter.

The partition would fail in certain exotic circumstances, mostly to do with approaching a concave vertex along an edge. After a couple of long, long days fiddling with the class hierarchy that represented the intersections, I became desperate. A colleague, Warren, a well-known genius debugger, came in to help me. After another full day and full night (through to 6am) we had got down to two sets of cases, either of which we could solve, but not both. There were two or three cases out of dozens that didn't sit in the intersection.

Rob took me to one side and said, very kindly, "You remind me of someone I knew once, who carried on beating his head against a tricky problem beyond the point of sense. Leave this, and go do something else. I'll look at it."

Rob re-wrote my object to traverse the given line segment and hold a stack, pushing a polygon whenever an edge was passed going inwards, popping in the converse case. His solution covered both sets of cases, allowed us to dispence with the hierarchy of intersections, and the pre-computed graph, and was twenty percent slower than my solution.

Lessons Learned: 1) Heroic debugging is, more often than not, stupid. So DontBeaHero 2) Everyone has to learn that for themselves, but may need prompting to notice that they have learned it. 3) Working around the clock to fix something, and failing, is a good indication that you are trying to fix the wrong thing. -- KeithBraithwaite


Ran into a mysterious problem with a Linotype 606 phototypesetter I was setting up at a NZ newspaper. Every once in a while, a short section of the printed text would come out as complete garbage. Always in a different place, different text, and seemingly unpredictable.

The system worked by reading font descriptions from disk into a memory buffer that drove the video. Each printed character instance consisted of a variable amount of code in the buffer at a certain address. It seemed likely the starting addresses of these characters were being mixed up somehow which would result in garbage but I checked the code thoroughly and couldn't find a problem.

I noticed there was a special routine to handle disk errors, and while it appeared to be almost trivial and obviously bug-free, for some reason I decided to call it for every 10,000th character typeset, to simulate a disk problem and AskTheCode if this was the cause of the problem. Sure enough there was a block of garbage text every few pages of output. Turns out the way the buffer was filled after a disk error omitted the first step of writing the code for a blank scan line for reverse video (background for white text on dark ground), clearly in violation of OnceAndOnlyOnce having two versions of the code to fill the buffer. This OffByOne error caused all the characters in the buffer to be at the wrong address, so the video printed garbage for the whole buffer.

Lessons Learned: DeliberateErrors? can be useful in debugging hard to reproduce problems. Also related to DefectSeeding.


We were working on a portable Java VM (it was a big wad of Smalltalk code that generated and optimized assembly... some people here might be able to guess which company this was ;). Every night, we did automated builds to many different platforms and ran automated tests against them. My job at this time was often to try to fix things that mysteriously broke on obscure platforms as a consequence of ongoing development (i.e. changes would only be manually tested on a major platform or two).

So, we had a JIT compiler for the Windows/x86 platform. And one day the team decided it would be nice to have a JIT for the Linux/x86 platform too. So they hacked it around and suddenly we did, but few people knew that it existed and we didn't pay much attention to it for a while.

Then, within a space of a couple of days:

  1. Outside people started to care that this Linux JIT existed, and
  2. We realized that it was horribly broken and had been for some time!

So for at least two days I tried everything I could think of to expose the problem or pin it down to a particular part of the JIT. It resisted my efforts - when I ran a test, the whole JIT would promptly dissolve in a ball of flames, but often not in the same place, etc.

Finally, out of sheer desperation this is what I did: I set up the same test instance on a remote Linux machine and my local Windows machine, and started both VMs under the respective native debuggers. In theory, the generated assembly code for both platforms was "the same". I single-stepped through the machine code one instruction at a time. After a grueling hour or two, I was suddenly staring at two very different versions of a function prologue: one that popped its arguments off the stack and another that left them on the stack to be removed by the callee. It was in the interface between JIT-generated code and static assembly code, so the rest of the VM code worked fine. I showed it to a GrandMasterProgrammer and 30 seconds later he had found and fixed the bug.

So what was it? A little base class method had been accidentally copied into a subclass, and sat there unnoticed for a long time. When another programmer came along and modified the method as part of a calling convention optimization for several platforms, one part of this optimization was not performed on the platform to which the subclass belonged (you guessed it: Linux/x86). The bug was not really anybody's fault; our Smalltalk environment kind of hid the bogus method from us, and at the time the changes in question were made, problems with other platforms had much higher priority than problems with the fledgling Linux JIT. It had been failing its nightly testing for many days (weeks?) and there was other small rot besides what I described here. IMHO in the VM space it is very difficult to have sufficient test coverage of everything, and we constantly tolerated a number of failing tests on 'lesser' platforms. We had more platforms than people and it was simply not practical to make the tests pass 100%. But that meant that it was often a lot of work to localize the cause of a failed test.

Lessons Learned: Test suites that don't pass 100% most of the time are dangerous, because "old" problems might be ignored. And "old" problems can mask new problems (WrongTestGranularity?)? And "old" problems might be anywhere, instead of just in recently changed code. -- WylieGarvin


I wrote an operating system for multiple z80 microprocesssors on an ieee multibus. When we started writing interrupt-driven applications, we found that the systems would lock up, with interrupts off, every few hours. Since I'd written the kernel, I could add diagnostic code that would the problem, though I can't remember now what that was. The problem turned out to be the read interrupt status instruction which would return the wrong value under obscure circumstances that would probably never occur except in a multiprocessor. Clearly the vendor had never tested this. They claimed that the part was not in error and that the proper way to use the read interrupt status instruction was the read three times and average. Incidentally, the os communicated with messages that looked a lot like xml without the angle brackets. Addresses went out as decimal strings which avoid any byte sex problems when we started adding other kinds of processors. -- WardCunningham

Back in the 1980s, when I was multi-tasking a z80, I spotted this problem and managed to pin it down with a logic analyser and tie it down to a particular machine cycle, can't remember which one off hand, it was somewhere around 17 I think (not many instructions had that many cycles!). If the status was being read and an interrupt occurred during this cycle then the read status interrupt flag would give the wrong value. Using a TTL counter to identify cycle 17? (reset on every instruction fetch) and dissallowing interrupts by gating the interrupt line during this cycle eliminated the problem with no need to average status flags. -- SteveWebb?


I was part of a startup using a very early (1983) Masscomp Motorola 68K machine running unix. We pushing the machine hard (for the time), running batched data analysis programs that typically ran all night overnight. Some mornings, we arrived to find that the machine had crashed - not just the program, but the entire machine - sometime during the night. All of us (young turks at the time) thrashed at the problem for weeks, without success. Phil, the old-timer (he was probably about 50), asked us to write code that would print *all* of the machine's dynamic state onto fanfold when the crash occurred (we were already getting coredumps). We all thought he was nuts - this was, literally, a six inch brick of fanfold, packed with line after line of hex (we did divide the hex into 8 char fields, as I recall). We arrived the morning after handing Phil his requested package (he worked at night) to find a specific page on each of our keyboards, with a specific longword highlighted - and a handwritten note stating "Here's the bug".

It turned out that there was a bug in the Masscomp assembler, so that on a specific set of machine instructions, the condition code bits were not correctly tested. The assembler listing was fine - it was the executable code that it emitted that was busted. If the essentially random bits happened to produce the correct result (which they usually did), everything worked. Sometimes - about once a night - they produced an incorrect result and crashed the system.

The only way to identify the problem was to compare what the executable should have been with what the executable was. That's what Phil did, after poring over all those fanfolds.

-- TomStambaugh


There is a bug in Apache Xerces C++ 2.3.0 when dealing with a passed in character string there was a line similar to this:

  while ( isValidCharacter( *pch ) && pch < endPch ) 
  {
     ++pch;
  }
where pch is a pointer to the current character and endPch a pointer after the last valid character.

The pch would be dereferenced when it was equal the endPch and thus invalid.

99% of the time this would be ok, but if this allocated array was just at the bound of a memory region it would result in a segmentation violation.

Thank god for Purify allowing us to locate the ABR in the Xerces code.


I once was called in to help a colleague who had come to the end of his tether. His software controlled a large science museum exhibit with pneumatics, fans, coloured balls, relays, etc. It had worked perfectly in test back home, but now on site every few seconds the relays would drop out. We wrote test programs to check the relay commands; no joy. After he'd been on site a week, we noticed that when we turned off an unassociated air blower on the other side of the room the problem occurred less often (failure rate dropped from seconds to minutes). We then noticed that the relay control cables were in the same conduit as the video and the power cables. Pulling them out decreased the failure rate again. In the end, I modified the software to set the relays to the correct values each time round the control loop rather than relying on them to stay there.

Lessons Learned: Make sure the test environment really is the same as in production. Take notice of anything that makes a difference, even if it "can't possibly". Program defensively.


Nah, that's not a bug. None of these are bugs. This is a bug: I spent 4 weeks on a bug a C++ module in what became the Foxboro I/A Series SCADA product. Using the Centerline 2.1 C++ compiler. 4 weeks of my life.

And, yes, thank you, I knew what BinaryChop was. I chopped and hacked and dumped and checked every damn thing you or anyone else I knew could think of to do. Zippo. I knew it had something to do with linking in this one module. Nothing complicated inside. Not badly coded for them days. Nothing fishy in the link itself - I ran experiments until my fingertips bled. Not a side effect - I've got it in a single .o. It is this module all right.

But no matter what, it wasn't in this half. And it wasn't in that half. Real problem was, I wouldn't believe that. It's in this module. I mean, let's just say now it really was. But it's not in this half, and it's not in that half. No matter what combination of chops, just when I think I have it pinned down the bastard just evaporates.

Centerline support can't find it either. Maybe needless to say, 4 weeks creates a lot of time pressure when its blocking a major release. I'm the most experienced hand in the joint, and no one else has any idea. Despite all appearances of cool and calm I project in all directions that count, I've died and gone to H-E-double-hockey-sticks.

One day I get a dumb idea. I'd like to say I had deduced the only possible location of the bug by the traditional Holmesian method. But you stop reading a moment and think. You have a bug. Binary chop proves its not in the contents of the file. You know its not in any other file. So where is it?

Close your eyes and do your best Holmes impression on that a moment. If you eliminate the impossible ...

Got it? Well, try your luck: it was the filename. It turned out the Centerline 2.1 name mangler had a weakness. If you gave it a file with a name exactly 13 characters long - not <=12, not >=14, but 13 characters exactly - then it produced ambiguous symbols. And if you had enough code in that file - any old combination of code lines you can think of - kerblooey! - one of a hundred different kinds of stack corruption. And when you try to sneak up carefully on one to nab it in the act - poof! - gone like spit on a summer sidewalk.

That, my friends, is a bug.

Lessons Learned:


What was your heroic debugging experience?


CategoryDebugging


EditText of this page (last edited June 12, 2006) or FindPage with title or text search