Extreme Programming Challenge Fourteen Split

w.r.t. this CodeReview not finding the bug, the bug in question was originally found by code review. (More precisely, it was found as a side-effect of a failure to prove the correctness of what was believed to be correct code, after its initial testing.) I review code for DougLea. I find bugs in his code. It's the hardest work I do. Usually, they are bugs for which I do not see how to construct a test. -- TomCargill

No question. And I suspect that if this is a bug for which you don't see how to construct a test, this author won't see how either - you've done more of it more recently than I. OTOH, I really want to know what the bug is ... because I sure don't see it! -- rj

Relatedly, I'd love to see a correctness proof for this one (or the corrected one). That'd be cool! -- rj

Since Don seems to be busy and Ron has only finite patience, I've described the bug in ExtremeProgrammingChallengeFourteenTheBug.

The concrete challenge is to create a test that exposes the bug.

The abstract challenge is to explain how concurrent code can be refactored at an extreme rate.

-- TomCargill

Yes, I have been busy, but not too busy to work on a good challenge in my spare time. Take a look at JavaUnitTestChallengeSolved. -- DonWells

I see what you are doing. I don't see how your approach makes enough progress against the problems of state space explosion and unpredictable context switching.

I did discover ReproducibleTestingOfMonitors, a 1978 paper by PerBrinchHansen?, though it doesn't show me how to create a test that exposes the bug in the challenge code.

-- TomCargill

When writing UnitTests, you try to cover as many possibilities as you can think of going wrong. You test the entire public interface of an object. You test some of the interactions between groups of objects. And then you let the AcceptanceTests do the rest. And when the AcceptanceTests, or even someone else on your team finds a bug you add another test or two. You do the best job you can initially, but you stay open to the possibility that you may find more bugs later. A suite of UnitTests needs to mature over time like a fine wine.

In this instance, I don't know that I would address the entire set of possibilities. I suspect that there is diminishing return on time invested. A few well chosen examples will suffice. I would test normal operation and boundary conditions. It also helps to keep things very simple to avoid the explosion in possible tests to write. KentBeck once advised me to write about 2 lines of test code for every line of actual code. (Ron, is 2 to 1 what Kent says these days?) You can test a whole lot of stuff with double the code, but you can't test code that is too complex or has not been designed to be testable.

In this case we wrote a whole lot more than twice the code. But probably we have redundant tests that could be pruned back a bit. I tested more than the public interface in the first test and should not have. And the second test is duplicated. So we could remove 2 out of the 7 tests and still feel sure that this code still runs correctly if we score 100%. -- DonWells

You are showing me tests that do not expose the bug. That only reinforces my belief that concurrent code is hard to test, and therefore hard to refactor at an extreme rate. What I need to see is one test that does expose the bug. Telling me how you would normally test sequential code is moot unless that same process yields effective concurrent tests as well. -- TomCargill

I was going to say that XP would have trouble in a concurrent program environment, because RegressionTests are so necessary. However, after reading the discussion, and rereading Tom's original question, my thoughts are different.

The question is - suppose you were using XP, and had this sort of code to write. Would you drop XP tactics? I wrote, on DesignBeforeCoding and ProgramYourWayOut, that sometimes you have to think before you type. I think this is one of those times. Then, having thought, you implement the smallest concurrent fragment you need, and develop test cases as best you can. At this moment, XP looks similar to many development methods, but different from some in that it explicitly recommends making a small chunk of code work early. After that moment, I would agree with Tom that you will not want to go in and change that code on every whim. When you do decide you need to upgrade it, you would go in with the same care as when you first typed it in.

So where does that break XP? You *might* still have implemented the simplest solution that works, you still have the best test cases you can think of (and no other method has better), you still do all the other XP things. What you do not do is change the code willy nilly, which would only be silly, not XP.

Is this reply close to target, Tom and XPers? -- Alistair

When I compared a suite of tests to a fine wine I was perhaps exaggerating. I have done this UnitTest gig several times now and each time it is surprising how fast you can converge on a good test suite (be it unit or acceptance), if you can stick to one simple practice: If you find a bug in production you add a AcceptanceTest. If you find a bug in a AcceptanceTest you add a UnitTest. It can be very tempting when fixing a bug to say "now that this is fixed I never need to test it again". That is when problems occur.

Yes, Tom, automated tests guard your functionality. They do not necessarily show bugs. But do we need to know anything other than all my required functionality is working? Anyone who claims that testing by the seat of the pants is going to find more bugs than automated testing is foolish.

XP does not say you don't need to think. But there is a different way of thinking. I had a devil of a time learning how to do it after decades of think first code second. But basically what you want to do is think a little then conduct an experiment by coding. Then think a little more and code a little more. I am ashamed to admit that at the top of this thread I made some wild unsubstantiated claims about the code having bugs, that was not the ExtremeWay and I turned out to be wrong. But now look at JavaUnitTestChallengeSolved. The first thing I did was to create an environment where I could learn more about Tom's challenge code by actually running it. I started out simple and learned about the problem by creating one UnitTest, then just one more unit test, then just one more, etc. I tried experiments with different numbers of threads and priorities, many of which I did not show, but which did teach me a lesson. I thought as I experimented giving me concrete feedback with which to do even more thinking.

I don't think changing code willy nilly sounds appealing to anyone, but you do need to be able to change any code in the system fearlessly. -- DonWells

Alistair: Your position is that all code is not treated uniformly when it comes to making refactoring decisions. That position not XP, as I understand it. I don't have an opinion on how severe the impact might be.

Don: Suppose then that you decide that you have enough UnitTests, proceed to production, and then discover the bug described in ExtremeProgrammingChallengeFourteenTheBug. Now you need an AcceptanceTest that exposes the bug, so that it does not return. I don't know how to create a test that depends on a particular scheduling event. What do you do? (I ran into almost exactly this situation two weeks ago. Would the details help?)

-- TomCargill

I just spotted this by AlistairCockburn in SimplifyTheRequirements, which I think is relevant here. "In fact, we never figured out how to test the mutual exclusion, because they couldn't generate test cases that would get in between those three statements."

-- TomCargill

Indeed. Good reading. Fortunately, in this case, the critical code was tiny. I concluded there was nothing we could do beyond a code reading. Even single stepping the reader and writer pair would not have proved anything except that the code worked the way it appeared. It wouldn't really have tested our thinking. In the context of this challenge, I was determined to create a safe zone for that code that people wouldn't go in and touch. Making it 3 lines long was the best way I could find. Making it match a 2-page proof would invite arbitrary "refactoring", and almost certainly not work after the first change.

With respect to XP, Tom has my assertion correct: that not all code is to be treated with the same casual attitude with respect to refactoring. -- AlistairCockburn

When I was an OperationsResearch? guy, I learned that there are two basic types of simulation. One is called deterministic and the other MonteCarlo?.

The test I created in AddOneMoreCheckToTestTwentyThreads, which I believe shows ExtremeProgrammingChallengeFourteenTheBug, is deterministic. In order to make it deterministic I had to set up things just so. I had to control the order in which threads ran and the priority of the threads. In an AcceptanceTest, you have far less leeway in setting up your test. You want the test to simulate the actual running of the system as much as possible. You want to input data as far up stream and interpret data as far down stream as possible.

So in an AcceptanceTest I would prefer not to reach into the guts of the system to contrive some test. Instead I would make it a MonteCarlo? simulation. That is, I would run the AcceptanceTest over and over a sufficient number of times. I could then come up with a score based on how many times the test failed. -- DonWells

Having looked at AddOneMoreCheckToTestTwentyThreads, I have two comments.

First, this test does not expose ExtremeProgrammingChallengeFourteenTheBug. The bug is a matter of LiveNess?, not FairNess?. When the bug arises, the result is a set of producer and consumer threads that should transfer objects and proceed end up stuck in wait(). This manifests itself as an accumulated output that is shorter than expected, not an output of the right length with elements in the wrong order.

Second, with respect to FairNess?, the challenge code makes no attempt to be fair. Generally, the concurrency primitives in Java are not fair. To achieve FairNess? in Java, we must add an explicit layer of policy, as described, for example, in http://www.sni.net/~cargill/jgf/9809/SpecificNotification.html Using testing to establish that concurrent code is fair sounds at least as hard as the correctness property we already have here. I would prefer to put FairNess? on hold for now.

Unless I've missed something, the challenge stands.

-- TomCargill

Far as I'm concerned, the core of the challenge was to show that it's difficult if not impossible to find some defects in concurrent code by testing. That makes refactoring difficult - but it makes writing the code originally difficult as well. Personally, I accept that as having been shown long ago. I applaud Don's continuing effort to build a test that will show the defect at all. We might well learn something if/when he's successful.

As I understand the defect, my guess is that it can't be shown except by a real (if simulated) scheduler that can put people on and off wait.

It's an XP challenge because Tom said, "Alistair: Your position is that all code is not treated uniformly when it comes to making refactoring decisions. That position is not XP, as I understand it."

Right. My original challenge is about refactoring. See the top of the page. -- tc

Looking at Don's most recent tests, in JavaUnitTestChallengeSolved, I believe that he has created a test that exposes the bug. I studied the OneAndTen? test, taking a thread dump at the point where it hangs: I saw that both a producer and a consumer thread were in the wait state, which is the crucial property. Congratulations, Don.

Readers should probably take a quick look at the test code itself. Even if you don't care about the details, scan the code to get a general feel for its size and complexity. Don posted the code at ftp://members.aol.com/jdwells123/challenge14.zip.

I think that Don's efforts reinforce my original hypothesis that testing concurrent code is very hard. (Scan the whole page to see the path by which we got here.) In this case, it was easier than I had speculated, but still very hard.

Given how hard it is to test, can concurrent code be refactored at an extreme rate?

-- TomCargill

Tom, ThankYou and I appreciate your fairness in judging. Unfortunately, we are still at odds.

I believe that this challenge shows where there is a will there is a way to test.

The challenge code was originally believed to be untestable, yet we found a way to test it fairly completely. Was it hard? I did the work. I can say truthfully that it was not too bad. Each test is less than a dozen lines long, exclusive of the testing framework and common or boilerplate code. It did take some time, but then XP has taught me that any time spent developing UnitTests is recouped many times during the lifetime of a project. Keeping things simple helps reduce the time spent developing tests. But if the simplest solution is still complex, like concurrent code can be, then you must take the time to test it all the more. In other words, the harder your code is to test then the greater the time you save in the long run.

Can we write sufficient tests that we feel confident about refactoring concurrent code in general? I have written a system that included concurrent re-entrant real-time code. We reviewed our code to ensure correctness. But testing is how we found our bugs. So for me the answer is yes. Many are the hours I could have saved if I knew then what I know now.

Concurrent code is not a barrier to ExtremeProgramming. -- DonWells

Having followed this discussion for the weeks/months it took to unfold, I side with Tom. No one found the bug until Tom pointed it out. Don thereupon managed to detect a known failure, over a fairly long time (and with Tom continually pointing out that the test did not uncover the defect). Detecting a known bug with coaching from the sidelines is hardly the point of the exercise. If Tom had not pointed out the bug, it is unpredictable whether Don would have found it, or at what time cost. The moment the code changes, it is again unpredictable whether or when any defects would be caught. Once you have this code working, I would be quite surprised if you decide to go in and refactor except under duress - and quite suspicious of the test harness for the result.

However (in a typical wiki experience), we learned more than originally anticipated. Special kudos to Don for producing the test code, and to Tom for examining all the versions. Thanks from the sidelines. -- AlistairCockburn

Alistair, I believe you are misinterpreting the results. First, let's add up how long this challenge actually took. The original testing framework extended for deadlock detection and the original 7 unit tests took 3 skating lessons, 3 gymnastics lessons, plus one evening. The original incorrect solution was developed when I took my wife to the beauty salon. The final two tests took an additional skating lesson. Extending the testing framework for tests that are non-deterministic took an additional evening.

All together about 16 hours. That would be two days on an actual project. I hardly believe this qualifies as difficult or even unreasonable. For comparison, let us note that the first UnitTest I created for the phase 1 VcapsProject required two weeks. That first test was the foundation of a test suite that has reduced our bug count by about half and given us the confidence we need to refactor even the complex JangIt style sections. So is two days unreasonable to test something tricky like concurrent access to a shared resource? To make sure it not only works today, but a year from today? Absolutely not - if this were a project, the time would be well spent indeed.

Could the bug have been found if Tom had not given it away? If you have been following this thread, you already know that knowing the answer actually hindered my progress. Coaching from the sidelines? Tom will vouch for the fact that I asked him exactly one question about Java threads.

So, as I have already said, testing this code was not hard.

But again let's return to the real question: Can we refactor concurrent or even complex code in general? This begs the next question: Can we have confidence in a set of UnitTests? The answer is obvious if we stick to the rules. We create an initial test suite and make it as good as we can conceive and then add to it when ever a new bug is found. Then yes, we can have confidence to refactor as required.

If we identify some code as untestable and therefore unchangeable, we will end up with an entire system that is (obviously) unmaintainable too. I have worked that way before. I did not enjoy it.

Get off the sidelines, the time it takes to develop good unit tests is not as great as your fears are telling you. Approach unit and acceptance testing with confidence and tenacity. Allow your test suites to evolve with your project. Have confidence that you can change any line of any method in the system. It is a better way to work. -- DonWells

Well put, Don! I wind up agreeing with both sides: you can test sufficiently to give yourself confidence, and it is harder to test concurrent code. Don, I think the tests you created are quite possibly beyond the skills of many people who (unfortunately) are trying to write concurrent code. And it's not clear whether finding a similar defect would require even more testing creativity, but I think it might. I wonder if a general approach could be defined.

On the third hand, concurrent code is usually not the bulk of the system one is developing. "Ordinary" UnitTests can be written for the bulk of the code, and you can dig deeper in the bag of tricks for the concurrency.

Don ... do you think the stuff you did is worth posting with the other jUnit files on my Web site or other repository? -- RonJeffries

By way of comparison with Don's numbers, here's how my original time was spent. Writing and basic unit testing of the code took less than an hour. At that point, I believed it correct, and decided to prove it so. I then spent two sessions, of about an hour each, failing to prove the critical property. So I switched to trying to find a counter example, which took less than an hour.

(Don't misconstrue the paragraph above as a claim that verification techniques are more effective than testing in general. I do believe that subtle concurrency bugs can be found more effectively by various forms of inspection than by testing.)

Don's numbers suggest that for concurrent code he is prepared to spend an order of magnitude more time testing than coding. That makes concurrent code an outlier in terms of the expected effort that testing takes in (my understanding of) XP. It's not "normal" XP.

The number we don't have is what it might have cost to encounter this bug in production. It's a nasty one, because it can result in a system that hangs for a while and then resumes normal operation. (Don uses the term 'deadlock' in the tests, but it's not a true deadlock, because the hanging state can be resolved by the arrival of a new thread.)

-- TomCargill

The last bone of contention here seems to be how long it takes to create UnitTests such that XP can apply in this situation. Which is good because I have not yet heard KentBeck put a time limit on how long it should take to create unit tests relative to the actual code. My belief is that "normal" XP means you take as long as it takes, no more and no less.

As for seeing this bug in production, that would happen only if the maintenance programmers who follow us don't take the time to either rewrite the ProofOfCorrectness (in your case) or press the button on the TestingFramework (in my case). Neither of which is very likely to happen, so either way we're covered...right? -- DonWells

I think that stricter coding standards would be appropriate when doing XP and concurrency. I may be wrong, but I seem to recall reading in a book about concurrent programming in Java that one would should always use notifyAll() and not use notify(). In that case, there would be a rule to use notifyAll() and not use notify(). (Is there is ever a situation where one should notify() instead of notifyAll()?) When using semaphores, I'd have a rule about always locking a group of multiple semaphores in the same order. A pair-partner would help you to follow these rules. -- KeithRay

Once the test is written, of course people will press the button. Once the proof is written, I don't believe every project will rewrite it every time they edit the code, time pressures being what they are.

Further, though we should take as long as necessary to write the tests, if it's hard to write a test, its chances of being written are reduced.

AlistairCockburn has pointed out elsewhere that XP is a HighDisciplineMethodology. I agree with him: you have to make yourself do it. I've seen teams that know better fall away from testing in times of stress.

Also, Don, I think you underestimate how hard it would be for an ordinary person to have written the tests you wrote. A lesser being might well have said, "I don't know how to test any better than this". A lazier being, such as myself, might have said "OK, that's good enough, and it obviously works anyway."

Finally, and this is the scariest: this test was hand-crafted to find a particular known defect. As such, it's not clear how much protection it gives against future related defects.

I put all these together and agree with TomCargill: testing is very difficult in Java concurrency. My conclusions are:

When doing concurrency,

expect testing to be harder;
expect your tests to be inadequate, and push harder;
expect that more than the usual number of defects will slip through;
rededicate yourself to simple code that will be easier to test;
rededicate yourself to PairProgramming, especially on the concurrent bits;
be alert to a need to extend your practices in the direction of inspection, proof, or high-volume testing.

[From UnitTestsReconsidered...]

ExtremeProgrammingChallengeFourteen took a very long time to verify one bug (16 hours). A bug that I would have just flushed out by reading the code and thinking it through. Time is expensive. [presumed to be a comment from SunirShah]

: Actually ExtremeProgrammingChallengeFourteen took 16 hours for a suite of tests that proved the existence of one error and tested for many others. I have heard many, many excuses for why someone can not create UnitTests. These are all just that, excuses. I have learned to interpret such excuses as someone telling me they just don't want to write code that way. There is absolutely nothing wrong with not wanting to write code this way or that way and no one should feel guilty about it or feel the need justify it or apologize for it. -- DonWells

I hate to jump into this so long after this thread's been idle... but it strikes me that much of the difficulty and uncertainty stems from the fact that the 'bug' being tested for is not the root bug. The root issue is that (as I believe has been mentioned elsewhere): there are two different types of notifications being sent on a single lock, only a single notify is sent at a time, and that a thread could release the lock without sending a new notify. (i.e., use of 'notify()' would be acceptable if every point where the lock is released had one.)

See ProofOfCorrectness, ProofsCantProveTheAbsenceOfBugs and TestsCantProveTheAbsenceOfBugs, SynchronizationStrategies

Please also see DiningPhilosophersChallenge.