Tdd Code Coverage

What CodeCoverage do you see on TestDrivenDevelopment (TDD) projects?

Do TestDrivenDevelopment projects get "high" CodeCoverage, "without even trying?" How high? 90%?

Would measuring the CodeCoverage of production code run by xUnit tests be useful and/or helpful on a TestDrivenDevelopment project? What kind of problems would it help detect, illustrate, and/or correct?


This page inspired by the "Code Coverage and QC" subject thread on news:comp.software.extreme-programming in May of 2004.

http://groups.google.com/groups?dq=&hl=en&lr=&ie=UTF-8&threadm=78eac%244063273b%24cff54506%241003%40dcanet.allthenewsgroups.com&prev=/groups%3Fhl%3Den%26group%3Dcomp.software.extreme-programming

http://makeashorterlink.com/?Q103228D7


RobertMartin posted to the news:comp.software.extreme-programming thread "Code Coverage and QC" on 2004-03-21:

"Just for fun I downloaded Clover and ran a rough coverage analysis over FitNesse (http://fitnesse.org). It reports over 80% line coverage, 80% method coverage, and 60% conditional coverage -- just when I run the unit tests."

VladimirTrushkin? posted:

"60% is [...] low. You have left 40% of decisions in your code not examined by tests"

RobertMartin posted:

"I came to the same conclusion, though it's important to remember that there's a large chunk of the code that was throwing exceptions because clover wasn't accessible from the second JVM. As such, the paths in that code weren't tracked well.

Still, I Imagine that the 60% ratio is close to what we'll find once we fix that problem. I looked at a number of the condition issues, and a large proportion of them turned out to look like this:

 public ItalicWidget?(ParentWidget? parent, 
                      String text) throws Exception
 {
  super(parent);
  Matcher match = pattern.matcher(text);
  if(match.find())
   addChildWidgets(match.group(1));
  else
   System.err.println("ItalicWidget?: match was not found");
 }

This is a design pattern that we used many dozens of times throughout this particular part of the code. The 'if' statement checks for something that should be a logical impossibility. So no unit test was ever written to see if it failed.

Indeed, we've never seen these messages in real life; so we've dodged this bullet so far. Still, as I looked at it I realized that the condition is not utterly impossible. One of us could make a modification to the code that allowed the false branch of that 'if' statement to execute.

The issue is that there are *two* regular expressions used to match the text string. The first is applied by the caller of this constructor. The second is the one you see in the constructor itself. The two regular expressions are not identical but are supposed to be "compatible". For example, the two regular expressions for the ItalicWidget? are:

 ".+?"
 "(.+?)"

Clearly if a string matches the first, it'll match the second.

Still, if a programmer creates a pair of patterns that aren't quite compatible, then we could get these errors. On the other hand the unit tests of the *first* pattern are typically very tight so it's very unlikely that an incompatibility can exist.

This leaves me with the following dilemma:

  - Should I write tests for the failing cases?
  - Should I eliminate the if statement?
  - Should I change the design?

Writing tests for the failing cases doesn't prove anything very useful. It doesn't prove the two regular expressions are consistent. It only proves that if they are inconsistent an error will be printed. In order to write the tests I'll have to replace one of the two existing expressions, which is tantamount to changing the code.

Eliminating the if statements would increase my Decision coverage, but would just feel *wrong*. I've seen these "can't happen" messages printed too often to simply eliminate them.

I don't know how yet, but there may be a way to change the structure of this code so that the dilemma goes away. Perhaps there is a way to generate one of the regular expressions from the other. But is this worth it? What would I really be buying for my effort.

I think this is interesting. Getting these failing branches covered by tests doesn't feel necessary to my pragmatic self, but feels like an omission to the part of me that wants things to be tight and ship-shape. I guess I'll just have to pull out that part of me that is an engineer and make a decision based upon all the conflicting variables."

RobertMartin posted to the news:comp.software.extreme-programming thread "Code Coverage and QC" on 2004-03-22:

"So here are some better numbers. I excluded all of the third party code in FitNesse and made the funny process boundary thing work well.

"Coverage in FitNesse is:

  90.3% of methods
  92.5% of statements
  82.1% of conditionals.

On a module by module basis the averages are:
 fitnesse.............................76.5%
 fitnesse.components..................94.3%
 fitnesse.http........................84.7%
 fitnesse.responders..................89.4%
 fitnesse.responders.editing..........96.8%
 fitnesse.responders.files............92.2%
 fitnesse.responders.html.............95.4%
 fitnesse.responders.refactoring......97.6%
 fitnesse.schedule....................96.2%
 fitnesse.socketservice...............97.8%
 fitnesse.updates.....................93.9%
 fitnesse.util........................94.8%
 fitnesse.wiki........................91.0%
 fitnesse.wikitext....................93.0%
 fitnesse.wikitext.widgets............94.2%

"The one we really got dinged on was the fitnesse module which holds some older archival code that doesn't run in any of the unit tests. If I pull this module out then coverage for the whole program goes to
  91.0% of methods
  93.7% of statements
  84.3% of conditionals.

"So, clearly we aren't in the high 90's.


AnthonyWilliams posted to the news:comp.software.extreme-programming thread "Code Coverage and QC" on 2004-03-24:

"I just ran gcov on my latest project (which is C++), and the line coverage results vary from 80% to 100% across the modules, with the average I would guess to be about 92%. I am going to use this information to help eliminate dead code, and add tests.

"As far as I can see, most of the code that isn't covered is the error handling, so in general I need to add more tests for things handling errors correctly. The code is just there as a reflex, rather than explicitly added in response to a failing test --- I need to get out of that habit, and into the habit of writing the failing test to force me to add the code. However, what really concerns me is that some of the missed code is supposed to be live code, and is supposed to be covered by the tests."

and

"Just to add: this was done using TDD. Also, this is the first real project I've done using TDD, and the first time I've run a coverage analysis on this codebase."

From later notes: The coverage is only for unit tests, not acceptance tests, which are done manually on the project.

JasonNocks asked:

"So, with respect to the important code that is not covered by TDD Unit Tests, do you remember whether this code was typically written during, or in direct response to, manual acceptance/validation testing?

"If so, then do you think it's possible that you stopped following TDD at those points in time, but practiced TDD the remainder of the time? Thoughts? I ask this, because I've personally experienced this myself."

AnthonyWilliams responded:

"The acceptance test required that it handle data types A, B and C. I wrote the tests for A and B then added the code for all 3, as it was nearly the same. Consequently the code for type C is untested. This overgeneralizing goes in the same bin as writing error handling code without tests --- a bad habit I need to get out of.

"Personally, I'm impressed at how high the figures are --- the tests for the legacy client have nowhere near so much coverage. Not that we've run a coverage analysis, but we keep finding bugs that have been there for years, and the manual testing means it is impractical to test everything.

"I am sure that Vladimir will feel that this is unacceptable, but it's the reality for most software developers that the 90% coverage we're seeing from TDD is actually a much higher level of test coverage than they're used to."


[In response to a request for code coverage numbers for a non-TestDrivenDevelopment OpenSource project...]

JasonNocks posted to the news:comp.software.extreme-programming thread "Code Coverage and QC" on 2004-03-23:

"I'll pony up another data point for comparison (contrast).

The Linux Testing Project has been going on for something like 4 years now. You can check the project out at: http://ltp.sourceforge.net/coverage/coverage.php

Here's a summary of the last posted results I could find quickly:

  Minimum overall: 0%
  Maximum overall: 50.65%
  Average overall code coverage: 16.04%

I feel it's important to add that I place a great deal of value on GPL'd projects, and Linux in particular. I do not think that these numbers in any way reflect negatively on GNU/Linux, or Free Software / Open Source in general. To me, the numbers achieved reflect the difficulty in achieving high code coverage with an approach that creates tests after implementing.

These results were obtained from tossing the numbers into a quick spreadsheet. The actual numbers are organized by kernel path and are available at: http://ltp.sourceforge.net/coverage/coverage-2.5.26/index.html These numbers are from Linux Kernel 2.5.26, from August 2002 (a bit out of date, but still after more than 2 years of the project's lifetime, and the latest that I could quickly find).

Here's a more interesting point. Not all of the tests pass! What good would high coverage be if a number of the tests don't pass. The list of "known failing tests" shortly after the test results were published can be found at: http://ltp.sourceforge.net/errorstemp.php?file=20020910-x86.html

Now I know that GNU/Linux has been around for a little over 10 years, and has a much larger code base. But, the LTP currently has something like 18 developers working on it, and has been around for about 4 of those years. I don't know how many were on the project later in the year 2002, but it's not been just 1 guy. There have been contributions from IBM, SGI, and others from very early on.

Lastly, the LTP has lots of other data, including stress and regression testing results. I think that those tests results are actually more interesting (not just because I think those results are more impressive for GNU/Linux).

Here's what I see: Results have been provided for two stable GPL'd projects, both of which receive commercial support of one sort or another.

The LTP actively tracks and works to improve its metrics, and has done so for a timespan of years. After over two years, it reached an average overall test coverage of 16.04% with a number of the tests not even passing.

Fitnesse, that was primarily test-driven (that's my understanding), after downloading and running Clover "Just for fun" (curiously also the name of Linus' book), was observed to have several test coverage metrics that range from around 82% to 92% with tests that *all pass* (this is again my understanding).

We can continue to collect data points, but so far I see absolutely nothing to suggest that we should question TDD's validity or feasibility with respect to code coverage. Again, everything I see so far indicate to me that TDD's results are quite good."


SpringFramework:

"Spring itself has an excellent unit test suite. Our unit test coverage as of 1.0 M1 is over 75%, and we’re hoping to go above 80% unit test coverage for 1.0 RC1. We’ve found the benefits of test first development to be very real on this project. For example, it has made working as an internationally distributed team extremely efficient, and users comment that CVS snapshots tend to be stable and safe to use." -- RodJohnson

From http://www.theserverside.com/articles/printfriendly.tss?l=SpringFramework


CategoryTestDrivenDevelopment


EditText of this page (last edited April 10, 2004) or FindPage with title or text search