Cleanroom Software Development Empirical Evaluation

Discussion on the "Cleanroom Software Development: An Empirical Evaluation" paper by Richard W. Selby, Victor R. Basili, and F. Terry Baker, from IEEE Transactions on Software Engineering, Volume SE-13, Number 9 (September 1987), pages 1027 to 1037. (Also available as Chapter 16 in the SoftwareStateOfTheArt book. (ISBN 0-932633-14-5 ))

It relates to CleanroomSoftwareEngineering.

[Editorial comments in this section are in separate paragraphs or in square brackets, and marked with the contributor's name. -- JeffGrigg]

The paper describes an empirical study of cleanroom vs traditional techniques, in 3 person ChiefProgrammerTeam-s of students working in a software development course in Washington DC, 1982 and 1983. (Average experience: 1.6 years) Teams independently developed a simple EMAIL system -- around 1500 lines of code.

From 10 cleanroom teams, and 5 teams using conventional methods, they concluded the following:

"Cleanroom teams met the requirements more completely than did the non-Cleanroom teams." (And JeffGrigg sees in the data that cleanroom teams more consistently delivered functionality: About 1/2 of both teams delivered "nearly all" required functions. But ALL cleanroom teams delivered over 50% of required functionality, while 2 of 5 non-cleanroom teams "flamed out" and delivered only 25% of required functionality.)
Based on testing designed to measure the expected reliability of the system, as experienced by the end user, "Cleanroom projects had a higher percentage of successful test cases at system delivery."

"In both of the product quality measures of implementation completeness and operational testing results, there was quite a variation in performance." And, in a footnote, where they drop the less successful 2/5 of teams using each approach from the statistics, they conclude "comparing the best teams from each approach increases the evidence in favor of Cleanroom in both of these product quality measures."

[Dropping nearly half of the data collected would seem to be highly questionable. The comment that dropping data "increases the evidence in favor of Cleanroom..." really raises concerns to me regarding the objectivity of this study.]

Applying metrics to the code revealed that the cleanroom teams produced code that was...

more highly commented
lower complexity density

Some other measures did not indicate significant differences. (Other observations are more obscure. Feel free to go read the article yourself. ;-)

"... These observations suggest that the more successful Cleanroom developers simplified their use of the implementation language." (more procedure calls and IFs, fewer CASE and WHILE statements, etc.)

On schedule slippage: "All the teams using Cleanroom kept to their original schedules by making all planned deliveries; only 2 of the 5 non-Cleanroom teams made all their scheduled deliveries." (Teams determined their own schedules, but were required to make incremental deliveries 1 to 2 weeks apart.) JeffGrigg's observation: The two teams that ended up doing one monolithic release, instead of incremental releases performed the worst of all 15 teams in all the other important measures -- particularly conformance to requirements and reliability. These two were also *NOT* cleanroom teams.

[JeffGrigg's observation: In terms of conventional unit testing, the two teams that did worst in all the most important measures had another interesting property when the amount of computer usage was measured: One team used the least computer time of all teams in the study. [Later discovery: This was a one-person "team" with a "highly experienced" (7 years?) member. All other teams started with 3 members, but some cleanroom teams lost a member during the study.]

The other "problem" team used by far more computer time than any other team (they're WAY out of the pack). JeffGrigg thinks they did lots of unit testing -- to no good effect, but there's no clear evidence to support or refute this. This team did not produce significantly more code (lines or executable statements) than other teams, but did divide this code into significantly more functions than any other team.]

Not surprisingly, the cleanroom developers would agree with the statement "I missed the satisfaction of program execution." In the study's conclusion, it's suggested that developers be allowed to witness, but not influence, the use of the product during independent testing. [JeffGrigg suggests that watching real users use your product is a good idea, regardless of methodology. Care to try watching them through a one way mirror?]

On other measures of developer opinion, it's surprising to find that most of the cleanroom developers (26 of 28) claimed that cleanroom development did not "substantially change their development style." [See article for exact wording of the question.] 11 of the 28 claimed that they did not change their development style at all. (That leaves 2 who said that "Yes, my style was substantially revised.") Some developers mentioned that they already utilized some of the techniques, in work before entering the study. [JeffGrigg thinks this indicates that using the CleanRoomMethodology will not require substantial changes to your work habits -- you'll just do more of the things you already know are good.] [On the other hand, JeffGrigg thinks that the student responses to this question could be explained as an ego thing. Like "no way will I admit that I was doing it wrong in all my work before this class." ;-> ] [ Or, that as a student, "change" is so normal that you have a hard time noticing it (until you read your two year old code :-) -- PerGunnarHanso ]

In another interesting measure of developer subjective opinion, most of the developers concluded that they would use the CleanRoomMethodology again, but not for all projects. The reasons for this result were not investigated. [Perhaps cleanroom is good for some projects, but not others? -- JeffGrigg]

Selected references, of the 47 in the article:

M. Dyer and H.D.Mills, "The Cleanroom Approach to Reliable Software Development,� in Proceedings of the Validation Methods Research for Fault-Tolerant Avionics and Control systems Sub-Working-Group Meeting: Production of Reliable Flight-Critical Software, Research Triangle Institute, N.C., November 2-4, 1981.
M. Dyer, "Cleanroom Software Development Method," IBM Federal Systems Division, Bethesda, Md., October 14, 1982.
"Developing Electronic Systems with Certifiable Reliability," in Proceedings of the NATO Conference, Summer 1982.
"Software Development Under Statistical Quality control," in Proceedings of the NATO Advanced Study Institute: The Challenge of Advanced Computing Technology to System Design Methods, Durham, UK, July 29 � August 10, 1985.
M. Dyer and H.D. Mills, "Certifying the Reliability of Software," IEEE Transactions on Software Engineering, Volume SE-12, pages 3 to 11, January 1986.

Comments? (I know I'm tired of typing. ;-> -- JeffGrigg)

Seems like the polar opposite of ExtremeProgramming. What I want to know is:

What was the interface between stakeholders and developers? How often did they meet? Did the stakeholders meet with both the developers and the QA team?
What process or documentation did they use to communicate requirements? What stakeholder feedback process did they use to deal with bogus requirements? What was their scheduling process?
How were the resulting deliverables evaluated by the stakeholders?
How maintainable was the result? Did they try to toss in a wild-ass requirement at one of these projects after the first release, and how readily was it assimilated?
How fragile was the process? Did they try throwing a wild-ass requirement in *after* the design team's effort but *before* the QA team's first release, and how bad a slippage did it make?
If the QA team noticed any architectural or design flaws, scaling problems and so on, did they communicate those back to the design team, or did they try to do some design themselves?
Did either team iterate in parallel, or were the QA folk sitting on their hands the whole time the design team did their thing?
You say this compared well with "traditional" techniques. Do you mean just waterfall, or has it performed well against spiral methodologies as well?
How does this methodology differ from BigDesignUpFront?

--PeterMerel

We have quite a few questions here.

Regarding gathering requirements from users: My reading indicates that cleanroom development relies a WaterFall-like method to gather requirements. IE: First you gather and document all the requirements, then you do cleanroom development, then you deliver the product. I find this interesting, as my several references clearly describe the cleanroom development process itself as iterative, with incremental deliveries. Cleanroom QA testing, for reliability measures, is always done on incremental releases.

I suspect that cleanroom development would be compatible with iterative requirements gathering, but that the organizations that most commonly practice it are politically incompatible with that approach. (...tho' I don't have evidence to support this.)

In this particular study, "a requirements document for an electronic message system (read, send, mailing lists, authorized capabilities, etc.) was distributed to each of the teams. The project was to be completed in six weeks [...]" I think it's safe to say that developers had no contact with the fictitious end users, and I strongly doubt that any requirements changed during the project.

We'll have to look to other sources to address the requirements gathering questions, particularly those involving interaction with real end users and requirements changes. But I think this is typical of empirical studies: How many businesses would be willing to fund 15 different development efforts to produce the same product?

On interaction with the QA team: It's important to realize that the role of the QA testing is different than their traditional role. In most other projects, the QA team finds bugs, and the development team "works off" the list of bugs. The QA team tests boundary conditions and exceptions, in an attempt to find as many bugs as possible. You deliver the product when the list of known bugs is sufficiently small.

With cleanroom development, the QA team tests each incremental delivery using randomly selected test cases that designed to mimic the usage patterns of real users. That is, they focus on determining the reliability of the software, as the user will experience it. They DO NOT go out of their way to find bugs. (The software should be 99.9% bug-free when they get it.) QA's role is to VERIFY that the code is 99.9% bug free, not to find as many bugs as possible.

Three things: One, Jeff, are you doing Cleanroom? If not, why not? If so, please report how it's going.

Two, would anyone care to fund a code-off experiment between a Cleanroom team and an Extreme team? I can come up with an ExtremeTeam. Maybe Beck, Cunningham, Auer, and, to hold the team down to finite capability, me.

Three, the team that used the most computer time was not clearly doing a lot of unit testing. It isn't at all clear what they were doing. --RonJeffries

[Yes, you have a very good point there. -- JeffGrigg]

Oh, and four ... I guess I'll have to read the article, but how do you do Cleanroom with only three people of whom one drops out?

To me, three people seems like an excessively small team. But on a 6-week project with rather straightforward and stable requirements, it's probably workable. You can do CodeReview and PairProgramming, even with only two people.

I talk a lot, but I haven't done cleanroom development yet. Nor have I personally seen a team use it. I am a contract computer programmer, and I find that most development environments I visit are strongly disposed to continue techniques they're comfortable with, no matter how irrational or self-destructive it may be. They're often hostile even on small improvements, like an occasional DesignReview or a little PairProgramming. Change can be threatening. Thus I find, in most cases, that the only development process I have the authority to control is my own: One (extraordinarily modest (-: ) person doing one job really well.

Cleanroom development can only be attempted with strong project management support, and with some training of developers to ensure that they can successfully use techniques, like code reviews, instead of relying on unit testing to find their errors.

For me, the most compelling result of learning about cleanroom development is accepting the concept that it is possible to produce bug-free code. This stands in stark contrast to conventional wisdom, which is that the code you produce must always contain lots of bugs, and so you use testing to find them.

My current practice is not CleanRoomMethodology. [Over time (since I wrote this), I've gotten closer and closer to ExtremeProgramming.] -- JeffGrigg

Could someone please summarize what process the control groups in this evaluation were doing? What is the clean room methodology an improvement on?

CategoryMethodology, CategoryBook (actually an IEEE magazine article; see CleanroomSoftwareEngineeringBooks)