Behavioral Effect Of Metrics

Extracted from XP egroups discussion as relevant to this crowd:

Anon. ...metrics exist, and how do they affect a project, especially XP projects which since they are lightweight and can change directions fairly easily, may feel the effects of metrics (read: exhibit interesting phenomenon) quicker.

JimLittle (To: extremeprogramming@egroups.com Sent: Monday, October 09, 2000 9:19 PM Subject: [XP] The power of the right metric)

Per XP Explained, our team has a whiteboard on which we track a few important metrics. We track our load factor because we sometimes have a tendency to goof off. We track time and task progress because we like to see if we'll meet our deadlines. And we track the number of unresolved support incidents. At least, we used to.

We started tracking the number of support incidents because it seemed like the support swami (our support role; it rotates weekly) was always buried in work. In addition, there seemed to be some cards that just never got resolved. We thought that tracking the number might help the situation.

Well, it did, but not by much. The number went down a bit and then hovered around 13-14 outstanding incidents. There didn't seem to be anything we could do to get the number any lower.

Then, a week ago, we decided to take another approach. We decided that the number of support incidents we resolved wasn't nearly as important as the *speed* at which resolved them. To that end, we stopped tracking the number of outstanding issues and started tracking the age of the oldest support issue instead.

The first day, our oldest issue was over a month old. Now, a week later, we have no outstanding issues at all. In the six months since we started the support swami role, we have NEVER had zero outstanding issues. Until now.

What happened? Well, it turned out that there were a bunch of support issues that were especially ugly. Since the support swami was a rotating role, and since he was incented to resolve as many issues as possible, those "ugly" issues never got looked at. "The next guy will handle them," was the rationale.

When we changed our metric, suddenly there was incentive to re-examine those ugly issues. A number of them turned out to be cases where we had been waiting for a call back from the user. We resolved these by handing ownership back to the user if they didn't get back to us quickly. (The support swami is only supposed to be used for emergency issues; our reasoning was that, if the user can't get back to us fairly soon, it must not really be an emergency.)

Most of the remainder of the "ugly" issues turned out not to be support issues at all. They were actually stories in disguise. People had called the support swami with them in order to make an end-run around our iteration planning process (demand for our time is very high) and the swamis hadn't been diligent about making sure only emergency issues were accepted. We converted these issues into stories.

The remaining "ugly" issues didn't turn out to be so bad. They were just old, and hadn't been taken by the current swami. Once the swami got past the hump of having to figure out what the issue was, those were resolved as well.

The bottom line? The feedback provided by a few carefully-selected metrics can be a powerful incentive for your team. Make sure you're tracking the right things, though, or you may be incenting people in the wrong direction. Be sure that an improvement in the metric will actually reflect an improvement in the code or in the customer's happiness.

Jim

JimLittle A few weeks after I posted the above message, I retired from the tracker position. The support swami, which is a role that rotates weekly, was put in charge of updating his own support metric. Some swamis kept the number up to date and some didn't. I noticed that the ones who kept the number up to date were also the ones with the best numbers (though it's hard to say, since I don't know the numbers for sure).

My final impression is that tracking the support metric definitely has had a significant effect on team members' behavior. However, without a dedicated tracker helping with discipline, the effect can occur either as increased effort or as hiding from the metric, depending on the individual. No swami has ever voluntarily put up a red (bad) metric, but I have seen the metric stop getting updated for several days in a row. Now, two months later, it hasn't been updated for over a week.

Interestingly, the team's quarterly bonus was directly tied to their responsiveness to support stories this quarter. That has had no visible effect whatsoever. Last quarter, when we were conscientious about updating the metric but it had no effect on bonuses, support stories were taken care of quickly. This quarter, the metric isn't being updated but money is involved... but support stories aren't being taken care of. I think the moral is that daily feedback is more important than money.

Jim

As JimCoplien has been saying for years (repeating an old adage), "What gets measured gets done." Measure the Severity A bug reports? Nothing that doesn't actually kill a user will be higher than Severity B. How long are bug reports in the "open" state? They never will be; they'll skip directly from "created" to "closed" (or will be transitioned as soon as the system allows). Track how many lines of code a programmer writes? You'll see a lot of comments. Measure lines of non-commentary code? There are tricks to bloat that one, too; you may even encourage copy-and-paste.-( Measure how many bug reports a programmer does real work on? You'll see heavy concentration on trivial fixes, and an avoidance of doing anything complicated.

I've always mistrusted software metrics. All too often, all they measure is how well people can manipulate them. (I hate having this attitude! Improving software quality or software development without metrics is like loosing weight without ever getting on a scale. Not that getting on a broken scale will help a dieter any....) --PaulChisholm

We've just introduced a crap metric which counts the number of bugs. Test team look good if they find lots of bugs. Development team (maintaining a stove pipe product) look good if they fix lots of bugs. Result - the most trivial of problems are being recorded as bugs, so that they get tracked and counted, rather than confirming first that they aren't misunderstandings, that they are not already recorded. Very very silly.

If there's any reuse, can you get your testing team to report each bug in reused code as a separate bug for each screen that displays it? (I've seen this happen before! ;-)

Unfortunately, there is a lot of belief in metrics as black magic. "If you measure it, it will improve."

Measurements only track a specific quality that results from a process, however, changing the reported result does not necessarily change the process, nor change the process for the better. The more importance you place on a measurement, the more likely the reported measurement will improve, but it is unknown what was done to cause the measurement to improve. If, however, you change a process, you can use measurements to determine whether an improvement has occurred. You can also use measurements to track a process and see if unknown problems have crept in.

Do not use measurements to drive change; you don't know what you will get. Use measurements to track change, they will tell you when and whether something happened.

There is one software metric that I do believe has enormous importance: Stress pass rate. (This depends, of course, on what kind of product you are working on -- whether your product even has stress tests, automated or not.) I was a Microsoft OS developer for several years. Win 98 had horrible stress pass rates -- daily stress results would range between 60% and 85%, and when it held at 90%+ for a couple of weeks, it was deemed shippable. (This is due mainly to the incredibly long and tortured history of the DOS/3x/9x code-base, and not to the tireless diligence of the Windows development team. Now, management, that's a different story.)

The NT code base, however, was considered in dire emergency if it dipped as low as 95% on the daily stress pass rates. (And it rarely did.) The NT test team had very hard requirements for stress pass rates -- and if you blew up a stress test, you had hell to pay until you fixed it. I firmly believe that the NT team's absolute devotion to stress numbers -- every day, not just in "ship mode" -- made all the difference, and it was a privilege to work in such an environment.

Well-chosen software metrics can be used to drive progress. Completely avoiding metrics is irrational. Stress pass rate is one of the most useful (and unambiguous) software metrics we have.

-- ArlieDavis

Another BehavioralEffectOfMetrics is, that WhateverGetsMeasuredGetsOptimized.

CategoryMetrics | CategoryStory