February 02, 2004 - Selecting Developer Testing Metrics

The first step in deciding what metrics to use is to specify clearly what results we want to achieve and what behaviors we want to encourage. In the context of developer testing, the results and behaviors that most organizations should target are the following:

I will tell you about prioritizing and setting targets for these objectives in the next article; right now I want to focus on picking the best metrics to measure progress and success in these areas.

This article addresses the following topics:

What Makes a Good Metric?

One lesson I have learned over the years is that, in order to be useful and to actually get used, any metric you choose should be simple, positive, controllable, and automatable. Let me elaborate a bit on each of these properties.

Simple: Most software systems are quite complex and the people who work on them are usually quite smart, so it seems both reasonable and workable to use complex metrics - but this is wrong! Although complex metrics may be more accurate than simple ones, and most developers will be able to understand them (if they are willing to put the time into it), I have found that the popularity, effectiveness, and usefulness of most metrics (software or otherwise) is inversely proportional to their complexity. I suggest that you start with the simplest metrics that will do the job and refine them over time if needed.

The Dow Jones Industrial Average index is a good example of this effect. The DJIA is a very old metric and is necessarily simple because it was developed before computers could be used to calculate it and there weren't as many public companies to track anyway. Today there are thousands more stocks that can be tracked and the DJIA still takes into account only 30 blue-chip stocks, but because it's simple and it seems to track a portion of the stock market well enough, it's still the most widely reported, understood, and followed market index.

Positive: I consider a metric to be positive if I want the quantity it measures to go up. Code coverage is a positive metric because increases in code coverage are generally good. The number of test cases is a positive metric for the same reason. On the other hand, commonly used metrics based on bug counts (for example, number of bugs found, number of bugs outstanding, etc.) are considered negative metrics because I want those numbers to be as low as possible.

Yes, I know, it's good to find bugs; it means that the tests are working, but they are bugs nonetheless. You should file them, track them, and set goals to prevent and reduce them, but they are not a good basis for developer testing targets.

Controllable: You should tie the success of your developer testing program to metrics over which you have control. You can control the growth in code coverage and the number of test cases (that is, you can keep adding test code and test cases) but the number of bugs that will be found by the tests is much harder to control.

Automatable: If calculating a metric requires manual effort it will quickly turn into a chore and it will not be tracked as frequently or as accurately as it should be. Make sure that whatever you decide to measure can be easily automated and will require little or no human effort to collect the data and calculate the result.

Let's apply these criteria to come up with an initial set of metrics to measure the objectives we have listed. You can use the following list as is, or modify and extend it to match your specific needs and objectives.

Objective: To start and grow a collection of self-sufficient and self-checking tests written by developers.

The two simple metrics I recommend to get you started are:

Both metrics are simple, positive, controllable, and easy to automate (although you'll need to use a code coverage tool for the second one - more about that later).

Objective: To have high-quality, thorough, and effective tests.

If you implement and start measuring the metrics for the previous objective you will soon have a growing set of developer tests. In my experience, however, the quality, thoroughness, and effectiveness of those tests can vary widely. Some of the tests will be well thought-out and thorough, while others will be written quickly, without much thought, and will provide minimal coverage. The latter type of tests can give you a false sense of security, so you should augment the first two metrics with additional measurements that can give you some indication of test quality. As you might suspect, this is not an easy task; this is one of the objectives where you will have plenty of opportunity for adding and refining metrics as you progress.1 But you have to start somewhere, and as a first step I suggest focusing on test thoroughness, which can be measured with some objectiveness using a code coverage tool.

There are many code coverage metrics that you can use, but for the sake of simplicity I recommend picking three or four of them and then, to further simplify, combining them into a single index. The specific metrics will vary depending on the programming language(s) used in your code; the following are my suggestions for code written in Java.

Basic code coverage metrics for Java:

Method coverage tells you whether a method has been called at least once by the tests, but does not tell you how thoroughly it has been exercised.

Outcome coverage is a seldom-used but very important test coverage metric. When a Java method is invoked it can either behave normally or throw one of several exceptions. To cover all possible behaviors of a method, a thorough test should trigger all possible outcomes or, at the very least, it should cause the method to execute normally at least once and throw each declared exception at least once.

Statement coverage tells you what percentage of the statements in the code have been exercised.

Branch coverage augments statement coverage by keeping track of whether all the possible branches in the code have been executed.

Since we want to keep things as simple as possible (remember the Dow Jones Industrial Average index example), I recommend combining these four metrics into a single index. Let's call it the Test Coverage Index, or TCI for short. I am sure we all know enough math to come up with a very impressive-looking formula that uses all sorts of fancy symbols, Greek letters, and impressive terms such as weighted means, variances, quartiles, etc. We could spend days discussing the relative merits of method coverage vs. statement coverage vs. branch coverage vs. outcome coverage and how to weigh each of them, but I will invoke the principle of simplicity once more and recommend the following relatively simple formula in which each coverage metric is weighed equally:

TCI = (MC/TM + OC/TO + SC/TS + BC/TB) * 25


MC = methods covered TM = total methods
OC = outcomes covered TO = total outcomes
SC = statements covered TS = total statements
BC = branches covered TB = total branches

I multiply the sum of the ratios (which will range between 0.0 and 4.0) by 25 in order to get a friendly, familiar, and intuitive TCI range of 0 to 100 (if you round it to the nearest integer, which I recommend).

The TCI is a bit more involved than the previous metrics but it still meets our key criteria:

Is the TCI perfect? No. Is it good enough to get your developer testing program started and effective in helping you achieve your initial objectives? You bet.

Objective: To increase the number of developers who are contributing actively and regularly to the collection of developer tests.

The terms actively and regularly are key components in this objective. Having each developer contribute a few tests at the beginning of a developer testing program is a great start, but it cannot end there. The ultimate objective is to make the body of tests match the body of code and to keep that up as the code base grows - when new code is checked in, it should be accompanied by a corresponding set of tests.

Since we already have the TCI in our toolset, we can reuse it on a per-developer basis with the following metric:

Clearly, this metric only makes sense if there is a concept of class ownership, which I observed is the case in most development organizations. Typically, class ownership is extracted from your source control system (for example, the class owner is last developer who modified the code, or the one who created it, or worked on it the most - whatever makes the most sense in your organization).

Misusing Metrics

Most metrics can be easily misused (either intentionally or unintentionally) both by managers and developers.

Managers might misuse the metrics by setting unrealistic objectives, or focusing on these metrics at the expense of other important deliverables (for example, meeting schedules, implementing new functionality). We will discuss the best way to use these metrics in future articles, but for the time being we should remind ourselves that metrics are just tools that provide us with some data to help us make decisions. Since metrics can't incorporate all the necessary knowledge and facts, they should not replace common sense and intuition in decision making.

Developers might misuse metrics by focusing too much on the numbers and too little on the intent behind the metric. To prevent unintentional misuse it's important to communicate to the team the details and, more importantly, the intent behind the metric.

Perhaps I have been very lucky, but in many years of managing software developers I have yet to experience a single case of malicious and intentional misuse of metrics (for example, creating trivial and very shallow tests just to increase the total test count). Once I have made the intent of each metric clear, I rely on the honor system, and when I do make the occasional check of the data behind the numbers (for example, by looking at a random test) I never do it with the expectation of catching intentional wrongdoing but to make sure that the intentions were successfully communicated and interpreted.

Putting It All Together

The following table summarizes the developer testing metrics we have come up with so far:

Results and Behaviors We Want To Achieve Metrics To Drive Desirable Results and Behaviors
To start and grow a collection of self-sufficient and self-checking tests written by developers.
  • Raw number of developer test programs.
  • Percentage of classes covered by developer tests.
To have high-quality, thorough, and effective tests. Test Coverage Index (TCI) which summarizes:
  • Method Coverage
  • Statement Coverage
  • Branch Coverage
  • Outcome Coverage
To increase the number of developers contributing to the developer testing effort. Percentage of developers with a TCI > X for their classes.

If you already have a code coverage tool, a code management system, and an in-house developer who's handy with a scripting language, you should be able to automate the collection and reporting of these metrics.

Below is an example of a very basic developer testing dashboard you can use for reporting purposes. Note that in this dashboard I added some non-developer-testing related metrics (the total number of classes and the total number of developers) to add some perspective to the metrics I am actually interested in.

Developer Testing Dashboard

Metric Value
Total number of classes 1776
Total Number of developers 12
Raw number of developer test programs 312
Percentage of classes covered by developer tests 27%
Test Coverage Index (TCI) 16
Percentage of developers with a TCI > 10 for their classes 50%

This is a very simple dashboard to get you started, but if you get to this point you will have more information and insight about the breadth, depth, and adoption of your developer testing program than 99% of the software development organizations out there.

Refining Your Metrics

What we covered in this article is just a start. As your developer testing program evolves you will probably want to add, improve, or replace some of these metrics with others that better fit your needs and your organization.

The most important thing to remember when developing your own metrics is to always start with a clear description of the results or behaviors that you want to achieve, and then to determine how those results and behaviors can be objectively measured. The next critical step is to try to keep all your metrics simple, positive, controllable, and automatable. This might not be possible in all cases, but it is essential to understand that your chance of success with any metric is highly dependent on these four properties.

1 One possible measure of test effectiveness, for example, is the ability to catch bugs. You can get some idea of a test's ability to catch certain categories of bugs by using a technique called mutation testing. In mutation testing you introduce artificial defects into the code under test (for example, replace a >= with a >) then run the tests for that code to see if the mutation results in an error. If the test passes, it means that it's not effective in catching that particular kind of error.

Posted by Alberto Savoia at February 2, 2004 09:33 PM

Trackback Pings

TrackBack URL for this entry:


Post a comment

Remember Me?