August 24, 2006 - Build Failures Policy

I just wrote a page on our internal wiki with our policy for dealing with build failures. We thought others might find it interesting so I am sharing it here (the links will be broken for obvious reasons).

Executive Summary

If the build fails, fix it.

directly from wiki ....

Red lights attract bugs

Developers have a natural aversion to breaking the build and most will do everything they can to avoid breaking it. The red lava lamp is our early warning signal that the build is broken.

Unfortunately, if the red light is already on, we tend to ignore all the symptoms of a broken build because, after all, it's already broken. If the red light is on for more than an hour, build-is-broken becomes the natural state of affairs.

To address this problem, we are introducing a new policy. It's the same as the old one.

What to do if the red light comes on

Step 1 - See if it is your fault

Check your email. If it was your checkin that broke the build, you will almost certainly have an email from xxxxxxx@agitar.com that has "Build Failed" in the subject.

If it was not your fault, look at the modifications list in the cc email to find the culprit. Guide them gently towards Step 1. If it is not obvious who the culprit is - or if the culprit is unavailable - fix it anyway.

Step 2 - Fix it

If it was your fault, fix it quickly. If you don't understand why it broke, ask for help.

If there is a reason that you cannot fix it immediately, make the red light go out by other means (see below).

Either way, send an email to xxx@agitar.com and explain briefly that you are/are not fixing it.

If it was not your fault and the light is still on after several minutes, inquire gently whether anyone is dealing with the broken build and when it will be fixed.

I can't fix it right now

For acceptance level tests (system test, tiger acceptance tests, dashboard acceptance tests), it might be a while before the test can be made to pass - the feature being tested might not even be written yet. Just make a call to bug() in the failing test.

The build will not fail for a failing test that has an open bug. The build generates a report that separates known issues from new failures. Bugs that are scheduled for the current release will be shown in red.

Unit tests should be fixed immediately.

What exactly causes the red light to come on?

Unit tests

All the unit tests (plus the smoke tests) run on http://cc-unit-test:8080/cruisecontrol/ and a failure in any of them will light the light.

Product build

The entire product (eclipse client and tiger server) is built on http://cc-builder:8080/cruisecontrol/

Acceptance tests

Once the product is built, the server is installed and the acceptance tests run at http://cc-builder:8080/cruisecontrol/

A failure in any of them will light the light - unless there is an open bug (see above).

Antelope tests

Antelope tests will only break the build if they don't compile. Failing tests should be fixed quickly though, of course. You have three options :

  1. fix the test
  2. fix the bug in the product that causes the test to fail
  3. regenerate the test

Is there anything that doesn't make the red light come on?

System tests

System tests for agitator and dashboard run on http://cc-system-test:8080/cruisecontrol/ and respect the open bug rule. They don't light the light though because the cycle time is too long.

Nightly agitator

The nightly agitator runs on http://cc-agitation:8080/cruisecontrol/

The build will only fail for assertion failures (not outcome failures). It will not light the light.

Agitation for com.agitar.common

Runs on http://dragon:8080/cruisecontrol/ (until it finds a permanent home).

The build will only fail for assertion failures (not outcome failures). It will not light the light.

There are too many emails!

Set up two mail filters

  1. send emails from xxxx@agitar.com to another folder
  2. mark emails as read if they have Build Successfull in the subject

Posted by Kevin Lawrence at August 24, 2006 04:20 PM


Trackback Pings

TrackBack URL for this entry:
http://www.developertesting.com/mt/mt-tb.cgi/227


Comments

Interesting policy - something I think we should adopt more rigidly too.

We have recently had the unfortunate problem of failing acceptance tests as we have rapidly built a suite of tests for our main product and some are a bit unstable. This can sometimes mask the genuine unit test failures in that product or other products built by cruisecontrol. Or, worse, a behavioral change in the software that the acceptance tests have highlighted, but that are ignored because people just think "the acceptance tests are always failing...".

How does your bug() call work? I am interested in implementing something similar. At the moment we have a suite called "failingTests" that bad acceptance tests are added to and are not run as part of the CC build. I'd much prefer the simpler mechanism of a method call in a failing test to bug() that was still logged and tracked in the CC build reports...

Posted by: Patrick Myles on September 2, 2006 05:44 AM

I have bug() write the bug number and test name to a file bugs.txt.

7720 test.harness.rabbit.selftest.ShouldLogBugNumbers.testShouldAssociateBugNumbersWithTest
9020 test.generation.environment.CharacterEncodings.testShouldSupportMultibyteInClassNames

Then I have a custom ant task that generates an xml report by merging bugs.txt with the TEST-xxxx.xml files that the ant JUnit runner generates. It annotates the bugs with data from bugzilla. I have an XSLT file that generates an HTML report with separate sections for passing tests, failures wth open bugs and new failures.



Finally I have another ant task that reads the XML file and causes the build to fail iff there is a failure with no open bug.

Posted by: Kevin Lawrence [TypeKey Profile Page] on September 3, 2006 09:35 AM

Neat.

Does the annotated XML report that you create and your XSLT file work as part of the cruisecontrol transforms, or is that HTML report independent of cruisecontrol's output?

So I guess you don't use the ANT junit runner to do the build failing, but rely on your last ant task to do that instead. I hadn't thought of doing that...

I don't suppose you want to share some of your code do you? ;-)


Posted by: Patrick Myles on September 5, 2006 09:17 AM

The xml report generator and verifier are separate ant tasks. The XSLT runs in ant and generates a report that I copy into CC's archive directory.

The verifier also writes a summary to ant's logger with a link to the report.


I have an index card on my desk that says

integrate the xslt transform into CC's email

and another that says

write an ant task that does all the above in a single task.

I'd be delighted to share the code when I get a chance to disentangle it from the rest of my build code. Any day now ;-)

Posted by: Kevin Lawrence [TypeKey Profile Page] on September 5, 2006 12:02 PM

Post a comment




Remember Me?