Continuous integration at Mozilla

Mozilla's poor continuous-integration story is a major source of stress and wasted time for Mozilla developers. Andreas Gal, for example, recently lost two consecutive weekends tracking down test failures. The story is all too common:

Numb to everyday "random orange" from intermittent failures, nobody noticed the new test failures until there were multiple new test failures. Many patches occupied each regression window, due to long test cycle times, combined with many developers sharing few opportunities to land. Backing out the potentially-responsible changesets one at a time required merging and waiting through the long cycles again.

Fixing these problems will require more than incremental improvements to the Tinderbox display. It will require, at the very least, rethinking how we organize and visualize the data coming off of our build machines. It may even require changes to the way we allocate build machines and check in code.

Recent work

Over the last few months, there has been an explosion of client-side Tinderbox modifications and mashups. Here's what we have now:

Other people in the Mozilla community are also interested in improving the continuous integration experience:

Why we load Tinderbox

  • Developers
    • Can I pull safely?
    • Can I push now?
    • Did I avoid breaking the tree so far?
    • Can I go to sleep now?
  • Sheriffs / firefighters
    • Is any column missing?
    • Has anything gotten slower?
    • What patch caused that slowdown?
    • What patch caused that test failure?
    • How do I get the tree green again?
  • Build engineers
    • Is any machine missing or unhealthy?
    • Which columns have long cycle times?
    • What changes needed clobbers, indicating makefile bugs?
  • Test engineers
    • What tests do we have?
    • What tests especially good at catching real bugs?
    • What tests are we skipping?
    • What tests are unreliable?
    • What tests are unreasonably slow, and maybe not worth running on every build if they can't be sped up?

Tinderbox tries to answer all these questions with a single <table>. As a result, it answers none of them well.

We can design better systems for answering each of these questions, but first, I'd like to present an idea for eliminating the need to ask the questions in bold.

Try-to-push

By integrating the Try concept directly into the way we work, we can make the answer to the first four questions be always yes. Under this proposal, instead of pushing to mozilla-central, everyone pushes to a server that runs all the tests, and if the tests pass, automatically pushes the changes to mozilla-central.

Then it is always ok to pull or push, because mozilla-central is always green. If my patch breaks something, it's immediately clear that it was my patch, and I don't have to be around to back it out. I haven't wasted anyone else's time or prevented anyone else from checking in.

When there are multiple pending changesets, the automatic push should include a rebase (or hg merge, but that's probably more painful). If my patch's rebase fails, my patch doesn't end up on mozilla-central, but I claim that the occasional automatic-merge failure (requiring additional action from only one developer) is less painful than the frequent backouts we have now.

Developers should be warned up-front if a patch will require manual merging in the case that other pending changesets succeed. We should have the option of basing changes on any pending changeset in addition to mozilla-central, with the caveat that if the base fails, our changeset will not go in either. And of course, we should still be able to use Try to test a changeset without intending for it to go onto mozilla-central immediately.

Reducing cycle time

We can decrease cycle times drastically by splitting up the testing work more sanely. Instead of having four unit test machines each testing a different revision (reporting to the same column), have each machine run a quarter of the tests. When the tree is calm, results will be available in a quarter of the current amount of time; when the tree is busy, it won't be any slower.

The cycle time goal for most machines should be the same. Otherwise, we're using our resources inefficiently, because a checkin isn't done cycling until all the tests are finished. We can make exceptions for especially slow tests, such as Valgrind, static analysis, or code coverage, but these slow tests shouldn't be allowed to turn the tree orange in the same way as other tests.

Finding performance regressions

Don't make me load 53 graphs, each of which takes 6 seconds to load. Don't even make me eyeball 20 graphs. Instead, Just tell me if I make Firefox slower. (Also tell me if I increase the variance enough to make it hard to spot future slowdowns.)

Using an algorithm frees us to track more performance data, such as the time taken on of each component of SunSpider. If a patch makes a single SunSpider file 5% slower, that might be noise as part of the total SunSpider score, but obvious from looking at just that one file.

An algorithm might be slightly worse or slightly better than a human at noticing numeric changes, but it's a hell of a lot more patient.

Understanding test failures

Make it easier to see which test failed. The "short log" isn't really short, and even the summary often includes extraneous information. Just tell me which tests failed and show the log from the execution of those tests.

Ted Mielczarek is working on making Tinderboxen produce stack traces for crashes. This will make it possible to debug many issues quickly, even if we can't reproduce them. Samples for hangs would useful in the same way.

It would be great if at least some of the boxes used record-and-replay to make it possible to debug other issues, including timing issues such as non-hang timeouts and race conditions. Ideally, I'd be able to replay in a VM on my own machine and debug the issue as it unfolds.

Fixing unreliable tests

Random oranges don't have to happen. If IMVU can eliminate random oranges while testing with Internet Explorer, we can do it while testing our own products.

The first step to fixing unreliable tests is finding out which tests are unreliable. Currently, it takes a very observant sheriff to notice that the same test has failed twice in a week. Even then, we can't search to see when the test began failing, so we don't know whether to disable the test or hunt for a regressing bug. We need a searchable database of test failures.

The next step is figuring out whether the test or the code is unreliable. For now, the answer is often "mark the test as 'skip' and hope we can figure it out later", which is better than distracting everyone with random orange, but not ideal. Again, record-and-replay is probably the only reliable way to find out.

Another approach to fixing unreliable tests is to use tools designed to hunt for unreliability. Valgrind's memcheck can find the uninitialized-variable and use-after-free bugs that only lead to crashes occasionally. Valgrind's helgrind can detect many types of race conditions. Unless we build our own botnet, Valgrind tests will be too slow to be allowed to turn the tree orange, but they will give us insight into some types of bugs that cause random failures on normal test machines.

Design lunch

It's going to take a lot of work from designers and engineers to make continuous integration work well at Mozilla's new scale, but I think the potential payoff in increased developer productivity makes it worth the effort to get this stuff right.

I've only proposed ways to answer some of the questions more efficiently. I'm just an observer -- not really a Mozilla developer, and definitely not a build engineer -- so I might have missed some important points.

John O’Duinn will host a design lunch on this topic tomorrow (Thursday, February 19, noon in California).

14 Responses to “Continuous integration at Mozilla”

  1. Arthur Says:

    “Backing out the potentially-responsible changesets one at a time required merging and waiting through the long cycles again.”

    Tsts. If the patches are independent (which they often are as they would otherwise land together) you’d better do a binary search and back out always half of the possibly responsible checkins. Logarithmic versus linear runtime is always a nice thing…

  2. Preed Says:

    My read of the situation (having worked on it in lives-previous-to-Mozilla, and judging solely on the complaints I see expressed herein) is the number of developers working on the project has increased the rate of change on the code base to the point where continuous integration of a centralized repo isn’t providing necessary information fast enough.

    More tooling and optimizing the build and continuous integration process provides increasingly diminishing returns wrt investment in it (especially in an area that is constantly changing, due to the product itself changing).

    Not to say there isn’t room for improvement, but a better gain would likely be made in changing in the development and integration methodology.

    Having said that, I’d also wager such a discussion would be DoA, because of resistance to change in the checkin process, which would add some amount of overhead, remove some amount of individual developer control, and make others responsible for getting changes checked in. It also requires more buy in and support from cross-functional parties, which… takes more effort than throwing people at (re)writing tools.

    Ironically, distributed version control would support such a model quite handily, but it would require getting away from the mindset of “mozilla-central” (a name that betrays the fact that its use is a centralized source code repository on top of a decentralized version control system).

    You might ask some of your old Mudder classmates who do this how they solve[d] it for more developers, across more product lines, way more branches, juggling more shared code, and with a (*gasp!*) centralized version control system and DIDN’T have to rewrite tinderbox (which they use) and come up with four different websites presenting the same information to get it done.

  3. Arpad Borsos Says:

    Nice write-up.
    What would definitely increase my productivity is speeding up the tests. I do a lot of code-build-test cycles on my machine to make sure my patches don’t break anything. And I have to wait a long time for the tests to finish. And a lot of them are not reliable at all. There are quite a lot of tests requiring focus on the browser window. Those tests usually fail when testing on a machine I am currently working on.

    Another must-have thing is tests on the try-server. This way critical patches can be tested on more architectures that I don’t have access to.

    The try-to-push method you mentioned also seems to be very interesting. But that really requires fast cycle-times to be effective.

  4. Dirkjan Ochtman Says:

    Echoing preed’s comments about getting away from mozilla-central, what about splitting more things out into their own repo? For example, the JS guys now do a lot of their work in a separate tracemonkey branch. If the /js dir was actually pulled in as part of the build process (or through some hg extension), how would that lighten the CI load? Are any other components/products amenable to this kind of treatment? For example, splitting up XULRunner and Firefox repositories? A case could be made that these are separate products, even though obviously there are a bunch of interdependencies.

  5. Robert O'Callahan Says:

    Try-to-push won’t actually guarantee that the tree is always green, because of nondeterministic test failures.

    I’m not going to cringe at comparisons with IMVU or others who don’t have the variety of test platforms or the number and complexity of tests that we do.

  6. Axel Hecht Says:

    FWIW, I posted in the newsgroup about infrastructure improvements that are not vapour ware but running on my build system. Just sayin’.

    Try: builds are flukes, Mozilla as a system is just too complex to rule out side effects just because patches merge.

  7. Jesse Ruderman Says:

    Mozilla as a system is just too complex to rule out side effects just because patches merge.

    Making people rebase manually even when patches touch different files does not solve this problem. It just shifts the blame when two patches interact badly.

  8. Jesse Ruderman Says:

    FWIW, I posted in the newsgroup about infrastructure improvements that are not vapour ware but running on my build system. Just sayin’.

    I believe Axel is referring to this thread:

    http://groups.google.com/group/mozilla.dev.builds/browse_thread/thread/4639c7524606f1b4

  9. Robert Kaiser Says:

    “Then it is always ok to pull or push, because mozilla-central is always green. ” Well, that’s only half the truth, as tryservers aren’t testing all uses of the Mozilla tree. There’s platforms and applications using mozilla-central as a base that might actually have failures Fennec, Thunderbird, SeaMonkey, and whatever else will probably never all be covered by try setups. And then, I fail to see how the try-to-push setup is much different than the current setup other than the tree you pull is always green for Firefox tier-1 platforms. If one tried push fails and another slightly later pushed one doesn, I don’t think it’ll be trivial to automatically sort that out without breaking the flow of patches into the repo, if automatically resolving it is possible at all.

  10. Chad Austin Says:

    Hi Jesse,

    Nice post. I’ll elaborate on a few key differences between development at IMVU and development at Mozilla.

    When the build goes red, commits are automatically blocked. Every engineer who is blocked helps contribute to the fix. This provides a natural feedback loop that causes the builds to get better over time. Eventually, we noticed we were wasting a lot of time on the same types of failures, so we required that each person who broke the build send out a five whys ( http://startuplessonslearned.blogspot.com/2008/11/five-whys.html )

    BTW, we only have two types of builds: red and green. If there’s something you care about, make regressions break the build. Build with -Werror too. Heck, it’s crazy to me that reviewers spend their time nitting on style, when you could automate that and just make style regressions break the build.

    In short, keep finding and removing waste in your development processes.

    In these comments I hear an undercurrent of “wah wah testing mozilla is hard”, but if you keep pushing on it, it gets easier and easier. IMVU’s got PHP tests, perl tests, Flash tests, client acceptance tests, unit tests, integration with third-party services, integration with our own services, Python tests, C++ tests, assembly tests, selenium tests against internet explorer, selenium tests against firefox, various performance tests (augmented by reporting from the field) and some that I’m probably forgetting. You obviously can’t do all of that overnight, but if you keep improving the situation, it will get better.

    One brick at a time.

  11. Chad Austin Says:

    BTW, you may need hundreds of build machines to get a 10-minute turnaround on testing your entire suite, but that’s doable too. Three years ago, we had two machines and hardly any tests. Now we have 40-50 and piles of tests.

  12. Raoul Duke Says:

    @Chad: right on! you hiring?

    everybody owes it to themselves to read up on the whole Lean Kanban For Software thing, perhaps.

  13. Chad Says:

    In fact we are! chad@imvu.com http://www.imvu.com/jobs/

  14. links for 2009-02-22 | The Build Doctor Says:

    [...] Jesse Ruderman » Continuous integration at Mozilla CI at Mozilla – fascinating. [...]