Archive for the 'Tinderbox' Category

War on Orange update

Friday, September 17th, 2010

Clint Talbert organized a meeting today on the topic of the intermittent failures. It was well-attended by members of the Automation, Metrics, and Platform teams, but we forgot to invite the Firefox front-end team.

There was some discussion of culture and policy around intermittence. For example, David Baron promoted the idea of estimating regression ranges for intermittent failures, and backing out patches suspected of causing the failures. But most of the meeting focused on metrics and tools.

Joel Maher demonstrated Orange Factor, which calculates the average number of intermittent failures per push. It shows that the average number of oranges dropped from 5.5 in August to 4.5 in September.

Daniel Einspanjer is designing a database for storing information about Tinderbox failures. He wants to know the kinds of queries we will run so he can make the database efficient for common queries. Jeff Hammel, Jonathan Griffin, and Joel Maher will be working on a new dashboard with him.

Two key points were raised about the database. The first is that people querying "by date" are usually interested in the time of the push, not the time the test suite started running. There was some discussion of whether we need to take the branchiness of the commit DAG into account, or whether we can stick with the linearity of pushes to each central repository.

The second key point is that we don't consistently have one test result per test and push. We might have skipped the test suite because the infrastructure was overloaded, or because someone else pushed right away. Another failure (intermittent or not) might have broken the build or made an earlier test in the suite cause a crash. Contrariwise, we might have run a test suite multiple times for a single push in order to help track down intermittent failures! The database needs to capture this information in order to estimate regression ranges and failure frequencies accurately.

We also discussed the sources of existing data about failures: Tinderbox logs, "star" comments attached to the logs, and bug comments created by TBPLbot (example) when a bug number in a "star" comment matches a bug number that had been suggested based on its summary. Each source of data has its own types of noise and gaps.

A turning point in the war on orange

Friday, July 9th, 2010

Mozilla now runs over a million tests on each checkin. We're consistently including tests with new features, and many old features now have tests as well. We're running tests on multiple versions of Windows. We've upped the ante by considering assertion failures and memory leaks to be test failures. We're testing things previously thought untestable, on every platform, on every checkin.

One cost of running so many tests is that a few tests that each fail 1% of the time can quickly add up to 3-5 intermittent failures per checkin. Historically, this has been a major source of pain for Mozilla developers, who are required to identify all oranges before and after checking in.

Ehsan and I have pretty much eliminated the difficulty of starring intermittent failures on Tinderbox. Ehsan's assisted starring feature for TinderboxPushlog was a breakthrough and keeps getting better. The orange almost stars itself now. The public data fairy lives.

I'm only aware of two frequent oranges that are difficult to star, and we have fixes in hand for both.

But we should not forget the need to reduce the number of intermittent failures now that they are easy to ignore. They're still an annoyance, and many of them are real bugs in Firefox.

What makes it hard to diagnose and fix intermittent failures in Firefox's automated tests? Let's fix these remaining unnecessary difficulties.

Assertion stacks on Tinderbox

Monday, June 28th, 2010

Logs from Mozilla automated tests often include assertion failures. Now, on Linux and 32-bit Mac, the logs also include stack traces for those assertion failures. You can see an example assertion stack from a recent Tinderbox log.

When a debug build of Firefox hits a non-fatal assertion, an in-process stack walker prints out libraries and offsets. A new Python script post-processes the stack trace, replacing the library+offset with function names and line numbers that it gets from Breakpad symbol files. (Tinderbox strips native symbols from binaries, so the old scripts using atos/addr2line don't work on Tinderbox.)

The new script was added in bug 570287 and now runs on Linux64, Linux32, and Mac32 Tinderboxen. It will work on Mac64 soon. It could work on Windows if someone brave were to dive into nsStackWalk.cpp and improve its baseline output on Windows.

Better error summaries on Tinderbox

Sunday, June 13th, 2010

I recently landed a fix so that when Firefox crashes or hangs on Tinderbox, the error summary shows which test was running.

As we add more tests and platforms to Tinderbox, it's increasingly important for developers to be able to identify each test failure quickly and accurately. Good error summaries make assisted starring of random oranges possible, which greatly reduces the pain induced by intermittent failures. Good error summaries also make it possible to track which failures are most frequent, and therefore concentrate on fixing the most important ones.

Error summaries for crashes could be further improved by showing what kind of process crashed and a signature based on the stack trace.

I'd also like to see better error summaries for memory leaks, compiler errors, python errors, and failures in smaller build steps.

If you see other error summaries on Tinderbox that could be improved, please file bugs. It's an easy way to help Mozilla scale across branches, and it's cheaper than cloning philor.

Continuous integration at Mozilla

Thursday, February 19th, 2009

Mozilla's poor continuous-integration story is a major source of stress and wasted time for Mozilla developers. Andreas Gal, for example, recently lost two consecutive weekends tracking down test failures. The story is all too common:

Numb to everyday "random orange" from intermittent failures, nobody noticed the new test failures until there were multiple new test failures. Many patches occupied each regression window, due to long test cycle times, combined with many developers sharing few opportunities to land. Backing out the potentially-responsible changesets one at a time required merging and waiting through the long cycles again.

Fixing these problems will require more than incremental improvements to the Tinderbox display. It will require, at the very least, rethinking how we organize and visualize the data coming off of our build machines. It may even require changes to the way we allocate build machines and check in code.

Recent work

Over the last few months, there has been an explosion of client-side Tinderbox modifications and mashups. Here's what we have now:

Other people in the Mozilla community are also interested in improving the continuous integration experience:

Why we load Tinderbox

  • Developers
    • Can I pull safely?
    • Can I push now?
    • Did I avoid breaking the tree so far?
    • Can I go to sleep now?
  • Sheriffs / firefighters
    • Is any column missing?
    • Has anything gotten slower?
    • What patch caused that slowdown?
    • What patch caused that test failure?
    • How do I get the tree green again?
  • Build engineers
    • Is any machine missing or unhealthy?
    • Which columns have long cycle times?
    • What changes needed clobbers, indicating makefile bugs?
  • Test engineers
    • What tests do we have?
    • What tests especially good at catching real bugs?
    • What tests are we skipping?
    • What tests are unreliable?
    • What tests are unreasonably slow, and maybe not worth running on every build if they can't be sped up?

Tinderbox tries to answer all these questions with a single <table>. As a result, it answers none of them well.

We can design better systems for answering each of these questions, but first, I'd like to present an idea for eliminating the need to ask the questions in bold.


By integrating the Try concept directly into the way we work, we can make the answer to the first four questions be always yes. Under this proposal, instead of pushing to mozilla-central, everyone pushes to a server that runs all the tests, and if the tests pass, automatically pushes the changes to mozilla-central.

Then it is always ok to pull or push, because mozilla-central is always green. If my patch breaks something, it's immediately clear that it was my patch, and I don't have to be around to back it out. I haven't wasted anyone else's time or prevented anyone else from checking in.

When there are multiple pending changesets, the automatic push should include a rebase (or hg merge, but that's probably more painful). If my patch's rebase fails, my patch doesn't end up on mozilla-central, but I claim that the occasional automatic-merge failure (requiring additional action from only one developer) is less painful than the frequent backouts we have now.

Developers should be warned up-front if a patch will require manual merging in the case that other pending changesets succeed. We should have the option of basing changes on any pending changeset in addition to mozilla-central, with the caveat that if the base fails, our changeset will not go in either. And of course, we should still be able to use Try to test a changeset without intending for it to go onto mozilla-central immediately.

Reducing cycle time

We can decrease cycle times drastically by splitting up the testing work more sanely. Instead of having four unit test machines each testing a different revision (reporting to the same column), have each machine run a quarter of the tests. When the tree is calm, results will be available in a quarter of the current amount of time; when the tree is busy, it won't be any slower.

The cycle time goal for most machines should be the same. Otherwise, we're using our resources inefficiently, because a checkin isn't done cycling until all the tests are finished. We can make exceptions for especially slow tests, such as Valgrind, static analysis, or code coverage, but these slow tests shouldn't be allowed to turn the tree orange in the same way as other tests.

Finding performance regressions

Don't make me load 53 graphs, each of which takes 6 seconds to load. Don't even make me eyeball 20 graphs. Instead, Just tell me if I make Firefox slower. (Also tell me if I increase the variance enough to make it hard to spot future slowdowns.)

Using an algorithm frees us to track more performance data, such as the time taken on of each component of SunSpider. If a patch makes a single SunSpider file 5% slower, that might be noise as part of the total SunSpider score, but obvious from looking at just that one file.

An algorithm might be slightly worse or slightly better than a human at noticing numeric changes, but it's a hell of a lot more patient.

Understanding test failures

Make it easier to see which test failed. The "short log" isn't really short, and even the summary often includes extraneous information. Just tell me which tests failed and show the log from the execution of those tests.

Ted Mielczarek is working on making Tinderboxen produce stack traces for crashes. This will make it possible to debug many issues quickly, even if we can't reproduce them. Samples for hangs would useful in the same way.

It would be great if at least some of the boxes used record-and-replay to make it possible to debug other issues, including timing issues such as non-hang timeouts and race conditions. Ideally, I'd be able to replay in a VM on my own machine and debug the issue as it unfolds.

Fixing unreliable tests

Random oranges don't have to happen. If IMVU can eliminate random oranges while testing with Internet Explorer, we can do it while testing our own products.

The first step to fixing unreliable tests is finding out which tests are unreliable. Currently, it takes a very observant sheriff to notice that the same test has failed twice in a week. Even then, we can't search to see when the test began failing, so we don't know whether to disable the test or hunt for a regressing bug. We need a searchable database of test failures.

The next step is figuring out whether the test or the code is unreliable. For now, the answer is often "mark the test as 'skip' and hope we can figure it out later", which is better than distracting everyone with random orange, but not ideal. Again, record-and-replay is probably the only reliable way to find out.

Another approach to fixing unreliable tests is to use tools designed to hunt for unreliability. Valgrind's memcheck can find the uninitialized-variable and use-after-free bugs that only lead to crashes occasionally. Valgrind's helgrind can detect many types of race conditions. Unless we build our own botnet, Valgrind tests will be too slow to be allowed to turn the tree orange, but they will give us insight into some types of bugs that cause random failures on normal test machines.

Design lunch

It's going to take a lot of work from designers and engineers to make continuous integration work well at Mozilla's new scale, but I think the potential payoff in increased developer productivity makes it worth the effort to get this stuff right.

I've only proposed ways to answer some of the questions more efficiently. I'm just an observer -- not really a Mozilla developer, and definitely not a build engineer -- so I might have missed some important points.

John O’Duinn will host a design lunch on this topic tomorrow (Thursday, February 19, noon in California).


Saturday, February 23rd, 2008

The Firefox Tinderbox has been unmanageably wide lately. I wrote a Greasemonkey script, TidyBox, to fix it by moving build results from the table cells to popups that appear when hovering the table cells.

Looking at a screenshot with TidyBox, it's easy to see that exactly one box is orange and that the orange started after the last checkin. With the normal Tinderbox display at the same time, you would probably have to scroll both horizontally and vertically to figure that out.

If you want to see the information about a build while using TidyBox, just hover over the cell. To click links that appear in the popup, click the cell to lock the popup in place and then click the link.

Install TidyBox today and you might never have to scroll Tinderbox again!

Other recent efforts to improve Tinderbox: