Archive for the 'Mozilla' Category

Fuzzing in the pool

Tuesday, November 23rd, 2010

In mid-2009, John O'Duinn offered to let my DOM fuzzer run on the same pool of machines as Firefox regression tests. I'd have an average of 20 computers running my fuzzer across a range of operating systems, and I wouldn't have to maintain the computers. All I had to do was tweak my script to play nicely with the scheduler, and not destroy the machines.

Playing nicely with the scheduler

Counter-intuitively, to maximize the amount of fuzzing, I had to minimize the duration of each fuzz job. The scheduler tries to avoid delays in the regression test jobs so developers don't go insane watching the tree. A low-priority job will be allowed to start much more often if it only takes 30 minutes.

Being limited to 30 minutes means the fuzz jobs don't have time to compile Firefox. Instead, fuzz jobs have to download Tinderbox builds like the regression test jobs do. I fixed several bugs in mozilla-central to make Tinderbox builds work for fuzzing.

I also modified the testcase reducer to split its work into 30-minute jobs. If the fuzzer finds a bug and the reducer takes longer than 30 minutes, it uploads the partially-reduced testcase, along with the reduction algorithm's state, for a subsequent job to continue reducing. To avoid race conditions between uploading and downloading, I use "ssh mv" synchronization.

Not destroying the test slaves

I wasn't trying to fill up the disks on the test slaves, really!

Early versions of my script filled up /tmp. I had incorrectly assumed that /tmp would be cleared on each reboot. Luckily, Nagios caught this before it caused serious damage.

Due to a security bug in some debug builds of Firefox, the fuzzer created randomly-named files in the home directory. This security bug has been fixed, but I'm afraid RelEng will be finding files named "undefined" and "[Object HTMLBodyElement]" for a while.

By restarting Firefox frequently, fuzzing accelerated the creation of gigantic console.log files on the slaves. We're trying to figure out whether to make debug-Firefox not create these files or make BuildBot delete them.

Results so far

Running in the test pool gets me a variety of operating systems. The fuzzer currently runs on Mac32 (10.5), Mac64 (10.6), Linux32, Linux64, and Win32. This allowed me to find a 64-bit-only bug and a Linux-only bug in October. Previously, I had mostly been testing on Mac.

The extra computational power also makes a difference. I can find regressions more quickly (which developers appreciate) and find harder-to-trigger bugs (which developers don't appreciate quite as much). I also get faster results when I change the fuzzer, such as the two convoluted testcases I got shortly after I added document.write fuzzing.

Unexpectedly, getting quick results from fuzzer changes makes me more inclined to tweak and improve to the fuzzer. I know that the change will still be fresh in my mind when I learn about its effects. This may turn out to be the most important win.

With cross-platform testing and the boost to agility, I suddenly feel a lot closer to being able to share and release the fuzzer.

How my DOM fuzzer ignores known bugs

Sunday, November 21st, 2010

When my DOM fuzzer finds a new bug, I want it to make a reduced testcase and notify me so I can file a bug report. To keep it from wasting time finding duplicates of known bugs, I maintain several ignore lists:

Some bugs are harder to distinguish based on output. In those cases, I use suppressions based on the fuzzer-generated input to Firefox:

Fixing any bug on those lists improves the fuzzer's ability to find additional bugs. But I'd like to point out a few that I'd especially like fixed:

In rare cases, I'll temporarily tell the fuzzer to skip a feature entirely:

Several bugs interfere with my ability to distinguish bugs. Luckily, they're all platform-specific, so they don't prevent me from finding cross-platform bugs.

  • Bug 610311 makes it difficult to distinguish crashes on Linux, so I ignore crashes there.
  • Bug 612093 makes it difficult to distinguish PR_Asserts and abnormal exits on Windows. (It's fixed in NSPR and needs to be merged to mozilla-central.)
  • Bug 507876 makes it difficult to distinguish too-much-recursion crashes on Mac. (But I don't currently know of any, so I'm not ignoring them at the moment!)

Detecting leak-until-shutdown bugs

Sunday, November 14th, 2010

Most of Mozilla's leak-detection tools work on the premise that when the application exits, no objects should remain. This strategy finds many types of leak bugs: I've used tools such as trace-refcnt to find over a hundred. But it misses bugs where an object lives longer than it should.

The worst of these are bugs where an object lives until shutdown, but is destroyed during shutdown. These leaks affect users as much as any other leak, but most of our tools don't detect them.

After reading about an SVG leak-until-shutdown bug that the traditional tools missed, I wondered if I could find more bugs of that type.

A new detector

I started with the premise that if I close all my browser windows (but open a new one so Firefox doesn't exit), the number of objects held alive should not depend on what I did in the other windows. I retrofitted my DOM fuzzer with a special exit sequence:

  1. Open a new, empty window
  2. Close all other windows
  3. Until memory use stabilizes
  4. Count the remaining objects (should be constant)
  5. Continue with the normal shutdown sequence
  6. Count the remaining objects (should be 0)

If the first count of remaining objects depends on what I did earlier in the session, and the second count is 0, I've probably found a leak-until-shutdown bug.

To reduce noise, I had to disable the XUL cache and restrict the counting to GlobalWindow and nsDocument objects. On Linux, I normally count 4 nsGlobalWindows and 4 nsDocuments.

So far, I've found two bugs where additional objects remain:

I'm glad we found the <video> leak before shipping Firefox 4!

Note that this tool can't find all types of leaks. It won't catch leak-until-page-close bugs or other leaks with relatively short lifetimes. It can't tell you if a cache is misbehaving or if cycle collection isn't being run often enough.

Next steps

Depending on how promising we think this approach is, we could:

  • Use it in more types of testing
    • Package it into a more user-friendly extension for Firefox debug builds
    • Make it a regular part of fuzzing
    • Use it for regression tests
  • Add something to Gecko that's similar but less kludgy
  • Expand the classes it will complain about
  • Debug the flakiness with smaller objects
  • Make the XUL cache respond to memory-pressure notifications

It's also possible that DEBUG_CC, and in particular its "expected to be garbage" feature, will prove itself able to find a superset of leaks that my tool can find.

War on Orange update

Friday, September 17th, 2010

Clint Talbert organized a meeting today on the topic of the intermittent failures. It was well-attended by members of the Automation, Metrics, and Platform teams, but we forgot to invite the Firefox front-end team.

There was some discussion of culture and policy around intermittence. For example, David Baron promoted the idea of estimating regression ranges for intermittent failures, and backing out patches suspected of causing the failures. But most of the meeting focused on metrics and tools.

Joel Maher demonstrated Orange Factor, which calculates the average number of intermittent failures per push. It shows that the average number of oranges dropped from 5.5 in August to 4.5 in September.

Daniel Einspanjer is designing a database for storing information about Tinderbox failures. He wants to know the kinds of queries we will run so he can make the database efficient for common queries. Jeff Hammel, Jonathan Griffin, and Joel Maher will be working on a new dashboard with him.

Two key points were raised about the database. The first is that people querying "by date" are usually interested in the time of the push, not the time the test suite started running. There was some discussion of whether we need to take the branchiness of the commit DAG into account, or whether we can stick with the linearity of pushes to each central repository.

The second key point is that we don't consistently have one test result per test and push. We might have skipped the test suite because the infrastructure was overloaded, or because someone else pushed right away. Another failure (intermittent or not) might have broken the build or made an earlier test in the suite cause a crash. Contrariwise, we might have run a test suite multiple times for a single push in order to help track down intermittent failures! The database needs to capture this information in order to estimate regression ranges and failure frequencies accurately.

We also discussed the sources of existing data about failures: Tinderbox logs, "star" comments attached to the logs, and bug comments created by TBPLbot (example) when a bug number in a "star" comment matches a bug number that had been suggested based on its summary. Each source of data has its own types of noise and gaps.

Untrusted text in security dialogs

Wednesday, July 14th, 2010

I just gave a 10-minute lightning talk at SOUPS on the topic of untrusted text in security dialogs.

I've been reading Firefox security bug reports over the years, and I've collected a list of things that can go wrong in security dialogs. New security dialogs should be tested against these attacks, or preferably designed to not be dialogs.

Fuzzing talk at the Mozilla Summit

Wednesday, July 14th, 2010

At the 2010 Mozilla Summit, I talked about my JavaScript engine and DOM fuzzers, which have each found many hundreds of bugs. I also talked about the automations that keep me sane when I fuzz these complex components.

My slides are in the S5 web-based presentation format. You can click the Ø button to view the presentation in "handout mode" and see what I planned to say while each slide was up.

I shared a presentation slot with Mozilla contractor Paul Nickerson, who has a separate slide deck. He wisely saved the best part of his talk for the end: a demo of his font fuzzer causing Windows 7 to blue-screen.

A turning point in the war on orange

Friday, July 9th, 2010

Mozilla now runs over a million tests on each checkin. We're consistently including tests with new features, and many old features now have tests as well. We're running tests on multiple versions of Windows. We've upped the ante by considering assertion failures and memory leaks to be test failures. We're testing things previously thought untestable, on every platform, on every checkin.

One cost of running so many tests is that a few tests that each fail 1% of the time can quickly add up to 3-5 intermittent failures per checkin. Historically, this has been a major source of pain for Mozilla developers, who are required to identify all oranges before and after checking in.

Ehsan and I have pretty much eliminated the difficulty of starring intermittent failures on Tinderbox. Ehsan's assisted starring feature for TinderboxPushlog was a breakthrough and keeps getting better. The orange almost stars itself now. The public data fairy lives.

I'm only aware of two frequent oranges that are difficult to star, and we have fixes in hand for both.

But we should not forget the need to reduce the number of intermittent failures now that they are easy to ignore. They're still an annoyance, and many of them are real bugs in Firefox.

What makes it hard to diagnose and fix intermittent failures in Firefox's automated tests? Let's fix these remaining unnecessary difficulties.

Assertion stacks on Tinderbox

Monday, June 28th, 2010

Logs from Mozilla automated tests often include assertion failures. Now, on Linux and 32-bit Mac, the logs also include stack traces for those assertion failures. You can see an example assertion stack from a recent Tinderbox log.

When a debug build of Firefox hits a non-fatal assertion, an in-process stack walker prints out libraries and offsets. A new Python script post-processes the stack trace, replacing the library+offset with function names and line numbers that it gets from Breakpad symbol files. (Tinderbox strips native symbols from binaries, so the old scripts using atos/addr2line don't work on Tinderbox.)

The new script was added in bug 570287 and now runs on Linux64, Linux32, and Mac32 Tinderboxen. It will work on Mac64 soon. It could work on Windows if someone brave were to dive into nsStackWalk.cpp and improve its baseline output on Windows.