GCC correctness fuzzing

November 3rd, 2010

In 2008 I wrote about generating random JavaScript to find differences between optimization modes and differences between JavaScript engines (rough list of bugs).

How do you do this kind of testing on a language like C where the behavior of many programs is undefined per spec? John Regehr explains how in his talk Exposing Difficult Compilers Bugs With Random Testing at GCC Summit 2010.

War on Orange update

September 17th, 2010

Clint Talbert organized a meeting today on the topic of the intermittent failures. It was well-attended by members of the Automation, Metrics, and Platform teams, but we forgot to invite the Firefox front-end team.

There was some discussion of culture and policy around intermittence. For example, David Baron promoted the idea of estimating regression ranges for intermittent failures, and backing out patches suspected of causing the failures. But most of the meeting focused on metrics and tools.

Joel Maher demonstrated Orange Factor, which calculates the average number of intermittent failures per push. It shows that the average number of oranges dropped from 5.5 in August to 4.5 in September.

Daniel Einspanjer is designing a database for storing information about Tinderbox failures. He wants to know the kinds of queries we will run so he can make the database efficient for common queries. Jeff Hammel, Jonathan Griffin, and Joel Maher will be working on a new dashboard with him.

Two key points were raised about the database. The first is that people querying "by date" are usually interested in the time of the push, not the time the test suite started running. There was some discussion of whether we need to take the branchiness of the commit DAG into account, or whether we can stick with the linearity of pushes to each central repository.

The second key point is that we don't consistently have one test result per test and push. We might have skipped the test suite because the infrastructure was overloaded, or because someone else pushed right away. Another failure (intermittent or not) might have broken the build or made an earlier test in the suite cause a crash. Contrariwise, we might have run a test suite multiple times for a single push in order to help track down intermittent failures! The database needs to capture this information in order to estimate regression ranges and failure frequencies accurately.

We also discussed the sources of existing data about failures: Tinderbox logs, "star" comments attached to the logs, and bug comments created by TBPLbot (example) when a bug number in a "star" comment matches a bug number that had been suggested based on its summary. Each source of data has its own types of noise and gaps.

Untrusted text in security dialogs

July 14th, 2010

I just gave a 10-minute lightning talk at SOUPS on the topic of untrusted text in security dialogs.

I've been reading Firefox security bug reports over the years, and I've collected a list of things that can go wrong in security dialogs. New security dialogs should be tested against these attacks, or preferably designed to not be dialogs.

Fuzzing talk at the Mozilla Summit

July 14th, 2010

At the 2010 Mozilla Summit, I talked about my JavaScript engine and DOM fuzzers, which have each found many hundreds of bugs. I also talked about the automations that keep me sane when I fuzz these complex components.

My slides are in the S5 web-based presentation format. You can click the Ø button to view the presentation in "handout mode" and see what I planned to say while each slide was up.

I shared a presentation slot with Mozilla contractor Paul Nickerson, who has a separate slide deck. He wisely saved the best part of his talk for the end: a demo of his font fuzzer causing Windows 7 to blue-screen.

A turning point in the war on orange

July 9th, 2010

Mozilla now runs over a million tests on each checkin. We're consistently including tests with new features, and many old features now have tests as well. We're running tests on multiple versions of Windows. We've upped the ante by considering assertion failures and memory leaks to be test failures. We're testing things previously thought untestable, on every platform, on every checkin.

One cost of running so many tests is that a few tests that each fail 1% of the time can quickly add up to 3-5 intermittent failures per checkin. Historically, this has been a major source of pain for Mozilla developers, who are required to identify all oranges before and after checking in.

Ehsan and I have pretty much eliminated the difficulty of starring intermittent failures on Tinderbox. Ehsan's assisted starring feature for TinderboxPushlog was a breakthrough and keeps getting better. The orange almost stars itself now. The public data fairy lives.

I'm only aware of two frequent oranges that are difficult to star, and we have fixes in hand for both.

But we should not forget the need to reduce the number of intermittent failures now that they are easy to ignore. They're still an annoyance, and many of them are real bugs in Firefox.

What makes it hard to diagnose and fix intermittent failures in Firefox's automated tests? Let's fix these remaining unnecessary difficulties.

Assertion stacks on Tinderbox

June 28th, 2010

Logs from Mozilla automated tests often include assertion failures. Now, on Linux and 32-bit Mac, the logs also include stack traces for those assertion failures. You can see an example assertion stack from a recent Tinderbox log.

When a debug build of Firefox hits a non-fatal assertion, an in-process stack walker prints out libraries and offsets. A new Python script post-processes the stack trace, replacing the library+offset with function names and line numbers that it gets from Breakpad symbol files. (Tinderbox strips native symbols from binaries, so the old scripts using atos/addr2line don't work on Tinderbox.)

The new script was added in bug 570287 and now runs on Linux64, Linux32, and Mac32 Tinderboxen. It will work on Mac64 soon. It could work on Windows if someone brave were to dive into nsStackWalk.cpp and improve its baseline output on Windows.

Better error summaries on Tinderbox

June 13th, 2010

I recently landed a fix so that when Firefox crashes or hangs on Tinderbox, the error summary shows which test was running.

As we add more tests and platforms to Tinderbox, it's increasingly important for developers to be able to identify each test failure quickly and accurately. Good error summaries make assisted starring of random oranges possible, which greatly reduces the pain induced by intermittent failures. Good error summaries also make it possible to track which failures are most frequent, and therefore concentrate on fixing the most important ones.

Error summaries for crashes could be further improved by showing what kind of process crashed and a signature based on the stack trace.

I'd also like to see better error summaries for memory leaks, compiler errors, python errors, and failures in smaller build steps.

If you see other error summaries on Tinderbox that could be improved, please file bugs. It's an easy way to help Mozilla scale across branches, and it's cheaper than cloning philor.

Simon Willison on phishing defense

March 2nd, 2010

If you want to stay safe from phishing and other forms of online fraud you need at least a basic understanding of a bewildering array of technologies—URLs, paths, domains, subdomains, ports, DNS, SSL as well as fundamental concepts like browsers, web sites and web servers. Misunderstand any of those concepts and you’ll be an easy target for even the most basic phishing attempts. It almost makes me uncomfortable encouraging regular people to use the web because I know they’ll be at massive risk to online fraud.

- Simon Willison