Integer overflows
Wednesday, November 1st, 2006"What is a string library? It's a way to pretend that computers can manipulate strings just as easily as they can manipulate numbers."
-- Joel Spolsky, The Law of Leaky Abstractions.
Most C++ code uses the integer mod 232 (or 264) type C++ calls "int" as if they were integers. This is great for performance -- many operations on int32 are a single CPU instruction -- but dangerous for security and correctness when the numbers can be large. This can cause security holes in at least two ways.
First, code might use int32 arithmetic to decide how much memory to allocate. Consider an image decoder that allocates width * height * 4
bytes to store RGBA pixels and then decodes the image data into the structure. But since width
and height
are unsigned int
s, it doesn't really allocate width*height*4 bytes; it allocates width*height*4 mod 232 bytes. If the integer used to decide how much memory to allocate has overflowed in such a way that it comes out as a small integer, the code is likely to overflow the buffer as it writes the decoded image into the structure.
Second, code might use int32 arithmetic to decide when to deallocate an object. In code that uses reference counting, an extra call to "release" can obviously lead to a dangling pointer situation. But thanks to integer overflows, 232 unbalanced calls to "addref" followed by a normal "release" can have the same effect. (Luckily, you can't cause this situation by merely making 232 objects point to a specific object, because you'd run out of memory first. So this could be addressed by auditing for addref-without-release leak bugs rather than modifying the addref function to make it safer.)
Explicit checks
Some code in Gecko has explicit checks to prevent overflows. (This must be done carefully -- "width*height*4 > 232" doesn't mean anything to a C++ compiler!) If you remember to think "integer mod 232 or 264" every time you see "int", you may be able to avoid introducing new security holes due to integer overflow when you write code.
Michael Howard at Microsoft advocates this approach, at least for C code that is used near things like allocation sizes and reference counts, and provides functions to do checked arithmetic operations. These functions return a boolean indicating whether the arithmetic operation succeeded. This leads to code where it is hard to see what calculation is being done but easy to see that each step of the calculation is done safely.
Safe integer classes
Another strategy is to avoid using "int", at least in code used to compute allocation sizes, and instead use a "safe" integer class. A safe class might do correct arithmetic on large numbers, allocating extra memory when needed, but perhaps that is overkill for keeping allocations safe. A proponent of this approach might say "int is the new char *", referring to how string buffer overflows have been nearly eliminated through the use of string classes, and make fun of Joel Spolsky for the quote at the beginning of this post.
David LeBlanc, also at Microsoft, advocates a slightly different approach: using a class that treats overflow as an error and can throw exceptions. This keeps arithmetic formulas readable at the expense of having to design the function to handle exceptions correctly.
Static analysis can be used to scan for calls to malloc that use "int" and need to be converted to using SafeInt.
Will Gecko soon have a multitude of integer classes, each with different performance characteristics, signedness, and overflow behavior? Probably not, because numbers used to decide how much memory to allocate are almost always unsigned integers where overflows can be treated as errors. But I wouldn't be surprised to see different parts of the code use different strategies, with C code using the "explicit checks with helpers" strategy and XPCOM C++ code using another strategy.
Other languages
Many languages share C++'s behavior of exposing "integer mod 232" types as "int", but JavaScript and Python are two major exceptions. JavaScript has a hybrid "number" type that is sometimes stored as an integer and sometimes stored as a floating-point number. Overflowing integer arithmetic turns your numbers into floating-point numbers, while treating a floating-point number as a bit field tries to turn it back into an integer by computing its value mod 232. While JavaScript's behavior is more useful in most situations than wrapping around, you wouldn't want to use it for memory allocation.
Python instead takes advantage of its dynamic type system to make integers safe. Overflowed integers are replaced with a "long integer" type that is slower to operate on but has safe, correct behavior for integers of any size (until you run out of memory).