Squarefree succumbs to the Digg effect

Yesterday, at around 4pm, I noticed that the content on squarefree.com was missing, and the main page was an empty directory listing. I ssh'ed to my web server and noticed that the "squarefree.com" directory had been renamed to "squarefree.com_DISABLED_BY_DREAMHOST". Then I checked my email and saw a message from DreamHost support:

Hello,

I just had to disable your site squarefree.com as it's coming under some load and spawning countless php processes that are crashing the webserver. I wasn't able to figure out exactly what's going on, as leaving it up for more than a minute pretty much toasts the server. Please don't re-enable it until you've figured out what's going on, or disabled any possibly problematic php.

Thanks,

James

I jumped into #dreamhost on irc.freenode.net and started looking through my web server logs for suspicious requests. I was expecting to find that my blog had been DDoSed, perhaps by someone trying to leave comment spam. Instead, I found a large number of requests for non-existant files, falling into two categories:

  • Requests for favicon.ico, a file that does not exist on my site. Some of these requests are expected: most browsers with tabs request favicon.ico to display it in the tab bar. But there were also hundreds of IP addresses that requested nothing but favicon.ico for the entire day, and some requested it many times. About 100 of these IPs were Internet Explorer users with the Google Toolbar, so apparently I was getting DDoS'ed by a bug in the Google Toolbar. Another 100 were Firefox users; I haven't figured out why Firefox would request nothing but favicon.ico over and over.
  • Requests due to people using my Real-time HTML Editor to edit pages that used relative URLs for images, iframes, etc. One user made dozens of requests for a file named "border=0". Another user made a request for 14 gif files every time the editor refreshed. I also saw from referrers that the Real-time HTML Editor had been featured on Digg, greatly increasing its traffic.

But why would 404 requests create PHP processes? Due to a recent change in WordPress, Apache was directing each 404 request to WordPress. WordPress used to put detailed rules in .htaccess -- for example, it would ask Apache to direct requests for http://www.squarefree.com/2005/ to WordPress using RewriteRule ^([0-9]{4})/?$. But newer versions of WordPress instead ask Apache to send it all requests for nonexistent files. I imagine this puts less strain on Apache when a site uses lots of WordPress Pages, but it hurts when a site gets lots of 404 requests. Several months ago, I had instructed WordPress to serve my custom 404 page for these requests, but WordPress still had to do a lot of work to determine that the requests should be treated as 404s.

Once I realized what had happened, and determined that reconfiguring WordPress would be difficult, I did what I could to reduce the number of 404 requests WordPress would have to handle. I created a tiny favicon.ico file so those requests wouldn't be 404s, and I moved the Real-time HTML Editor onto its own subdomain so WordPress wouldn't handle the 404s it causes. My site was only down for 40 minutes, with the Real-time HTML Editor down a little longer while I waited for the new subdomain's DNS to propagate.

Some things DreamHost could have done better:

  • It would have been nice if James had disabled PHP for my domain instead of disabling my site entirely. Pornzilla did not need to be down due to PHP problems.
  • A per-user process limit might have allowed my site to send "503 Service Unavailable" in response to some requests instead of being down entirely. It would have also prevented my site from causing problems for other sites on the shared server.
  • Better performance diagnostics would have helped both James and me isolate the problem. For example, it would have been great to have a list of PHP processes showing the request URL that caused each PHP instance to be triggered, the lifetime of each process, and perhaps some performance information (CPU used, RAM used, number of database requests).

Some things DreamHost did right:

  • DreamHost allowed me to restore my site myself once I fixed the problems. All I had to do was rename "squarefree.com_DISABLED_BY_DREAMHOST" back to "squarefree.com".
  • Knowing about DreamHost's .snapshot feature kept me from panicking about data loss when my site appeared to have disappeared.
  • The employees in #dreamhost were helpful.

If anyone is wondering: yes, I still love DreamHost.

15 Responses to “Squarefree succumbs to the Digg effect”

  1. Gérard Talbot Says:

    Hello Jesse,

    “Requests for favicon.ico, a file that does not exist on my site. Some of these requests are expected: most browsers with tabs request favicon.ico to display it in the tab bar.”

    This is a known bug, I’d say.

    “Stop automatically requesting /favicon.ico and instead only request it when it’s explicitly linked to from the page.”
    Top 10 Bugs (Internet Explorer)
    http://tobyinkster.co.uk/web-bugs

    Gérard

  2. Gids Says:

    Not 100% on topic…

    …but I noticed the HTML editor puts a big strain on my back button, perhaps it would be better off in its own window.

    Gids

  3. Jesse Ruderman Says:

    You’re free to open it in a new window or tab ;)

  4. Matt Nordhoff Says:

    Gérard:

    As a user, I like automatically requesting favicon.ico. I’ll still see it even if I go directly to an image or something instead of HTML page.

    As a webmaster (not that I have anything on my website), I dunno. I just made an empty favicon.ico so I wouldn’t see 404s for it in my error.log, and it doesn’t bother me.

    If browsers suddenly stopped automatically getting it, I bet a decent number of sites suddenly wouldn’t have their favicons displayed because they don’t have the proper <link> tags.

  5. Wladimir Palant Says:

    Jesse, how about using document.open(“text/html”,”replace”) to open the preview document? It won’t create new history entries.

  6. greggles Says:

    Not to be a total drupal shill, but compare your experience to this other recent “post digg effect” story:

    http://linux.inet.hr/the_digg_effect_analyzed.html

  7. Pau Tomàs Says:

    Hi Jeese,

    “Another 100 were Firefox users; I haven’t figured out why Firefox would request nothing but favicon.ico over and over.”

    Maybe these are using some extension that requests the favicon. I developed one extension that has to do that and I’m was not really sure if the method used to request them could cause some overload problems to the servers.

  8. alanjstr Says:

    Apache gets the request before WordPress, right? So then why couldn’t you do your own 404 rule in .htaccess? Or would WordPress just blow it away?

  9. Jesse Ruderman Says:

    Without detailed rules to send requests for “WordPress Pages” (etc.) to WordPress, WordPress has to be the handler for nonexistent files so it can determine whether there’s a Page with the requested URL.

  10. Matt Nordhoff Says:

    Does Firefox cache 404s?

  11. Rowan Lewis Says:

    Screw WordPress I say… I don’t know what you people see in it.

    Matt, it is against the HTTP protocol to cache them.

  12. Matt Nordhoff Says:

    Rowan:

    Oh, really? Huh. Interesting.

    Wow, the HTTP/1/1 RFC is LONG. Has anyone ever read the whole thing? :P