Server Down! Server Down!

As many of you know, the site went down on Monday when the sale went live.  The load on the site tripled and a number of poorly coded portions started acting up.  For that, we do apologise to everyone, especially those who missed out on  games they wished to purchase because of the site going down.

We’ve now fixed numerous issues and have highlighted some of the other problems for fixing as well as introducing some new features to distribute the load.  It’s actually improved the overall site performance, even with the increased load we’re still seeing.

With that said, I’m going to get (mildly) technical about what happened.  If you are not interested in things like that, you should stop reading now.

The Problem

There were a few problems that happened on Monday.  These included:

  • Increased number of visitors – too much bandwidth / load on the site at the moment, even with CDN turned on.
  • Bad Topsellers Query – our bestellers list was indexing / refreshing itself on a live basis which is a problem when you have a number of sales coming in repeatedly
  • Re-indexing of URLs – the site was set to reindex / recheck URLs on an hourly basis.  Unfortunately, with the continual refreshing & sales; this meant the process kept having to restart
  • Backend, background scripts running – basic reporting scripts were being run on the backend which added to the server load

The Solutions

On Monday, when the site first went down; we contacted our server company to see if they could help with the load.  They noticed the increased load and increased our bandwidth which helped a bit.  They also shut down a few of the backend background scripts that we had running.  That got the site up and running briefly.

Secondly, we installed / upgraded our server cacheing and basically went on a distributed network.  This decreased the overall load on the original site; allowing some customers and orders to come through.  Unfortunately, if too many people hit the SSL pages and/or were pulling data direct from the database; the cached pages available at these sites would be of little use and the site would slow down or go down.

To fix that, we started reviewing the site scripts to look for additional problems.  That’s when we came across the Topsell / Reindexing issues.

Reindexing was simple – we just turned off the feature; going to a manual refresh.

Topsellers on the other hand was a bit more difficult and required us to redo the script.  It’s now a cached version of the topsellers, with data refreshed in intervals instead of a single; live refresh.

Lastly, to speed up the entire site; we’re in the process of installing a new module that should shrink a number of files and cache a number of other processes.  Preliminary testing has shown that it’s sped up the site another 20%.  We’re trying to make sure the new module doesn’t break anything else before we go live with it though.

Final Thoughts

It’s been a stressful couple of days.  Having the entire site down when we had a sale is never fun, but at the least it helped us find a number of problems that would likely have crept up during XMas.  It might be time to take a closer look at the scripts we have running to make sure that nothing else is as badly coded, just to speed the site up further but overall; the latest round of changes seem to have made a big difference to both site speed and stability.

Once again, we do apologise on the site going down.