Blog Database Corruption Solved

Posted by JD 08/09/2011 at 08:45

Sometime on Monday the database that we run our blog software on became corrupted to the point that accessing the blog wasn’t possible for hours, perhaps many, many hours.

I don’t know how long the error existed, just that I created a few new articles in the morning and didn’t check back until late afternoon to see the process eating 99.99% of the available CPU AND not serving any pages.

Open DB Transaction

For about an hour, I attempted to discover the issue and correct it. I checked that processes were running, log files (which were worthless), attacks, load balancer status, cluster server status, and finally the database. I could access the DB, but there was a journal that seemed incomplete, like the transaction never finished. I don’t know and couldn’t figure out if that were really the issue, but I decided I’d sleep on the solution of restoring the DB backup from earlier in the day.

The Quick Fix

Rather than just disable the service completely, I decided to load up an empty database, like when the blog was fresh and new, around 2000, and have 1 article saying, “we’re having issues.” Please come back later.*

Abusive Search Engine

The last record was from Baidu, a Chinese search engine which I’ve blocked due to abuse. They are not honoring the robots.txt, so I dropped a rule into the load-balancer/reverse-proxy to redirect those user agents, Baiduspider, to an unavailable due to permissions page. I hate blocking any end users, but when anyone becomes abusive, I don’t have any choice. There is only so much bandwidth available here. Play nice, please.

Abusive RSS Readers

I block a few abusive RSS readers too. Sorry folks, there’s nothing on my blog that can’t wait for 24 hours to see. You definitely do not need your RSS reader to be checking every 5 minutes. Ok?

Restore, Restore, Restore

I preach about backup, backup, backup. Today was about restore, restore, restore. It was pretty un-eventful. I use rdiff-backup and this was the first time I needed to restore from a backup more than 1 day old, i.e. not the latest mirror, and just a single file is wanted, not the . Here were the steps:

  1. log into the backup server – backups are kept on a different server than where processing happens.
  2. create a temporary directory to hold the restored files. chdir into the new directory.
  3. # rdiff-backup -r 1D /backups/xen/rdiff/xen41/rdiff-backup-data/increments/var/www/typo/db .
    This will restore the backup from 24 hours ago for that specific directory into the current directory. Puzzle solved.
  4. Take the DB offline,
  5. Move the restored DB into the correct location and restart.
    Check that everything is working. Yep. Restore successful. I need to flush the static caches and rebuild them. Simple enough, done. Now to add back those few new articles.

Having a backup and knowing how to restore all of it (or parts) from a different backup than the most recent = priceless.

Result

Anyway, after restoring the backup from yesterday, about 1200 articles are back on line. I’ve manually pulled the 2 lost articles back from the corrupted DB and reposted them.

I was encouraged to see that only 1 CPU was pegged during the issue. I don’t know if that is luck or design of the blog software. This virtual machine has access to 2 vCPUs, but other machines are also running on the same physical hardware. Playing nice with others is appreciated. The bandwidth here is pretty limited, so CPU will never really be the issue with this blog. Bandwidth will always be the real limiting factor.

Anyway, I’m happy to be back online.