Snow was flying by my vehicle as I drove to work during the 2015 snow storm and every school district had closed. I had just finished moving all of my school districts websites to our new platform a few weeks prior, during the Christmas break, which turned out to be a blessing and a curse.
The district’s websites now had several new features, such asĀ banner notifications for school closings, modern responsive web design, and an integrated calendar just to name a few, these were great tools to have during a snow day.
But the curse was about to show it self.
At 7:41 AM my phone rang, it was my supervisor. I was caught off guard but I answered the call. He informed me that our shiny new website was crashing a lot and crashing hard. 504 errors and high CPU usage was making out webserver a liability rather than an asset.
Thousands of parents were checking the district website for information about makeup days, activity schedules, and remote learning instructions. But instead of the website, they were getting error messages.
Our 3rd party vendor support assured us that the website was ready for production and heavy traffic. But they clearly didn’t test it well.
As soon as I got to my computer, I logged into the web server and immediately started investigating. The app server database server were both online and functional but the MySQL database servers was running HOT using 98% of available resources, and MySQL was throwing errors about corrupted tables.
Checking the logs, I could see that each time the database service was restartedĀ database service would consistently start to eat up all of the CPU and RAM resources.
Then looking through the database service logs I notice some table locks were held for a long time. These table locks would build up as the web server continued to make connections to the database. These table locks were causing both the high resource usage and complete crashes. The websites were using the MyISAM storage engine, which doesn’t handle heavy traffic well, this is because the MyISAM database engine relies on table locks rather than row locks.
Plus MyISAM does not handle crashes gracefully and is more likely to crash if is under heavy load. This is especially true when a write operation is interrupted, it can leave the entire table corrupted.
The vendor we contracted for design and load testing support was supposed to setup our database to handle heavy loads, but clearly they missed the mark.
As I dove in to the servers, my supervisor contacted our vendor for support.
First I backed up the corrupt databases for each school website. Then I by taking a backup of the corrupted databases – even broken data is better than no data. Then I used MySQL’s built-in MyISAM repair utility by running the repair command directly on the database files:
myisamchk --force --verbose dp_district*.MYI
Once the data was recovered, I made a new copy again.
Then logged into MySQL and I converted the district’s site database to InnoDB using:
ALTER TABLE dp_districtSite ENGINE=InnoDB;
Going from MyISAM to InnoDB would provide us the improved load handling, row locks instead of table locks, and improved handling of crashes.
After the website was converted I did a quick test connection and then I shutdown the database server and changed the config to ensure the InnoDB engine ran fine. Then I started MySQL up again using:
mysqld --innodb_force_recovery=4
This allowed me to make sure recovery worked flawlessly.
As the website came back online it was was even faster than before. Within an hour of that initial call, all district websites were back online and serving content to the thousands of parents and staff trying to access them.
But I wasn’t done yet. I had noticed during my investigation that the Apache web server configuration was allowing too many persistent connections, which could overwhelm the database during high-traffic periods like snow days. I optimized the Apache worker parameters to limit concurrent database connections while still maintaining good performance for users.
Three hours later I had the databases restored, websites online, and Apache tuned making the website load in about < 1.5 to 2 seconds. Which was an improvement from out 10 seconds after the initial database repair and close to 20 seconds during heavy load 1 day prior. Now we had sub 2 second load times during a traffic overflow.
I emailed the vendor to tell them that I handled it. The vendor appeared shocked and seems like they had budgeted this to be an all day recovery effort.
More importantly, we had prevented thousands of phone calls to already-understaffed school offices, and staff were able to communicate with students and parents without interruption.