It's been awhile since I've had a chance to blog and I really need to make a concerted effort to do it more frequently. It's certainly not for lack of things to blog about, but more about making the time to do it. I think if I stop worrying about writing long blog posts and try to write smaller updates, it'll be easier for me to keep up with it.
We've been preparing to merge the PvP server Venekor into Nagafen for several months now. We announced it around the time of Fan Faire and set a date for some time in September. As we approached the date we started experiencing performance problems on Nagafen's database that gave us some real concern about whether or not it would be able to handle the additional load from Venekor. Nagafen's performance has always been a little sluggish due to the higher population but a few weeks before the merge, the problem got much worse.
We had several meetings with our DBA's to discuss methods for improving the performance. Most of the options were band-aids or had no guarantee that they would solve the problem. Much to our surprise the DBA's were able to offer a complete upgrade to the database hardware and software. This was something we had been wanting to do for awhile but hadn't had the opportunity due to conflicting projects for the DBA Team. While this was certainly the most desirable option, there just wasn't enough time to do it before the merge that was only 2 weeks away. Although we knew that the decision would be met with some disappointment, we decided to delay the merge by another 4 weeks to give us time to upgrade the database hardware.
We scheduled the migration of the Nagafen data to the new database a week before the merge was to happen. The migration day came, the servers were brought down and the data was moved to the new database. Everything appeared to be perfect even after we brought up the servers on the new hardware. Unfortunately, there was a small subset of characters who were not able to log in. After some research we found that the migration software had a problem moving certain large BLOB fields. At this point we had to rollback to the old database and come up with a new plan. After more meetings we decide to try the migration again but resort to a different method of moving the data that was slower but more reliable. Several days later we tried again and everything worked great.
The DB migration happened on a Tuesday. Ideally we would have waited until the next Tuesday to perform the merge, but this would mean delaying it another whole week and we just didn't want to do that. So we decided to kick off the merge process Wednesday night at midnight. We had a 28 hour maintenance window to complete the merge.
On the evening of the merge we kicked off the script to export the data from Venekor. The script was running much slower than during our tests and looked as if it would take 26 hours just to export the data which was only half of the process. At this point we didn't have a lot of options so we let it go. By 7am Thursday morning the script had been running for 7 hours when it decided to crash with a segmentation fault. You can imagine how we felt knowing that the script was already taking twice as long as it should have been and on top of that we just lost 7 hours. We decided to run the script from another machine to rule out hardware problems and started it up again. That morning we discussed worst-case scenarios and pretty much decided that if the script crashed again we would have no other choice but to abort and troubleshoot the problem. However, not only did the script continue to run, it actually started running faster. The export script completed after about 15 hours. The next step of the process was to bring down Nagafen for a final backup and then start the import of the data.
The import went a good deal faster since it was running against the newer database hardware. About 12 hours later the import script finished and all of the Venekor data was now in Nagafen's database. There were still many things to do and fortunately the timing worked out so that the import finished during the day and not in the middle of the night. This allowed us to tackle the remaining tasks from the office during normal work hours. We continued to execute scripts that copied shared banks, guild halls and more. These scripts had to resolve name conflicts and keep up with ID changes so we could notify the chat and mail servers of changing player data. Chat servers were brought down to update their data which unfortunately had to impact all game servers. (Sorry about that)
As soon as all of the scripts were done we were ready to bring up Nagafen and have QA run through the standard sanity checks to make sure everything looks OK. Our smoketest takes several people an hour or two to complete and covers most of the gaming functionality. Unfortunately its just not possible to test every detail and every code-path. The smoketest passed and the server was unlocked.
For the most part, everything looks good. We haven't seen any problems with character data other than some issues with missing shared bank coin. We also had an issue with guild hall supply depot contents not transferring over correctly. These can be corrected with some additional time and overall I'd say the merge was a success. The server has only been up for about 5-6 hours at the time of this post, so I hope we don't discover any other issues. *Keeps fingers crossed*
Some of us will be busy fixing up these final issues over the next couple of days, but overall I'm very happy with how well things turned out. Everyone who had a part in the merge and DB migration did an excellent job and I'm proud to be working with a group of such intelligent and hard-working people.
We understand that the extended downtime was an inconvenience to those of you who played on Venekor and Nagafen. We apologize for that but hope this process will bring you much more enjoyment in the long run.