don't click here

Friday Outage Post-Mortem

Discussion in 'Announcements' started by SeanieB, Apr 6, 2013.

  1. SeanieB


    Chief Server Monkey Administrator
    San Diego, CA
    Fixing Sonic Retro
    Hey there, everyone --

    I'm just gonna write up a quick thing about today's chain of events.

    We've been tracking a dying pair of the original Samsung 500GB SATA drives inside retro's web server for a few days now. Around 1:30-2:00 AM Pacific this morning, the server dropped off the face of the earth. I wasn't notified about this because my Nagios was only set to notify me of Retro's status by e-mail and pager text-message during work hours.

    I'm not entirely sure if it was a coincidence, a bit more disk failure, software failure, or maybe a hack gone wrong, but no logs of the problem were made, and the OS on the machine was definitely damaged in some way.

    For those familiar with Linux, conventional System V style init uses a series of init scripts to accomplish the boot sequence, and handling of run level requests. For a still undiscovered reason, on initial run level, not all of the daemons start, but if SSH is adjusted to start immediately following networking, manually changing run level to one of the other unoccupied multiuser run levels on Debian will start the remaining daemons and allow logins to work as normal from the console.

    I verified the drives with my personal copy of Spinrite and they did turn up a couple more errors, but no outright failure. RAM tests also passed.

    Some of Retro's ancillary public and private services have been disabled until the restore to the new drives and OS is complete. Additional backup plans are being carried out to make sure we're prepared for catastrophe. Expect a lot of slowdown in the next 24 hours. I will be setting my Nagios to report problems 24/7 until Retro goes back to normal.

    New drives have been ordered, and tentatively we hope this process can get started by Wednesday night, but that's an estimate. Expect another full day's downtime at that time, but I will give full warning. I am going to fully test the new drives before putting them into service, I think it would be prudent to spare some time for confidence in the stability of the new drives.

    As far as I know at this time no information was lost in this incident. A backup finished very close to the time of failure, which may have also been a root cause.

    Repo, FTP, the Scans service, and a couple of under utilized user accounts have been disabled for security, and will reappear later this evening or some time tomorrow.

    I'm willing to accept questions in this thread

    Thanks for your understanding.
  2. Scarred Sun

    Scarred Sun

    Be who you needed when you were younger Administrator
    Tower 8 ️
    Welp, this.
    What he said. :v:

    In all seriousness, we had planned to replace these drives in a couple of weeks—we're just going to have to do so a little earlier than planned. It doesn't look like we lost anything in the backup, and we should be running at full steam by Friday at the absolute latest. I've also had SeanieB make an offsite backup in the event that one of our tertiary backups fails.

    So all in all, good to go.
  3. JaxTH


    Pudding Deity Oldbie
    Los Angeles
    Jack shit.
    That is what makes it hilarious, we were just talking about it in #retro mere days ago! :v:
  4. GeneHF


    SEGA-ier than you'll potentially ever be. Site Staff
    Scenic Studiopolis
    Complete Global Conquest
    So, as some of you clever girls pointed out in IRC when they first went down, yes, we "forgot to blow in the carts."

    But now we're cleaning the cart with alcohol and putting in a new pin slot for an even more reliable Retro and Segabits. :eng101:
  5. Josh


    Aye, excellent job getting things back up and running. It's very, very much appreciated, and I thank you all for keeping things together.


    Do you need one of these? I think I still have one somewhere.