Sonic and Sega Retro Message Board: Friday Outage Post-Mortem - Sonic and Sega Retro Message Board

Jump to content

Hey there, Guest!  (Log In · Register) Help
Page 1 of 1
    Locked Forum

Friday Outage Post-Mortem I hope you like talking about Linux.

#1 User is offline SeanieB 

Posted 05 April 2013 - 09:32 PM

  • errno -1 (Not system error)
  • Posts: 442
  • Joined: 26-February 08
  • Gender:Male
  • Location:San Diego, CA
  • Project:Fixing Sonic Retro
  • Wiki edits:12
Hey there, everyone --

I'm just gonna write up a quick thing about today's chain of events.

We've been tracking a dying pair of the original Samsung 500GB SATA drives inside retro's web server for a few days now. Around 1:30-2:00 AM Pacific this morning, the server dropped off the face of the earth. I wasn't notified about this because my Nagios was only set to notify me of Retro's status by e-mail and pager text-message during work hours.

I'm not entirely sure if it was a coincidence, a bit more disk failure, software failure, or maybe a hack gone wrong, but no logs of the problem were made, and the OS on the machine was definitely damaged in some way.

For those familiar with Linux, conventional System V style init uses a series of init scripts to accomplish the boot sequence, and handling of run level requests. For a still undiscovered reason, on initial run level, not all of the daemons start, but if SSH is adjusted to start immediately following networking, manually changing run level to one of the other unoccupied multiuser run levels on Debian will start the remaining daemons and allow logins to work as normal from the console.

I verified the drives with my personal copy of Spinrite and they did turn up a couple more errors, but no outright failure. RAM tests also passed.

Some of Retro's ancillary public and private services have been disabled until the restore to the new drives and OS is complete. Additional backup plans are being carried out to make sure we're prepared for catastrophe. Expect a lot of slowdown in the next 24 hours. I will be setting my Nagios to report problems 24/7 until Retro goes back to normal.

New drives have been ordered, and tentatively we hope this process can get started by Wednesday night, but that's an estimate. Expect another full day's downtime at that time, but I will give full warning. I am going to fully test the new drives before putting them into service, I think it would be prudent to spare some time for confidence in the stability of the new drives.

As far as I know at this time no information was lost in this incident. A backup finished very close to the time of failure, which may have also been a root cause.

Repo, FTP, the Scans service, and a couple of under utilized user accounts have been disabled for security, and will reappear later this evening or some time tomorrow.

I'm willing to accept questions in this thread

Thanks for your understanding.

#2 User is offline Scarred Sun 

Posted 05 April 2013 - 09:41 PM

  • In Defense of Lost Causes
  • Posts: 3793
  • Joined: 06-February 05
  • Gender:Female
  • Location:SD/LA/SF
  • Project:Staying woke
  • Wiki edits:36,091
What he said. :v:

In all seriousness, we had planned to replace these drives in a couple of weeks—we're just going to have to do so a little earlier than planned. It doesn't look like we lost anything in the backup, and we should be running at full steam by Friday at the absolute latest. I've also had SeanieB make an offsite backup in the event that one of our tertiary backups fails.

So all in all, good to go.

#3 User is offline JaxTH 

Posted 05 April 2013 - 11:31 PM

  • Pudding Deity
  • Posts: 8308
  • Joined: 29-March 10
  • Gender:Male
  • Location:Los Angeles
  • Project:Jack shit.
  • Wiki edits:2

View PostScarred Sun, on 05 April 2013 - 09:41 PM, said:

In all seriousness, we had planned to replace these drives in a couple of weeks

That is what makes it hilarious, we were just talking about it in #retro mere days ago! :v:

#4 User is offline GeneHF 

Posted 06 April 2013 - 01:09 AM

  • SEGA-ier than you'll potentially ever be.
  • Posts: 8384
  • Joined: 16-May 04
  • Gender:Male
  • Location:Scenic Studiopolis
  • Project:Complete Global Conquest
  • Wiki edits:381
So, as some of you clever girls pointed out in IRC when they first went down, yes, we "forgot to blow in the carts."

But now we're cleaning the cart with alcohol and putting in a new pin slot for an even more reliable Retro and Segabits. :eng101:

#5 User is offline Josh 

Posted 06 April 2013 - 02:05 AM

  • TGC Warrior
  • Posts: 1393
  • Joined: 22-January 08
  • Gender:Male
  • Location:USA
  • Wiki edits:1
Aye, excellent job getting things back up and running. It's very, very much appreciated, and I thank you all for keeping things together.

Posted Image

Do you need one of these? I think I still have one somewhere.

Page 1 of 1
    Locked Forum

1 User(s) are reading this topic
0 members, 1 guests, 0 anonymous users