Listening to yesterday’s broadcast of the Marketplace radio program on NPR this morning (as a podcast), I heard the following mentioned during the “Datebook for July 31, 2009” segment of the program.
And bring some goodies for the IT department. You need those folks for the health of your server, firewall and computer stuff. It’s System Administrator Appreciation Day.
Considering that I was in the office until 2 AM last night, I find it particularly serendipitous. If you’re a system administrator, you already know (too well) the propensity for late nights and working weekends. For those of you not familiar, here’s a quick synopsis of my evening’s circumstances to help you understand how it all went down.
Yesterday afternoon, my office lost power when a car crash brought down a neighborhood utility poll. As we waited for utility power to be restored, our server room’s UPS kept the battery power rolling to the network, the wireless access points, and the servers. The overhead lights were out, but the Internet was still on.
When the UPS reported 25 minutes of power remaining, we decided to shutdown the non-critical servers. At 10 minutes of power remaining, we began to shutdown all critical servers. Unfortunately it was not enough. About 40 minutes after the outage started, all remaining equipment went dark. When the power was restored another 20 minutes after that, we discovered that the Cisco Catalyst switch was dead. The supervisor card on the Cat was unable to boot the switch properly, and it would be 4 hours before a new replacement would be available on-site.
Some of you might point out that we could have designed better redundancy into our office network. And yes, we could have. But it all comes down to balancing cost and risk. All of our production data centers (i.e. live website presence) are configured with redundant sets of A/B switches. However, our office network is not. A production site outage can potentially impact our website revenue to the tune of $1,000,000 or more, depending on the timing and duration. An office outage, on the other hand, represents lost productivity on an order of magnitude much less costly. And for this reason, we do not have a redundant Catalyst switch in the office, but we do have a Cisco support contract with 4-hour response turn-around time.
Waiting for the new supervisor card to arrive, we began the process of bypassing the larger Catalyst switch with a smaller 24-port switch, patching the most critical servers (e.g. email, VPN, phones, and various other backend processing). The new card arrived later that evening, and we then began the process of getting the larger Catalyst back online and reconfigured. We eventually moved all the bypassed servers and devices back to the original switch, brought up all remaining servers and storage, and fixed all the other little things that break during an unplanned outage.
When it was all done, the time was 2 AM.
I’m not complaining about my evening, because it’s what I do. It’s my job, and after all, I love my job. But it is the nature of the work. And I hope this story helps you to gain a little insight into what we system administrators do.
So, on the last Friday in July, put a smile on a system administrator’s face, and send a brief mention of thanks or appreciation for all the work they do.