Today’s a no-deploy Friday for me, like it is for many. However, also like many others, here I am deploying things. Small, minor things, but it would ruin my weekend if they broke anyway. Sometimes the worst does happen and we break things. Don’t worry, we’re professionals!
So, what happens if you do break something? First, don’t panic. Everyone’s broken something before, and that includes everyone above you in the food chain. The second step is to notify those above you according to your internal processes. In most cases, that means stopping what you are doing and giving your boss a paragraph summary of the issue, what it affects, and what you’re doing about it, then getting back to work. Third, don’t panic! I know I already said that, but since you’ve now gone and told your boss, they may have induced some panic – let it pass. The only way you’ll recover is if you don’t panic. Breath.
Fourth, fix it! Use your mind to decide what was supposed to happen, what you did, and where things went wrong. Identify the steps required to either back things out or repair the situation so you can proceed. Document the steps and follow them. If you have a maintenance window you are operating under, put some time estimates down and set an alarm for when you need to make the go/no-go call. Though the situation is urgent, taking a few moments now to prepare will make you more efficient as you proceed. Give your management chain short updates throughout the event until it is cleared, and don’t let rising panic get to you.
I love the overnight shift!
Above, I said that everyone has broken something. I’d like to share two stories of some of my most significant failures so you can see that mistakes aren’t only natural, but are learning experiences.
The first situation was in 2001. My boss and I were doing an upgrade to a Novell Netware 5.0 server, either to 5.1 or some patch for 5.0. We had planned the window a few days ago, had made a checklist, read the release notes, opened a pre-emptive support case with Novell, etc. – everything you’re supposed to do to prepare for such a significant change. We started the upgrade at 5:30PM on the dot as per the maintenance notice provided to the users. Up popped the first screen – but it wasn’t what the release notes said would happen. It was some warning about something something could do something to the something else and maybe you’d like to… and I have no idea what else it said, because I gleefully mashed the ‘Y’ key to continue.
My boss’s sole acknowledgement of my impertinence was to turn his head slowly at me, say, “I was reading that.”
We turned back to the server and stared at it for a few minutes, willing the dots to move faster across the screen. When the whole screen was full of dots and there was no sign of when it would finish, we stepped out and went back to our PCs to surf the internet, checking back every 10 minutes. At around 6:30PM, it finished! There was no warning text this time, so I gleefully mashed the ‘Y’ key to reboot the server.
The only problem is, Netware never booted. I don’t remember if it hung or if some error finally displayed, but there was no joy to be found. I think we tried a second reboot and probably a cold boot, but by 7:00PM it was clear, there was no easy fix to be had. Our window was supposed to end at 8:30PM.
“I told you that I was reading that screen,” my boss said as he started putting on his coat. “It was warning us about something, but we don’t know what because you clicked past it. We shouldn’t have done this, but you decided you really wanted to, so I guess you’ll have it fixed by the time I come in tomorrow, or you won’t come in at all.” Then he was gone.
The next 9 hours of my life went by in a blur. Aside from calling my wife and relating what my boss said, I don’t know what I did exactly. But by 4AM, the server was booted and it had the latest patch. I went home, got some sleep, and rolled in around 10AM with shadows under my eyes. No-one said anything about my late arrival, and my boss never said a word to me about my decision again. I remained there for a few years and received great ratings the entire time.
My inbox is empty. That never happened before!
A few years later, we were moving, so I put in my two week notice. On my last Friday at the company, I decided to help my coworkers by running the email maintenance program. We were using a DOS-based predecessor to GroupWise and it was now 2002. A lot of users had large mailboxes (some of you may see where this is going already) and sometimes things got slow. I kicked off the maintenance program and walked away.
A couple of minutes later, the phones started ringing, with a common theme. People were having issues with their email. Suddenly, a light bulb went on in my head. I ran to the console where the maintenance program was running and starting mashing ctrl-C as hard as I could. See, there was a slight problem with this old DOS-based application: It didn’t like mailboxes over 2G in size. The email client ran fine, but that was it. The maintenance script would chug along on mailboxes just fine, cleaning up broken references and reclaiming empty space from each mailbox’s database file, until it ran into a database file over 2G. Once that happened, you were in uncharted waters since the program didn’t support that. Here be dragons! Sometimes it would just muck up the database and clean some extra emails, othertimes, it would delete EVERYTHING. I never saw such pristine inboxes! The worst part is … this was the second time we encountered it. I had just forgotten about it.
Thankfully, we had great backups and I caught the program relatively early, meaning we didn’t have to restore that much – and the tapes were still on-site so we didn’t have to wait, either. Later in the day, the big boss called me into his office to give me my recommendation letters. I must have been nervous, because he asked me to review it and I told him it looked great without really reading it. He asked me a few times if I was sure before letting out a big belly laugh. He had added a little flavor to his recommendation. It took a little bit before he could compose himself and hand me the real letters 🙂
As you can see, I still keep the letter around. It’s hilarious (now!), but it’s also a great reminder to pay more attention to what you’re doing, especially on your last day at a company! These two learning experiences taught me not only how to better plan to avoid such “teachable moments”, but also that I could recover from them with confidence.
Please, share your own “learning experiences” in the comments!
Anyone who says that they haven’t screwed up like this is either lying or not telling the truth. After spending a better part of a day setting up a vote tabulation environment my coworker asked me to format a floppy drive from the master server. This is a master server that took about 4 to 6 hours to set up and configure. I typed and I sat back and watched, wondering why it was taking so long. That was a long night.
Just as long as the night where I was upgrading a Windows NT server HAL to support multiple processors but I didn’t have the correct HAL and it rendered the systems unrecoverable. It was in the middle of a snowstorm and I was about 60 miles from home. The snow didn’t matter because by the time I finished up at 7AM the storm had finished and roads were plowed anyway.
My command was deleted from the last message. Basically, I formatted the C: drive instead of the A: drive and watched the entire server get blown away.
I think I speak for all of us when I say, “Ooops!”
I’m fairly sure the statute of limitations hasn’t expired on my worst blunder.