I manage a small IT infrastructure team. We support, among other users, a development team that’s responsible for the application and database code that’s our primary product offering.
Due to a minor screwup earlier today, it was requested that we perform a full restore of their development server to last night’s backup. This server contains working copies of all of the stuff that’s at the core of the business, plus the version control system that keeps track of of changes to those working files.
Normally I’d foist a job like this off on to my staff, but I have a couple of new people on the team and decided that I’d better run through the procedure with them looking over my shoulder.
So I logged in to the storage system, and pointed to spot where we keep the nightly snapshots of the development system.
“See these snapshot files? With the names like dev20131201, 02, 03? We want dev20131204. I’m gonna select it and hit restore.”
And I did. The system flashed a warning message on the screen:
“All changes made after this snapshot copy was created will be lost.”
“No biggie,” I said. “Those are the changes we want to blow away.” I clicked OK.
Except those were not the only changes the system was referring to. It was also talking about destroying all of the subsequent snapshots since the one I had selected to restore from.
Of course, since I picked last night’s snapshot, there weren’t any subsequent snapshots to worry about, right?
Wrong. I hadn’t picked last night, 20131204. In my distraction while explaining the process to the team, I picked 20131104 - last month’s.
The system happily reset itself to a month ago, destroying any work that had been performed since. Eight developers. One month’s worth of effort. And all the backups of it.
Now, we might be able to recover some of this. People have local working copies of the stuff that’s in active development, and we have a good record of changes that have been deployed to production in the interim.
There’s also the fact that this system was set up exactly as my peer, the Development manager, specified. We asked him about offsite backups, etc, etc, and he didn’t think we needed any (because nobody anticipated anybody would be so stupid as to blow away a month’s worth of work.)
And he has a bit of a habit of being a bit run and gun with his change management process - his team doesn’t log their change requests in any centralized control system, instead relying on CVS to be self documenting.
And I’ve been the strongest advocate in the company against putting all of our eggs in one basket this way, thanks to previous screwups (mostly not my team’s fault) that caused actual customer-impacting problems, for which I’ve been met with nothing but resistance from the same peer who wants everything wrapped up in one nice, neat, monolithic, fragile package.
But still, I fucked up a month’s worth of productivity for eight people thanks to two seconds of inattention. So we’re going to spend probably the next week painfully retracing our steps for the past month before we can even start recreating any of the work. Thanks to a dumb cowboy mistake.
I just wrote my apology speech, accepting full responsibility - none of that Mistakes Were Made shit. I called the CEO and delivered the short version over the phone (he seems both pissed and bemused, which is good) and I’m going to deliver the full version to both teams tomorrow before we get down to business.
But first, I’m gonna drink myself to sleep and take comfort in your fuckups. Please share.