Today I royally fucked up at work. Commiserate with me

I’m sure you know this already (and if you didn’t before, you certainly do now), there is ***never ***a substitute for off-site backups.
So far, I’ve been lucky; the worst thing I’ve *personally *been responsible for is accidentally pulling the plug on our Exchange server…but there’s always tomorrow…

No, since I am not dead, nor do I write SQL, but I do have forgotten to either write or WRITE CLEARLY ENOUGH the equivalent part in the functional specs.

It’s a good thing I’m in a line of work where SOP is to test everything to death, then resurrect it and test some more.

One with backups, not mine:
Back in the Pleistocene, we had these VAX boxes at work, and this boss who kept changing the settings for the backup tapes. For some reason he’d have a different person run backups every time and give different settings every time. After a few months we glommed onto this and started writing the settings used on the tape labels.

At one point Carlos needed to recover some files. He located the appropriate tape, stuck it in, wrote the read command appropriate for the settings listed on the tape. The VAX paid about as much attention as if we’d started singing lullabies. Well, strictly speaking, it gave some sort of message, but it didn’t read the tape. After a few tries, Carlos went to the boss’ office and came back looking like the sad smiley’s pink brother. Bossman had handed that particular tape to the newest person a couple of weeks back “to teach her how to run backups”: not only was the original backup overwritten, but the settings on that tape did not match its label any more. And since Bossman had been overwriting backup tapes for several weeks, who knew what was where any more.

Several of us happened to have partial backups, as we didn’t trust bossman’s further than we could throw him. Our coworkers didn’t laugh at our “paranoia” again and bought zipdisks of their own (kept offsite, of course, and their reader in a drawer).

My only fuckup is entrusting critical work tasks to idiots who don’t know how to make & restore backups, write SQL, or save their work.:smiley:

I realise the horse has bolted, so I’m not quite going to recommend shutting the stable door.

But, actually, I am. For next time. As part of your apology letter, include a plan on how this will be prevented from happening again in future.

It may be as simple as two people reading, repeating and confirming the meaning of the digits before and as the restore process is kicked off.
-That’s what I always do when restoring a snapshot - I write the date version we’re looking for in words (“Test environment backup from 4th September, 2013”), then I also write the expected shortened name for the backup record (i.e. something like “Test_20130904”) - I pass ONE of these descriptions to a colleague and ask him/her to construct the other, then we both check that we agreed on both values, and both talk each other through the restore process, confirming the meaning of each segment of the date identifier before committing.

I created a database for a law firm, from scratch.

Two horrible things happened that turned out to be really good things.

Early in development (thankfully), I accidentally DELETED the most recent version of the database! Luckily, I backed it up about every three days, so I was only missing two days of data input (by about 30 temps who had been entering the data), and luckily they still all had the documents on their desks. Two days of work shot to hell, but the good news was that this just happened to be very slow days and there wasn’t all that much to re-enter.

After that, I started to do manual backups, a minimum of twice a day!

The other glitch was when I “fixed” the database and added a new feature, which promptly erased the original entire section of that part of the database. I still don’t know how the hell that happened - but because I did save the database twice daily, we only lost a half day of data entry - and it was a Friday afternoon, so once again - not a huge deal to re-enter.

And then the crash…the entire server went down in the law firm - NOT my fault - but that was where we stored the database and where their IT stored everything else.

However, because of my paranoia of the previous two disasters, not only did I continue to back up the database twice daily, but every other night I made a back up on my own portable hard drive! The night before the huge crash, I had saved everything to my external hard drive and - voila - my database was the only thing we could work on the following day - unscathed and fully intact!

Those first two disasters had scared the bejesus out of me, and I am thankful it did - or the database would have been toast.

rm -r /etc is not the same as rm -r etc

Here’s one I didn’t cause. I was apparently the first to run into it and report it to our software vendor.

I was a Berkeley Unix BSD sysadmin for a smallish high-tech firm. (We made laser printers in the days when hardly anybody had ever heard of laser printers.) We obtained Bky 4.x BSD and maintenance (both software and hardware) from a company that was in the business of doing that.

One fine day we upgraded from 4.2 BSD to 4.3 BSD (circa 1975 or so). The procedure was basically:
[ol]
[li] Do full disk-to-tape backup of entire 4.2 BSD system, other than the root stuff.[/li][li] Wipe the disk completely.[/li][li] Install 4.3 BSD system from scratch (from distribution tape).[/li][li] Do full filesystem restore of all the rest.[/li][li] Recommended: Run fsck (file system consistency check; like Microsoft’s Scandisk) to make sure everything got installed/restored properly.[/li][/ol]

In that upgrade, they did a major overhaul of the internals of the file system, including a lot of the data structures and layout of the data on the disk. Since the backup and restore programs worked on the low-level data, they had to be largely re-written as well.

We got all the way to step 5. Well, surprise, surprise. Running fsck, we found that the restored filesystems were largely trashed garbage.

Turned out, the 4.3 restore program couldn’t restore dumps written by the 4.2 backup program. Oops.

We managed to patch up the trashy file system and get it running, but it took us many hours, plus the assistance of a more experience outside consultant expert than I, whom we had to hire and bring in just for the purpose. The problem was all with symbolic links, where were totally different in 4.3 – we ended up just deleting them all manually, which took a long time, as there were hundreds (thousands?) of them. Then anybody who wanted those link just had to re-create them as they needed them.

I can’t believe neither Marley nor Hal has made an appearance yet; I know they both have excellent stories they could contribute to this thread.

The day after 9/11, I was the opening CSR at Blockbuster, and for some reason that I can only ascribe to stress, I angrily blew off a customer when he quite correctly pointed out that we hadn’t turned the lights on yet.

I have no such excuse for the next story: I responded to a customer’s request for something (that would have violated company policy) with the sotto voce inquiry at her back, “You want to pay my bills?!” Which turned out to be not nearly as quiet as I’d hoped. Boy, did I catch it from her. Thankfully, a manager made sure I didn’t dig myself any deeper.

Oh for sure. I’m well practiced in how to give a meaningful apology.

Ugh. Reminds me of the time we upgraded from JFS to JFS2 without considering the implications on random I/O.

That’s an excellent question that I’m sure as hell going to ask both the vendor and the consultants who validated our configuration. At the very least, I want to know why the hell the warning message isn’t more explicit.

A chap on Twitter recently asked for “work screw-up” tales, and collated the best on a Storify page. Read 'em and chuckle.

Marley #7 and Hal #4A & 4B? Those are great stories.

I didn’t forget the WHERE. I simply fucked it up.:smack:

We had an auto load program that read data from a vendor for one of our state databases. But one time they sent it in as csv rather than pipe delimited. and our program simply ignored the commas and smushed the first four numeric data fields together. So the new bad records were created with record numbers like 1286868685586585865865865861212331.

No problem for me though . Simply delete all records with a record # greater than a billion; so easy I didn’t need to test, just throw it in the midnight cron and head off home Friday night.

So boys are girls, can anybody tell me what happens when you fat finger < instead of >, but don’t notice? That’s right. You get a frantic call from the night support at 12:15 AM wondering why nobody can find Minnesota anymore, except a couple dozen screwed up records.

I too developed the habit of writing the WHERE clause before the UPDATE statement.

Long ago, when the world was still DOS with nary a Window in sight, I was working at a large Georgia bank. I was just getting good enough with DOS commands to be comfortable with my work. I needed to delete a bunch of files from a particular directory on a hard drive. Because I’m brilliant, I had used a naming convention where each file started with the letter “O”. One simple DOS command would get rid of all of them (del c:\drctrynm\O*.* - delete every file in this dierectory that starts with thte letter “O”). I typed it and hit <enter>. Only I forgot to type the directory. I entered "del c:\O*.* (delete every file starting with “O” from the entire hard drive). On the computer that stored collection attempts for every overdrawn account for the previous 5 years. Guess what letter the overdraft files started with?

Fortunately, someone in IT had a one of the first copies of Norton Utilities in the bank. We got back almost all of the files. My boss was livid, but couldn’t do much to me because he had to explain to his boss why he had not kept backups of such critical files. If you had an account closed for overdraft at a Georgia bank in the 80’s and said bank suddenly stopped trying to collect - you’re welcome.

cough

Oh yeah. Once. Now whenever I do an update I include a ridiculous number of WHERE conditions so that I’m guaranteed to only update what I want. I even do a select with the same conditions just to double check.

I also updated the Production database once when I thought I was in Development. Oops!

Isn’t that the quintessential typo?

Or, maybe the root typo, from which all others are descended?

I once sold a $600,000 mortgage for $60,000. Thankfully, it only cost us a couple thousand dollars to get fixed, but that was a “so should I pack up my stuff now or at the end of the day?” moment for me.

Uncommitted transactions don’t block queries on Oracle. They do block queries by default on MS SQL Server but you can fix that by turning on Read Committed Snapshot Isolation. I like Read Committed Snapshot Isolation because of this and several other issues it fixes.