Today I royally fucked up at work. Commiserate with me

Raises hand. Funny thing is I still get that blood-rushing-from-my-face feeling when executing commands that update large amounts of data, even if they go off without a hitch.

I have a story very nearly similar to the OPs, back from my early twenties when I was a makeshift system admin with 0 experience for a development shop. I had inherited a clunky database server, where all of our files were pointless hosted on a RAID5 array, and a cheap tape backup server that never quite worked right. My first job was to set up a nightly backup system from the RAID array to the tape drive. I spent a lot of time setting it all up, documenting all the backup and recovery procedures and even running some drills. We rotated tapes, to include offsite storage. Once I was satisfied with the process, I promptly put it all in the back of my mind, until about a year later when the RAID array suddenly gave up the ghost. After a day spent rebuilding the array, I reached for the latest backup, only to discover, to my horror, that the cron job that created the backup files had stopped executing months prior. We’d been dutifully taking our tape backups offsite for months, without checking to see that the files we were copying to them weren’t 3 months old.

Somehow I didn’t get fired for that. But what a learning experience.

I forgot to cancel a advertising schedule with a television station once and the entire thing ran. I had been told to cancel it but just missed doing so for that station and my agency had to eat all $40k of the schedule since the client refused to pay for something they’d directed us to cancel. Somehow kept my job.

Environmental consultant here. I still recall the time I did a soil sampling project for an electric utility at a couple of their equipment storage yards, years ago. I collected the samples and sent them to the lab to be analyzed for petroleum hydrocarbons, basically to see if any of their gear had leaked oil onto the ground. Got the results back, prepared my report, hand-delivered the report to their office, sent them the bill.

A week or so later I got a call from one of their attorneys, politely asking me why the samples hadn’t also been analyzed for PCB’s. I confidently said something about how the petroleum hydrocarbon results would be sufficient to provide an indication of whether anything had leaked and I didn’t want them to have to pay for the expensive PCB analysis as well. He directed me to the relevant section of the state’s legal code where it specifies that PCB analysis is required for this type of investigation. With quite a bit less confidence I said, and if memory serves me correctly, this is an exact quote: “Ummmm… <gulp>.”

All in all, I decided, it seemed rather reasonable of them to expect that their environmental consultant would be up to speed on the requirements of the environmental investigation they needed.

Well, that’s one client that never called us back for another job. The best part is that, lo these many years later, I still drive by the office building to which I delivered the report at least once a week, and every time I see it I get to think “Hey, there’s that client whose project I fucked up!”

I’m a DBA on Oracle databases. I’ve screwed up a few times, but nothing too bad. Probably the worse was shutting down the production database instance when I thought I was on the customer testing database. I was able to get it up and running in a couple of minutes.

On production we have multiple backups. We use Data Guard to update our alternate server which is located about 400 miles away. The most data we should lose is 30 minutes max. I also export a dump (exp dmp) file of all the production data to another server and my PC every night. Lastly we do a full tape backup of the production server every Sunday morning and rotate out the tapes.

We also do a tape backup of our Customer Testing, Testing, and Development servers every Sunday. We have a strong configuration control system so if something happens to the testing servers, we should be able to get them back up to speed pretty quickly with all recent changes.

Now the development server is a different beast. It’s not under CM control and the developers can do what they want there. It’s up to the developers to keep track of what they do and keep backups of their most recent work. For example if I refresh the database with production data, any changes that they made, such as adding a field to a table, gets overwritten. It’s up to them put things back to how they want them. If a developer asks for me to refresh the data, then all the developers have to agree before I do it. It’s fun to watch developers bitch at each other!

Back when the air traffic controllers went on strike (1981?), I had just started a new job with a law firm representing the SEATAC controllers.

I was typing a huge brief on a DEC, a dedicated word processor, state of the art then, first time I’d used one – it had a system disk and another disk for data. One of the attorneys had dictated a huge brief which he needed to file that day. (He always waited until the last minute).

Anyway, got it all typed with time to spare but screwed up the Save and poof, it was gone. I got it retyped in time for filing but with no time for proofreading.

And in the same case, the same last-minute attorney had me send out notices to the controllers about a meeting a couple days away. I mailed the notices but neglected the postage.

In the same office, I deleted everything from a co-worker’s hard drive. She’d been complaining about unsorted files on her PC and I’d just taken a DOS class so decided to organize it for her.

All this happened early on, but they kept me for eight years.

Not me but a co-worker. He changed the latitude and longitude of every airport in the world to the same one. We would figure out mileage and fuel costs between airports. Suddenly they were all coming back as 0. :confused:

I’ll admit to making a data change, but not issuing a commit statement afterwards. About a day later someone will complain that the data didn’t get changed.

I do website support and got a real cock on the phone today. When I told him he would have to register at our website before we could see if the function he wanted would be supported, he complained about how his time was too valuable to be wasted before launching into a long-winded tirade telling me how to do my job (and also referenced his title of “Doctor”–typical self-important prick). I told him, “Well sir, if you know how to do my job better than I do, maybe you should come work here.” :eek: Kinda surprised myself.

I’ll probably hear some blowback next week, but nothing serious. Our calls aren’t recorded, but we don’t have to take shit from assholes. I still shouldn’t have said it, but I stand by it.

I did something similar at one of those traffic safety sessions when the instructor, in an effort to keep all us retirees from falling asleep asked, “What are some of the things that could cause distracted driving?”

I suggested “insects in the car” thinking of a recent bee incident.

There was uproarious laughter and people turning and staring at me.

“It certainly could!” said the instructor.

I was clueless until DH whispered in my ear what people had heard.

I’m sorry black rabbit. Time heals. Pats br on shoulder

I worked on the team that wrote that feature circa 2003 when the company was still LSI. I may have actually written the code that you used to destroy yourself. So… sorry? :frowning:

Also, what an amazingly small world!

(Edited to add: A former coworker who also reads The Dope brought this thread to my attention with the comment that “your code executed when this guy effed himself.”)

Did you also write the code that enables aggregate-level snapshots by default? Because after I kick your ass, I’d like to buy you a drink.

We fixed it, thanks to a mostly undocumented and little-used feature that creates one more low level, short term backup of the entire disk aggregate and is fortuitously enabled by default. In the end, we lost only lost about two hours worth of changes and a half-day of developers sitting around with their thumbs up their asses.

It would have been nice if the vendor had told me about that little feature when I called them in a panic last night. It would probably have saved me about four glasses of bourbon and twelve hours of chain smoking.

Whew.

With the caveats that:[ul][li]My recollections are dim because I left the company a many, many years ago[*]I’m being deliberately vague to avoid issues of proprietary knowledge[/ul][/li]
…given sequential snapshots A, B, and C at time T, snapshots B and C lose meaning at the moment you restore to snapshot A. Once you’ve restored to A, you couldn’t do anything with B and C even if you had them.

Snapshots aren’t copies of your data at times A, B, or C; they’re a mechanism for describing the way your data has changed between A and T or B and T or C and T. And I can’t conjecture on warning messages; that’s the UX guys.

Nope, but I’m always open to free drinks. Glad it worked out for you.

Not really in line with the OP, but I sorta-kinda shot myself in the ass today.

Very long story, I don’t want to be too specific, but I got irritated by a situation in my town and poked a number of civic, administrative and business anthills. As a result, there will be a meeting of the town council, two town commissions, a special town corporation and several large businesses, along with assorted hangers-on and mover-shakers. And me.

When it all hit me, the feeling of satisfaction for having gotten this huge ball rolling quickly turned into, “Oh, shit, now I have to run catch it.”

Years ago I worked at a small graphics company that had a small staff of in-house programmers who maintained a custom photo-retouching system for a team of in-house artists.

(It was kind of like Photoshop five years before desktop systems were powerful enough for that sort of thing. Imagine Photoshop running on rack-mounted minicomputers with custom hardware.)

We were each responsible for running our own back-ups. One guy (not me) didn’t. For about six months.

And then one day … .

They couldn’t fire him because he was the only one who could recreate what was lost.

The scary thing was that the entire company depended on our photo-retouching system staying up and running. So for several months we were locked into running a version of the system for which the source code no longer existed. If anything had gone wrong – a major crash, corrupted data – there was literally no way for us to fix it.

Early 90’s I’m running a Mutual Fund system on a VAX. We do before & after batch backups. Do the pre-batch backup & running the nightly processing when BOOM the disk crashes…HARD! It’ll take a few days to get one from East Jabip or whatever distant locale it’s coming from so we declare an emergency, take the backups & head off to the DR site & spend all night doing a restore. A few days in I create a dividend tape that I now need to drive 50 miles to our office for posting to the mainframe when the lightbulb goes off - we had already created the extract before the crash & then created a new one when we were up & running at the DR site so we double posted about 100 different dividends in 30,000+ accounts. Oh yeah, I need to add the next day is monthend, when we produce statements!
I call the CIO, who is well known for screaming into his speakerphone using language that’ll make a sailor seem angelic. “What? I got bigger problems than your F$%&ing Piece o’ #!+ system!" I tell him what happened & there's 5-10 seconds of silence (while it sinks in). "You're right. Your F%&ing problem is F$%&ing bigger; get the F$%& in here F$%&ing NOW!”

Once the dust settled, the disaster plan was amended.

I would like to let you know that at least one person in this IT heavy thread appreciates what an incredibly bone-headed move that was.

(Not as bad the poor bastard who ordered an unrequested PCB test as part of major remediation project - that came back positive … )

Once our customer service team pulled me in to help them with an issue at a customer. It turns out that this customer had upgraded the software on their system, which included an OS upgrade, but after the upgrade the system would not boot. They got the customer to boot the machine from a USB disk so that I could take a look.

It ended up being a problem that I’d seen before. It turns out that our imaging process in our manufacturing environment did not correctly wipe the drives before partitioning them and installing the OS. The problem was that if the disk happened to have GPT metadata on it would survive the imaging process, but the metadata of course would be complete garbage. We were still using MBR partitions for most things for various lame reasons that I don’t need to get into. GPT is actually designed to co-exist somewhat with BIOSes and other tools that only understand MBR, so the MBR metadata does not overlap with the GPT metadata. This meant that when we partitioned with MBR, the GPT metadata was still present on the disk.

As it happens, the version of the OS that we shipped on the box had GPT support disabled, so the system would boot just fine. But when they upgraded the OS in the field they got an OS that had GPT enabled, and it got completely confused by the incorrect GPT metadata and couldn’t mount the disk. I typed the following command onto the machine, intending to wipe out the stray GPT header:


dd if=/dev/zero of=/dev/da0 bs=512 oseek=1

I asked the other engineer on the call to confirm that this would do what I wanted, and he agreed, so I ran it.

Almost immediately I got an uneasy feeling, as the command did not complete immediately as it should have. I’m sure that many people in this thread can relate to the terror of seeing a destructive command take a lot longer than it should, implying that it’s changing a lot more data than it should.

Where did I go wrong? dd is a Unix command used to copy data around in blocks. if=/dev/zero tells it to read from a special file that contains all zeros. of=/dev/da0 selected the disk that I wanted to write to. bs=512 set it to copy data in blocks of 512 bytes, which is the block size on most hard drives. oseek=1 tells it to skip over the first block on the disk (that first block being the MBR metadata that I needed to keep).

What I forgot to say was count=1, which would have told it to copy only one block of data. Instead it was copying until it hit the end of the disk – in other words, I had blanked out all of this customer’s data.

Oops.

Customer service screw up, not IT but OMG my face still burns when I think of it.

A few years ago I was working for a dive operator- a very busy one in a resort town. Christmas time here is JAM PACKED. A group that I’d had many times before made a reservation way back over the summer, then changed the hotel they wanted and I could have SWORN I’d made the change, sent a confirmation, etc. Unfortunately I’d canceled the first one but did not reserve the second.

Christmas eve rolls around and I’m taking a nap and anticipating a nice evening with my new boyfriend watching movies and having take out and I wake to a FURIOUS voicemail from my boss; the group is here (about 20 people) and the hotel they are supposed to be at has no record of them! WTF and curse curse curse and you better call right now and FIX THIS. Well, what on earth could I do?? You can’t get blood from a stone and you can’t invent rooms in sold out hotels. We had the idea to split the group onto several properties but a few of them brought their dogs with them thinking they were going to the animal friendly condos they reserved.

What a nightmare. Luckily the owner (much more level headed than my immediate boss) realized that I had never, ever, ever made a mistake like this before and we just had to FIX IT because seriously, what were they going to do, turn around and drive back to NC? We found a property nearby that had rooms at about 3x the cost (suites) that they agreed to share/split (only paying what they agreed to, we ate the rest of the cost) and after some phone calls the owner got the property owner to cut him a much better deal; in the end we came out about even rather than out thousands of dollars, thank God.

All I can say is that I double/triple check things now even when I am SURE I did them and I realized right then that I don’t have to work for someone who gives me an abusive tirade when I make a mistake no matter how big it is. I stopped working there that spring because I never got over being screamed at like that. It was so awful when I realized what had happened AND that the person that I usually worked best with had turned against me and was so crazy upset. It actually ruined my Christmas completely, I was so upset at the mistake and the call.

I crashed a mainframe. That is not supposed to be possible, but I did. Um, acheivement unlocked??

In the early 90s I was installing a fully redundant system that talked to the local mainframe at the client site. I had to write custom code changes for this site, and finally got them all working just before the mainframe went down at 6:00, per the nightly schedule. Yeah, that’s pathetic, but it’s how the client worked. They claimed they needed 12 hours to process a database. Anyway, between 6:00 and 8-ish that night, I documented my changes, backed everything up to tape, and copied everything from the primary computer to the stand-by/backup server. Then I went to my hotel, confident that everything would be great when the mainframe came back up at 6:00 AM. I could relax after my long day - I’d come in as soon after 6:00 AM as I could so I’d have a full day while the mainframe was up so I could (hopefully) complete my site-custom code changes.

Each machine accessed the mainframe through regular user accounts. We had a file for each machine’s mainframe passwords. I was careful - I had a backup of each machine’s password file on the other machine. During my long day I changed all the account passwords. What I neglected to do was save a current copy of the backup machine’s password file on the primary before I sprayed around the files. When I was saving my files around I had, without realizing it, saved the old password file to each machine. Neither machine had the correct password file, but the mainframe was down anyway so the failure would not be visible until the mainframe came back up at 6:00 AM.

Cue scary music.

When I drifted back to the client site at 9-ish in the morning, and walked to where my system was located, I could hear people whispering “He’s the one!” I got to my system, and saw that both mainframe connections were down. I shut them down, and checked the virtual screens. Each of the accounts had tried to log in with the old password. That failed, so my system was programmed to wait one minute and try to log in again. For each account. And again, and again. Lather, rinse repeat for 48 accounts every minute from 6:00 AM until 9:00-ish. I realized what had happened, so I called the help desk.

I did not realize the magnitude of the consequences of my screw up. It turns out my machine’s 48 failed log-ins every minute for 3 hours had overflowed the mainframe’s security logs, and nobody could log in. That’s the part that’s not supposed to happen - the security logs should not have overflowed at all, let alone so easily, and even with some repeated bad logins other folks should still have been able to log in. I don’t know why someone from at the client site didn’t call me at my hotel - they had all the information, and the mainframe security folks had ID’ed my system as the problem. The mainframe comms folks could have shut down those two lines, but instead they waited for me to come back in.

I called the help desk and said I needed to reset a number of account passwords, and the help desk person said “Darn right you do!” Anyway, it was all fixed in a few minutes, but I had crashed a mainframe.

My boss at the time was a former tech person, so she was very understanding. And I took pains in the future to make sure that the backup password file for the secondary machine was stored in a separate place, so I was less likely to over-write it by mistake.

I don’t know how much crap my boss took for me. Probably a lot. That client was very particular about keeping access up - the reason they’d insisted on our system being fully redundant.

Not computer related and to this Day I SWEAR the mistake wasn’t in the proof but I okay’ed the printing of the announcement in our city’s newspaper for the appointment of our new CEO.

He had worked in The Netherlands. He was not from Neverland.