the biggest IT screwup you've seen

<<**This is to replace a thread that got lost in the rollback.

Post your survivor stories of IT screwups and disasters here.**>>
A client wants an updated version of their bill tracking and anlysis software pushed to their offices and users. Along with this goes a change to their Oracle client, as the new software will only work on v8.05 or later. OK, fair enough.

What? You’ve no enterprise management software in place? Oh, you do have mangement software, it’s running on the box in the corner, the one that no one has looked at for over a year… Yeah, ok… What? No manuals, no installation notes, no passwords? Who installed this? He left six months ago? Fired? Took all his notes with him? Where does he work now? You don’t know? Can you get in touch with him in any way? Oh, I see… All of his supervisors were fired in the last two months, and none of them had updated the contacts list.
<sigh>
OK, I’ll do it the hard way: Script off the login. Thank God For high-capacity WAN links. WHAT!? Half your remote users are working from home over dial-up??? (Note to self: Kill the Account Exectutive that agreed to make this a one-person, fixed-price job) Right then, what kind of links? DSL or ISDN to the remote offices, 56k to the home users… <Sigh> Ooookay.

<scripting and testing on local machines>

OK, I need to deploy this to remote servers to minimize stress on the central facilities. I’ve designed the script to test for office location, and to connect to the closest server for down-load and update. Whattya mean I can’t test it across the WAN? Do you want to know if it’s going to work, or not? Thanks, I appreciate your trust, but I’ve earned my reputation by testing this stuff first, before releasing it on the whole user base. Well OK, but I’m telling you this is a bad idea (call Account Exec, chew his dumb ass for sticking me with this job, warn him about the client’s stupidity).

<Go Live>
<staggered login accross time zones>

It’s working! The updates are down-loading and installing…

It’s…

killing…

the…

WAN…

<WAN collapses under load>
<Many updates fail on time-out>
<2000 users unable to work (~ 80% of company)>

Whattya mean “dual-bonded 56K dial-up” is the same as ISDN? (BOZO!) Why didn’t you reveal that you were using Dual bonded instead of ISDN earlier!? The Phone Guy said it was the same? You let a Phone. Company. Salesman. Sell you dual-bonded in place of ISDN? You have nothing but dial-up? No ISDN, no DSL? You know, if you’d let me test, I’d have found out, even after you lied to me up-front. I’d have writen a different script, and all of this would have been avoided. It’s not like clients have never lied before, but this is the first time a client has both lied and prevented me from testing. Here’s a novel concept: You could’ve told me the TRUTH…!

<12 hours of all-hands scrambling>

OK, the update is in place, and all users are able to go back to work now. I’m going to go home and have a heart attack. Why are you thanking me? My beautiful, lovely script just shut your company DOWN for a day. Yeah, I know that none of you could do any better (you feckless slackers!).
(Note to self: Tell the Account Executive I’m not willing to work with this client again: They violated the Full Disclosure clause, and are too stupid to breathe)

(I Wanna forget this ever happened)
<sigh>

I’m amazed at the stupidity of people.
In fact only about 80% of the population would be allowed to reproduce if I had my way.
I work at a hotel
The ignorance of some people puts me in awe!
I cannot remember a single incident to share with you right of hand, but I’ll get back to you with one

I wasn’t really involved in this one 'cause all I can do with computers is poke at the keys and hope something comes out right but I was a victim. At the hospital where I work the IM people decide to do some upgrades one day. System should only be down an hour or so. No prob. But apparently someone did something a leetle cockeyed and managed to erase the social security numbers of all the patients and employees in our database.

Did I mention that the SSN is the primary identifier for our patients? What a fun day. I’m sure someone had carpal tunnel syndrome before it was all fixed.

This was one I did all by myself.

I was filling in for the Asst. IT at work. One of the jobs is to input all the invoices that come in for the day. I thought I had put this certain problem invoice in just fine. However,I found out the next day that the system had totaled the invoice at 7 million dollars. The system decided it couldn’t handle it, and it crashed the IT directors computer back at headquarters. I haven’t been asked to fill in for an IT since.

At least you wanted to test. Our programmers at work drive me decidedly batty sometimes. We regularly have to push out software updates via a dial up connection and specialized software. And these updates can be huge, so we hate putting our end users through it. Most times, not all, but most times they will tell us a day before they are going to push them down. We will ask them if they tested the update. Oh yeah yeah, of course we did they reply. And then it turns out they didn’t, or just on their LAN pc, or just on one operating system of the two types of OS’s we have in the field. And low and behold, the update will fail on half the pc’s and screw up the system beyond belief. The last time this happened, it screwed up the main program the end user’s use for 75% of our users, and we had to completely reinstall the OS for 25% of the people who got screwed up. And these people kept their jobs, because they are the only ones who can maintain this ancient proprietary software. But the joke is on them. By the end of the year, we should have all our software moved to the web, and they will be out of a job.

Jeeves

I used to work in IT for a Fortune 50 company that required all machines to have 620kB of conventional memory free but they wouldn’t let us use MemMaker to do it because some IT manager didn’t understand how to use it and screwed up his computer. Nevermind that a competent tech could have fixed it in less than five minutes; he wouldn’t let any tech touch his PC since he knew what he was doing. Imagine having a DOS6.22/WFWG3.11 machine with Token Ring and Ethernet cards installed and being threatened with termination for being caught running MemMaker on it. When I got shunted into laptop configuration, I insisted on my own cubicle away from the rest of the configuration staff so that I could run MM in secret because 2/3 of the laptop users required configs for both interfaces (10B-T for ISDN at home) and didn’t want to have to remove PCMCIA cards from the machines to make them work and yet still insisted on 620kB RAM.

:eek:

I feel better now.

Let me see, there was the IT manager who was formulating policy for year 2000 purchases that bought a non-Y2K compliant server…after the policy came out. He still works here.

Then there was the IS director that looked at the network diagram, said “That’s way too complicated!” and cut out all of our backup links. He doesn’t work here anymore.

And then there was the other IS director that decided she was going to make us use “just-in-time” inventory practices. Which means, no spare equipment. No spare CPUs, monitors, cables, adapters, nada. I work in the Helpdesk, in the Yukon. We sometimes can’t get parts even when we order them waaaay in advance. She doesn’t work here anymore.

And my very favorite, the IS director that was responsible for converting us from a Novell 4.1 network to a mostly NT network? The one directly responsible for inflicting the hideous abomination known as Exchange Server upon us hapless technicians? The one who wouldn’t let me upgrade our Novell when the upgrade was $150 US ?

He doesn’t work here anymore either. Now if we could just get rid of his debris…

tisiphone
Looks like you had a whole run of PHBs. Good they got sacked. Too bad you couldn’t cut them off at the pass.

sewalk
Preach it, brother!
MemMaker saved my butt more times than I can remember. I had a Supply Officer make me write a 2-page justification when I went to order it. He still was about to deny the request when I pointed out that it would stop his transaction journalling computer from crashing every 1 1/2 days. After that, I could do no wrong in the Supply world.

Jeeves
I’ve a reputation for being more than a little paranoid when it comes to making config changes. Now, you know why. That was my first solo run at a major config change/migraton/update. Despite trying to do it by the numbers, I was stymied at every turn. The IT management at that particular client had changed completely three whole times in three years. No wonder it was such a mess. The Help desk manager’s previous position had been as a manager for a bookstore! The Dept Head had fled the local mass-transit system just ahead of a major audit/reorg, and the unit manager responsible for my work had been his right-hand man there. The main management still, to this day, have no understanding of IT, despite it being utterly critical to their business model, as in: No WAN, no company. Period. They hire guest-workers from overseas because they’re not willing to pay more than US$45K/yr for qualified Oracle DBAs. At the end of a year, when the contract is up, the DBAs invariably walk, taking a year of experience out the door with them. That IT dept is hemorrhaging money.

Lady Ice
Ouch.

dwyr
Ow. Owowowowowow. OUCH! My wrists hurt just thinking about it.

There was the time the head Systems admin discovered that his backups had not been running…ever…after two disks in the raid failed at the same time.
Another company:
Me to head of computer operations “we need to junk that printer, it has a bad power supply.”

“No it doesn’t, thier tech said it was fine”

“Thier tech is a moron, its been screwing up for weeks. Why do you think we pulled it off the floor?”

“here, I’ll show you”(plugs in printer in the main computer lab…printer fires up, starts to run…then…darkness…sound of UPS alarms screaming in pain…hard drives spinning down because UPS’s hadn’t had thier batterys replaced in years…Oh the humanity"

I go back to my office and laugh my ass off. Thankfully, I had insisted on putting the NT domain controller and the exchange server in my office.

This place was exciting. When the Fire mashall came by they got the prettiest girl they could find to lead him around, and distract him right past the computer room so he wouldnt see the long chain of power strips running all the servers. They had cat 3 runs going down four stories, and wondered why they had trouble logging in on the ground floor. They had about 20 X terminals running off my poor little p166 citrix server. I had to build the domain controller of spare parts, and they wouldn’t buy me a BDC. Man, I’m glad I’m out of there.
Current job. One of my all time favorite screwups.

We started getting calls about hundreds of users machines crashing. Some wouldnt boot, others thier desktop disapeered and was reset to default. On investigation, the desktop folder or the system folder(IIRC) had been renamed to “john”. The fix was easy, but this appeared to be a new virus/trojan. We called Mcaffee, they said it was the first they had heard about it. Maybe an inside job. They wanted a speciman if possible. The hunt was on. Whatever it was it must have deleted itself after activation, because nothing was amiss and once fixed, there was no problem. It was found this was only happening in the engineering dept. They had a differant log on script than anyone else. The person who had last modified the script first name was john. He had singed his name to it in a remark statement, but he substituted a “N” for “M” in the remark statment at the end of his section of the script. So, where it was supposed to say

It said

which will rename the current directory on c: to john, and depending on the last thing done in the script, it was either desktop, or system.

I never got the to look at the script in question personally, so I don’t know how the drive letter came to be right after the ren statement and before the guys same, but I found the whole situation amusing…

Well, I’m going to nominate the old IT manager here, who once told me " I’m not that concerned about network security or viruses. The NT server goes to a UNIX server and then to the network connection, so anything that comes across will have to go across two platforms before it hits the users."

Like viruses in email attachments are going to care what kind of (unfiltered) server they get passed across. This person also didn’t believe in firewalls ("'cause they can shut down network traffic!" Duh, that’s what they’re supposed to do. The traffic you don’t want.
She didn’t think that updating virus definitions was a priority, because users were instructed to scan their machines weekly. Never mind that 85% of them couldn’t find the antivirus even with a shortcut on the desktop. I had to print out Symantec’s operating instructions to make her see the light on this, and I’m not sure she believed me.
Shortly after I arrived, she got fired and I got a LOT more work. Wonder why?

Bwa Hahahaha…!

That’s just about the most horrifying, funny screw-up I’ve ever heard. [sub]And I’ve heard plenty.[/sub]

At Biosphere 2 we had an HP micro 3000 “mighty mouse” system that ran the MPE operating systen, one I had never heard of. It blew a disk so boss called HP to replace it. He was taking a class that week so it was my task to restore the backup. Horrible slow system that used multiple tape cartridges in a built in changer. Halfway though the fourth and final tape I find the tape is corrupt. Ask boss where he has the previous day’s tapes. I figure we’ll lose a day’s transactions but payday is tomorrow so we better get something done. He returns from his class and tells me there are no previous tapes.

Huh?

Instead of cycling tapes or using fresh ones he had been simply opening and closing the door on the drive to it would always reuse the same set of tapes.

NFBSK! says I

He also announced that he found another job, had sold his house and would be moving but not to tell management. Mighty mouse was my problem now.

I pulled an all nighter trying to recover individual files with the help of an HP tech in Australia. Only by the sheerest of coincidences we found the last file on a single loose tape sitting on the shelf, a static file tha completed the system. Payday was saved and I went out that day and got a case of tapes to do proper backups

NFBSK = Not For British Schoolkids

A relatively modest one, but one I just experienced and which demonstrates the perils of not testing your webpage design on somebody else’s computer, too.

I need some info on Pioneer’s plasma screens for home video. I go to their site and punch up the .pdf file of the manual. This comes up in a window that is roughly the full width of my 21" monitor but is only a couple inches high. Maybe nine lines of text. Half the elevation drawing of the front of the screen, which works out to about one inch when I print it. In other words, a slit. Can I enlarge the window? Nope. Instead, I have to PRINT OUT ON PAPER the drawings I need so I can see everything I need.

Were that window sizeable I could have the full page in front of me, perfectly legible. Instead I have to SLOWLY scroll through the document and print out what I need.

So, does the moron running the website have a place to leave comments and questions so I can point out this problem and how it reminds me what a nice plasma screen Philips makes? Nope. Neither is his boss. Guess he’s not as dumb as I thought.

It’s so tempting to start describing the incidents which make me want to crash a car bomb into our data center … except I’m at work now, so I’ll refrain from describing how I plan to destroy the building until I’m not using company hardware.

We had a QA guy here who had root access on a UNIX server (unfortunately he needed it to do some of his testing). My favorite script he ever wrote included the following two lines

cd /
rm -r .

For those who are not unix savy the result is a complete erasure of all disks on the machine. The guy thought he was just erasing the local direcotry the script created earlier. The first time he ran the script we were angry. But we all make scripting errors of one kind or another so we forgave him. We had good backups so it only cost us a couple of hours of downtime and turmoil. When he ran the same script a second time less than a week later, IT informed QA that he would no longer be granted the permissions necessary to do his job and perhaps they should find alternative work for him. He didn’t stay with us long after that. Screw with my machines once and you will be forgiven. Make the same idiotic mistake twice and you will punished.

::sigh:: I guess I’ll post mine again.

I participated in a doozy of a screw-up here at my company. We had a system crash, and part of the recovery process is to start a filesave with a process called “transaction logging”. The two have to be done together. So we didn’t notice that the backup was no good, neither was the transaction logging. So when the system crashed again later that day, all 2000 users lost all of the work they’d put in since the last crash…:eek: :o

Dude! He couldn’t do his testing using sudo??? I mean, you can give sudoers rights to just about anything without opening root wide open to people like that!

At a former job there was a Microsoft network spread across three sites connected by WAN links. The NT 4.0 servers merrily chugged along, the NT 4.0 workstations worked fine.

The head IT managers decided that they wanted remote administration capabilities and implemented Microsoft SMS. Then came the rollout of NT 4.0 Service Pack 6. The managers had the idea to roll out the operating system’s Service Pack over SMS, across LAN and WAN links, to the several hundred workstations at once over the course of one evening. The rollout started at 6PM; the go ahead was given, the rollout started, and everyone went home.

Come 8AM the following morning, the rollout was still in progress. Network traffic lights were going nuts, bandwidth was maxed, and no one could get anything done. Frustrated that they couldn’t use their machines and not understanding why their machines weren’t working properly, users rebooted sat down at their stations and rebooted their machines. In the middle of operating system Service Pack rollout. End result: techs running around re-installing operating systems from scratch on all the machines that would no longer boot.

My previous job was as QA manager for a software company. Like a lot of smaller organizations, we tended not to update product spec and requirements docs as things changed. Most of our discussions of feature changes, meeting mintues, etc. were handled as e-mail threads, so we tended to keep nearly all of our messages for later reference.

One fine day, the marketing director managed to trash his inbox, taking all of his past messages with it. He asked the IT guys to restore from backup, which they did. Unfortunately, their most recent backup of the Exchange Server was from two months before – and they restored the whole bloody thing, overwriting every mailbox in the company. Two months, and thousands of messages, all full of irreplaceable information, all gone to the great bit-bucket in the sky. The only messages that were salvaged were those where individual users had done a local archive of their messages (a pitiful fraction of the total).

A couple of months later, in preparing to migrate to a new version of our defect tracking software, I discovered that our existing defect database hadn’t been backed up in four months; the explanation I received was that the server it was on was backed up every day, but that the database file was always open so the backup software was skipping it. Now even if they were using backup software so lame that it couldn’t deal with open files, they knew that this was happening and that it meant we stood to lose our entire repository of information about defects in a software product we’d been developing for nearly two years, and they didn’t think it was important enough to mention. Needless to say, I immediately moved the database to a machine I controlled and instituted an effective backup strategy until the IT guys demonstrated to me that they’d resolved the problem.

It was a great company to work for in so many ways, but their Achilles heel was that they believed in finding another place in the company for people who proved not to be well suited for the position they were hired for. I was promoted to QA Manager to replace one of them – who ended up as IT manager, and presided over both of the above debacles.

This exact same thing actually happened where I work. The sad fact is it wasn’t even considered a royal screwup, in fact it hardly merits mention in light of other ROYAL screwups that have gone on around here.