SCSI guru needed - drives flaking out

A little background: I’m the IT director for a very small company. What that means in reality is that I do pretty much all the system/network/database administration and much of the development work. But I’m in a nasty situation right now, I’m at my wit’s end, and I’m hoping someone here with some experience can help me out.

The short version of the story is that we bought a bunch of used enterprise-grade server equipment from another company. We work very closely with this other company and we know this equipment was in good working order when it was pulled out of service as part of an upgrade. But now it’s ours, it’s my job to get it running, and I’m having a horrible time.

What we got were 3 Dell 8450 servers along with more SCSI storage than I’ve ever seen in my life - a whole slew of 8- and 10-disk rackmount arrays chock full of 73G Seagate Cheetah drives, 10K and 15K RPM. They’re fast, and there are a lot of them. My problem is that I cannot seem to keep them running.

I started by loading Linux on one of the 8450’s and connecting two of the 10-disk arrays to a Dell PERC RAID controller. At first, everything was fine, but under heavy load the drives start failing. They’re usually fine again after I power-cycle them (pull out the hot-swappable tray and reinsert it), but they flake out way too often - always within less than a day. I can’t keep an array of any size stable.

My first thought was heat - I know those drives run insanely hot under load. But I think I’ve finally eliminated that as a possibility. In my most recent test I used a stripped-down test setup - one rack-mount array, pulled half the drives out to allow much greater airflow and cooling, and pointed the output of an industrial-grade air conditioner straight into the front of the array. The air coming out of the air conditioner is about 50 degrees F. I taped a temperature sensor to the side of one of the drives and it never exceeded about 72 degrees F. But in a RAID array of 4 drives (3-disk RAID 5 plus one hot-spare) I had a 2-disk failure in under a day.

I’ve tried everything I can think of. I’ve swapped disks. I’ve swapped cables. I’ve rebuilt arrays of various configurations from scratch more times than I can count. I tried another disk controller. I swapped enclosures. I upgraded the firmware on the PERC RAID card (which did eliminate another problem, but not this one). Nothing has made any difference.

This has to be hardware related somehow, and it has every sign of being a heat problem. But I don’t know where to turn for help, so I’m coming here and hoping someone has some serious SCSI experience. I have a whole heapin’ load of computer experience, but minimal exposure to SCSI and none in a true “enterprise-grade” environment.

The only other possible cause I can think of is that this equipment was driven from California to Florida in a rental truck by one of our employees and, while it was well-packaged, it did take a heck of a beating in that truck.

If anyone has any suggestions, I’m all ears. I can provide much more detail if requested. Hiring an outside consultant is certainly an option, but I don’t know where to even start looking for someone who can handle this kind of work. We’re a successful company but we’re not dripping with money (many people who see our operation are stunned at what we’ve accomplished with so little in the way of capital/people/equipment), so anything involving, say, tens of thousands of dollars isn’t an option. But I don’t think this is really that kind of problem.

The only thing I ask is that nobody clutter this thread with “Oh my god, a real company would never work that way! You’re going to go out of business if you keep working like that!” posts. I’ve seen that happen on other forums (coughSlashdotcough) and I’d appreciate it if nobody would resort to that here. Thanks. And double-thanks in advance to anyone who has any suggestions.

You say the drives “flake out”, it might be helpful to know what exactly is going wrong. Look for an error message or an entry in the system log file.

If the hardware worked for the previous owner, but not for you, that suggests that the system configuration has changed. Are you using the exact same systems software as the previous owner? If you can’t replicate the previous configuration, you might try testing the system with another operating system, like FreeBSD or Windows 2000/XP. That might tell you whether you have a hardware problem or a systems software problem.

IANA SCSI expert, but here are some relatively obvious questions that you have probably thought of. Are you carefully tracking and documenting the failures, both in terms of the specific drives involved and their locations in the racks? Could it be that a relatively small number of drives were damaged in the move, and that as you pull them and move them elsewhere, they fail again later? Or possibly that a certain rack location is hotter than expected, causing any drive placed there to have problems?

Could you buy a couple of brand new drives, stick them into the mix and see if they have problems? If not, it would suggest the problem is in the old equipment. If they do, then something else is going on.

I assume these are Dell Powervault RAIDs?

Three questions:

How good is your power supply?
Are you using known good LVD cables? Get a kink in one and watch your RAID die.
Are your PERC adapters in cluster mode?

Oh dear. This is not good. How well packaged was it? Was each drive seperated and packed in foam or bubble-wrap? How hot did it get inside the truck?

Thanks for the replies, folks. To answer some questions:

The machines originally ran Windows Server 2003, but they came to us with no OS. If I keep having no luck I might load Windows and completely eliminate Linux from the equation.

Some of the LVD cables were kinked near the connector that goes into the server - I’ve ordered a brand new one for testing.

I never did track the individual drives carefully, but I did swap out the failed ones with others at one point, but it didn’t help. There’s no telling whether or not those were damaged, though. I don’t have any brand new drives to test with, but I can probably order a few if I have to.

The power supplies are good. The server has triple-redundant power supplies, the drive racks have double-redundant power, and the whole mess is plugged into a datacenter-grade high-capacity UPS.

When this equipment took a trip in the Truck of Death, it wasn’t in the dead of summer, so it shouldn’t have gotten too hot. The drives were still mounted in the cages, which were in the racks, but the racks themselves were in cardboard boxes and packed with thick foam cushioning custom-fit to the racks. That doesn’t mean they weren’t bounced around too much, though. I’m still thinking this is potentially a source of the problem.

The 8-disk arrays are Dell Powervaults. The 10-disk arrays are from some company I’ve never heard of before (I think they’re called “Forta” boxes). I think they may have been part of a Fibre Channel setup originally, but they’re basically just SCSI backplanes.

The errors in the system log are unhelpful. When the arrays are plugged into the PERC RAID controller, they appear to the OS as one big SCSI disk. Once enough drives fail to crash the array, the only errors I get are generic “I can’t access this drive any more” errors. The PERC RAID setup program, which I get to using CTRL-M at boot, will then show the drives as failed until I power-cycle them.

The PERC adaptors are not in cluster mode. I’m not even sure what cluster mode is. Heh… For testing I’ve just been creating various simple RAID 0 or 5 arrays.

At this point I’m thinking this is the result of damage sustaine in transport, or bad cabling. The new cable should help me eliminate that as a possibility, and perhaps careful tracking of failed drives will ultimately let me find one set that works consistently. I went through that when we got some Promise IDE RAID boxes a couple years ago. They’re great, but some of th IDE drives just did not like being pushed hard and would flake out. Once I replaced enough of them I ended up with a stable set and aside from occasionally power-cycling one of them (once a month, maybe, if that) they run without a hitch. It’s a shame they’re nowhere near as fast as these SCSI drives, 'cause it’s awful nice to have over a terabyte of storage in one little box.

Thanks again for the suggestions. I’ll keep at it and see what happens.

I haven’t used SCSI in years. This might be obvious but are you properly terminating the drives?

The Dell Powervaults should be self-terminating. The biggest problem I had with Dell Powervaults was the backplane controller.

Yeah, the backplanes are self-terminating. The drives do work for hours, and in my experience they’d fail very, very quickly if they weren’t terminated properly.