A little background: I’m the IT director for a very small company. What that means in reality is that I do pretty much all the system/network/database administration and much of the development work. But I’m in a nasty situation right now, I’m at my wit’s end, and I’m hoping someone here with some experience can help me out.
The short version of the story is that we bought a bunch of used enterprise-grade server equipment from another company. We work very closely with this other company and we know this equipment was in good working order when it was pulled out of service as part of an upgrade. But now it’s ours, it’s my job to get it running, and I’m having a horrible time.
What we got were 3 Dell 8450 servers along with more SCSI storage than I’ve ever seen in my life - a whole slew of 8- and 10-disk rackmount arrays chock full of 73G Seagate Cheetah drives, 10K and 15K RPM. They’re fast, and there are a lot of them. My problem is that I cannot seem to keep them running.
I started by loading Linux on one of the 8450’s and connecting two of the 10-disk arrays to a Dell PERC RAID controller. At first, everything was fine, but under heavy load the drives start failing. They’re usually fine again after I power-cycle them (pull out the hot-swappable tray and reinsert it), but they flake out way too often - always within less than a day. I can’t keep an array of any size stable.
My first thought was heat - I know those drives run insanely hot under load. But I think I’ve finally eliminated that as a possibility. In my most recent test I used a stripped-down test setup - one rack-mount array, pulled half the drives out to allow much greater airflow and cooling, and pointed the output of an industrial-grade air conditioner straight into the front of the array. The air coming out of the air conditioner is about 50 degrees F. I taped a temperature sensor to the side of one of the drives and it never exceeded about 72 degrees F. But in a RAID array of 4 drives (3-disk RAID 5 plus one hot-spare) I had a 2-disk failure in under a day.
I’ve tried everything I can think of. I’ve swapped disks. I’ve swapped cables. I’ve rebuilt arrays of various configurations from scratch more times than I can count. I tried another disk controller. I swapped enclosures. I upgraded the firmware on the PERC RAID card (which did eliminate another problem, but not this one). Nothing has made any difference.
This has to be hardware related somehow, and it has every sign of being a heat problem. But I don’t know where to turn for help, so I’m coming here and hoping someone has some serious SCSI experience. I have a whole heapin’ load of computer experience, but minimal exposure to SCSI and none in a true “enterprise-grade” environment.
The only other possible cause I can think of is that this equipment was driven from California to Florida in a rental truck by one of our employees and, while it was well-packaged, it did take a heck of a beating in that truck.
If anyone has any suggestions, I’m all ears. I can provide much more detail if requested. Hiring an outside consultant is certainly an option, but I don’t know where to even start looking for someone who can handle this kind of work. We’re a successful company but we’re not dripping with money (many people who see our operation are stunned at what we’ve accomplished with so little in the way of capital/people/equipment), so anything involving, say, tens of thousands of dollars isn’t an option. But I don’t think this is really that kind of problem.
The only thing I ask is that nobody clutter this thread with “Oh my god, a real company would never work that way! You’re going to go out of business if you keep working like that!” posts. I’ve seen that happen on other forums (coughSlashdotcough) and I’d appreciate it if nobody would resort to that here. Thanks. And double-thanks in advance to anyone who has any suggestions.