RAID: how perfectly matched do the drives have to be?

Now that you brought it up again, I’d like to give my experience with the 5-bay Drobo. Yes, it can shift levels on the fly, but when I tried to go from a one-disk protection to a two-disk protection, it took 2 days, during which I was warned that I had no protection at all (at least that’s what the message said). So an UPS would be in order, and even that isn’t likely to protect for 2 days.

This was a nominal 15TB system, so I’m sure smaller ones would be faster.

Many years ago, we used to sell Compaq servers to our customers. Many price-conscious customers purchased the low spec RAID controller, which did not do on the fly expansion. Eventually, those buyers did require additional disk space. We would turn up with a new disk and our spare RAID card, the expensive one with battery backup. Swap in the card for the cheap one, the system boots and recognises the existing array. Add the new disk, extend the array and wait for the rebuild to finish. Then swap the cards again and the cheap card would read the new config from the disks and start the extended array. Utterly wonderful, and impossible with servers from Dell or HP.

Most cheap array controllers (or software RAID) don’t do array expansion. It is a shame, but you can’t use flash for the transitional RAM because of the write cycles, and battery based DRAM is expensive.

Infant mortality, or the infamous bathtub curve. Sadly it is a fact of life. The modern reality is that the consumer’s pure focus on price means that manufactures tend to push down the cost of manufacture and QA, and wear the cost of warranty replacements on the higher failure rate. There are higher quality drives, but you start talking enterprise level systems, and if you baulk at $500 for a controller you are at least two digits in the price away.

No one really believes it will happen to them, and no one ever realises how bad it is, until it happens to them.

Sort of. It is a difficult question to give anything more than general principles, and the simplistic answers are not all that good. RAID 5 gives good average read performance as it will tend to distribute the requests across the disks. Write performance is worse than a single disk as the controller must calculate the contents of the parity block, and that requires a read as well as a write. However a reasonable sized cache may allow the controller to aggregate writes into a stripe (which avoids any need to read existing data) or can delay the write until there is lull in activity. Such tricks usually need a persistent cache (battery backed up) or similar to avoid problems if the power goes. But these are broad brush points. Journalling file systems can play well with RAID systems because the journal can be written in stripes. But the background journal truncation still requires random writes. Bottom line is that performance is complex, and a moving target. It depends greatly upon your specific workload, and many simple analyses are based upon constant average loads - which are not representative of a single user PC. There is no doubt that a SSD is a game changer, but we don’t have enough experience to really understand the long term reliability issues.
Personally I don’t much like RAID 5. It doesn’t provide enough in either performance or resilience for my liking. Lose one disk and you are suddenly running with zero resilience right up until the point the array has been healed with a new disk. Better controllers will allow you to configure a hot spare, and it will initiate a rebuild automatically if a disk fails. But the rebuild time is not short - and is typically many hours or worse. This is from someone who knows the pain of losing a disk in a large stripped array and then discovers a bug in the (very expensive) backup software. Still have the receipt for the data recovery. $8k, and one of the worst weeks of my life. Do not underestimate how bad things can get, and how much it will cost to recover when things go bad. Boatloads of disks? - go RAID 60.