When is a hard disk considered to be failed? Should I replace it? (RAID 6)

So I have a home server/NAS Synology DS416play.
I am running a configuration which is equivalent to RAID6. It is called Synology Hybrid Raid2. It is not actually RAID6, but I do have 2 disk redundancy.
Anyway, one of my disks has recently shown a bad sector, then a day later another bad sector. These disks are constantly being written to as I use it for my home surveillance (2 cameras, rolling recording), in addition to general data storage. So they experience a fairly constant load.

When do I consider this hard drive to have failed?
Should I replace it now?
Should I wait until it completely dies before replacing it?

I realise the answer would be “Depends on your risk tolerances”, but I am just trying to get a sense for what the risks are. What would you do?

Hard drives are so cheap that I would replace it as soon as it started having any errors.

What is the S.M.A.R.T. status of the drive? I think it will stay “Normal” until the bad sector count reaches a certain limit.

The SMART status is normal, but the status (without the SMART) is normal. I tried to run an extended SMART test last night but it got stuck at 90%, for about 24hours, until I stopped it. I ran a regular SMART test which completed in 2 minutes and didn’t change any of the status indicators.

Sorry, the status without smart is warning.
Sent from my iPhone using Tapatalk

better replace than sorry

If you don’t mind losing all the data on it.:frowning:

One thing to try might be to remove that disk from service and run a complete disk wipe and low-level format on it. That will re-do the formatting, checking each sector, and mark ones that fail as bad sectors not to be used. On an older disk, this could easily find more bad sectors than when it was new, thus you lose some of the capacity of the disk. But it is still useful, and maybe you will have less bad sectors occurring while in use.

However, more frequent bad sectors is a sign of an aging disk, with an increased chance of a complete failure. So it depends on how vital the dat on this disk is, and how much you could recover with your RAID configuration.

But given how cheap hard disks are now, I’d begin looking for a replacement one.

Do you have the Synology Assistant software? It should have a health sensor.

It’s RAID(esque). This particular drive doesn’t have specific unique data on it, and the OP wouldn’t lose any data unless a second and third drives fail.

Why not buy a replacement disk or two right now in any case? Then you can immediately hot-swap it into the array when the problematic drive dies, or whenever you feel like it. If the worn drive is years old it is not worth the trouble trying tricks like low-level formatting; the RAID software should automatically detect corrupt data and re-write it onto a good sector. With data sets on the order of terabytes and petabytes these days, you should expect a few flipped bits in any case, which a good RAID will automatically correct.

BTW what is this mysterious format which is described as RAID 6 but not exactly?

Low level formats haven’t really existed since the MFM or RLL days. You could wipe the partition table and reinitialize the MBR or GPT but that’s about it.

It’s Synology’s semi-proprietary format, though not as restrictive as Drobo. You don’t HAVE to use it but it makes things easier than RAID, theoretically.

Losing a couple of sectors in succession is usually a bad sign. The SMART status won’t necessarily tell you much - it can’t know what the underlying cause of the failures was. It could just be bad luck. But failures can often manifest as an accelerating numbers of bad sectors. As time goes on the rate of failures accelerates until they are a constant cascade.

Having been burned in the past there is no way I would hold off on replacing the disk. They are too cheap to worry about the money, and for all useful intents you are now running with one disk redundancy.

So I have a new disk ready to replace it, and will do so shortly.
The issue seems to be “Current Pending Sector” count = 2.
From what I understand, this is when the hard disk has identified a bad sector but has not remapped it yet.
I have no idea why it has not remapped it, or if it is normal for it not to be remapped right away.

In the old days, we would all get together and cry about a bad sector and go through great pains to re-format the drive and use all these fancy tools on it to keep it working. But this was done because the cost of the hard drives were so expensive. But these days, they aren’t. Just not to have to be concerned about it I’d replace it now at the first signs of any trouble.

What I’d replace it with are hard drives that have a 3 or 5 year warranty. I prefer to buy drives with a 5 year warranty. Not because I expect to get a free replacement one if it fails within 4 years or whenever, but because this tells me the manufacturer is expecting it to last 5 years without a problem. Even though those drives are more expensive with a 5 year warranty, I consider the lowered risk of downtime and having to replace them worth it.

I’ve run RAIDs on Linux servers for many years now, and if the drive gives an indication it has a problem and SMART indicates it might be failing so, I replace it, because I don’t like having unscheduled downtime.

As for RAIDs, I run a three disk mirroring with one hot spare. This way, in the event of a failure, the hot spare gets cloned and the system is still running on a two disk mirroring until the hot spare gets cloned. Then I can have the bad drive replaced with a new hot spare.

I guess it also depends on how you like to do things. I prefer to take our cars in for Preventive Maintenance because I don’t like unscheduled interruptions to service. I feel the same way about servers, and rather know it is time either by SMART or by age to replace a hard drive before it becomes a problem. I know, people will have stories about drives lasting 10 years or more without a problem, and I have drives like that too in non-critical situations too, but the days are gone I’m going to hesitate a drive considering how low the cost is.

The drive firmware is probably waiting for you to attempt to re-read, or write to, the bad sector(s), at which time it will be reallocated.

Just to round out this discussion, if the bad sectors are not just bad luck (manufacturers report read error rates as high as 1 in 10^14 and you are constantly rolling the dice) and the drive is truly failing, consider that the other drives in the array may be of similar age and manufacture, in which case the expected time to failure is shorter than with brand-new drives. Also, copying data onto a huge new drive is not instantaneous and there is the possibility of further failure during the operation. So the consensus in this thread is indeed better safe than sorry.

This is why when running a two disk mirroring RAID, I decided to go to a three disk mirroring RAID, because is one of the disks failed in a two disk RAID, then you are down to one disk. Which because of their age and being purchased and put into service at the same time you are risking it all on one disk.

These days, as soon as the disks are coming up on the 5-year warranty I replace them.

So why don’t you use raid6? You still have 2 disk redundancy and you get more storage. I ask because I am wondering if a mirroring raid would be a good solution for me.
Sent from my iPhone using Tapatalk

You chose SHR-2 over the other options, I presume because you didn’t want to lose your data due to drive failure. To waffle on that now doesn’t make sense IMO. I use SHR-2 due to having experienced multiple (mirrored) drive failure in RAID-10 environments (entire weekends of no-fun-at-all, I assure you).

Think about what you’re going to do when a different drive in your array fails completely. On replacement, you’re now depending on that flaky drive to deliver it’s contents under a higher load that will be stressing all your other drives. If you’re the kind of person with enough nails to bite through that entire process then you can leave it to another day, but I’d personally welcome the opportunity to introduce some variety to the (I imagine) homogenous mix of disks.
(SHR’s main advantages are that you can better use capacity if your disks are of different sizes, and you can grow your array by adding disks after creation or by swapping existing disks with larger ones. Especially useful on their larger 5, 8 or whatever bay models.)

Retired IT guy here. Yes, replace the drive now.