TL,DR: there was a fire at a data center in South Korea. Shit happens, but there was this notable detail about a drive system that hosted 858 TB of data:
According to a report from The Chosun, the drive was one of 96 systems completely destroyed in the fire, and there is no backup.
“The G-Drive couldn’t have a backup system due to its large capacity,” an unnamed official told The Chosun. “The remaining 95 systems have backup data in online or offline forms.”
There is no backup, because of its large capacity? Wait…what?
I’m not an IT guy, so I’ll ask the naive question: Is there something about a large drive system that makes it feasible to have one, but impossible to have a second one as a backup? Logistics? Cost? Complexity? Or was this really just the catastrophic lapse of judgment that it appears to be?
All of those factors play a part, but this is a major screw up by people. A petabyte of storage is fully capable of being backed up. Here is a white paper from Wasabi, a backup vendor. Their estimates are about USD$1.3M for 5 years.
I agree with the preceding that, if this is being reported accurately, somebody royally screwed the pooch. It might have been a little harder to put this into a backup scheme but it’s far from impossible.
The only thing I can think is that there’s some sort of mistranslation or misunderstanding — something like, maybe the data center was itself in part a backup facility, and this lost drive was already a backup repository for something else, in which the other systems had redundancy but this one didn’t. This is a stretch, but it’s hard to make the story sensible otherwise.
When you are getting into storage of this much data typical backups are not used. It’d be very difficult to do and very expensive.
What they do is called “continuous replication.” It is like it sounds and, generally, they would have at least three data centers and each is a clone of the others (near enough). Of course, you distribute them widely. Having them in the same building is an obviously bad idea.
Of course, that all costs money and is expensive so a cheap company might skip it and and keep their fingers’ crossed. Not a good strategy.
How so? Maybe I’m missing something (and I’m sure I am), but I can go down to my local Best Buy and purchase a memory card with 1 TB of storage on it for less than $100 and about the size of half a postage stamp. I can get an external USB drive with 24 TB for about $350. 858 of the former or 35 of the latter would be expensive for a regular person, but it should be chump change for the South Korean government.
Second, these drives tend to be slooow. Not ideal for this use.
Third, they have iffy reliability. Good enough for consumer use but not in a data center. Our camera guy always carries extra memory cards because it is common enough for one to go bad on him when out in the field.
I remember in the old days Google data centers had guys running around with new hard drives replacing broken ones 24/7. Sure, a given drive might have a 20,000 hour rating but when you have almost that many in a data center one is always breaking. I imagine this still happens to some extent.
If the data wasn’t backed up, what does the article in the OP mean by this, “As of Saturday, only 115 of 647 affected networks had been restored, a recovery rate of 17.8 percent. A full recovery is expected to take a month. The government says it will offer alternatives to the most important services.”
Thanks. I imagine that there’s one person in the office who never trusted this whole “cloud” thing and is going to be gloating that they were proven correct.
As others have said “couldn’t” can’t be the right word in that article. That particularly drive certainly could have been backed up if that was seen as necessary and desirable.
My hunch (which could be way off) is that this particular drive was meant to be used for a particular purpose that didn’t justify the cost of a robust backup system. Maybe it was for temporary files or archival use or remote access to files that should always be backed up somewhere else. I can think of plenty of use cases where you would create a large storage system that didn’t justify the cost of continuous backup.
But then these government workers realized it was handy to have all of their data online so they started using the “G-Drive” to store lots of data that probably wasn’t meant to be stored there since it wasn’t reliably backed up. I’ve seen something similar happen in past jobs where server drives were being used to store critical system information and the users didn’t realize it was just a shitty PC in a lab somewhere. Fortunately it was identified and moved to more robust systems before something catastrophic happened.
You’d think the South Korean government could swing that though; here in the US, large cities and companies do that sort of thing routinely, and it’s not even something that gets questioned.
I mean, if we’re talking about that triple backup with continuous replication and using the storage numbers from @FinsToTheLeft, we’re probably talking about what… 5 million a year? That’s maybe not chump change for a large scale IT organization, but it’s not a particularly large spend either. It probably is chump change for a state government or our Federal government though, so I’d think the SK government could do it without blinking.
The issue is more that they’ve got some combination of lazy, incompetent, and cheap people working in their IT department. I mean, referring to all that as a “G-Drive”? I’m assuming that’s some network share on a SAN or NAS and that’s administered through AD… to their whole organization? With nearly a petabyte on it? That’s where their problem really lies, not in whether or not it’s actually able to be backed up or replicated.
That’s almost certainly what was going on. That sort of thing has happened at every employer I’ve ever worked at. People just fundamentally like having a mapped drive or share that they can just dump stuff in and go on about their business. But as the IT folks, we have to be aware of that sort of thing and enforce policies and procedures that make sure they’re storing the right data in the right places, or that those drives are indeed being backed up. This gets back to the laziness and incompetence I referenced earlier.
It’s very easy these days. Just rent 200 GBs of Cloud space and you’re good. Unless you make a point of collecting a copious number of videos and pictures, you’ll never need even that much. If you do, you can rent even more space.
The problem, too, is timing. To back up say, a giant database - you would have to pause making updates while the data is copied. You cannot copy it while it is being updated, or you don’t have a complete copy - you would have some pieces refering to other pieces that did not get copied, etc.
This can be done with a few strategies, if say it’s something like a DMV database, do it at night. Or, freeze updates and build a file of updates pending, and apply them after the backup. Or if access is critical, the database program will check the pending update file and the database during backup, and correct the answer to any queries with the updated data. Usually (always), this functionality is built in as part of the database system. Sometimes it;s a matter of a full backup every week and a copy of the updates applied each day, so possibly they lose a week’s worth of updats that did not get redundantly copied offsite.
Then there’s timing. How long will the database take to back up? Restore? How often is it done?
When they say “files” It makes me think this is a giant “archive” database, where people check in and out random files - spreadsheets, documents, etc. That can get huge.
Perhaps what they meant was that the timing of backups was such that anything updated in the last day or two (or week) was gone - something I recall from a few decades ago. “Last night’s backup failed, so we restored with the night before. Anything done yesterday or today is gone.”
I can think of plenty of scenarios where they were not totally negligent and this worst case scenario results in some data loss.
Another issue is how and where and when to get systems restored to an alternate site. One item I recall was a company whose datacenter was destroyed in a fire. They managed to get a replacement site up and running in a matter of days. But moving from the emergency recovery site back to the replacement permanent datacenter would also require a few days. They could have systems down fo a few days when it was caved in ashes, but scheduling another multi-day outage without a fire was the difficult part. Over a year later, they were still in the emergency site.
That seems too high. For comparison, AWS charges about $1k-$2k/mo for a petabyte of backup (S3 Glacier), depending on region, and that already includes internal multi-zone backups (within a region). Double or triple those costs if you want to replicate it across regions (i.e., have backups in multiple continents). At the upper end, that’s $360k for 5 years, with multiple backups in multiple countries. Add in 20% internal staff and admin costs and that’s still under $500k for 5 years, or $100k/year — a single US IT person’s salary.
Even Wasabi’s cloud estimates are crazy expensive. Granted, AWS has economies of scale and ruthless pricing, but I mean… it’s not exactly an impossible problem to solve. These days, backups are mostly just a problem you throw money at. It’s probably just what @Jas09 and @bump said… they didn’t want to bother.
And maybe that’s why one of the data center employees killed themselves…
From the article:
Meanwhile, a government worker overseeing efforts to restore the data center has died after jumping from a building.
As reported by local media outlet The Dong-A Ilbo, the 56-year-old man was found in cardiac arrest near the central building at the government complex in Sejong City at 10.50am on Friday, October 3. He was taken to hospital, and died shortly afterwards.
The man was a senior officer in the Digital Government Innovation Office and had been overseeing work on the data center network.
That was the fully burdened on-premise cost including staffing. The S3 pricing for raw storage would just be a small part of the overall cost, with data ingress/egress, data validation, DR testing, and other factors included.
It’s true, but presumably (this being a data center and all, with other existing backups) those costs wouldn’t all have needed to be borne from scratch. It should just be the incremental cost of adding a few more terabytes of storage and the like.
And that S3 storage was for three sets of backups. Even if you add in various other charges, if Amazon could keep one set of backups for $24k/year — or even let’s double that to $48k/year — it shouldn’t cost anybody 10x that amount, much less 100x (5 mil a year). The various cloud providers all offer similar enough prices: https://www.starwindsoftware.com/blog/aws-vs-azure-vs-google-cloud-vs-backblaze-b2-vs-wasabi/. I can believe maybe 1.5x-2x the cost or thereabouts, maybe 3x if it involves a ton of government bureaucracy, but not 10x+. It just seems like an exaggerated estimate from Wasabi.
In any case, any of those prices is certainly cheaper than having to redo all that lost work from scratch…