AWS S3 backup has objects not files

edwardcoast · December 14, 2019, 4:09am

Can anyone explain why AWS S3 doesn’t use a file system like Linux? My understanding is that it uses objects. What is the logic in doing this? What is this catering to?

Here is my concern about this. Using Linux, for backups I like to run rsync. It preserves all the file info and permissions, so when it is restored it looks exactly like it did on the original system.

And a better question is, how can you use a Linux file system to do a backup to AWS S3 so it can be easily restore without losing any of the file information?

I’ve also noticed that AWS S3 at the command line complains of doing backups to linked files.

I considered doing a compressed TAR file, but that seems wasteful since it doesn’t allow for delta changes for the backup like rsync does.

I’m sure I’m missing something here? For all the services and layers AWS has created and continues to do, I can’t believe something as simple as using rsync to do a backup isn’t part of what S3 allows.

Please enlighten me.

friedo · December 14, 2019, 4:25am

S3 doesn’t do files because it’s not a filesystem. It’s a distributed key-value store. Sometimes the keys may look a bit like filenames, but that’s just an illusion.

But there are ways to abuse S3 and make it behave kinda like a filesystem. s3fs is a FUSE module that will let you mount a S3 bucket locally and copy files to it. It will attempt to preserve file permissions and the like via S3’s object metadata system.

But if you’re just looking for a cloud backup system, something like Dropbox may be more appropriate.

edwardcoast · December 14, 2019, 4:39am

Thanks for the reply and explanation. On AWS if you are running an EC2 instance, how are people backing up the files of the web server so it can be restored without a problem?

friedo · December 14, 2019, 4:51am

If you’re running EC2 instances, you can attach permanent virtual block storage devices (think virtual hard drives) to each instance. Those are designed with distributed redundancy built in (completely opaque to you, but it seems to work very well) so they will survive a reasonable amount of hardware failure at AWS. If you shut the instance down, the block storage device remains and can be attached to a new instance.

friedo · December 14, 2019, 4:56am

Oh, also you can make point-in-time snapshots of your block storage devices and save those to S3, so you can restore an old version very easily.

edwardcoast · December 14, 2019, 5:21am

So that’s a common operation to backup an AWS EC2 instance that is hosting a bunch of CMSes, you put it on VBS and leave it there? I’ve not looked into the pricing on that, I thought that would be must more expensive than using S3.

The other part of this that bothers me, is the file size limit for AWS S3 where you need to do a multi-part to get it there. Then how do you know it’s correct without downloading it and comparing an MD5 sum on the original and what you downloaded? Is there some sort of checksum available for what is in AWS S3?

I’m thinking about the company’s file server which is hosted on-site and has about 3 TB of data. I wanted to put it on AWS S3 for backup, but I can’t see getting around the objects instead of files issues without making a compressed tar file and putting it on there. But in a nightly backup, it would require transferring the entire 3 TB each time. That too would need to be transferred as multipart. Seems like since we are moving everything to AWS, that the backup from the local file server on-site should be there too. Maybe I’m trying to use AWS S3 for what it wasn’t intended to do?

edwardcoast · December 14, 2019, 5:23am

Where are the point-in-time snapshots stored, in EBS?

It doesn’t sound practical I am guessing to use snapshot backups of a 3 TB file server? Of course, in most situations I would only need to restore part of it and not the enter thing.

DPRK · December 14, 2019, 5:18pm

If you take periodic incremental snapshots of a filesystem, only the first, full backup will take 3 TB; the subsequent ones will only be as big as the amount of data changed since the previous snapshot.

That is an abstract consideration independent of Amazon services. But now, if you start with storage on Amazon EBS, then these incremental snapshots will be dumped to Amazon S3. If you later need to mount an image based on a snapshot, you can recreate an EBS volume based on it.

If your data is stored on your own site, not Amazon, then you can still use S3 to keep your snapshot data, it’s just up to you to transfer it to/from Amazon (say using a script involving ‘zfs send’ or ‘cpio’ or ‘rsync’ or whatever command you use to dump the data. S3 has a maximum object size of 5 TB currently so it should not present a problem.

DPRK · December 14, 2019, 5:28pm

He was talking about EC2 instances, but you say you have your own server, so you have no need to store anything on EBS merely to store your backup data.

Also the individual S3 object max size is bigger than your entire server storage, si even a full dump of all the files will fit into a single object - no need to split anything up or use one object per file.

edwardcoast · December 15, 2019, 3:51am

I thought AWS required or at least I got the impression it was a best practice to do this:

Otherwise I would start a compress tar file at the end of the business day, which would less than 3 TB (considering compression) and I could have it transferred to an AWS S3 bucket with the object named for that day’s backup. Then have this run in the CRON each day with a new object name. Perhaps only keeping a 30 day rotation of this.

What I don’t like about this is that it creates a much larger storage on AWS S3 because it isn’t incremental like rsync on Linux. And that in the event someone says “I need to recover that JPG from yesterday’s backup”, I would have to download the compressed tar file or at least put it on an EC2 instance temporarily to unpack it and retrieve the file.

All this makes me feel like I must be approaching the backup of the Linux file server on-site to AWS S3 wrong. It just seems like a lot more work and compromises. In the past I would have another Linux server remotely, and would run a CRON job to do an rsync and that took care of everything. I don’t want to do that because as we all know all the “cool kids” are in the cloud, so I want to get this solved. But I don’t like the idea of giving up the file permissions and dates that Linux handles fine. I guess I would have it upload a compress TAR file of about 3 TB each night, if that’s what everyone is doing. But are they using the multipart thing to do it?

DPRK · December 15, 2019, 5:25am

There is no way “everyone” is uploading a full backup every night. A full dump takes 3 TB, sure. But you are not going to upload 3 TB the following night: if only 40 GB of files got changed, an incremental backup would only take 40 GB.

Now obviously there is going to be some recommended best practice depending on how long your dump cycle is and how long you are going to keep old backups around, e.g. one could do a full dump every Sunday and then incremental dump level 9 on Monday, level 8 on Tuesday, and so on down to level 4 on Saturday before beginning again. See some examples here and here.

What tools to use… in the old days you would run the command “dump” yourself, or use a manager like Amanda or Bacula or something else. Instead of tapes you will be using AWS S3. I’m not recommending any specific backup tool or script yet because it depends on your exact setup and there are entire books on the subject I haven’t read, only saying that you are not going to be using the compressed tar file or rsync method, rather a more sophisticated tool, and it’s not going to take 30 days x 3TB of space because you are going to use an appropriate (for you) incremental/differential backup scheme.

Anyway, open-source software like Bacula et al seems pretty popular, already supports AWS/Azure/Google/whatever and seems to do what you want in terms of being able to restore files with minimal overhead so I would look into a few of these off-the-shelf solutions first before rolling my own or paying a lot for a custom “enterprise” version.

edwardcoast · December 16, 2019, 5:04am

DPRK:

There is no way “everyone” is uploading a full backup every night. A full dump takes 3 TB, sure. But you are not going to upload 3 TB the following night: if only 40 GB of files got changed, an incremental backup would only take 40 GB.

Now obviously there is going to be some recommended best practice depending on how long your dump cycle is and how long you are going to keep old backups around, e.g. one could do a full dump every Sunday and then incremental dump level 9 on Monday, level 8 on Tuesday, and so on down to level 4 on Saturday before beginning again. See some examples here and here.

What tools to use… in the old days you would run the command “dump” yourself, or use a manager like Amanda or Bacula or something else. Instead of tapes you will be using AWS S3. I’m not recommending any specific backup tool or script yet because it depends on your exact setup and there are entire books on the subject I haven’t read, only saying that you are not going to be using the compressed tar file or rsync method, rather a more sophisticated tool, and it’s not going to take 30 days x 3TB of space because you are going to use an appropriate (for you) incremental/differential backup scheme.

Anyway, open-source software like Bacula et al seems pretty popular, already supports AWS/Azure/Google/whatever and seems to do what you want in terms of being able to restore files with minimal overhead so I would look into a few of these off-the-shelf solutions first before rolling my own or paying a lot for a custom “enterprise” version.

Thanks.

You’ve given me helpful options to consider.

DPRK · December 16, 2019, 8:47pm

You are welcome. This is the kind of problem with multiple valid solutions, so let us know what eventually worked in your case.

edwardcoast · January 10, 2020, 2:44pm

I’ve had some time, so I looked into this and this is the plan so far, but it still had a problem, which I will get to at the end of this posting.

After looking closely on the file server, we’ve decided to take the legacy stuff and put it in a directory called COLD_STORAGE. This will be stored on AWS S3 Glacier Deep Archive. This is data we would only need to retrieve in the case of a real disaster of the file server and local on-site backup, such as a flood. There aren’t many objects/files, so we didn’t feel it was necessary to create a tar file of it.

The rest of the data which is about 114 GB, we’ve decided to storage this on AWS S3’s Standard - IA (Infrequent access). Creating a tar file of that whole thing doesn’t make sense, because to retrieve one file, the entire 114GB tar file would need to be downloaded. Not practical. The 114 GB has over 600K objects/files.

I have done some experimenting with using s3cmd, which is an open-source aws cli type of solution. It has some good features like it will take care of the multipart stuff for you on files over the default size which is 15 MB.

What looked to be a very useful feature in s3cmd, which aws cli is lacking, is the -preserve option on the command line which will preserve the Linux file permissions and timestamps. It stores this in the metadata on AWS S3. So when you go to restore it, it will transfer the file and restore the Linux file permissions and timestamps. We rely on the file permissions and more importantly the timestamps when working with so many files. Having it changed to the timestamp to when it was last transferred is kind of useless to us.

OK, now the very annoying part. s3cmd with the -preserve option forces it to encrypt the file. When you download it back from AWS S3, it then decrypts the file, but it entirely ignores the file permissions and timestamp that is in the metadata and instead writes it with the current time and sets the file permission to whatever you have the umask set to. There is an option to have it not encrypt on the s3cmd command line, but it is ignored. I looked in the bug reports for s3cmd and someone complained about the permissions issue years ago and it has not been addressed. I’m not knocking the people who worked on s3cmd, just stating my path to solve this problem. I love open-source, but people need time to work on things.

So at this point, I’m wondering if I need to write my own wrapped to put around aws cli to do the backups so that the Linux file permissions and timestamps are stored in the metadata, and so they can be retrieved and restored when transfer the files out of AWS S3.

I’m still kind of surprised that more people aren’t concerned about the loss of the Linux file permissions and timestamps when things are stored on AWS S3 by default. I guess if they are using it mostly for internet access to provide to the world many PDFs of product catalogs, they don’t care about these things. But for backup and restore to a file server, we find it important to have. But I still an open to what others are doing or ignoring this if that’s the case.

DPRK · January 10, 2020, 5:29pm

I can run my own test later, but this thread and the linked issue claim it was fixed by the end of 2017. At least make sure you are using the latest version of s3cmd and some of the options they recommend like s3cmd sync --cf-invalidate --preserve --recursive --delete-removed, though according to the documentations --preserve is the default for the “sync” command.

edwardcoast · January 10, 2020, 6:38pm

Thanks. I just tried that s3cmd command, and while it looked like it completed, it also had an error message:

Using s3cmd version 2.0.2.

I think what might be the problem is the default encryption, because it is writing the file locally and might be losing the permissions and timestamp when it is writing the file on the Linux server? I don’t want to use the client or server side encryption at rest, because I want a regular aws cli to be able to get the files without having to deal with the encryption. I’m really only concerned about things being encrypted/secure while they are being translated to and from AWS. I thought AWS S3 had some default rest encryption that was transparent anyway.

edwardcoast · January 10, 2020, 7:06pm

The file permissions are still an issue, but the time stamps off:

Access: 2020-01-10 18:40:17.000000000 -0500
Modify: 2020-01-10 18:40:17.000000000 -0500
Change: 2020-01-10 13:48:34.530200300 -0500

13:48:34 is the time it was transferred from AWS S3 to our file server, and the time is now 14:02, so the Access and Modify times are bogus because they are in the future.

Thanks for your help, but I think s3cmd isn’t ready for prime time.

I think if I can write something in python to use aws cli, but without the encryption and it stores the mode and timestamps in the metadata, I will be able to get this done.

Still can’t believe AWS doesn’t have this as a workable option on their AWS cli, but there might be other issues of why they have avoided doing this.

Topic		Replies	Views
Linux backup software Factual Questions	12	1398	May 21, 2009
Backing up Windows PCs to Linux box Factual Questions	4	740	May 2, 2005
Cloud computing: Anyone successfully using it to replace the local file server? Factual Questions	48	2556	March 28, 2020
Store Once Filesystems Factual Questions	6	2435	February 7, 2011
Inexpensive/free file-sharing/cloud storage Factual Questions	8	1152	July 7, 2017

AWS S3 backup has objects not files

Related topics