I’ve had some time, so I looked into this and this is the plan so far, but it still had a problem, which I will get to at the end of this posting.
After looking closely on the file server, we’ve decided to take the legacy stuff and put it in a directory called COLD_STORAGE. This will be stored on AWS S3 Glacier Deep Archive. This is data we would only need to retrieve in the case of a real disaster of the file server and local on-site backup, such as a flood. There aren’t many objects/files, so we didn’t feel it was necessary to create a tar file of it.
The rest of the data which is about 114 GB, we’ve decided to storage this on AWS S3’s Standard - IA (Infrequent access). Creating a tar file of that whole thing doesn’t make sense, because to retrieve one file, the entire 114GB tar file would need to be downloaded. Not practical. The 114 GB has over 600K objects/files.
I have done some experimenting with using s3cmd, which is an open-source aws cli type of solution. It has some good features like it will take care of the multipart stuff for you on files over the default size which is 15 MB.
What looked to be a very useful feature in s3cmd, which aws cli is lacking, is the -preserve option on the command line which will preserve the Linux file permissions and timestamps. It stores this in the metadata on AWS S3. So when you go to restore it, it will transfer the file and restore the Linux file permissions and timestamps. We rely on the file permissions and more importantly the timestamps when working with so many files. Having it changed to the timestamp to when it was last transferred is kind of useless to us.
OK, now the very annoying part. s3cmd with the -preserve option forces it to encrypt the file. When you download it back from AWS S3, it then decrypts the file, but it entirely ignores the file permissions and timestamp that is in the metadata and instead writes it with the current time and sets the file permission to whatever you have the umask set to. There is an option to have it not encrypt on the s3cmd command line, but it is ignored. I looked in the bug reports for s3cmd and someone complained about the permissions issue years ago and it has not been addressed. I’m not knocking the people who worked on s3cmd, just stating my path to solve this problem. I love open-source, but people need time to work on things.
So at this point, I’m wondering if I need to write my own wrapped to put around aws cli to do the backups so that the Linux file permissions and timestamps are stored in the metadata, and so they can be retrieved and restored when transfer the files out of AWS S3.
I’m still kind of surprised that more people aren’t concerned about the loss of the Linux file permissions and timestamps when things are stored on AWS S3 by default. I guess if they are using it mostly for internet access to provide to the world many PDFs of product catalogs, they don’t care about these things. But for backup and restore to a file server, we find it important to have. But I still an open to what others are doing or ignoring this if that’s the case.