What programming approach for continuously building data file over weeks and years?

What approaches should I consider to write a program that will build a growing data file over the years, in a robust way?

In my original source, data appear every day in a pretty continuous way. This is kind of like stock prices or weather data - the same variables keep getting updated, second by second, day by day, over time. My source is an online data system to which I can submit SQL queries, though I can’t modify the system in any way. It holds a couple years worth of data, which age out of the system. When I submit a query, in a little while I get back a .csv file.

I am doing this because I regularly run a proprietary analysis program, which works just fine. If you can live with the stock market analogy, let’s say my program creates buy and sell recommendations. My analysis method uses all the data I can get my hands on, several years worth, so I use data that have aged out of the source system in addition to data that are still currently available from the source system.

Because I want to use data that are no longer available, and because downloading much of the data set takes hours, I keep a local copy of the data, and occasionally download the stuff that’s new since the last download, and append it to my local copy. So far, so good.

But I want to automate this to run once a day. Extending my current approach would basically mean that, every day, I download yesterday’s data, append it to my multiple gigabyte master data file, and then do today’s analysis with the new master file.

Is this wise? It seems, I dunno, kinda amateurish to keep doing that. Shouldn’t I be updating the master file less frequently? Like maybe append yesterday to a data file for this month, then append the month to the master every time the month changes, or something like that?

Or am I being silly? I could keep backup copies of the master at various stages, have some diagnostics that look for something going wrong (like trying to append data that are older then the last thing that was already appended).

Is there some kind of recommended practice for this sort of thing, that goes beyond fiddling with things until it looks like it’s working?

Your daily approach if fine. Just add a step to your script to rsync your data everyday was well.

Are you storing the data as time/datestamped records in a common table, or are you storing each version as its own table/file?

The former is preferable, as it makes comparison of any two versions possible just by altering the parameters of a query, rather than switching data sources.

A lot depends upon how big the file is, or will ever become, your programming chops, and the level of resiliency you need. Multiple gigabytes is becoming a little unwieldy. But not unduly so.

If the file does start to get much larger the usual sort of tactic are things like using a journal/transaction log and periodic log truncation. Pretty much as you describe. However it really doesn’t sound as if you are anywhere near that sort of activity. (You would be looking at how many updates per second, not per day, before this would matter.)

You are basically making the file system do the heavy lifting, and for your needs this is perfectly reasonable. Under the hood, many more modern file systems actually are journal based.

However, resiliency is where you do need to be more careful. You want to convince yourself that the file is always properly backed up, and that you will never lose data. Archiving the file periodically so that if there is a screw up at some point, you can roll back to an earlier valid copy and regenerate missing/corrupted data. Archives need backing up as well as your master file. (Backups and archives are not the same thing.)

Belt and braces to avoid corruption could even involve a pair of files, which are updated alternately. Running a consistency check over the file is also a good thing.

To aid with data management, are you using a local SQL database or a flat file?

Oh yeah: backups, backups, and more backups.

cron

Just in case nobody else mentioned it…backups :smiley:

Here is what we did for a large mainframe system:

We had 2 main files, called the odd-day & even-day files. Every day we copied the last 2 days data from every store to the current days’ file. Thus at all times we had a current master file, and a master file as of yesterday.

This doubled the download time, but gave us an automatic backup of almost current data, plus the yesterday file was sufficient for a lot of our reporting & inquiry jobs, thus speeding up processing & access to the todays master file.

Then the downloaded data was also appended to the weekly file, and to a month-to-date file.

Plus a daily ‘reasonableness’ report was run against the downloaded data, giving statistics on the size of the file (number of transactions, $ amount, etc.) compared to the averages of the past week – this was checked by someone in the IT area, and was the first report of the morning on the desks of the VP for Sales, several Regional Sales Managers, etc. – an incomplete download would have been noticed very quickly!

Plus all of these were automatically backed up daily, and a copy of the weekly one was physically transported to another site, which also had a copy of the master as of last Sunday.

Note: Skip this post if you can afford some slop in your data (e.g. it’s weather data so who cares if you miss a few records)…

Anyone can append daily deltas to a file or database table.

The real trick is getting it right.

If you have some sort of transaction sequence number that is guaranteed to increase, you are in pretty good shape. Just make sure you keep track of the sequence number retrieved on the last run, and do the next run based on that (with a subtle caveat below regarding delayed commits).

Don’t just use “today” blindly since your process might miss a few days if someone pulls the cord from your server when you are on vacation. Make sure you track the last key you pulled.

If you don’t have keys to use, the next best choice is timestamps. Be careful with the granularity of the timestamp. If it limited to a minute, understand that there could be records your data pull might miss because of granularity.
This implies two things: you need to be able to compare records you get today with those you have in your local copy and discard duplicates or update the records (if that is appropriate), and you need some degree of overlap. It is not unusual to have 4 hours of overlap for a daily batch job. The duplicate check is often a checksum of columns for a given key.

The overlap is often necessary because (especially with databases) the source system may have in-flight transactions that haven’t been committed but have already had their timestamp set–such as if the remote guy did some changes, got timestamped, and then got coffee before issuing “commit”. This can make records magically appear in the past, even though you already processed that date fully, thereby being missed by you.

If possible, a sync-up job to balance the checkbook with the remote data is useful. In your case the remote data is eventually retired, but it would be nice to be able to occasionally sync the full dataset back to the oldest record they have.

One final note: if you are dealing with timezones be extra careful to convert all things to a common denominator such as EST or GMT. Getting TZ info stripped by some errant code is one sure way of losing records.

If you’re not downloading samples very frequently, in situations like this I like to store each sample in a separate file, named with the date and time of the sample, like 201605140830. To make management a little easier you can store in a separate directory for each day (20160514/201605140830). (Thus lexicographic order is the same as chronological order.) This makes it easy to do things like select data from a given time range, delete data older than some threshold, take one random sample from each day, etc. However if the number of samples gets very large, the file system overhead in having a large number of small files may make this method impractical.

–Mark

Advice: DON’T use any Microsoft database products - they come out with a new version, then suddenly all your programming needs to be redone!

Sounds like your downloading a coma delimited ASCII file. You don’t give any info about what type of DB you are loading this up into. Or are you just appending and making a huge flat file?

This isn’t going to be a problem since it seems that his data doesn’t come directly from the source database, but from the csv files he gets when he queries them. Timestamping them is not going to be an issue.
I have a database of manufacturing information which gets updated several times a day from csv files generated by translating downloads of manufacturing data. I think we have something like 500 Gig of data now, but I’m using an Oracle database so no worries.
We discard duplicate records to avoid issues when we update something more than once, and because we have meta-data which is almost always duplicated. I’d assume his queries would be for certain time periods (given that the data is continuous) so I don’t see why he’d get overlap. Perhaps the data has timestamps on it - I’d hope so. After all, you might want to sequence it.

It sounds like you have one giant csv file. Very inefficient. Putting it in a database will let you extract only the stuff you need for your analysis, and it will probably be faster sequentially reading a giant file in a script. It would also be easier to modify analysis, save over time ranges, by changing the query and not your code. I do queries in Perl all the time, and I assume other scripting languages have modules to talk to databases also.
I have lots of cron jobs to do all sorts of analysis automatically, and then email me back the results.