I need to be able to take a random set of data (like point scores) and looking at it over a certain set of time (for instance, a week), be able to determine by how much on average the data is overall increasing or decreasing as compared to yesterday, two days ago, or a week.
The data is taken by, essentially, a daily poll and should probably be assumed to have a realtively high level of randomness to it. Assuming a linear growth/decrease rate would be fine, though a curve would be better.
Is there any particularly good algorithms for this? Being able to do most of it within a SQL query would also be nice so we can automatically ignore cases which don’t match the kind of data we are looking for.
It’s been a while, but it sounds like you’re talking about some form of ANOVA. I wouldn’t count on being able to do any significant part of the work in SQL.
You need a linear or quadratic regression, which you can’t do in pure SQL. You could write a bunch of code to do it but Excel handles this very nicely (see TREND function). You can write queries that will output your data in comma-delimited format, then import that into Excel. Then you set up a column with the X data (e.g., day number when the data point was collected). To make it quadratic set up a second column that is the square of X. Then in another column you use the two columns you just created plus the column with your data in it and set up the TREND function.
You can also get the parameters for the trend with an add-in. Go to Tools, Add-Ins check the box for Analysis ToolPak. Then in your Tools menu you will see Data Analysis at the bottom. It will bring up a list, select Regression. Then you’ll get a dialog box where you can enter the information regarding your data and it will create a new worksheet showing you the detailed statistical parameters for the data.
How will you decide what cases to ignore? You really shouldn’t ignore data when you do statistical analysis just because it doesn’t fit what you’re trying to show.
If you’re going to avoid ANOVA (and I can’t blame you), a time series model might be more appropriate than a regression model.