How to optimize a constant in an algorithm with multiple constants?

First off, I’m an interested amateur, obviously not a mathematician or statistician, so I apologize for making mistakes in terminology, and if you could please keep the replies as simplistic as possible, I’d appreciate it. Also, I find I’m having a hard time expressing the problem, so bear with me.

A few years ago, I went down the Sabermetric rabbit hole, to the point of creating my own run estimator based on Base runs, only much, much more complicated.

The gist of the problem that I’m having is this-

Say that Singles = S, Doubles = D, Triples = T, Homeruns = HR and that -

RUNS = (S * X) + (D * Y) + (T * Z) + (HR * W)

Previously in order to optimize the constants (S, D, T, HR), I would have tweaked one of them, then used Root Square Mean Error to determine whether the tweak rendered the formula more or less accurate with regard to the given RUNS. Simple enough in one spreadsheet, but when you have to change the formula and record the results for multiple leagues across multiple years in multiple spreadsheets, you obviously run into a big efficiency problem.

So, how does one go about optimizing each of these constants for multiple data sets efficiently, without having to resort to a brute force method of, ‘make a change, record the result, repeat two dozen times,’ as I have previously done?

Put another way, is there a way calculate the value of ‘X’ that’s the most accurate across all the data sets combined, but not necessarily the most accurate for each individual data set, without physically combining the data into one spreadsheet?

Again, apologies for the lack of clarity. I’ll explain as I’m able.

I believe what you are seeking is called multivariable regression. This is a piece of cake for statistical packages such as R. There is an Excel add-in that will do the trick as well. Link

The output of the model is the set of coefficients that will minimize the difference between the actual and the predicted number of runs. As you requested, it is the overall best fit for your data.

Sweet. Unfortunately, I’m on a Mac, so I’ll have to look around for an alternative or do some other software finagling.

I was wondering if this was more of a math question or a process/software question anyway.

Thanks for the input, and any other insight you can give me into the practical side of performing a multivariable regression would be much appreciated.

Your particular optimization problem is quite simple, but others can be far more complicated. For many problems, you can end up with multiple local maxima, and while it’s easy to find one (or more) of those local maxima, it’s nearly impossible to be certain that you’ve found the best of them.

Is it though? After a bit more consideration, it seems like a multiple regression approach would optimize each variable for the data set it was run upon, but I would need to perform the same analysis for each data set and then come up with a weighted average of the results. Does that sound correct?

Since we’re into regression, would anyone care to explain the concept of performing a multivariable regression? Or a regression for variable X when there are multiple variables are in play?

And since this seems so dead easy to those in the know, exactly what sort of professional would I be able to speak to in order to get a hands on demonstration? Sounds like a statistician, but again, I’m out of my depth here.

Thanks all.

You can do it in two lines of code in R

dataset <- read.csv(baseball.csv)
model <- lm(RUNS ~ (S * X) + (D * Y) + (T * Z) + (HR * W), data=dataset)

To get summary stats use

summary(model)
And
print(model)

I usually don’t use LM as much - and I wouldn’t ever write the formula the way you did - so I might have the syntax wrong - to test you can use:

Model <- lm(RUNS ~ ., data=dataset)
You might have to add:
,header=TRUE

AFTER “dataset”

can’t remember default off hand.

r is great for this stuff and you can easily use different types of machine learning models by treating a line or two of code

To predict on an unknown dataset you load that in as a csv file as the original (but use say - “newdataset” instead of dataset )and simply do:

mypred <- predict(model, newdataset)

Then

mypred

To see

The book “R in a nutshell” is pretty good

And it has specific examples - including baseball examples (I’m guessing there is at least three in there on baseball - but don’t hold me to that - I know it’s at least one)

It pretty fun to be able to do your own machine learning - and was way easier than I expected

Oh wait - you don’t put the x y and z in there - I thought they were extra columns - drop them and make it

RUNS ~.
The dot means “everything else”

I’m not sure what you mean by “multiple data sets”. If you have reason to believe that the values would be different for Little League and for professionals (for one thing, Little Leaguers hit a lot more triples than pros do), then you do the analyses separately, and get separate conclusions from them, and ne’er the twain shall meet. If, however, you want a single set of numbers for baseball as a whole, across all leagues, then you lump all of your data together into a single data set. But no matter what data set you’re working with, your problem is still going to be pretty simple.

Or just a data set that’s sufficiently representative of what you’re trying to model.

Thanks for the insight, DataX. Frankly, the particulars went right over my head. I’m not even familiar with the concept of ‘Machine Learning’, nor coding for that matter, but I do find myself trying to generate algorithms that explain reality much more often than I have any business doing.

So is that what I need to look into? Machine learning? Would you recommend college courses or is it something a dedicated amateur could pick up on their own? Keep in mind I’m not trying to become an expert, just be able to defend myself and optimize the occasional algorithm…

Thanks, bump. Representative sampling, eh? It looks more and more like I need to have a pow-wow with a Stats guy at the local college, huh?

In theory you want to hold out a set of data and do stuff like cross validation - this is why I don’t like linear regression as I have to set up multiple fold cross validation and it has some problems generalizing.

A method like random forest gets rid of all these problems as you can use what is known as the “out of bag” trees to make predictions on data it hasn’t seen.

You can do what chronos said - and lump them together - but you can also add another column for that - say “year” if that is the difference between your sets. There’s gotta be something different. You can record these as a “factor” in R, but usually it isn’t going to matter. I have tried to get this to matter - many, many, many times - but very rarely does it make any real discernible difference - even though it seems like it should.

Don’t make it more complicated than it needs to be. I could get that model up and working in like three minutes - get something working first - then add stuff.

That’s where the fun begins (although I don’t wanna ruin your enthusiasm - your not gonna get very far down that path)

But it’s still fun to try.

I have no statistical training or math background.

It takes a special way of thinking, but it is possible, but you have to be able to pick up on stuff to compensate for the lack of statistics - how to do cross food validation for example.

As far as what you should read:

They have videos somewhere - not sure if it’s there, but it wouldn’t do those first as they make it more technical than it needs to be.

That book I mentioned has at least a chapter on machine learning. It is written for people like us. It just tries to teach you how to do it - no math needed.

If your promise not to reveal my secret identity I’ll send you the link to a video presentation I gave on R and Machine Learning at a place you probably have heard of :slight_smile:

The example I gave was monstrously simple compared to the actual algorithm I’ve got, but it all basically boils down to needing to optimize about 6-12 variables (constants?) across a number of data sets.

I’ll look into that book when I have a chance…

I’m familiar with saber metrics and Nate silver is my hero :slight_smile:

I’ve got a couple books on saber metrics and find them good for inspiration.

I think you’ll enjoy it - it gives you power!

I never could program before R

Sweet, yeah I like Nate, but I think I like him more for 538 than his Saber contributions.

Most of his work seems to have been based on predicting player development and decline, where I’m more interested in the backwater of sabermetrics, namely ranking historical players accurately.

Are your dataset(s) laid out like this?: Runs; Singles; Doubles; Triples; Homers; Yadda1; Yadda2; etc
If so, you’ve done all the hard work already. I’m happy to get you started with some R code that will work on your exact dataset if you want to email me a link to the data via Dropbox or GoogleDocs or whatever.

R is a free program, by the way.

No the data isn’t set up nearly so neatly. I’m in a spreadsheet, so the variables I’m trying to optimize are embedded within formulas that are performing calculations, referencing other cells, etc.

I could dig them all out into stand alone cells fairly easily, but it would take some time.

Yeah, thanks to your and DataX’s recommendations, I’m definitely planning on getting into R for future models.

Speaking of, right now, the variables I want to optimize are defined as numbers within those formulas, so lets say Singles = 0.75. Now, if I pull it out into its own cell, would the fact that it is concretely defined mess with R or a similar programs ability to adjust/optimize it?

IOW, is the fact that it is a concrete number interfere with the programs ability to see it as a variable, or is that something that’s taken care of by the code/command I would enter?

Excel (part of MS Office) is available for Mac.

So is LibreOffice, which is free. However, I don’t know whether it has an equivalent feature. It’s generally pretty close to Excel in feature parity.

Yep, and it’s available for Mac. I’ve never used it, but my wife uses R and SAS (proprietary software similar to R in terms of functionality) in her line of work. It shouldn’t be too difficult to set up.

In fact, it’s probably easier to set up R on a Mac than on a PC, owing to OSX’s Unix underpinnings.