statistics question

Chronos · August 21, 2009, 7:57pm

True as far as it goes, but you do have reason to think some particular true percentages are more likely than others. If someone makes the one and only free throw he’s attempted so far, his true percentage is a lot more likely to be the league average than it is to be 100%.

Kimmy_Gibbler · August 21, 2009, 7:57pm

What are you going on about?

tim314 · August 21, 2009, 8:09pm

Quercus:

I think Pasta is on the right track. If you have no reason to think any particular ‘true’ shooting percentage is more likely than any other, then the only way to predict future performance ranking is ranking by the percentage made previously. Now, you’ll be more confident in your predictions for some players, but you can’t change the ordering. For example, if player A made 744 of 1000 shots, player B made 6 of 10 and player C made 580 of 1000, the only possible ranking (without guessing what the distribution of ‘true’ abilities is) is A>B>C. In this case, you’ll be pretty sure that A>C over the long run while not be very sure at all where B will end up, but the only way to rank B is between A and C. (Or, in other words, if you have to bet, bet that A>B>C, but put a lot of money on A>C and as little as possible on A>B or B>C).

Now, if you follow Pasta’s idea and come up with some idea of what the league average is, and how players tend to fall around it, you can, with at least some kind of mathematical precision, say things like “Player Z made 0 of 3 shots, but that’s such a small sample that it’s more likely than they’re close to the league average in percentage than it is likely that they’re actually a zero percent shooter. So we’ll rank player Z above player S who only made 2,146 of 11,784 free throws and has demonstrated that they’re almost certainly well below league average.”

Well, is there any reason not to think the “true” league average shooting percentage is close to the average shooting percentage from my sample of all players?

The sample size for an individual player might be too small to trust that their measured FT % reflects their true FT %, but for the league as a whole the sample is much larger.

Pasta · August 21, 2009, 8:22pm

I think the last two (on topic) posts might have missed what Quercus is saying. I’m pretty sure what he said agrees with what you guys are saying (and what I was saying). That is, one can use the full sample of players to learn about the underlying true distribution of percentages. This distribution, and a little integration, is all that is needed to rank players.

Quercus first paragraph was not a statement of what one should do but rather a Devil’s-advocate setup for his second paragraph.

Munch · August 21, 2009, 8:29pm

Since there seems to be some confusion, why don’t you just explain what these sentences mean:

“If you want to know who is likely to sink more free throws, you’ve just got to watch them try to sink free throws. No math formula is going to take its place.”

Are you suggesting that it is feasible to “watch them”? Are you suggesting that you watch all 1230 games each and every year and are capable of determining who, given any two players amongst the entire NBA rosterdom, is more likely to hit a FT than the other?

Kimmy_Gibbler · August 21, 2009, 8:50pm

I am saying that if you want to rank players, you’ve got to have data on their performance. And that means having a robust number of data points. If you don’t have that, you cannot just meditate on the idea of professional basketball and come up with a math formula that will allow you to rank the players in the absence of a sufficient number of data points.

What are you saying? That you have some kind of basketball ESP that allows you to determine who the better player is without having data about their past performance?

tim314 · August 21, 2009, 9:02pm

Kimmy, no one is talking about not using datapoints. Extrapolating from data is exactly what I’m trying to do.

What you seem to be saying in this post, near as I can tell, is “Unless you have so much data on every player’s FT shooting that you can be confident his true FT % is virtually identical to his measured FT %, it’s impossible to say anything about how good a FT shooter he’s likely to be.” This is untrue, as Pasta’s excellent suggestion above demonstrates. Obviously more data is always better, but that doesn’t mean we can’t say anything without it.

Extreme example: Let’s say I’ve measured the FT % of every player in the league. Let’s say the league average is 70%, and the best player in the league is at 90%. Some new player joins the league. I have zero datapoints on the new player. I can still say with quite a bit of confidence that he’s probably worse than the guy with the 90% FT%, just based on what I know about the league-wide distribution of FT%.

Kimmy_Gibbler · August 21, 2009, 9:10pm

tim314:

Kimmy, no one is talking about not using datapoints. Extrapolating from data is exactly what I’m trying to do.

What you seem to be saying in this post, near as I can tell, is “Unless you have so much data on every player’s FT shooting that you can be confident his true FT % is virtually identical to his measured FT %, it’s impossible to say anything about how good a FT shooter he’s likely to be.” This is untrue, as Pasta’s excellent suggestion above demonstrates. Obviously more data is always better, but that doesn’t mean we can’t say anything without it.

Extreme example: Let’s say I’ve measured the FT % of every player in the league. Let’s say the league average is 70%, and the best player in the league is at 90%. Some new player joins the league. I have zero datapoints on the new player. I can still say with quite a bit of confidence that he’s probably worse than the guy with the 90% FT%, just based on what I know about the league-wide distribution of FT%.

But what you cannot do, and what you hope to do, is come up with an average for a player for whom you have a few, but not enough data points.

Let’s say the league average is 1/2 and you have a player for whom you have three data points, two of which were successes. Your best estimate of the proportion of free throws he will sink is 2/3. Naturally, this statistic will have larger confidence intervals at all confidence levels than those calculated for players on whom you have more data.

And this is where you must rest, because you cannot do better than this. But, it seems you think you could if only you could craft some model, perhaps some weighting of the league average and the observed proportion, then you could happen upon the “real” proportion. There is no mathematical justification for this misguided faith.

tim314 · August 21, 2009, 9:17pm

If you have only three datapoints, guessing their true FT% is close to the league average would make a lot more sense than guessing their true FT % is close to their measured FT%.

If we watch a random NBA player sink three shots, and I say “I bet his FT% is 70%” (or whatever the league average is) and you say “I bet his FT% is 100%”, I’ll be closer to the truth more often than you. Say clearly I can do better than just taking his measured FT %.

Obviously, you can have more confidence that your guesses are good if you have more data, and no one is disputing this.

Indistinguishable · August 21, 2009, 9:35pm

Everything drowns in the problem of the priors, like most statistics, so far as I can see. And it’s not like you’ll find One True Prior Probability Distribution to work from. As I see it, priors are choices we make for how to analyze a situation, not data we discover empirically.

Indistinguishable · August 21, 2009, 9:39pm

Shouldn’t this come out to 0.5P(obs|r[sub]1[/sub]>r[sub]2[/sub])/P(obs)? What happened to the denominator? And, of course, P(obs) is the tricky term; we can expand it out to the sum/integral of P(obs | r[sub]1[/sub] = …, r[sub]2[/sub] = …) * P(r[sub]1[/sub] = …) * P(r[sub]2[/sub] = …), but we’re left in worries about the prior distributions of r[sub]1[/sub] and r[sub]2[/sub] (presumably identical).

ultrafilter · August 21, 2009, 9:55pm

You may not be able to find the “real” proportion, but you can combine information about the league averages with information about a player’s record to get a reasonable estimate. See T. Herzog’s “Introduction to Credibility Theory” for a very detailed treatment.

Pasta · August 21, 2009, 11:36pm

Ah, yes, of course. Apologies for my haste. Yes, the missing demoninator P(obs) can be calculated with an integral nearly identical to the one I included.

It does come down to determining a prior distribution, but I think the large collection of data at hand (all players) can reasonably be used to inform our prior.

In the limit of infinite players, we could use only those who shot the most often to construct the underlying distribution. However, the resulting distribution would then be good for ranking everyone, including those who only shot once or twice.

In reality, assessing and propagating uncertainties on D(r) due to the (finite) data we use to inform it allows us to estimate uncertainties on the rankings. Or better, we can incorporate the assessed uncertainties on D(r) into our definition of the prior and integrate those out as part of the P(obs) calculation.

Pasta · August 22, 2009, 12:38am

For the heck of it, here is the distribution of (FT made)/(FT attempted) for all player-years from 1990 to 2007 for which the player attempted at least 75 free throws:

A quick estimate of D(r).
ETA: I just slapped sqrt(N)-derived errors from the bin contents on the curve as a quick-n-dirty example of how one can also assess uncertainties on D(r) which can be integrated over when calculating P(obs|anything).

ultrafilter · August 22, 2009, 12:53am

You run into a big problem with a scheme like this. Any given confidence interval contains the true value of the quantity being estimated with probability 1 - a. Therefore, if you fit n confidence intervals independently, the probability that they all contain the true value is (1 - a)[sup]n[/sup]. For sufficiently large n, this is a small number, and it’s very likely that you’re dealing with bad numbers somewhere.

Pasta · August 22, 2009, 1:07am

This doesn’t have anything to do with the OP given the assumptions that are implicit in the thread, but while I’m playing with data… Here’s how the mean FT success rate varies with how often a player is at line, showing an expected trend (since attempts and success rate will both be positively correlated with how good the player is).

Mean rate vs. attempts

Topic		Replies	Views
Why Do Short Basketball Players Shoot Freethrows Better Than Tall Players? Factual Questions	21	12593	July 31, 2003
How good are pro sports teams at picking good players? Factual Questions	19	1523	November 24, 2006
Why Don't Basketball Players Do This? Factual Questions	35	2283	April 4, 2002
Favorite Examples of Poor Reasoning In My Humble Opinion	69	3775	June 21, 2002
Football pick 'em Miscellaneous and Personal Stuff I Must Share	74	2315	October 25, 2001

statistics question

Related topics