Comparing Internet Reviews More Accurately with Mathematics

I think reading the 2 and 4 stars is a better indicator than the 1 and 5 stars reviews.

Yeah, kinda how the Olympic judges throw out the high and low scores.

That sounds a bit like the Bayesian Average approach. Or is it something different?

Pretending away all the problems with the data, I think you’re assigning far too much importance to the number of reviews.

If you have two products which each have 50+ honest reviews, you can assume with very high confidence that the reviews accurately reflect true quality. There is no statistical sense that, e.g. a product with 200 reviews has a meaningfully more accurate score than a product with 700. So if one is e.g. 3.8 and the other is 4.2, they’re both an accurate reflection of reality to the same tolerance. Even at 50 reviews vs 500 reviews, the scores are accurate to within the same tolerance.

Where it gets dicey is when the number of reviews of either or both is very few. Like 10 or less. OTOH, if the reviews are so few, just read them all and assign your own metric of reliability to the individual reviews. It’s only when the reviews get above WAG 30 that it’s too hard to read them all and you want to switch to a statistical approach to evaluating them.

Here’s a wiki on the process of deciding how big a sample is big enough. The math is probably too deep, but the qualitative point is that it it takes a surprisingly small number before the results converge closely with reality.

Great post, thanks a lot!

All in all, it looks like sample size and score distribution are more indicative than the raw number of reviews.

(Bolding mine)

I like that approach, it makes sense to me, at least intuitively. To be clear, you chose a “phantom 3” on purpose because it is right in the middle of the grading system, right?

I almost always give 4 stars to products as nothing is ever perfect, but I never thought of applying the same logic to how I read reviews. I’ll keep this in mind, thanks.

Not to mention the inevitable generous helping of “I ordered the wrong size, and I had to tape the box back together to return it” reviews…

To be clear, the “add a phantom review” method is just my intuition, not based on any specific mathematical process. It has the qualitatively right features, but then, so do a number of variants on it. It might be better to add two or three phantom reviews, or to add a number of reviews to each product proportional to the square root of the number of existing reviews, or something like that.

Almost: Because it’s right in the middle of the expected distribution that I was assuming. In other words, if a product had no reviews at all, that’s my best guess as to what the rating would be. Which, as folks have already mentioned, probably isn’t a reasonable assumption.

The ultimate takeaway in all this, I think, is that while the numerical ratings might not be entirely meaningless, they’re so unreliable as to make drawing any kind of conclusion from them terribly dicey, no matter how they’re manipulated mathematically, and the only sensible means of extracting value from the reviews is to actually read them and create your own subjective impression of the degree to which the various claims can be trusted.

Part of the problem with star ratings is that they mean different things to different people. To one person, 5 stars means “This did what it was supposed to do; I have no complaints,” while to another, it means “This was amazing! It couldn’t possibly be better than it was.”

It’s interesting to point out the Steam (video game store) reviews system here: Introducing Steam Reviews (or see an example page for a recent game)

Instead of a 1 to 5 star system, reviewers are:

  • Limited to a yes/no vote about whether they’d recommend the game
  • Given a chance to explain why in an accompanying textbox
  • Only able to leave a review if they bought the game. Free copies are marked as such.
  • Voted on by other players for helpfulness

Then the reviews are processed in aggregate:

  • Summarized into a rating like “Very Positive” or “Mixed” based on the proportion of up/down votes
  • Separated into “all time” and “recent” (most recent 30 to 90 days)
  • Displayed on a distribution graph over time
  • Ranked by helpfulness votes
  • With off topic reviews indicated and excluded by default, but can be manually shown if desired. (Marked as such by a combination of user reports, algorithms, and manual curation)

Altogether, it creates an ecosystem that’s much more reliable than Amazon and also other game reviews (from websites, magazines, Metacritic, etc.)

Users really only get an up / down vote, with an accompanying text explanation that doesn’t get factored into the game’s summary in and of itself. But the texts still get read by humans and voted on, and the best ones float to the top by default. Sometimes that’s because they’re very detailed, or point out a major issue, or compare it with other games, or are just funny.

Amazon also has helpfulness votes for their reviews, but their aggregation system doesn’t seem anywhere as effective, and they also do a much worse job of filtering out fake and irrelevant reviews. I think Valve spends a lot of time curating the reviews while Amazon doesn’t really care, taking a quantity vs quality approach.

Also, and I’ve seen this a few timers, is those who invert the ratings and who rate 1 star as #1, not as 1 out of 5, i.e. they treat the star ratings are ordinal rankings instead of a rating system. I’ve noticed this several times when reading gushing one-star reviews, and figured out (or maybe even Googled around) what was going on. It makes up only a small %age of reviews, but it’s a thing to look out for.

I generally find it’s most helpful to read a sampling of the critical reviews and see if there’s a common thread, and whether it’s something I care about. For example, I don’t really care if the service is inattentive as long as the food is good, so if I see the food being praised in most of the reviews, but the service being slammed, I don’t count it against the establishment. Or I note that some reviewers simply have no idea what they’re talking about. But, yes, as mentioned previously, it’s important to actually read the reviews in some capacity to get a fuller picture. Ratings on their own only take you so far, and I wouldn’t automatically choose a 4.7 item or establishment over a 4.3 one without reading and taking other things (price, distance, etc.) into consideration. Also, I am wary of anything rated too highly, like 4.9+. But I understand that’s not the purpose of this exercise.

Definitely. I know someone who wrote a massive, and expensive, two volume book, which is supposed to be sold and shipped both volumes together. The idiots at Amazon often screw it up, and he is infuriated at getting bad reviews for Amazon’s screw up. He says this supposedly is against policy, but I’ve never checked.

I was involved with a conference. When I was program chair, we had 300 papers submitted each reviewed by 5 people. Our custom review system had an option to show the average review scores for reviewers (who typically reviewed several papers per year over many years) to balance scores. There was wide variance on the averages.
Might be nice to show average review score for Amazon, which would detect the all 5s and all 1s people pretty well. Wouldn’t be perfect, some people no doubt only write a review when they are upset, which would skew the numbers.

If Amazon was serious about useful reviews, which they self-evidently are not, there would be three separate questions:

  1. Rate the product.
  2. Rate the seller.
  3. Rate the shipping.

Of course that still leaves the problem of people who don’t read directions, or can’t separate their anger at the shipper from venting their spleen about the product.


IMO it’s a real toss-up these days between:

  1. People! It’s why we can’t have nice things.
  2. Monster Corporations! It’s why we can’t have nice things.

They sort of do; it’s just that people all too often don’t answer the correct question.

Ah, I see. Thanks!

Or a deal breaker like :

be warned, This BT-speaker cannot be Used while being charged