Does anyone know of any research (findable on the net) of comparing individual likes in one category to see how well it will predict what things he will like among an unrelated field. For instance, if I know what movies a person likes, and I’ve got a database of 5000 people and what movies each of them likes, and I’ve also got a database of what music those same 5000 people like, how likely is it that I’ll be able to match our test subject to some similar people in the movie database, and be able to come out with music he would like by extrapolation?
Probably music and movies aren’t too far removed from one another so that might succeed, but what about if you extraploated to cars or recipes?
I’m not aware of any. Amazon probably has the best database for this sort of thing, but I doubt they like to publish their findings. The technique you’re looking for is collaborative filtering, if you’d like to Google.
The discipline of statistics in general is the science of picking out patterns from randomness. There are a couple of concentrations in statistics which get to pattern-finding within ‘likes’. In addition to collaborative filtering, you might google:
[ul]
[li]Cluster analysis Wikipedia link to cluster analysis in marketing [/li][li]Principal component analysis[/li][li]Factor analysis[/li][li]Data mining [/li][/ul]
In general, clustering and these other methods are ‘soft’, meaning that it’s difficult to create exact rules based on likes that are reliable. In other words, there are clustering methods that will find clusters for various genres in music or movies and do so successfully, based on a sufficient sample size, but they can’t hope to have respectible precision across all genres. I think what you’re interested in, though, is using likes of one (e.g., music) to predict likes of another (e.g., movies). Understand that extending soft clusters to predict other soft clusters is likely to be yet even more soft, although the algorithms don’t complain when you apply them. How successful your predictions will be is dependent on methodological approach, data quality and sample size.