I have collected some data - feedback from various people, who fall into three distinct categories. The first two groups are roughly a dozen or so in size, but the third group is three times as big. When I calculate averages, the opinions of the third group tends to dominate - how should I manipulate things to avoid this?
One thought I’ve had is to average the averages of each group - would this work?
I’m working in Excel if that makes any difference…
Calculate the three separate averages, the average of the averages and the global average.
Your subjects should have been chosen to represent the global population (i.e., “I asked more people from group #3 because I had more at hand” is bad design of experiments); having those five averages lets you compare the results from each distinct group among themselves and with those of the full group.
I’m trying to set priorities for ongoing projects and asking various stakeholders where they think our priorities should be (by rating the various options from 1 to 5)
The groups are different sizes because that’s how big they are - I’m asking everyone’s opinion and there happen to be more people in category three.
That doesn’t make any sense. By using an average you are offsetting the difference in group sizes. The average of an average is a great way to distort results and come up with incorrect conclusions. Obviously the 3 groups have different priorities but the number in the group isn’t influencing this.
Exactly. It isn’t the fact that one group is larger or smaller it is the values they have given.
I most commonly come across this error with people using monthly averages and then averaging them to get an annual average. Or doing satisfaction surveys among many groups and then averaging the average results. It is fun explaining to people how their segmentation allows the majority to be unhappy but averaging the average makes it look as though everyone is happy.
My favourite how to lie with stats example is the old Pepsi ads with the blind tasting. The reveal would show Coke drinkers picking Pepsi as their preferred drink. What was clever was the accompanying voiceover: “In recent blind taste tests the majority of Coke drinkers preferred Pepsi to Coke.”
It sounds like they are saying in blind taste tests Coke drinkers prefer Pepsi to Coke. What they are really saying is that in more than one blind test the majority of Coke drinkers preferred Pepsi to Coke. So if I ran 1,000 tests, each with a test group of 3, and in 2 of those groups it finished 2-1 or 3-0 to Pepsi I can in all honesty make the claim in the ad. This despite the fact that the result could be 2,996 to 4 in favour of Coke.
I was also going to suggest that - multiply each group result with a weight factor and divide the sum of these products by the sum of these factors. This leads to the problem, however, that you need to find adequate weight factors for every group result. How do you choose those? Based on group size? That’s not better than to simply compute the overall average without regards to the groups.
Later on, the OP stated the entire population was sampled (everyone in every group provided an answer). An average is an estimate of how closely a sample matches a normal distribution (it is the value at the peak of the normal distribution)-but when the entire population is sampled you don’t have to assume a distribution. The problem has moved out of statistics and into data analysis.
If I were doing this analysis, I would plot the data for each group and determine the mode. One could calculate the mode, but plotting it is informative. That is what the OP is trying to determine isn’t it? What is the point of an average in this case anyway? The mode is the most likely value in the sample which in this case is the entire population. So I believe that is what is desired here.
It depends very much on where the different samples/groups come from, and what you’re using the info for.
As a concrete example: if you were trying to determine average salary for an electrical engineer, and your three groups were three companies, then the third group (the very large one) will dominate – if that company tends to pay very high or very low, then your average is distorted if you don’t separate the group. In this type situation, the “average of the averages” is an “average of what companies are paying.” That’s probably what you want.
On the other hand, if the three groups are totally random samples from totally different studies that you want to combine, then you want to just use the average over all the numbers, and the larger group will dominate but so what?
And, as noted, do you really want the mean or do you want the mode? or even the median?
Thanks for the advice everyone, I’m thinking that I will need to present the three data sets separately and note the differences and similarities between them.