Approximating individual surveys from aggregate results

I’m working on a project where we collect employee surveys and report on them. Depending on role and location, the surveys may have up to 15 responses. Each response must be from one to five.

Responses are then grouped. For example:

Group 1: q1, q4, q9
Group 2: q5, q6, q7
Group 3: q6 and q8
Group 4: q2 and q3
Group 5: q10 and q11

(Yes, a question can be in more than one group, as q4 is in this example.)

Straightforward so far. However, it’s an inherited project and the surveys from before a certain date were not kept, nor was a count of the responses. I’d like to approximate the old surveys in order to determine a minimum number of responses for those old weeks. Usually there’s only 8-15 responses in any given week, so it shouldn’t require all that much crunching.

I’ve set up a spreadsheet, naturally, and tried running the Excel (XL07) Solver add-in over the data but no luck-- no surprise there, I’ve never got Solver to do much for me, although it probably works fine on simpler problems like textbook break-evens. I’m not even coming up with good search terms to look up. Any ideas?

Sample responses:

Group 1: 3.4
Group 2: 3.2
Group 3: 3.15
Group 4: 3.5
Group 5: 3.8

My gut feeling is that what you’re trying to do is impossible, but I’m still not 100% clear on what exactly it is. What exactly is involved in grouping the responses?

Yeah, sorry, but it isn’t terribly clear what you are trying to do. Back out the average response for each question?

Mainly to determine an estimated number of responses for each group in a reporting week. Secondly, I’d like to load dummy responses into the database that would calculate out to the same result as the aggregate data.

Formula for grouping the responses for a team would be (sum(questions in group))/(number of surveys*number of questions in group), three being the number of questions in Group 1.

So, say there were only two employees responding that week, employees A and B. Looking only at Group 1 responses, A responds:

q1=3, q4=5, q9=4 (on a 1 to 5 scale, of course, for a total of 12)

B is a little less sanguine and responds:

q1=2, q4=2, q9=1 (totalling 5)

(5+12)/(2*3)

17/6

or 2.83.

So it sounds like you have the averages from previous weeks, but not the actual data. What I am not clear on, is what you want to calculate from the previous weeks. I think an example as to what you have is input and what you want as output would be helpful. Also information as to how you are going to use the output might assist us in determining if there is an alternative way to get what you want.

Let k be the number of questions in a group, let n be the number of surveys, and let m be the average. If you round the product kmn to an integer, you’re just looking for an integer solution to the equation x[sub]1[/sub] + 2x[sub]2[/sub] + 3x[sub]3[/sub] + 4x[sub]4[/sub] + 5x[sub]5[/sub] = kmn, with the constraints that each x[sub]i[/sub] is positive. There are algorithms that can solve this (Google on solving Diophantine equations), but since you know that each x[sub]i[/sub] has to be between 0 and kmn, you can just brute force it. You will need to do some kind of programming for this, though; Excel’s solver will not be good enough.

Do you always get two digits of accuracy, or could that 2.833333 sometimes be reported as 2.8?

9 digits in the aggregate data, sample below:
3.476190476
3.333333333
3.428571429
3.285714286
3.571428571

That’s more than enough.

Yes, I have the averages. The output I’m hoping for is 1) an estimate of the number of responses and 2) dummy responses I can add to the database that will get the same average (or to one decimal place-- that would be good enough).

I’ll go ahead with a more detailed example of the input and output. Let’s say I’ve got the results:

Group 1: 2.433333333
Group 2: 2.6
Group 3: 2.9
Group 4: 2.65
Group 5: 3.466666667

I know that the employees answered 13 questions, grouped as below.

Group 1: q1, q2, q3
Group 2: q4, q5, q6
Group 3: q7, q8
Group 4: q9, q10
Group 5: q11, q12, q13

What I’d like for output is:


	q1	q2	q3	q4	q5	q6	q7	q8	q9	q10	q11	q12	q13
r1	5	1	3	1	2	3	5	5	3	4	4	3	3
r2	5	3	4	2	5	4	2	5	3	2	4	2	1
r3	2	4	5	2	3	5	4	1	4	1	2	1	5
r4	1	2	1	5	2	1	1	5	2	2	1	3	1
r5	1	4	4	3	4	1	1	2	5	3	5	5	1
r6	4	3	4	2	4	4	4	2	5	5	5	5	1
r7	2	1	3	5	4	3	5	3	1	3	4	2	3
r8	1	2	3	3	1	4	5	2	1	5	1	1	5
r9	4	5	2	5	4	1	4	5	3	3	5	5	4
r10	4	5	4	2	2	5	5	5	4	5	2	2	1

You could rearrange those in any way, switching respondent 1 and 10, say, or swapping r3 and r4’s answers for questions 1 and 2, no matter, so long as the averages come out the same. I’m not trying to determine the individual results, I’d like to fake up some data that will work out the same as the legacy averages, and thereby, hopefully, determine a minimum number of responses that fit the data.

Thanks all for replying. I had a look at Diophantine equations in Wikipedia but didn’t leap up and cry “Eureka”. Looks like I’ll back burner this aspect of the project and make do with the aggregate data. Nobody even asked me to do this. I just got curious if it could be done.

Realized too late that I posted bad survey results, should’ve been something like:


	q1	q2	q3	q4	q5	q6	q7	q8	q9	q10	q11	q12	q13
r1	3	1	1	4	3	1	3	5	5	2	5	5	3
r2	1	2	1	2	1	1	2	3	3	5	5	1	4
r3	2	1	3	3	2	4	1	1	3	1	5	1	1
r4	1	3	2	3	4	1	2	2	2	3	5	4	2
r5	2	5	3	1	1	5	4	4	4	1	1	5	5
r6	5	2	2	2	1	2	3	4	4	4	5	5	5
r7	4	1	1	1	1	1	4	1	1	2	4	3	1
r8	2	3	2	3	4	3	3	3	2	1	3	4	5
r9	5	3	5	1	5	5	3	3	1	3	4	4	1
r10	1	4	2	5	3	5	3	4	5	1	2	1	5

OK I understand the question, and could see that given the integer nature of the responses, and overlaps that it could be possible to solve, I realize that I haven’t had nearly enough number theory to necessarily solve it.:slight_smile:

The only help I can give is that its clear that if you have enough decimal places and look at the fractional component of your answer it would be possible to work out for each group a factor of the denominator that was used for that group. Fo that if there are N respondents and Q questions in a group then the AverageNQ must be an integer.

In group 1 we have 2.4333 which only equals and integer when multiplied by a multiple of 30. So N*3 is a multiple of 30, so N is a multiple of 10.

question 2: 2.6 will be an integer when multipled by 5, so 2*N must be a multiple of 5, so N must be a multiple of 5.

If you do this for all of your questions and then take the least common multiple, that should give you a good guess as to the number of people questioned.