PDA

View Full Version : Math geeks - need help

redtail23
01-26-2012, 01:21 PM
Is there a way to calculate the top n% of a weighted distribution (if that's the right term)? I've found a formula for calc'ing a weighted average, but that's not what I need.

I have a set of values (number of times X was done, grouped by X) and need to calculate the top n% of that, by the count rather than grouping.

Cherries 15
Bananas 10
Cantelope 7
Oranges 6
Apples 5
Kiwifruit 3
Persimmons 3
Grapes 2
Mangos 1
Papayas 1
Pineapples 1
Plums 1
Apricots 1
Blueberries 1
Raspberries 1
Blackberries 1

So I've got a total of 59 pieces of fruit. I need to know the list of fruits in the top 30% by the number of pieces.

59 * .3 = 17.7, so that'd be 18 pieces of fruit. That would be 15 Cherries and 3 Bananas.

Is there a way to calculate this rather than just going through the sorted list and adding until you get to the right number?

Thanks!

P.S., no it's not a homework assignment, I've actually got a much larger dataset I'm trying to crunch for work, but I'm no statistician, just a computer nerd.

standingwave
01-26-2012, 01:37 PM
Since you're talking about a spreadsheet, I create a column that shows the sum from the top to that row.

=SUM(A\$1:A6)

With the \$ symbol pinning the first element, you can copy it for every element in the column. Then it's simple to calculate the percentage (or to do it all in one step).

Xema
01-26-2012, 01:52 PM
Is there a way to calculate this rather than just going through the sorted list and adding until you get to the right number?
It will be challenging to come up with a calculation method that avoids summing the counts.

redtail23
01-26-2012, 01:53 PM
Yeah, it's not in a spreadsheet, that would make it too easy. ;)

It's a database, with multiple tables that have to be put back together after finding my top list.

I can get there calculating RBAR like a spreadsheet, I was just hoping for a better way since I'll have to run this multiple times for different breakdowns and datasets.

Thanks, though!

nivlac
01-26-2012, 01:54 PM
If you're going to end up ranking the fruits by highest frequency there's no way you can avoid a sort or some kind of search to find the highest frequency fruits. What's required is not a math formula but an algorithm and it's perfectly described by the OP.

redtail23
01-26-2012, 01:58 PM
It will be challenging to come up with a calculation method that avoids summing the counts.That's what I was afraid of.

I can do that, I was just hoping that mathier people than me had a better idea. :p

But you did make me think of another way to search and it looks like that may do it. I needed to be looking for solutions to create set-based running sums.

Thanks!

redtail23
01-26-2012, 02:01 PM
Yeah, sorting them into the proper order is easy, it was how to keep from adding each row and then comparing that to the grand total row-by-row that was hard.

SUM OVER will do it - you can compare partitioned sums to the total sum.

I've gotta get better at remembering the aggregations available.

Thanks for jogging the old brain into a new slot!

01-26-2012, 03:29 PM
If the flavor of SQL you're using supports analytic functions, you can do something like:

select fruit_name,
count(*) as fruit_count,
sum(fruit_count) over ( ) as total_count,
sum(fruit_count) over
(
order by fruit_count
rows between unbounded preceding and current row
) as rolling_count
from fruit_order_table
group by fruit_name
having rolling_count / total_count > 0.7;

to get the top 30%.

redtail23
02-01-2012, 04:57 PM
Just to close out, I did finally get it.

I'm in MS-SQL. I think the Analysis Services might do that analytical stuff, but I haven't gotten into it yet.

I ended up with a multi-part query to get the different pieces.

First one breaks out the sections, each with a list of items, count of those items, grand total for the section, and ranked in descending order by count. The second one calcs the percentage of grand total for each item. Third one calcs running totals on the item count and percentage. Final part selects all rows for each section with running total less than my goal.

Runs pretty well, and it will be fairly easy to change up the breakdowns as needed.

Thanks for all the suggestions!