Is there a way to calculate the top n% of a weighted distribution (if that’s the right term)? I’ve found a formula for calc’ing a weighted average, but that’s not what I need.
I have a set of values (number of times X was done, grouped by X) and need to calculate the top n% of that, by the count rather than grouping.
So I’ve got a total of 59 pieces of fruit. I need to know the list of fruits in the top 30% by the number of pieces.
59 * .3 = 17.7, so that’d be 18 pieces of fruit. That would be 15 Cherries and 3 Bananas.
Is there a way to calculate this rather than just going through the sorted list and adding until you get to the right number?
Thanks!
P.S., no it’s not a homework assignment, I’ve actually got a much larger dataset I’m trying to crunch for work, but I’m no statistician, just a computer nerd.
Since you’re talking about a spreadsheet, I create a column that shows the sum from the top to that row.
=SUM(A$1:A6)
With the $ symbol pinning the first element, you can copy it for every element in the column. Then it’s simple to calculate the percentage (or to do it all in one step).
Yeah, it’s not in a spreadsheet, that would make it too easy.
It’s a database, with multiple tables that have to be put back together after finding my top list.
I can get there calculating RBAR like a spreadsheet, I was just hoping for a better way since I’ll have to run this multiple times for different breakdowns and datasets.
If you’re going to end up ranking the fruits by highest frequency there’s no way you can avoid a sort or some kind of search to find the highest frequency fruits. What’s required is not a math formula but an algorithm and it’s perfectly described by the OP.
I can do that, I was just hoping that mathier people than me had a better idea.
But you did make me think of another way to search and it looks like that may do it. I needed to be looking for solutions to create set-based running sums.
Yeah, sorting them into the proper order is easy, it was how to keep from adding each row and then comparing that to the grand total row-by-row that was hard.
SUM OVER will do it - you can compare partitioned sums to the total sum.
I’ve gotta get better at remembering the aggregations available.
If the flavor of SQL you’re using supports analytic functions, you can do something like:
select fruit_name,
count(*) as fruit_count,
sum(fruit_count) over ( ) as total_count,
sum(fruit_count) over
(
order by fruit_count
rows between unbounded preceding and current row
) as rolling_count
from fruit_order_table
group by fruit_name
having rolling_count / total_count > 0.7;
I’m in MS-SQL. I think the Analysis Services might do that analytical stuff, but I haven’t gotten into it yet.
I ended up with a multi-part query to get the different pieces.
First one breaks out the sections, each with a list of items, count of those items, grand total for the section, and ranked in descending order by count. The second one calcs the percentage of grand total for each item. Third one calcs running totals on the item count and percentage. Final part selects all rows for each section with running total less than my goal.
Runs pretty well, and it will be fairly easy to change up the breakdowns as needed.