Can you describe my (new, original) sorting algorithm mathematically?

panamajack · October 25, 2007, 10:09pm

Looks like he’s picking the middle value, by post #3.

EllisDee · October 25, 2007, 10:25pm

Agreed, which is why I was comfortable labeling its average case as the worst case. It is worth pointing out that merge sort’s best case is also pretty much the same as its average and worst case, whereas snake sort can easily improve upon its average case depending on the starting order.

No, I’m comparing it to the actual mergesort algorithm. You know, the one you find if you google it or look it up in a book. The code is as follows:


' Omit pvarMirror, plngLeft & plngRight; they are used internally during recursion
Public Sub MergeSort1(ByRef pvarArray As Variant, Optional pvarMirror As Variant, Optional ByVal plngLeft As Long, Optional ByVal plngRight As Long)
    Dim lngMid As Long
    Dim L As Long
    Dim R As Long
    Dim O As Long
    Dim varSwap As Variant

    If plngRight = 0 Then
        plngLeft = LBound(pvarArray)
        plngRight = UBound(pvarArray)
        ReDim pvarMirror(plngLeft To plngRight)
    End If
    lngMid = plngRight - plngLeft
    Select Case lngMid
        Case 0
        Case 1
            If pvarArray(plngLeft) > pvarArray(plngRight) Then
                varSwap = pvarArray(plngLeft)
                pvarArray(plngLeft) = pvarArray(plngRight)
                pvarArray(plngRight) = varSwap
            End If
        Case Else
            lngMid = lngMid \ 2 + plngLeft
            MergeSort1 pvarArray, pvarMirror, plngLeft, lngMid
            MergeSort1 pvarArray, pvarMirror, lngMid + 1, plngRight
            ' Merge the resulting halves
            L = plngLeft ' start of first (left) half
            R = lngMid + 1 ' start of second (right) half
            O = plngLeft ' start of output (mirror array)
            Do
                If pvarArray(R) < pvarArray(L) Then
                    pvarMirror(O) = pvarArray(R)
                    R = R + 1
                    If R > plngRight Then
                        For L = L To lngMid
                            O = O + 1
                            pvarMirror(O) = pvarArray(L)
                        Next
                        Exit Do
                    End If
                Else
                    pvarMirror(O) = pvarArray(L)
                    L = L + 1
                    If L > lngMid Then
                        For R = R To plngRight
                            O = O + 1
                            pvarMirror(O) = pvarArray(R)
                        Next
                        Exit Do
                    End If
                End If
                O = O + 1
            Loop
            For O = plngLeft To plngRight
                pvarArray(O) = pvarMirror(O)
            Next
    End Select
End Sub

ultrafilter · October 25, 2007, 10:41pm

That’s not really a great approach. Ellis Dee, try picking five values at random and using their median as the pivot value. That makes it extremely unlikely that you’ll get the quadratic behavior that your current quicksort implementation shows in the worst case. If you have a subarray of less than five elements, either pick one element at random as the pivot, or just bubblesort it.

EllisDee · October 26, 2007, 9:13am

Picking 5 would require sorting those 5 to identify the median. Mathematically speaking, would 3 random values work just as well, or at least almost as well? It’s much easier / more efficient to identify the middle of 3 than the middle of 5.

As a general note, I’m reluctant to optimize any of the algorithms too much, because I want to compare the actual algorithms. And I don’t see the camel-hump worst case of quicksort as a big deal, since it’s extremely unlikely to arise unless somebody is purposely mounting an attack.

Also of importance to me is that using a median of random elements approach would make it impossible to reliably display the worst case for quicksort in my benchmark. No matter what you do, quicksort has a worst case of O(n^2). Making it so that I can’t reliably show it to the user kind of defeats the purpose.

Indistinguishable · October 26, 2007, 9:27am

It’s just a constant number of operations to get the median of five values; as your list size gets larger, this small constant overhead becomes negligible. Very quickly.

Also, you seem to maybe have the notion that there is one exact version of quicksort, one exact version of mergesort, etc. But that’s not really how it is. Little deviations here and there are still accepted as quicksort or as mergesort or such things; thus, maybe quicksort is in-place, maybe it isn’t, maybe it’s a stable sort, maybe it isn’t, depending on the details. There isn’t exactly one golden reference piece of code; there’s just a general idea. If you select a pivot, decompose the original list around the pivot, recursively sort those two pieces, and then just snap it all back together, then you’re using a quicksort of some fashion or another, even though there’s a certain amount of leeway in the details. If you decompose the original list arbitrarily into two pieces, recursively sort those two pieces, and then merge those two pieces back together, then you’re using a mergesort of some fashion or another, even though there’s a certain amount of leeway in the details. (I might have some sympathy to saying “A proper ‘quicksort’, as such, should not put any effort into pivot selection, but just take the first thing it sees”, but there’s still a lot of leeway in other details, and, besides, this wouldn’t really correspond to the way people actually speak, anyway).

EllisDee · October 26, 2007, 9:55am

On second thought, I’ve been looking for an 18th algorithm to use, so I think I’ll use the median-of-3 quicksort variation as the 18th. It seems to work fine, and only adds two comparisons per recursion.

And let me say, this variation rocks. At first I figured the extra comparisons to determine the median would slow it down on random orders, but (as you no doubt already knew) it speeds it up on random orders because it’s generating better pivots.

EllisDee · October 26, 2007, 10:02am

Anyone have any pseudocode on the most efficient way to pick the median of 3 values? My first pass at it is suboptimal.

If there is a legitimate reason to use a median of 5 instead of 3, then the most efficient way to determine that would be great.

Note that I can’t use any mathematical operations on the elements; these functions are written with variants so that they can be used for arrays of any data type: numbers, dates, strings, etc…

ultrafilter · October 26, 2007, 1:35pm

The simplest way to do it is just to bubblesort a 3-element array and pick out the middle value.

Indistinguishable · October 26, 2007, 6:37pm

If you do it right, you’ll have a worst case of three comparisons, a best case of two, and an average (over the six possible permutations) of 8/3. All of these are the best possible.

(Not that any of this will be significant in comparison to the rest of your algorithm.)

D_Parker · September 11, 2009, 12:02am

Ellis… You’re thinking too small!!!

I’ve been working with sorts for nearly 30 years. I downloaded your code, analyzed it, and have to say “WOW!!!”. The only problem you are having, is that you are thinking too small, in number of items. Try 10,000,000 or more items, and see what happens. It’s pretty incredible. The reason I spent the time on your algorithm, is that it is similar to one of my own. Your algorithm is a linear one. I’ve not gotten to the exact math yet, but I’ve sorted 10,000,000 records in under on second on my Pentium M laptop. And yes, it smoked the fastest Quicksort out there. My quicksort is nearly twice as fast as any I’ve seen on the net, and it got creamed by nearly 15 seconds for the 10 million item comparison.

I have not found the worst case for your algorithm yet. I’ve got a pretty flexable testing setup, and I’ve thown all the basics at it, and it hasn’t flinched yet. In fact, I had to check my setup, because it ran so fast the first time. The reason you are so close to quicksort in your tests, is that you are at that point in the performance “curve” that your line is crossing quicksort’s at between 50,000 and 100,000 items. If you increase the number of items, I’m sure you’ll see what you’re looking for.

I’ve been working on two algorithms in the 5 to 6 second/10,000,000 range, and had planned on anouncing one or both soon. I hadn’t even dug up my old “Bullet Sort”. That’s what I called my version of the presorted lists algorithm. The last time I had it out, my computer didn’t even have capacity for 10,000,000 anything. I was pleased to see it at least compares to what you have. I went about it a little differently, but it’s interesting that the times are fairly close. Sorry, I didn’t come here to talk about my algorithms. I came here to tell you to stick to your guns, man. YOU HAVE SOMETHING HERE!!!

I’d be happy to provide you with a quicksort to test against, and rest asured, it’s been through the fire. The first thing I do when I find another quicksort, is to compare it to mine. I keep thinking that one will outperform it, but so far, it’s undefeated. It would be considered a combined algorithm though, because it uses an insertion sort to complete the cleanup on little lists (less than 15).

EllisDee · September 11, 2009, 4:55am

Braaaains!

I appreciate the kind words. Sadly, this algorithm didn’t turn out to be entirely original. The basic concept is called natural merge sort because it does a merge sort based on the initial (natural) contiguous blocks in the list.

Where my implementation is original is in the snaking back and forth between the original array and the mirror array, as opposed to conventional merge sorts which merge to the mirror array and then copy back to the original array before going on to the next set of merges. This greatly increases efficiency, but I don’t think it’s enough to merit the claim that this algorithm is original to me. Think of it as a natural merge sort on steroids.

As a comparison sort it still runs into the glass ceiling of all comparison sorts. Radix sort isn’t a comparison sort at all, and (almost by definition) ends up being orders of magnitude faster than even the fastest comparison sorts like quicksort.

To answer a question from upthread, I’ve since gotten some help implementing a true smooth sort, and natural merge sort isn’t even remotely related. Natural merge sort does merging to sort the list while smooth sort uses heap structure to sort the list. They are about as unrelated as it gets.

It’s tricky to pin down the worst case scenario for natural merge sort because it doesn’t really have one. Much like heap sort, the worst case is effectively the same as the average case. Unlike heap sort, though, natural merge sort has many initial orderings where it performs far better than the average case. Heap sort performs at the same speed regardless of initial ordering. This is why people thought it sounded like smooth sort, which has a smooth transition from best case to worst case as the starting order becomes more shuffled. (Natural merge sort does not have a smooth transition from best to worst.)

EDIT: Rereading the OP, another optimization that puts my implementation ahead of natural merge sort is that my implementation looks for ordered blocks in either direction. So zyxvuabcde will appear as two blocks to my implementation but six blocks to a conventional natural merge sort. So in addition to steroids, it’s also on HGH.

Chronos · September 11, 2009, 5:20am

I think I know what this means, but to be clear: When I’m sorting a stack of papers by name, and I see a name that starts with Y, I immediately put it on the bottom of the stack, because it’s probably the last name in the set, or at least close to it. Meanwhile, if I get an M name, I put it near the middle. Is this the basic idea behind radix sort?

EllisDee · September 11, 2009, 6:04am

Radix sort is a counting sort, but it uses trickery to make it work with strings. I believe you have to customize this trickery for each implementation, but I could be wrong. I have code for both types of radix string sorts (left to right and right to left) but never bothered to look at it because my only interest in sorting was benchmarking comparison sorts.

Counting sort only works on integers. First you go through and identify the lowest and highest value in your list. Then you allocate an array large enough to hold one “bucket” for each unique value. So if you have 10,000 numbers from 0-100, your bucket array should be 0-100.

The meat of counting sort is extremely simple. All you do is step through the array of 10,000 items, and for each item you increment that array element’s bucket. So if the first item in the list to be sorted is 57, you do something like bucket[57] = bucket[57] + 1.

After this single, linear pass, all your data is fully sorted. Just make a single pass through the bucket array and fill in the original array as appropriate. If the 0 bucket has a total of 167, the first 167 elements of your list get set to 0. Let’s say bucket 1’s counter is 1260. That means after the 167 zeros, you set the next 1260 items to 1. Lather, rinse, repeat.

It ends up requiring 1 linear pass to identify the low and high values, 1 linear pass to count up all the bucket values, and 1 final linear pass to overwrite the original list with the new sorted order. Three linear passes is hugely faster than any comparison sort could ever dream of being. Note how no elements in counting sort are ever compared against each other, which is why it isn’t limited by the bounds of comparison sorts.

Radix sort uses a similar concept. I can’t really visualize how it converts strings into bucket indexes, but I have seen code run that shows a radix string sort to be vastly superior to quicksort.

Jragon · September 11, 2009, 6:39am

If you used a Linked List you could probably omit the third step, instead of incrementing a counter each time just add a new node (probably in an array of them?) to that bucket value with the data and at the end just link the last node and first node of each bucket value to the adjacent one. Set original = new and you’re done. Though maybe I’m underestimating the time it makes to make another object (and the memory would probably be pretty ouch as well for larger ones, but whatever). Changing it into a 2-d array would work too, but that necessarily alters the original structure of the data so it’s not the best option.

EllisDee · September 11, 2009, 7:01am

Incrementing a counter is possibly the fastest operation a computer can do. Allocating memory is possibly the slowest. I believe the counting sort implementation I described is the fastest possible way to sort large lists of a small range of numbers.

I think what you describe is similar to the trickery to make it work with strings, though, so I think you’re on the right track for conceptually turning counting sort (integers only) into radix sort. (Strings, numbers of any kind, dates, etc…)

Jragon · September 11, 2009, 7:31am

Okay it looks like I got it… kind of -

A radix sort goes like this-
Place the element in a bucket based on the LSD (rightmost digit, if you actually use LSD it’s a bogosort).

Move left a digit, iterate through the buckets and throw them into the new buckets based on their next left digit.

Rinse, wash, repeat until you hit the MSD. Your list will be sorted because of the order it threw them in the lists. This does require padding of a sort (if you have numbers between 1 and 3 digits you have to pad by placing zeroes to fill to the MSB, so 2 would become 002 for the sort).

This issue with strings seems to be the filling out. I’m not sure on the best way to fix this, the only way I can think of it to reserve a certain character that won’t be used in the set (say, ^), and then “hardwire” that as the first bucket. In addition, you’d fill out the right instead of the left because shorter strings tend to have the nasty thing about coming first if they’re shorter and otherwise the same.

Edit: You do, of course, have to iterate through once to get the longest string/highest digit number unless you know the max length already.

D_Parker · September 11, 2009, 7:37am

Ellis,
I mistakenly posted to a very old thread earlier on another site, giving you kudos. You have something here, even if it is an existing algorithm. I downloaded snakesort to analyze it. I had written something that used the sorted runs in “random” data some years back, and thought it would be interesting to see how you implemented it. I ran snakesort with as much as 30,000,000 data items in under 5 seconds on my old laptop (compared to 1 minute for quicksort). The algorithm is natural merge sort, though I think you’ve implemented it very well indeed. I’ve been working with sorts for nearly 30 years, and this is the fastest comparison sort by far… I’m not sure why it’s not gotten better publicity. Maybe because it takes getting past 100,000 items before it completely blows Quicksort sideways.

EllisDee · September 11, 2009, 12:36pm

Publicity? The only places it’s ever been posted are this thread and one other. The linked thread contains the source code to the graphical sorting program I wrote plus descriptions of over a dozen algorithms including snake sort. The second and third pages of that thread contain discussion about some of the algorithms and implementations. Of particular interest is post 55. The attached text file (SmoothSort.bas) contains an amazingly thorough explanation of every facet of smooth sort, which to me was a complete mystery.

I’m currently looking for a free attachment hosting website so I can attach my graphical sorter so people here can check it out. It’s an incredibly descriptive way to understand exactly how these algorithms work. No amount of explanation can come close to the impact of watching the algorithms sort visually. It’s especially useful for seeing exactly how related algorithms differ (heap sort and smooth sort) and how unrelated algorithms are similar. (insertion sort and gnome sort)

On that site it is against the rules to post executable files, so while the complete source code there is great for the programmers over there, it’s of little use here. Anyone know of a place where I can post an attachment for free?

D_Parker · September 12, 2009, 7:31am

Ellis,

MY APOLOGIES.

PLEASE DISREGARD MY PREVIOUS POST the data in it is WRONG.

Does anyone know how to delete my previous post, so that noone gets misinformed?

Chronos · September 12, 2009, 5:37pm

Which previous post, the one where you bumped the thread or the one earlier this page?

You can’t delete your own post, but a moderator can if you ask. I’m reporting this so it gets their attention.

Topic		Replies	Views
Two questions about the Merge Sort Algorithm Factual Questions	25	1657	July 22, 2005
A sorting puzzle for geeky types Miscellaneous and Personal Stuff I Must Share	21	1593	July 15, 2007
On factoring really big numbers Factual Questions	24	1212	January 27, 2005
Ordering a list of variables making only direct comparisons - minimum needed? Factual Questions	23	1724	February 2, 2016
Efficient sort algorithm for a human being Factual Questions	38	3600	December 12, 2012

Can you describe my (new, original) sorting algorithm mathematically?

Related topics