When I did a piece of software some time ago to check URL entries in a directory I wanted to ‘dis-sort’ the list of URLs in order to space checks on URLs on the same site as far apart in time as possible (e.g. if I hit tens of thousands of URLs from the same site in sequence without interspersing other sites, the administrators of that site might well class it as a denial of service attack and block my IP).
Now, sorting a list is not a problem - good algorithms are published and well known, and a lot of languages such as Perl have efficient built-in sort functions.
But I have not found algorithms for dis-sorting (dis-ordering) a list, so that items that are the more distant in the output list the nearer they are in collation order.
Of course you can put the list in (pseudo) random order, but wouldn’t that be sub-optimal in that respect, as some items near to each other in collation order would also happen to be near to each other in the output list?
(for my original task I used an ad-hoc solution by sorting URLs into bins by domain, then taking items from each bin in turn until all bins were empty. That was good enough but inelegant, and obviously not as disorderted as possibly).
Since sorting methods are different in the way they move things about, I think the definition of ‘maximally dis-sorted’ must vary depending on the algorithm you’re going to use to sort them. A bubble sort (well-known, but not at all efficient), for example, will have to do the most work if the list is in perfect reverse order (I think), whereas some other sorting methods will just eat that right up.
Not everything can be as far away from everything else as possible, and still remain in an actual list, so given the list:
1
2
3
4
What would dis-sorted mean? - if we put 1 and 2 at opposite ends of the new list:
1
4
3
2
-then we’ve still got 3 and 4 right next to one another, but we can only move them apart by conceding to move 1 and 2 closer again - which value takes precedence in all of this, and why?
I don’t rightly know what a good metric for being maximally disordered would look like. (in the case of 1-2-3-4 there is no possibility of having an output list where no two consecutive numbers would be adjacent, right?)
To expand on that idea, you might define a “disorderedness measure” as the sum of the distances between pairs. For instance, 3142 and 2413 have measures of 7, as opposed to 1234 or 4321 having measures of 3. Now, what permutation maximizes that. Of course, I’m still not sure that it’s a reasonable criteria.
It seems reasonable. If you take the second half of your list, reverse it, and interleave the result with the first half, you should come pretty close to maximizing it.
ETA: Maybe it’s better not to reverse the second half. I’ll let someone else check the numbers.
I think your ad hoc solution is close to what you actually want to do. What is your goal: to avoid hitting any server too much.
What I’d do is keep a list of all servers you’ve accessed along with a time stamp of when you accessed it. Don’t worry about randomizing your list (although a hash function would work well to that effect). Instead, before hitting a server, check to see the last access time and skip that url if it’s too recent. After you gone a certain distance down the list, restart back at the top.
The algorithm will take a little tuning to optimize for all your needs, so be sure to keep logs with enough information to be able to assess your perfermance.
If you sort a list but order it such that an element that is distance k from the beginning of the list is placed in location k’, where k’ is equal to the reverse-bit ordering of the bits in k, I think you’ll get a good dissorted list.
You mean, the list entries which are in collation order #400, #401 and #402 (binary 110010000/110010001/110010010, respectively) to be placed in locations # 19 (000010011), #275 (100010011) and #147 (010010011)? That looks like at least adjacent list elements wind up dispersed in the output list.
If you want to break them up, your maximum breakage will come by:
Determine the total number of URLs (total_count)
Determine the total number of URLs from each host (host_total_count)
For each host, determine the frequency (host_frequency* = host_total_count* / total_count)
Create an array, curr_frequency_by_host array and initialise it to having each index == host_frequency.
Create a linked list, hosts_by_frequency, and sort it by curr_frequency_by_host
Take one host from the bottom (minimum frequency) of hosts_by_frequency
Use one URL from it
Set curr_frequency_by_host* += host_frequency*
Reinsert it into hosts_by_frequency using curr_frequency_by_host as the sorting method.*
Loop to #6 until all URLs have been used.
In realistic terms though, you’re better to just do something stupid like inserting them all into a hash or a database and then iterating through (which will return them in semi-random order), or ordering the list by alphabetic order from the end of the URL or such.
You will need to sort such that when items are reinserted, they will be placed as the last link of items with the same curr_frequency_by_host.
Yeah, that’s what I meant. One problem with my idea is that if the number of entries is not a power of two, there will be holes in the resulting indices, but you at least have an ordering.
Here’s another method: Number your original list, from 0 to N-1. Create a floating point index variable, and initialize it to 0. Round your index to the nearest integer (but keep the floating point value), and use that element of the list. Now, increment your index by Phi*N modulo N (where Phi is the golden ratio, sqrt(5)/2 + 1/2), round it again, and repeat. I’m not absolutely certain that this will hit every list element before repeating, but I’m pretty sure.
Another option, which I guarantee will work, is to find the prime number closest to sqrt(N), and increment by that at each step.
This sounds like a version of the Maximal Dispersion problem in Geography.
You want your entries to be maximally dispersed in list-space.
There are several (if not many) objective functions for defining maximal dispersion.
Can I suggest a rather simpler method? I’ll point out that the list doesn’t actually need to be dis-sorted, merely the elements picked out in random order. So compile a straight list - sorted is better than unsorted - and pick randomly within that list.
The OP knows random ordering will do a reasonable job, but is concerned that it could still pick the same website multiple times in a row. Also, if you’re choosing them randomly, it doesn’t matter whether the list is sorted or not.