Here’s my guess… from your description of the problem I’m assuming that multiple matches for a character are OK, and that each individual character can only be matched once.
Set up 2 arrays of length 36, initialized to 0. Let’s call them arr1 and arr2. The slots in the arrays will correspond to 0-9 and then A-Z.
Scan through the first string, and tally up the number of occurrences of each character in arr1. Do the same for the second string and arr2.
Now to get the total of matched characters, loop over arr1 and arr2, and at each step, add to the total min(arr1, arr2) where x is our index.
This has a running time of O(L+M) where L is the size of the largest string and M is the size of the alphabet.
What do y’all think of it? Sound right?
Obviously for larger alphabets this isn’t as good of an idea, but for this example I think it’s OK.
friedo, I’m no Perl expert, but am I right in understanding that your code is running a pattern-matching regex or two inside the loop? That might not be the most efficient way to go when dealing with long strings…
Two regexes, actually. One to see if the current character exists in the second string, and one to remove it if it does (so it doesn’t match next time.)
In terms of program efficiency, it’s very bad. You wouldn’t want to do that if you had to run this operation billions of times.
In terms of programmer efficiency, on the other hand, it can’t be beat. That took me less than a minute, and is a valid answer to the question IMHO.
Sort each string. For each character in the first string, count how many are in that string and how many are in the other one. Add the minimum of those two values to the total.
That sounds quadratic, but if you code it properly it’s linear, and the running time is dominated by the sort routine. If these are ASCII strings, each character is really a number, and you can use a couting sort, making the whole thing linear in the sum of the lengths of the input strings.
It might have been worthwhile to ask the interviewer what the context was, in terms of the criteria for ‘best algorithm’. If this is something that will be done over and over again with long strings, it’s worth putting a lot of effort into coming up with an efficient algorithm. Otherwise friedo’s comments are quite relevant.
The problem basically boils down to outputting the union between two multisets… since order doesn’t affect the result, it can be discounted.
This is a somewhat out there idea, but one notion might be to use some kind of fast sorting strategy to sort each string and then step through them.
Using the example given:
(After sorting)
STR-1: aabbcc
STR-2: accdxxy
Now, maintain pointers starting at the first character in each string. If they match, then push along both pointers and increment your matches variable. Otherwise, push along only the pointer that is resting on the earlier letter in order. To take a snapshot in the middle:
aabbcc
^
accdxxy
^
match count: 1
Since the first pointer is aimed at a “b”, and the second at an “a”, you move the first pointer on, (and next iteration you start cashing in on the C’s) Algorithm returns when one pointer passes the end of its string.
Efficiency calculations will depend on how long the expected strings are, but for strings length m and n, I get (m lg m) + (n lg n) + m + n , which could easily be less than m * n
on preview: ultrafilter is with me on the sorting routine… but with this ‘simultaneous stepping’ strategy it is perfectly clear that you can avoid a quadratic cost. Is this what you meant when you so blithely said “code it properly”? Or just step through the first and binary-search the second, which could also work.
It’s not because there’s only 36 of them. Even with a constant alphabet size, the running time of the sort algorithm depends on the length of the string. It’s easy to sort in linearithmic time on an arbitrary totally-ordered alphabet, but since ASCII characters are really integers, you can use counting sort, a specialized algorithm that doesn’t require comparability and runs in linear time.
If memory is constrained, you don’t want to use that, but you might be able to get away with one of the other specialized linear-time sorting algorithms.