String matching algorithm

Continuing with our silly interview questions, I had this one the other day.

Describe the algorithm you would use in a subroutine that compares two strings and returns the number of characters that are common in both strings.

What would be the best algorithm for this subroutine?

Notes on the input strings:

  1. They are not necessarily the same length.
  2. They both have at least one character.
  3. The characters come from the set of all numbers 0 to 9 and all uppercase letters A to Z.

One more note:

If STR-1 is “bbaacc” and STR-2 is “dcaxxcy” the subroutine would return 3.

Here’s my guess… from your description of the problem I’m assuming that multiple matches for a character are OK, and that each individual character can only be matched once.

Set up 2 arrays of length 36, initialized to 0. Let’s call them arr1 and arr2. The slots in the arrays will correspond to 0-9 and then A-Z.

Scan through the first string, and tally up the number of occurrences of each character in arr1. Do the same for the second string and arr2.

Now to get the total of matched characters, loop over arr1 and arr2, and at each step, add to the total min(arr1, arr2) where x is our index.

This has a running time of O(L+M) where L is the size of the largest string and M is the size of the alphabet.

What do y’all think of it? Sound right?
Obviously for larger alphabets this isn’t as good of an idea, but for this example I think it’s OK.



#!/usr/bin/perl

use strict;
use warnings;

my $str1 = "bbaacc";
my $str2 = "dcaxxcy";

my $total = 0;
my %match;
for(split '', $str1) {
    if ($str2 =~ /$_/) {
        $match{$_}++;
        $str2 =~ s/$_//;
    }
}

$total += $_ for values %match;

print "Total matches: $total
";
print "Matches:
";
print "$_: $match{$_}
" for sort keys %match;


Output:

Total matches: 3
Matches:
a: 1
c: 2

friedo, I’m no Perl expert, but am I right in understanding that your code is running a pattern-matching regex or two inside the loop? That might not be the most efficient way to go when dealing with long strings…

Two regexes, actually. One to see if the current character exists in the second string, and one to remove it if it does (so it doesn’t match next time.)

In terms of program efficiency, it’s very bad. You wouldn’t want to do that if you had to run this operation billions of times.

In terms of programmer efficiency, on the other hand, it can’t be beat. That took me less than a minute, and is a valid answer to the question IMHO.

Right tool for the right job and whatnot.

Sort each string. For each character in the first string, count how many are in that string and how many are in the other one. Add the minimum of those two values to the total.

That sounds quadratic, but if you code it properly it’s linear, and the running time is dominated by the sort routine. If these are ASCII strings, each character is really a number, and you can use a couting sort, making the whole thing linear in the sum of the lengths of the input strings.

It might have been worthwhile to ask the interviewer what the context was, in terms of the criteria for ‘best algorithm’. If this is something that will be done over and over again with long strings, it’s worth putting a lot of effort into coming up with an efficient algorithm. Otherwise friedo’s comments are quite relevant.

The problem basically boils down to outputting the union between two multisets… since order doesn’t affect the result, it can be discounted.

This is a somewhat out there idea, but one notion might be to use some kind of fast sorting strategy to sort each string and then step through them.

Using the example given:

(After sorting)

STR-1: aabbcc
STR-2: accdxxy

Now, maintain pointers starting at the first character in each string. If they match, then push along both pointers and increment your matches variable. Otherwise, push along only the pointer that is resting on the earlier letter in order. To take a snapshot in the middle:



aabbcc
   ^
accdxxy
 ^
match count: 1



Since the first pointer is aimed at a “b”, and the second at an “a”, you move the first pointer on, (and next iteration you start cashing in on the C’s) Algorithm returns when one pointer passes the end of its string.

Efficiency calculations will depend on how long the expected strings are, but for strings length m and n, I get (m lg m) + (n lg n) + m + n , which could easily be less than m * n

on preview: ultrafilter is with me on the sorting routine… but with this ‘simultaneous stepping’ strategy it is perfectly clear that you can avoid a quadratic cost. :slight_smile: Is this what you meant when you so blithely said “code it properly”? Or just step through the first and binary-search the second, which could also work.

Yeah. I’m not sure if it’s considered a standard technique, but I’ve seen it before, and besides, what’s the fun in giving away the whole answer?

As ultra pointed out, sorting the characters can be done in linear time because there are only 36 of them. So it’s O(n + m)

I went with the sorting idea too. Then after S1 and S2 are sorted, use P1 and P2 as pointers into S1 and S2.

/* psudo-C code for string match */

match_str( char * S1, char * S2)
{

int count = 0;
char * P1, P2;

CHR_SORT(S1);

CHR_SORT(S2);

P1 = S1;
P2 = S2;

while ((*P1 != NULL) && (*P2 != NULL)) {

if (*P1 == *P2) {
	count++;
	P2++;
}
elseif (*P1 > *P2) {
	P2++;
}
else {
	P1++;
}

}

return (count);
}

It’s not because there’s only 36 of them. Even with a constant alphabet size, the running time of the sort algorithm depends on the length of the string. It’s easy to sort in linearithmic time on an arbitrary totally-ordered alphabet, but since ASCII characters are really integers, you can use counting sort, a specialized algorithm that doesn’t require comparability and runs in linear time.

If memory is constrained, you don’t want to use that, but you might be able to get away with one of the other specialized linear-time sorting algorithms.