Fuzzy Text Search in SQL Server (or other database)

My database has a list of items for sale, eg digital cameras. I can at the moment run a full text search for items matching an exact string. For example, I can get a list of items with the string ‘Panasonic DMC-FS15’ in the description field.

However, it’s exact matches only. It won’t return items that *nearly *match. If someone has mistyped the camera description as ‘Panasonic DMC-F515’ then my search won’t return it.

Is there any add on that will allow this in SQL Server?

Or, failing that, any other DBMS that will allow it? It also must have a free version for home/ non-commercial use.

Try SOUNDEX.

Soundex doesn’t do what I want.

There are other things like the Postgres Trigram module and Lucene, but I very much doubt you’ll find anything that does what you want out of the box.

Simple proof: google “Panasonic DMC-F515” (with quotes). Looks like the results all contain F515, and not FS15. If Google doesn’t see through that typo, chances are nothing else will.

What you’ll probably need to do: record a log of all failed searches. Review it manually for typos, figure out what each one meant. Then tag a searchable field on each product with the typo text (so your DMC-FS15 product record has the text “DMC-F515” somewhere in it, for example). If the manual review part is too tedious, use Mechanical Turk.

Nothing useful to add because I’m such a beginner but -

Google does on occasion correct my misspelled words. And it even seems to work on some model numbers like if you enter things like Gibson ST-200 it replies with “Do you mean Gibson sj-200” (an acoustic guitar model).

I always wondered whether there is a method/algorithm for doing that in my databases.

My guess is that it searches for occurrences of all words, then single words and compares letter by letter left over words in the single search results against the unmatched words in the query subject.

I can’t imagine a cross reference table system on even a Google database level. It would be enormous.

It would be nice if there is a method/algorithm and somebody posted an example on one of the SQL discussion boards/

and Hambil - thanks for the Soundex reference - that could be helpful for some of my search issues

I’m not sure why. It would certainly handle the example you gave. It all depends on how you construct your query, and the amount of difference you allow for in order for two fields to be deemed matching.

awhile ago I built a “query correction” module in Lucene using Levenshtein distance - Wikipedia algorithm or some such. It’s doable but non trivial.

The project subsequently got axed and replaced with a bought off-the-shelf module - so consider doing this preemptively for yourself :slight_smile:

I don’t think you will find this stuff in a regular RDBMS. Find it, buy it or roll your own.

As I stated in my OP, it’s a full text search on a description. The description might be something like:
*
Digital camera for sale, perfect working order, made by Panasonic, model number DMC-FS15. Comes in the original box, with all accessories, including battery charger, manual, USB cable and software.
*
If I’ve understood SOUNDEX right, it won’t work on something like that.

You need keyword searching plus did-you-mean. I use Sphinx on my company wiki, which does have limited do-you-mean. I have it automatically add all the word tokens in the corpus to the potential suggestions.

Thanks for the suggestion, I’ll take a good look at Sphinx, and see if it meets my needs.

Any other recommendations for similar products still welcome.

Oracle has what are called “regular expression” searches. I’m not sure, but I think that originated in the Unix world.

There is a free non-commercial version of it, as well (I have it loaded on my laptop for playing around).

Your search would be something like this:
SELECT * FROM inventory WHERE REGEXP_LIKE(inventory.item_name,‘Panasonic515*’,‘i’);

This should (if I’ve formed it correctly) return anything with the word ‘Panasonic’, anywhere in the string, followed by zero or more of any other character, followed by ‘515’, followed by zero or more of any other character, and matching regardless of case.

For example “Camera: Panasonic model DVM515”. “Panasonic DVM-515 subtype A”, “panaSonic DVM 515…”.

Obviously you can get a lot more elaborate than that. And I don’t promise my code above is correct - I know of regexps, but have only used them a handful of times.

Well, it can get complicated, and there are always more than one way to do things, but consider this for an example:

You can construct a query (or more likely a stored procedure) that will convert the above phrase “Digital camera for sale, perfect working order, made by Panasonic, model number DMC-FS15. Comes in the original box, with all accessories, including battery charger, manual, USB cable and software.” to a SOUNDEX phrase that might look something like “5994 3211 7654 …” etc. Each word in the original phrase replaced by a SOUNDEX equivalent.

It would just be another column on the table, created at the time the record is added to the table and thus having almost no overhead. That column can be fulltext indexed instead of, or along with, the original column.

Thus, you can perform queries like SELECT FROM table.desc_soundex WHERE CONTAINS (table.desc_soundex, SOUNDEX(@search_value))

You can adjust the accuracy many ways by playing with the DIFFERENCE operator, creating stop lists, and tweaking the query such as:

SELECT FROM table.desc_soundex WHERE CONTAINS (table.desc_soundex, SOUNDEX(@search_value)) or CONTAINS (table.desc, @search_value)

It’s really the 80/20 rule of programming. It’s not hard to get 80% accuracy by that last 20% is why Google is a huge rich company and we are (assuming) not. :slight_smile: