Note: this article is a continuation of previous articles on search engines (part 1, part 2).

After testing out an alpha version of my search engine for a while, I realized that its greatest flaw (other than printing out the results in a downright ugly format) was that it couldn’t recognize “programs” as a variant of the word “program”. I briefly considered programming it to automatically check for a limited set of common variants, but I decided that this was probably too much effort for what would be a decidedly low-quality result. I needed to find a list of, for every word, all of its variants.

What I was looking for (I discovered after about 45 minutes of IRC chat, google, and man pages) was the ispell english dictionary. ISpell dictionaries have a list of ‘roots’, and then, for every root, they have a list of flags that describe how that root can be transformed to form valid words. I could enter this information, via perl script, into a mysql table, and then retrieve it quickly both while indexing and while searching.

jab/MS
jabbed
jabbing
jack/DGMRS
jacket/DS
jacketed/U
jade/DGMS
jaded/PY
jail/DGMRSZ
jam/MS
James

In another file, there are definitions of each of the ten or so flags that can be used. These definitions made extensive use of regex, so it was somewhat easy to translate them into a perl script to place these in the mysql database.

I originally planned to have a database with a ‘Word’ column and a boolean column for each flag. I realized, about halfway through creating these columns, that this meant I would have to perform word transformations, on the spot, for every word in a page that came up as a result, since I wanted to make the related words be bolded. Obviously, that was going to be a bit slow. MySQL can look things up pretty quickly if you have indexes, especially if the table is relatively small (only 50,000 entries for this one, although I’ll probably be appending other dictionaries later). I redesigned my table to have two columns, ‘Root’ and ‘Word’. This way,for every word in the search, I could just look up its root, and do the same thing for each page. The mysql query would not slow things down very much, since I would have to do queries for each word anyway just to get the flag list.

I have not yet performed an tests on this, but I expect it to be reasonably fast (hopefully under 0.1 seconds for each query) and to give accurate results.

Related posts: