Note: this article is a continuation of previous articles on search engines (part 1, part 2).
After testing out an alpha version of my search engine for a while, I realized that its greatest flaw (other than printing out the results in a downright ugly format) was that it couldn’t recognize “programs” as a variant of the word “program”. I briefly considered programming it to automatically check for a limited set of common variants, but I decided that this was probably too much effort for what would be a decidedly low-quality result. I needed to find a list of, for every word, all of its variants.
What I was looking for (I discovered after about 45 minutes of IRC chat, google, and man pages) was the ispell english dictionary. ISpell dictionaries have a list of ‘roots’, and then, for every root, they have a list of flags that describe how that root can be transformed to form valid words. I could enter this information, via perl script, into a mysql table, and then retrieve it quickly both while indexing and while searching.
Read the rest of this entry »