I decided that the website had grown to the point where it needed a search engine. I didn’t want to use a google search or an embedded yahoo search – they look disgusting. I also didn’t want to use any of the third party searching scripts, since most of them were costly and all of them had commercial licenses. I like free software. So I set out to write my own.
General Considerations
Let me start off by saying that what I have below is not a magic, easy solution to writing a search engine. If you are planning to write the world’s “next google”, I have a recommendation: go to http://bing.com – Microsoft’s “next google”. Notice how “copycatted” it looks. Then search around (on google, please) to find out exactly how popular it is. Hint: not very. Microsoft tried and failed. Don’t waste your time. My problem is to build an internal search engine, which only needs to deal with a small number of pages, and is low traffic so it doesn’t have to be super fast. When I told my co-working friends about the project, the responses I got varied from “maniac” to “shoot yourself now rather than afterwards – save some time”. And that’s with a highly simplified version of the problem.
After studying up on what is required to write a good search engine (including looking at this excellent article by the ACM, on Why Writing Your Own Search Engine is Hard, I understood that there were three main tasks you had to complete, which corresponded to the three main ’stages’ of the search engine:
- Take down all of the pages
- Index the pages
- Build the actual searching system, which takes the user’s queries and does a match
In theory, part (1) requires a lot of bandwidth, and parts (2) and (3) require no bandwidth but do need a very, very good processor (and a large disk). When building an internal site search, no bandwidth is needed at all (obviously). That’s great! I don’t have much bandwidth! I don’t have a good CPU, either, but we’ll get around that. Unfortunately, the burden of part (1) is not entirely relieved – all of my pages are stored as php, and need to be translated into html, and then parsed, before I can search them. That’s the only “simplification” I get. Oh, and the site is really, really small.
Is’t Possible?
Before we go on to discuss how to code this, lets just decide whether or not it truly is feasible. I, personally, I have some constraints. First, I don’t want to spend a heck of a lot of time on this project. Second, I don’t have a super-fast processor to use – I have a quad-core, three-year-old processor that other people are using. I can only really count on being able to use one or two cores at any time – and if I have consistent usage of an entire core, I’m probably going to get my search engine deleted. So I need to keep processor power low. My only advantage is that hard disk space is really not much of a problem – right now, the content part of my website is in the 10 megabyte range, whereas the backups, old versions, wordpress installations, and other things I don’t want to search scale to half a gigabyte. So no one will mind, or even notice, if my search engine is rather inefficient as far as space consumption goes.
So I have to keep CPU usage small, although I’m practically unlimited as far as hard disk space. That’s not terrible – the first task is mainly a hard disk task – if there are 1000 pages to translate to HTML and parse (I have no where near that many), I could handle 1 a minute and completely re-parse the entire site three times per day.
The second and third tasks are more problematic. I figure that I can spread the second task out a lot, so that the only truly computation-intensive part is the third task. And I may be able to speed that up by doing more stuff beforehand.
Program Design
Getting and Parsing
The first part is almost trivial. I already have functions for fetching the pages from the database (yes, I store all my pages in a database) and running the php. The only thing I can’t do yet is extract the text.
I can’t just wipe out all of the tags, since some of them contain valuable information. For example, the “alt text” on anchor tags should be included in the search, as should the title text on abbr and other tags. So instead, I replace each tag by the contents of its title or alt attribute – bold – if it has one (and just delete the ending tags). I bold the contents of the attribute because I plan to take bold and italicized text into greater account, later on, than I will plain text. Thus, my pseudocode for the getter and parser is:
algorithm get_and_parse_pages
for every page
do
fetch page content to $php
evaluate $php as php, and place the result in $content
remove all end tags from $content
for each start tag
do
if tag has no title or alt attribute
then delete tag
else if tag has alt attribute but no title attribute
then replace tag with bolded contents of alt attribute
else if tag as title attribute but no alt attribute
then replace tag with bolded contents of title attribute
else do
replace tag with bolded contents of title attribute
concatenated with bolded contents of alt attribute
end if
end loop
end loop
That shouldn’t take too long to implement properly, using ob_start and related functions for fetching the output of the php script retrieved from the database.
Indexing and Lookups
Now for parts 2 and 3 – which are intimately linked as far as design is concerned. The output of the indexer is the input of the looker-upper.
Here we run into a problem – it is impossible to index every possible search phrase. Suppose that somebody remembers a quote form a particular page: “possible search phrase. Suppose that”. Obviously, this is not going to be in our index. The solution is to index only single words (and possibly command two- and three-word combinations, as I now suspect google does). For each one of the non-trivial words in a search query (and we can terminate at ten like google used to if we are having trouble), fetch all associated pages. Now, if there are too many pages, we can drop the ones that have only been ‘fetched’ once or twice, thus saving some time. Now, if necessary, we can do an actual search through those pages.
Uh-oh, you think. “We’re back where we started – trying to search through every page every time”. Well, no, we’re not. First of all, we’re only going to search through the top 30 pages – we can lump all the rest together in no particular order. Second, we have parsed HTML already, so we can save some time there. As a matter of fact, this problem pretty much reduces itself to a search through 300KB or less of information – easily done in 2 or 3 seconds if you don’t feel like implementing any further optimizations.
I’m not going to pseudocode these ideas yet – I will first putter around (not to mention getting a good night’s sleep), and see what seems feasible.
A small note in speed – I have no intention of doing much php coding for this search engine. Nearly all of it will be written in C and Perl. Where php is required by my serving system, I will call a C program or a Perl script. Both are rather faster than php, and perl has the added benefit of truly built-in regexp support.
If I wanted to be really fancy, I could also keep track of which pages linked where. Heck, I could even implement my own version of pagerank, or take into account which pages my visitors spent the most time on and/or visited the most. But I’ll keep things simple. (Not that simple, though. I intend to have the search tap into Wordpress’s own index, so it will also search the blog)
To be continued… here!