Note: this article is a continuation of a previous article on search engines, and has been continued with part 3.

After a bit over a week coding (and learning various Perl libraries), I have completed stages 1 and 2 of the search, although stage 2, the indexer, could do with a little improvement. Both are written in perl, and as usual, the complete code listings are below. I decided to write the entire spider and indexer in perl and optimize as necessary later on, so that I could get done with the thing and not get bogged down in C code. If the perl turns out not to be fast enough as the site grows, then I plan to port to C. Likewise, the actual search part (stage 3) will be written in PHP to save time. If the PHP is not fast enough, I’ll rewrite it in C – but I expect there to be no problems.

The Spider (Crawler)

The spider for my search engine, update-raw-pages.pl, is rather simple. It is a recursive algorithm (implemented breadth-first with an array as a queue) that starts at the main page. For each page, it places the page and its meta-information in the database, and then creates a list of all of the links on that page, and pushes that list to the queue. A hash is stored to keep track of which pages have been visited, and pages are not added to the queue if they have been visited. The hash is updated as pages are added to the queue, not when they are processed (otherwise we would index pages multiple times). The algorithm terminates when the queue is empty.

I connect to the website using LWP::UserAgent, like so:

  1. my $user_agent=LWP::UserAgent->new;
  2. $user_agent->agent('Robowebot (http://robot.mbhs.edu/ internal robot) (bytbox@gmail.com)');
  3. $user_agent->from('bytbox@gmail.com');
  4. $user_agent->timeout(10);
  5.  
  6. sub httpget {
  7.     my $request = HTTP::Request->new(GET => $_[0]);
  8.     my $response = $user_agent->request($request);
  9.     return unless $response->content_type eq 'text/html';
  10.     return $response->is_success?$response->content:();
  11. }
  12.  
  13. httpget "http://robot.mbhs.edu/";

The interesting part of the code is the HTML parsing. Against the best advice of the perl IRC folk, I chose to do my HTML parsing in regexp, without the help of any modules. (This was sheer laziness on my part – I have a good bit of experience with regexp and didn’t want to learn a new API). As a result, there were several parts of the code that depended on features specific to our site (and many parts that depended on valid XHTML). The regexp for link detection was easy: href *= *\"([^\"]*)\". I then ignored any results that ended .css, .png, or the like. Getting the title was trivial, of course. The meta tags were the first really site-dependent thing, since I didn’t bother to add an robustness:

  1. sub content_get_description {
  2.     (my $content)=@_;
  3.     if($content=~/<meta name=\"description\" content=\"([^\"]*)\"/) {
  4.        return $1;
  5.    }
  6.    return "";
  7. }
  8.  
  9. sub content_get_keywords {
  10.    (my $content)=@_;
  11.    if($content=~/<meta name=\"keywords\" content=\"([^\"]*)\"/) {
  12.        return $1;
  13.    }
  14.    return "";
  15. }

And now for the bad part. As I said in my previous post, I wanted to remove all tags, and replace them, if necessary, with the contents of the alt or title tag. I also wanted to ignore the linkbar and the footer. The result:

  1. sub content_get_simple {
  2.     my $content=$_[0];
  3.     #wipe out stuff we don't want
  4.     if($content=~/<div id="content">/) {
  5.         $content=~s/.*<div id="content">//msg;
  6.         $content=~s/<!\-\-end\-content\-\->.*<\/html>//msg;
  7.     } else {
  8.         $content=~s/.*<body[^>]*>//msg;
  9.         $content=~s/<\/body>.*//msg;
  10.     }
  11.     #now replace the tags with their content
  12.     $content=~s/<[^>]*alt=\"([^\"]*)\"[^>]*>/ $1 /msg;
  13.    $content=~s/<[^>]*title=\"([^\"]*)\"[^>]*>/ $1 /msg;
  14.    $content=~s/<[^>]*>/ /msg;
  15.    return $content;
  16. }

Obviously, when we upload the new page UI, this will have to change. However, it works, and the whole script takes about 5 seconds to run. I’m happy.

The Indexer

The indexer, interestingly, is a good bit shorter than the spider, even though it took longer to code and does more. This is probably because it contains few if no cheap hacks (although it still needs a bit of optimization in the mysql commands). It’s job is simple, go through the text, the title, and the meta tags (arbitrary weights are assigned to each), and every time a word of a certain length comes up, add that word and the URL to the table (or else increment the Strength column):

  1. sub process_words_from_url {
  2.     my $priority=pop;
  3.     my $url=escape pop;
  4.     my @words=@_;
  5.     for my $word (@words) {
  6.         $word=escape $word;
  7.         my $sql="select Strength from search_IndexTemp where Term='$word' and URL='$url'";
  8.         my $result=$conn->query($sql);
  9.         my $record_set=$conn->create_record_iterator;
  10.         my $r=$record_set->each;
  11.         if($r) {
  12.             my $strength=$r->[0];
  13.             $strength+=$priority;
  14.             $conn->query("update search_IndexTemp set Strength=$strength where URL='$url' and Term='$word'");
  15.         } else {
  16.             $conn->query("insert into search_IndexTemp (URL,Term,Strength) values('$url','$word',$priority)");
  17.         }
  18.     }
  19. }

I also had to write checks to keep the program from indexing whitespace. That sped it up a lot.

MySQL Issues

After I had finished with the indexer, I noticed that it took 10 minutes to run without the pauses, and 40 minutes to run with the pauses. Not good. I took out the mysql commands, and noticed that the run-time was now under 5 seconds. Obviously, the database was being slow. Why? Well, it turned out I hadn’t bothered to index anything. I created an index on Terms and URL, and then I got the pausing run-time to be about 1 minutes. Much better.

I also discovered that the MySQL module I was using in my perl code, Net::MySQL, is slow and supports very few of the standard features (not to mention the fact that I was using a version I had modified because the original didn’t work in the first place). I was pointed to a new module, DBI, which is apparently the standard. If you look below, you will see that only the crawler has been updated to use that module. The indexer, as well as the administration tools, remain with the old one. For now.

Code Listings

update-raw-pages.pl
  1. #!/usr/bin/perl
  2.  
  3. #Written June 2009
  4. #Updates the search_Pages table in database 'robot'
  5. #by Scott Lawrence (sclawren@mbhs.edu, bytbox@gmail.com)
  6.  
  7. package main;
  8.  
  9. use lib '/var/www/robot/inc';
  10. use lib '/var/www/robot/inc/lib';
  11.  
  12. use warnings;
  13. use strict;
  14.  
  15. use Cwd;
  16.  
  17. use DBI;
  18. use HTTP::Request;
  19. use LWP::UserAgent;
  20. use Roboweb::Passwords qw($mysql_password);
  21.  
  22. sub httpget;
  23. sub crawl;
  24. sub handle_content;
  25. sub content_get_title;
  26. sub content_get_description;
  27. sub content_get_keywords;
  28. sub content_get_simple;
  29. sub add_url;
  30. sub mesg;
  31. sub escape;
  32.  
  33. sub connect_rbwdb;
  34. sub close_rbwdb;
  35.  
  36. $|=1;
  37. my $user_agent=LWP::UserAgent->new;
  38. $user_agent->agent('Robowebot (http://robot.mbhs.edu/ internal robot) (bytbox@gmail.com)');
  39. $user_agent->from('bytbox@gmail.com');
  40. $user_agent->timeout(10);
  41. my $queue_read=0;
  42. my $queue_write=0;
  43. my @queue;
  44. my %urls_done;
  45. my $conn;#the mysql connection
  46. my $conn_open;#is the mysql connection open?
  47.  
  48. connect_rbwdb;
  49. $conn->do("truncate search_Pages");
  50.  
  51. $queue[$queue_write++]="http://robot.mbhs.edu/";
  52. $urls_done{"http://robot.mbhs.edu/"}=1;
  53. $urls_done{"http://robot.mbhs.edu"}=1;#avoid this duplicate
  54. while($queue_write>$queue_read) {
  55.     crawl $queue[$queue_read++];
  56. }
  57. close_rbwdb;
  58.  
  59. #fetch data
  60. sub httpget {
  61.     my $request = HTTP::Request->new(GET => $_[0]);
  62.     my $response = $user_agent->request($request);
  63.     return unless $response->content_type eq 'text/html';
  64.     return $response->is_success?$response->content:();
  65. }
  66.  
  67. sub crawl {
  68.     (my $url)=@_;
  69.     my $content=httpget $url;
  70.     return unless $content;
  71.     handle_content $content,$url;
  72.     #now get all the links
  73.     while ($content =~ /href *= *\"([^\"]*)\"/g) {
  74.        my $linkto=$1;
  75.        next if $linkto=~/.css$/;
  76.        next if $linkto=~/.png$/;
  77.        next if $linkto=~/.gif$/;
  78.        next if $linkto=~/.jpg$/;
  79.        next if $linkto=~/.jpeg$/;
  80.        next if $linkto=~/.tgz$/;
  81.        next if $linkto=~/.tar.gz$/;
  82.        next if $linkto=~/.zip$/;
  83.        next if $linkto=~/.ico$/;
  84.        add_url $linkto;
  85.    }
  86. }
  87.  
  88. sub handle_content {
  89.    (my $content,my $url)=@_;
  90.    my $title=content_get_title $content;
  91.    my $desc=content_get_description $content;
  92.    my $keys=content_get_keywords $content;
  93.    $content=escape content_get_simple $content;
  94.    $conn->do("insert into search_Pages (URL,Title,Meta_Description,Meta_Keywords,Content) values ('$url','$title','$desc','$keys','$content')");
  95. }
  96.  
  97. sub content_get_title {
  98.    (my $content)=@_;
  99.    if($content=~/<title>(.*)<\/title>/) {
  100.        return $1;
  101.    }
  102.    return "Untitled";
  103. }
  104.  
  105. sub content_get_description {
  106.    (my $content)=@_;
  107.    if($content=~/<meta name=\"description\" content=\"([^\"]*)\"/) {
  108.        return $1;
  109.    }
  110.    return "";
  111. }
  112.  
  113. sub content_get_keywords {
  114.    (my $content)=@_;
  115.    if($content=~/<meta name=\"keywords\" content=\"([^\"]*)\"/) {
  116.        return $1;
  117.    }
  118.    return "";
  119. }
  120.  
  121. sub content_get_simple {
  122.    my $content=$_[0];
  123.    #wipe out stuff we don't want
  124.    if($content=~/<div id="content">/) {
  125.        $content=~s/.*<div id="content">//msg;
  126.        $content=~s/<!\-\-end\-content\-\->.*<\/html>//msg;
  127.    } else {
  128.        $content=~s/.*<body[^>]*>//msg;
  129.        $content=~s/<\/body>.*//msg;
  130.    }
  131.    #now replace the tags with their content
  132.    $content=~s/<[^>]*alt=\"([^\"]*)\"[^>]*>/ $1 /msg;
  133.    $content=~s/<[^>]*title=\"([^\"]*)\"[^>]*>/ $1 /msg;
  134.    $content=~s/<[^>]*>/ /msg;
  135.    return $content;
  136. }
  137.  
  138. sub add_url {
  139.    (my $url)=@_;
  140.    if($url=~/^http:\/\/robot.mbhs.edu\/.*/) {
  141.        #do nothing
  142.    } elsif($url=~/^\/.*$/) {
  143.        mesg $url;
  144.        #just prepend robot.mbhs.edu stuff
  145.        $url="http://robot.mbhs.edu$url";
  146.    } else {
  147.        mesg "!! $url";
  148.        #not in this site, or else fubar.  ditch it
  149.        return;
  150.    }
  151.    $url=~s/\/$//;
  152.    return if $urls_done{$url};
  153.    #stuff to ignore
  154.    return if $url=~/http:\/\/robot.mbhs.edu\/login/;
  155.    #for now, we'll exclude wordpress from the index
  156.    return if $url=~/http:\/\/robot.mbhs.edu\/wordpress/;
  157.    $urls_done{$url}=1;
  158.    $queue[$queue_write++]=$url;
  159. }
  160.  
  161. sub mesg {
  162.    (my $message)=@_;
  163. #    print $message."\n";
  164. }
  165.  
  166. sub escape {
  167.    my ($str)=@_;
  168.    $str=~s/\\/\\\\/g;
  169.    $str=~s/'/\\'/g;
  170.    $str=~s/\n/\\n/g;
  171.    $str=~s/\r//g;
  172.    return $str;
  173. }
  174.  
  175. #MySQL functions
  176.  
  177. sub connect_rbwdb {
  178.    return if $conn_open;
  179.    $conn_open=1;
  180.    $conn=DBI->connect('DBI:mysql:robot','roboweb', $mysql_password
  181.                       ) || die "Could not connect to database: $DBI::errstr";
  182. }
  183.  
  184. sub close_rbwdb {
  185.    return unless $conn_open;
  186.    $conn_open=0;
  187.    $conn->disconnect;
  188. }
index-pages.pl
  1. #!/usr/bin/perl
  2.  
  3. #Written June 2009
  4. #Indexes the roboweb pages
  5. #by Scott Lawrence (sclawren@mbhs.edu, bytbox@gmail.com)
  6.  
  7. package main;
  8.  
  9. use lib '/var/www/robot/inc';
  10. use lib '/var/www/robot/inc/lib';
  11.  
  12. use warnings;
  13. use strict;
  14.  
  15. use Cwd;
  16.  
  17. use Net::MySQL;
  18.  
  19. use Time::HiRes qw(usleep time);
  20. use IO::Socket;
  21. use Roboweb::Passwords qw($mysql_password);
  22.  
  23.  
  24. #indexing
  25. sub index_page;
  26. sub get_words;
  27. sub process_words_from_url;
  28.  
  29. #utils
  30. sub sleep_some;
  31. sub mesg;
  32. sub escape;
  33. sub connect_rbwdb;
  34. sub close_rbwdb;
  35.  
  36. $|=1;
  37. my $debug=0;
  38. my $conn;
  39. my $conn_open;
  40. my $rest_ratio=5;
  41. my $last_rest=time()*1000;
  42.  
  43. #begin execution
  44.  
  45. connect_rbwdb;
  46. my $server=new IO::Socket::INET(LocalHost=>'localhost',LocalPort=>'44944',Proto=>'tcp',Listen=>1,Reuse=>1);
  47. #we're already here!
  48. exit unless $server;
  49.  
  50. $conn->query("truncate search_IndexTemp");
  51.  
  52. #for every page
  53. $conn->query("select URL,Content,Title,Meta_Description,Meta_Keywords from search_Pages");
  54. my $record_set=$conn->create_record_iterator;
  55. while(my $record=$record_set->each) {
  56.     sleep_some;
  57.     index_page $record->[0],$record->[1],$record->[2],$record->[3],$record->[4];
  58. }
  59.  
  60. #swap
  61. $conn->query("truncate search_Index");
  62. $conn->query("insert into search_Index select * from search_IndexTemp");
  63.  
  64. close($server);
  65. close_rbwdb;
  66.  
  67. #end execution
  68.  
  69.  
  70. sub index_page {
  71.     (my $url,my $content,my $title,my $desc,my $keywords)=@_;
  72.     mesg $url if $debug;
  73.     my @content_words=get_words $content;
  74.     my @title_words=get_words $title;
  75.     my @desc_words=get_words $desc;
  76.     my @key_words=get_words $keywords;
  77.     process_words_from_url @content_words,$url,2;
  78.     process_words_from_url @title_words,$url,4;
  79.     process_words_from_url @desc_words,$url,3;
  80.     process_words_from_url @key_words,$url,5;
  81. }
  82.  
  83. sub get_words {
  84.     my @words=split /\b/,shift;
  85.     my @ret;
  86.     for my $word (@words) {
  87.         push @ret,$word if (length($word)>=4 &&
  88.                             !($word=~/^\W*$/));
  89.     }
  90.     return @ret;
  91. }
  92.  
  93. sub process_words_from_url {
  94.     my $priority=pop;
  95.     my $url=escape pop;
  96.     my @words=@_;
  97.     for my $word (@words) {
  98.         $word=escape $word;
  99.         my $sql="select Strength from search_IndexTemp where Term='$word' and URL='$url'";
  100.         my $result=$conn->query($sql);
  101.         my $record_set=$conn->create_record_iterator;
  102.         my $r=$record_set->each;
  103.         if($r) {
  104.             my $strength=$r->[0];
  105.             $strength+=$priority;
  106.             $conn->query("update search_IndexTemp set Strength=$strength where URL='$url' and Term='$word'");
  107.         } else {
  108.             $conn->query("insert into search_IndexTemp (URL,Term,Strength) values('$url','$word',$priority)");
  109.         }
  110.     }
  111. }
  112.  
  113. sub sleep_some {
  114.     return if $debug;
  115.     my $length=((time()*1000-$last_rest)*$rest_ratio);
  116.     usleep $length*1000;
  117.     $last_rest=time()*1000;
  118. }
  119. sub mesg {
  120.     (my $message)=@_;
  121.     print $message."\n";
  122. }
  123.  
  124. sub escape {
  125.     my ($str)=@_;
  126.     $str=~s/\\/\\\\/g;
  127.     $str=~s/'/\\'/g;
  128.     $str=~s/\n/\\n/g;
  129.     $str=~s/\r//g;
  130.     return $str;
  131. }
  132.  
  133. #MySQL functions
  134.  
  135. sub connect_rbwdb {
  136.     return if $conn_open;
  137.     $conn_open=1;
  138.     $conn=Net::MySQL->new(server=>'binx.mbhs.edu',database=>'robot',user=>'roboweb',password=>$mysql_password);
  139. }
  140.  
  141. sub close_rbwdb {
  142.     return unless $conn_open;
  143.     $conn_open=0;
  144.     $conn->close;
  145. }

Related posts: