Most sites will want to avoid “bad bots” – bots that harm a site more than they help it. Frequent traits of these bots are:

  1. Not pausing in between requests, causing server overload
  2. Ignoring robots.txt
  3. Using the ‘disallow’ lines of robots.txt to get more URLs to look at
  4. Harvesting email addresses for spam lists

Luckily for the webmaster, each of these traits forces a bot to reveal itself. I’ll focus here on #3, since it is the easiest to exploit, but I’ll also explain how to modify the scripts to take care of bots that don’t happen to use the robots.txt file to get more URLs.

The robots.txt file is ordinarily used by the webmaster to keep search engines (and bots in general) from looking at certain material. This is most often dynamic content like search pages, which is worthless to most bots. (If a webmaster want to hide material from unauthorized users, it is best to use htaccess authorization, not just robots.txt.) A place a bot should not look is specified in the robots.txt file as follows:

Disallow: /hidden/address

However, some bots deliberately go to the URLs “disallowed” by the robots.txt file in hopes that they will yield information juice, like email addresses to spam. You can catch the bots by creating a page which does nothing but logs to a file the IP address of the requester and the user-agent (browser or botname), and then adding to the robots.txt file a reference to that page. For example, if your root web directory was /var/www/robot, you could create a file /var/www/robot/catch_that_bot.php:

  1. <?php
  2. $add=$_SERVER['HTTP_USER_AGENT']."\t".$_SERVER['REMOTE_ADDR']."\n";
  3. $file=fopen("/var/www/robot/badbots","a");
  4. fprintf($file,"%s",$add);
  5. fclose($file);
  6. ?>

Every time catch_that_bot.php is fetched from the web server, the client IP address and browser/bot name are logged to the file ‘badbots’. We can ensure that only bad bots get this file by not linking to it, and disallowing it in robots.txt:

Disallow: /catch_that_bot.php

To disallow the bad bot, all you have to do is look at the “user agent” and find some distinctive part of it. So if the useragent string is “Mozilla 5.0 (evilguy)”, the distinctive part would be “evilguy”. Then, add to your .htaccess (near the beginning):

RewriteCond %{HTTP_USER_AGENT} evilguy
RewriteRule ^.* - [R=403]

Every time the bot named ‘evilguy’ visits, it will receive the ‘403′ error code in response: “permission denied”. For addition amusement, the robotics team site uses the ‘402′ error code: “payment required”. If the bot wants to pay us, we don’t mind!

What if the bot masquerades completely as a commond browser, like firefox? Then you’ll just have to block the IP address:

RewriteCond %{REMOTE_ADDR} 123.45.67.89
RewriteRule ^.* - [R=403]

Warning: you probably should avoid blocking an IP address unless that bot is really getting to be a problem. The bot could be sharing an IP with innocent users, and the innocent users will be blocked as well.

But what if the bot doesn’t look in URLs disallowed by the robots.txt? The other three types of bad bots have equally automatable ways to detect them. The bot that sends requests too quickly (google sends one every few minutes by default) is easy to detect – just make each page append the user agent and IP address to a file. Clear the file every 10 minutes, and when you clear it, go through and look for user agents that pop up more than 60 times. That user agent is probably a bot – block it. The same script can be used to clear the file a block the bots.

For bots that simply ignore the robots.txt, the solution is equally simple. On your home page, put the following html code

  1. <div style="display:none"><a href="/catch_that_bot.php"></a></div>

and disallow catch_that_bot.php in your robots.txt. Then use the same script we used above, and the bad bots will be caught.

Catching the fourth type of bot, however, can actually be kind of tricky. You may not want to bother doing it, since most mail-harvesting bots will also ignore your robots.txt file. However, to catch those few that don’t, you can do the following (these instructions are not particularly detailed nor reliable, since I’ve never bothered with them):

  1. Set up a mail server (SMTP/POP3) on your computer.
  2. Set up a page that has a hidden (see above) mailto: link to an email address on your new computer (like badbots@mycomputer.homeip.org).
  3. When you get email at this address, look through the list of servers it went through. If the first server is something like gmail or yahoo, the best you can do is block the sending address. If, however, the first server seems to not be a public mail server, look up its IP and block all requests from that IP using the method above.

Of course, you could also just use a captcha script.

Related posts: