“Breadcrumb navigation” is the feature on many websites (including ours, now), where there is a line of links showing the position of the current page in the overall hierarchy. This often corresponds directly to the URL. For example, if the URL was http://robot.mbhs.edu/resources/web/html, the breadcrumb bar might display something like Team 449 >> Resources >> Website Development >> HTML (except with links). This is not only useful to users, since they can have more navigational tools available, but search engines also like it, because it results in a lot of links to your site within your site. As usual, I’ve put the source code used in the blair robot project website below.
Posts Tagged php
Note: this article is a continuation of a previous article on search engines, and has been continued with part 3.
After a bit over a week coding (and learning various Perl libraries), I have completed stages 1 and 2 of the search, although stage 2, the indexer, could do with a little improvement. Both are written in perl, and as usual, the complete code listings are below. I decided to write the entire spider and indexer in perl and optimize as necessary later on, so that I could get done with the thing and not get bogged down in C code. If the perl turns out not to be fast enough as the site grows, then I plan to port to C. Likewise, the actual search part (stage 3) will be written in PHP to save time. If the PHP is not fast enough, I’ll rewrite it in C – but I expect there to be no problems.
Read the rest of this entry »For any given weird, seemingly pointless action, you will be able to find at least 3 good, interesting reasons for wanting to take that action (provided you look around hard enough). In my case, the action was calling a cgi script from php, and the reason was to implement a good logging system. I run a bugzilla installation here, and I didn’t want to go editing every page to get a logger. I also didn’t want to use Apache’s own loggers, since a MySQL database is much easier to handle than a super-sized file.
My solution was to create an index.php file and modify the htaccess so that every request to the bugzilla installation went through index.php. Then, I could have index.php log the event, and call the appropriate cgi script.
I decided that the website had grown to the point where it needed a search engine. I didn’t want to use a google search or an embedded yahoo search – they look disgusting. I also didn’t want to use any of the third party searching scripts, since most of them were costly and all of them had commercial licenses. I like free software. So I set out to write my own.
General Considerations
Let me start off by saying that what I have below is not a magic, easy solution to writing a search engine. If you are planning to write the world’s “next google”, I have a recommendation: go to http://bing.com – Microsoft’s “next google”. Notice how “copycatted” it looks. Then search around (on google, please) to find out exactly how popular it is. Hint: not very. Microsoft tried and failed. Don’t waste your time. My problem is to build an internal search engine, which only needs to deal with a small number of pages, and is low traffic so it doesn’t have to be super fast. When I told my co-working friends about the project, the responses I got varied from “maniac” to “shoot yourself now rather than afterwards – save some time”. And that’s with a highly simplified version of the problem.
Read the rest of this entry »On websites with traffic outside of a group of coworkers, you may find it desirable to modify a copy of the website, and then periodically upload the new version. Both the main robotics website and TMS (”Team Management System”) are developed apart from the main website, and then the drafts are periodically “pushed” onto the main site.
While this technique may not seem particularly impressive to some, some problems come up when you actually try to implement it. The most major problem is that when you push, all the links are now broken. A link to /draft/page.html needs to become /page.html when the page is pushed, and this is hard to automate. (A simple regexp is not enough: what about favicons and stylesheets?) The more minor problem that comes up is design-based, and depends on how you plan to store your pages. If you use a simple filesystem-oriented storage method, there will be no problem.
Most sites will want to avoid “bad bots” – bots that harm a site more than they help it. Frequent traits of these bots are:
- Not pausing in between requests, causing server overload
- Ignoring robots.txt
- Using the ‘disallow’ lines of robots.txt to get more URLs to look at
- Harvesting email addresses for spam lists
Luckily for the webmaster, each of these traits forces a bot to reveal itself. I’ll focus here on #3, since it is the easiest to exploit, but I’ll also explain how to modify the scripts to take care of bots that don’t happen to use the robots.txt file to get more URLs.
Read the rest of this entry »For the minification scripts mentioned here, I had to write a program that would go through every file of a certain type in a directory and process (in this case, minify) that file. At the heart of this problem was the need to get a directory listing.
The obvious way to get a directory listing in C is to use the system call, as in system("ls");. But this is not only lame, it does not work on all systems, and the result cannot be used very easily. The better way to get a directory listing is to use system calls to the filesystem. A complete example is shown at the bottom of this post, as usual.
The functions used in getting a directory listing are opendir and readdir (note: these functions also exist in php, and the code to get a directory listing is in fact nearly the same in php and c). opendir() can be seen as kind of like a fopen() for directories. Following this analogy, readdir() is like a fgetc() for directories. To get a directory listing, we first open the directory with opendir("/dir/name/here"), and then read the name of each file in the directory with a loop of readdir(directory_handle)s. Read the rest of this entry »
On a fresh installation of lighttpd (which I chose instead of Apache because the server was dreadfully old and slow), I discovered that although html and other client-side files worked fine, trying to browse php and cgi files resulted in a “403 Forbidden” message. Being an Apache veteran, I checked the htaccess (there was none), made sure the permissions were properly set (world-readable), and looked through the config file to make sure I had gotten rid of all of the lines that instructed the server to return 403 on .php requests (I had had a problem with those in Apache once). Nothing. I checked the error log – nothing noteworthy.
I next checked to make sure the modules were being loaded. Well, the instruction to load the modules was right there:
server.modules = (
"mod_access",
"mod_alias",
"mod_accesslog",
"mod_compress",
"mod_fastcgi",
# "mod_rewrite",
# "mod_redirect",
# "mod_evhost",
# "mod_usertrack",
# "mod_rrdtool",
# "mod_webdav",
# "mod_expire",
# "mod_flv_streaming",
# "mod_evasive"
)
Later on in the file, I found instructions to load php via fast-cgi:
static-file.exclude-extensions = ( ".php", ".pl", ".fcgi",".cgi")
The ultimate solution was trivial. Lighttpd apparently has pretty bad error reporting – the modules were not, in fact, being loaded. I had to move the appropriate files from /etc/lighttpd/conf-available to /etc/lighttpd/conf-enabled.
Minifying CSS and JS
May 27
Many web developers want to make use of large, flashy Javascript libraries that allow fancy effects. (Most such libraries also come with large CSS files). Team 449’s website uses Prototype (an AJAX library) and Scriptaculous (a JS effects library built on top of prototype). While many or most viewers of the website may enjoy the experience, others, who have slower connections or are farther away or have the bad luck to view the website during peak viewing time, get frustrated with the long load time.
The solution to this is twofold. First, install gzipping on the server – I’ll discuss this in a later entry. Second, use a minification program, such as the one Yahoo provides. These programs take the JS or CSS files and remove unnecessary whitespace and comments to decrease the file size. The end result – the load time is often cut in half or better.
Yahoo’s compressor must be called from the command line and told to actually compress the file. Naturally, most people will want to automate this process, so that they don’t have to remember to call the compressor everytime they install a new version of scriptaculous or prototype, or make a change.
One possibility is to set up a serving script that first calls the compressor and then serves the page. This can be done in the htaccess like so:
Options +FollowSymlinks RewriteEngine on RewriteRule ^js/(.*)$ jserve.php/$1 [L] RewriteRule ^style/(.*)$ cserve.php/$1 [L]
This is grossly inefficient. For a medium or high-traffic website, it will completely kill your server. The better solution (and the one we used at the Blair Robot Project) is to have the compressor be called in your crontab, and then have the serving script serve the minified file. The crontab entry would then look like:
*/30 * * * * /var/www/robot/scripts/manage-roboweb.sh
The above entry runs manage-roboweb.sh every 30 minutes. In manage-roboweb.sh, we have a call to a c minification program that automatically finds every file in a specified directory (in this case /var/www/robot/js) that has a specified extension, and processes that file with a call to YUI (the Yahoo compression application).
This can be improved by having the serving script first check to see if the minified file is up-to-date, by checking the last modified times of each. You can also have the managing script do the same, so that the files are only re-minified if they were modified. I’ve placed the code we used for both serving scripts below.
cserve.php
- function minify_version($fn) {
- $min_fn = str_replace(".css", ".min.css", $fn);
- $min_stat = stat($min_fn);
- $norm_stat = stat($fn);
- return $min_stat && $min_stat['mtime'] >= $norm_stat['mtime']
- ? $min_fn
- : $fn
- ;
- }
- if (substr_count($_SERVER['HTTP_ACCEPT_ENCODING'], 'gzip')) {
- ob_start("ob_gzhandler");
- } else {
- ob_start();
- }
- $offset=1000*3600*48;//48 hour cache
- header("Expires: ".gmdate("D, d M Y H:i:s",time()+$offset)." GMT");
- header("Cache-Control: max-age=$offset, must-revalidate");
- $gmdate_mod = gmdate('D, d M Y H:i:s', time()) . ' GMT';
- header("Last-Modified: $gmdate_mod");
- header("Pragma: public");
- header("Content-Type: text/css");
- $url=$_SERVER["REQUEST_URI"];
- $fn=minify_version("/var/www/robot/style/" .
- str_replace("/style/","",$url));
- $file=file_get_contents($fn);
- echo $file;
jserve.php
- if (substr_count($_SERVER['HTTP_ACCEPT_ENCODING'], 'gzip')) {
- ob_start("ob_gzhandler");
- } else {
- ob_start();
- }
- function minify_version($fn) {
- $min_fn = str_replace(".js", ".min.js", $fn);
- $min_stat = stat($min_fn);
- $norm_stat = stat($fn);
- return $min_stat && $min_stat['mtime'] >= $norm_stat['mtime']
- ? $min_fn
- : $fn;
- }
- $offset=1000*3600*48;//48 hour cache
- header("Expires: ".gmdate("D, d M Y H:i:s",time()+$offset)." GMT");
- header("Cache-Control: max-age=$offset, must-revalidate");
- $gmdate_mod = gmdate('D, d M Y H:i:s', time()) . ' GMT';
- header("Last-Modified: $gmdate_mod");
- header("Pragma: public");
- header("Content-Type: text/javascript");
- $url=$_SERVER["REQUEST_URI"];
- $fn = minify_version("/var/www/robot/js/" .
- str_replace("/js/", "", $url));
- $contents=file_get_contents($fn);
- echo $contents;
Things to Watch Out For: SEO
May 23
Over the past few months, I’ve noticed several aspects of the website that were damaging our ranking in google. They’ve been fixed, with the fixes ranging from standard to hideously messy. Here they are.
-
Watch out for the ‘www.’ prefix. Most of the time, a website named ‘website.ext’ can be accessed both as ‘website.ext’ and ‘www.website.ext’. Google will think they are two different websites, index your site twice, and thus halve your page rank. I also suspect (but am not sure) that Google will penalize you for content duplication (suspected plagiarism and worthless content in any case). This can be fixed with the following addition to the htaccess file:
RewriteCond %{HTTP_HOST} !^robot.mbhs.edu$ [NC] RewriteRule ^(.*)$ http://robot.mbhs.edu/$1 [L,R=301] -
Watch out for duplicates of individual pages. For simple sites without dedicated serving scripts, this is almost never a problem, since Apache is intelligent enough to do the necessary redirects for you. If, however, you have serving scripts, you may discover that Google has indexed both http://robot.mbhs.edu/contact and https://robot.mbhs.edu/contact/. As noted above, this will decrease your pagerank, and I suspect Google may penalize you for content duplication. Fixing this is more involved, and really depends on your serving script. I managed to fix by adding, to my htaccess, the following:
DirectorySlash off RewriteCond %{REQUEST_FILENAME} -d RewriteCond %{REQUEST_URI} !alumni RewriteCond %{REQUEST_URI} !(.*)/$ RewriteRule ^(.*)$ http://robot.mbhs.edu/$1/ [R=301,L]and then adding to my serve.php:
if(ereg(".+/$",$_SERVER["REQUEST_URI"])) { header("HTTP/1.1 301 Moved Permanently"); header("Location: http://robot.mbhs.edu$betterurl"); return; }I have the special line for ‘alumni’ in the htaccessbecause that is a directory in addition to a page served by my serving script. The ‘$betterurl’ variable is just the url, what those 5 lines of code do is just strip the slash at the end of the url of the served page and cause a 301 redirect.
-
Page aliases cause trouble with links. When designing my serving script, I had thought, “oh, cool, I can make it so that the user can get to a page from multiple URLs! That way, they don’t have to remember they exact URL, and we can make allowances for mistakes!”. Hahaha. Internal links went that way too, which meant that at one point, Google had listed several pages 3 and 4 times. The solution to this is actually pretty simple – just use a 301 redirect in the php script. This is accomplished with:
header("HTTP/1.1 301 Moved Permanently"); header("Location: http://robot.mbhs.edu$betterurl");