Mythos Image

Tech Blog

Automatic Handling for robots.txt

Have you or someone on your team ever accidentally launched the “don’t spider me” robots.txt file from the development environment into production and didn’t notice it until the search engine optimization nose dived? The first has happened to me, but luckily not the latter. When the file was launched live, it was noticed rather quickly by our SEO guru whose background programs stopped working on that site because of the accidental change. After a scare like that, I decided on the new server that I was setting up, I’ll just set it up to where I have two master robots.txt files, one for production and one for development, and just have Apache figure out which one to serve up based upon the request. Just like in my dynamic virtual hosting, I will accomplish this through Apache’s Rewrite Module piped through a PHP Script.

First, the PHP Script, which I named robots.php:

#!/usr/bin/php
<?php
$devDomain = "example"; // This is the domain name for the development environment
set_time_limit (0);
$input = fopen('php://stdin', 'r');
$output = fopen('php://stdout', 'w');
while (1) {
$original = strtolower(trim(fgets($input)));
$request = array_reverse(explode('.', preg_replace("/^www\./", "", $original)));
if ($request[1] == $devDomain) fwrite($output, "/path/to/robots.hidden.txt\n");
else {
require "regconnect.php";
$r = mysql_query("
SELECT
`spiders`
FROM
`apache_vhosts`
WHERE
`host`='".implode (".", array_reverse($request))."'
LIMIT 1;");
if ($robot = mysql_fetch_array($r)) {
if (!$robot[0]) fwrite($output, "/path/to/robots.hidden.txt\n");
else fwrite($output, "/path/to/robots.visible.txt\n");
} else fwrite($output, "/path/to/robots.visible.txt\n");
mysql_close();
}
}
?>

So with this script, it first checks the domain name closest to the TLD (e.g. for subdomain.example.com, it extract example). It checks to see if this is the domain name used in the development environment and if so, it serves up the “don’t spider me” file, robots.hidden.txt. If it is not in the development environment, then it queries the same vhosts table that I used in the dynamic virtual hosting, but this time leverages from a new column, “spiders.” This is a simple bit/bool to tell whether or not the site is set to be spidered. The default on this column is 1/true and since nearly all of the live sites had this set to 1/true so in any case where there may be a problem querying the database, it would serve up robots.visible.txt by default.

Now comes the part that’s always harder to read because of the regular expression involved, the Apache Rewrite. This one is a lot easier than the dynamic virtual hosts and you can just add it in to the same <IfModule/> and leave out the duplicate lines if you are doing them both at the same time (which I was).

<IfModule mod_rewrite.c>
RewriteEngine On
RewriteLock /tmp/httpd.lock
RewriteMap robots prg:/path/to/robots.php
RewriteCond %{REQUEST_URI} ^/robots\.txt$
RewriteRule ^/(.*)$ ${robots:%{HTTP_HOST}} [L]
</IfModule>

Real simple one here, if the request is for robots.txt pipe the HTTP_HOST to the PHP Script and it will output back to Apache the path to the proper robots.txt file for the site. Then it was as simple as having two robots.txt files that could be used for everything and no need to worry about the designers accidentally launching the development robots.txt to production.

This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *