What goes through your mind when you read about the silly lawsuits against Google accessing portions of your website? What do you think when you visit the Internet Wayback Machine and find hundreds of pages of your site in its full form (almost)? Most of you wonder what is going on in the minds of these clueless people. Don’t they understand how the web works?
That’s right, folks. In case you’re not in the know, the web works in a certain way. A brand new site generally does not get indexed in search engines for a period of months. Over time, the search spiders find your site and your interlinked pages begin getting crawled. Eventually, someone will search for something and your website will hopefully come up.
Not everyone is happy with these search results, and oddly, some people just don’t want to be found. In fact, on websites that are relatively large, the spiders crawl so many pages at once that people have defined the silliest robots.txt files. For example, check the hilton.com robots.txt file (which I learned about during the robots.txt Summit at Search Engine Strategies last month). Note their first two lines.
# Daytime instructions for search engines
# Do not visit Hilton.com during the day!
Dear hilton.com webmaster: Search engine spiders have no understanding of English. Look at the picture that accompanies this blog post. The spiders run through a site, grabbing pieces of data to spin into a web — the World Wide Web — and then it moves on. It does not pay attention to your personalized messages (but on behalf of the spider, thanks for the attention). Spiders do not understand anything. Spiders are robots.
If you want more information about robots.txt, you can visit one of the premier sites for the Robots Exclusion Standard. Understanding the implementation is important. If you have personal information that you don’t want the search engines to find, you can block it out.
Simply use the following:
This code blocks all search engines from accessing content in “mysecretdirectory.” This is also helpful if you have concerns about duplicate content. The typical example is if you have printer-friendly versions of pages on your site, but you don’t want to be penalized by Google for having these pages available on your website. You could create a robots.txt file with the following code:
You’d obviously be replacing the /printer-friendly directory with the directory your printer-friendly documents reside upon.
There are additional applications of robots.txt. Some search engines, such as Google, will now let you specify the path of your sitemap in the robots.txt file as such:
You can also be selective and block off certain search engines, including the Internet Wayback Machine, as discussed earlier. This way, old versions of your site will no longer be accessible.
There are a variety of other bots out there that crawl your site on a regular basis. It’s not just about Google, MSN, Yahoo, or Ask. You can get an idea of what works and what doesn’t by experimenting. Note that if you plan on blocking content, it takes time for it to drop out of the search results if these pages were indexed already.
Fortunately, in my post about the Google Webmaster Central tool, I mentioned that you can change the crawl rate of the Google spider. Unfortunately, you can’t be any more specific and invite the spider only during late night hours. But you can set the speed for the spider to crawl your page at a slower rate so that it doesn’t negatively impact your web server or website performance.
Before you get mad at the search engines and start frivolous lawsuits, realize that it is your responsibility too to prevent the searche engines from accessing your web page, if that’s your desire.