Robocops

Robots.txt protocol, also called "robots exclusion standard" is designed to lock out web spiders from accessing a portion of a site. It is a security or privacy measure, the equivalent of hanging a "Keep Out" sign on your door.

This protocol used by Web site administrators when there are parts, or files that they would rather not be available to the rest of the world. This could include employee lists, or files that are circulating internally. For example, the White House's site uses robots.txt to block all investigations of messages from Vice President, a photo essay of the First Lady, and profiles of the 911 victims.

How does the protocol work? It shows the files that can not be scanned, and places it in the top level directory of the website. Robots.txt protocol was created by consensus in June 1994 by members of the robots mailing list (robots-request@nexor.co.uk). There are no official standards body or RFC for the protocol, so it is difficult to legislate or mandate that the protocol must be followed. In reality the situation is treated as strictly advisory, and not absolute assurance that such content will not be read.

In reality, requires robot.txt collaboration with web spider and even the reader as something that is uploaded to the web becomes publicly available. You're not locking them out of these pages, you're just making it harder for them to come in. But it takes very little for them to ignore these instructions. Hackers can easily penetrate into the files and retrieve information. So the rule of thumb is if it's that sensitive, it should not be on your site to begin with.

Care, however, should be taken to ensure that the robots.txt protocol does not block web robots from other areas of the website. This will greatly affect your search engine placement, as the crawlers rely on robots to count the keywords audit meta tags, titles, and Cross-Frames, and even register hyperlinks.

A misplaced hyphen or dash can have catastrophic consequences. For example, the robots.txt patterns are matched by simple substring comparisons, so care should be taken to ensure that patterns match libraries have the last '/' character appended: otherwise all files with names starting with that substring will match, rather than just those in the library proper.

To avoid these problems, consider submitting your site to a search engine spider simulator, also known as search engine robot simulator. These simulators can be purchased or downloaded from the Internet, using the same processes and strategies in different search engines and give you a "dry run" on how to read your site. They will tell you which pages are skipped, which links are ignored and the error is encountered. Since simulators will also reenact how robots will follow your hyperlinks, you will see that your robot.txt protocol is to intervene in the search engine's ability to read through all pages.

It is also important to review your robot.txt files which will allow you to spot potential problems and correct them before sending them to real search engines.