Searchmarking with JRB

of :

Search Journal



Obeying the Robots Exclusion Protocol 

Every once in a while I try to add a resource to this index only to discover that it has used the Robots Exclusion Protocol (REP) to discourage this action. This is par for the course, so to speak, and something I have done myself on innumerable occasions. That said, it's also highly, highly annoying.

REP is useful, in fact, necessary. To quote from the standard:

... there have been occasions where robots have visited WWW servers where they weren't welcome for various reasons. Sometimes these reasons were robot specific, e.g. certain robots swamped servers with rapid-fire requests, or retrieved the same files repeatedly. In other situations robots traversed parts of WWW servers that weren't suitable, e.g. very deep virtual trees, duplicated information, temporary information, or cgi-scripts with side-effects (such as voting).

Fair enough. Used correctly, REP satisfies some basic requirements for content and distribution management, and for this I think it's a grand idea. The problem is that REP is rarely used with any meaningful precision. Pages or sites that make use of REP often treat all visiting agents in exactly the same fashion, making no distinctions between agents of different types or intents. Even when such distinctions are made, it is done based on the user-agent string of the visiting agent and not on any true evaluation of capability, behavior, or purpose. It's akin to saying "Since a guy named Jeff once broke a chair in my apartment I am banning all people named Jeff from making note of where I live".

You see, my search engine does not swamp servers with rapid-fire requests, or request the same file repeatedly. Further, since the resources I have added were not discovered by blind link crawling the engine is not transversing unsuitable servers, or indexing duplicate information. The only resources it indexes are ones that are publicly available, ones that I can hit with a web browser at any moment I see fit.

The original point of this engine was to replace bookmarking. Misuse of REP imposes a direct limitation on this idea by assuming that a visit from my search engine is different than a visit from my browser. I have never encountered a browser that implements REP in its bookmarking system, and if anyone reading this has I'd love to know about it. As far as I can tell I am free to bookmark any page I wish regardless of the values encoded in the relevant robots.txt file and/or <meta> element. I'm just not free to index it as a part of my search collection, regardless of the nature, scope, audience, and intent of that collection.

A similar paradox is that I can add a link to any e-mail, web page, or instant message I wish, presumably unless otherwise instructed by the legalese found on the page or site in question. In lieu of such instructions I feel it safe to assume I may link to whatever I please, regardless of the REP information in use. It begs the question that if I can link to a resource, cite it in communications, bookmark it, and visit it as many times as I like using a web browser, why again can I not index it with my very own personal search engine? Interesting question.

Sure, I could just ignore the REP directives altogether. After all, there are compelling arguments that the intent behind these directives are not contextually applicable to me, not mention that my engine is acting in accordance with the Guidelines for Robot Writers, and is every bit as harmless as your typical browser. But this would be rude, and at least to some extent dishonest - call me a purist. The only safe assumption I can make is that the content provider has used REP in a particular way for a reason, albeit a reason unknown to me. To ignore it would be to willfully disrespect the desires of the content provider, and if it were my content that would really cheese me off.

The real solution lies in increased popular support for more precise content-negotiation protocols and agent/server functionality. Instead of the simple noindex, nofollow directives of REP, web content should be able to communicate meaningful terms of use information to any visiting agent or process. Part of these terms of use would describe what visiting software and agents could and could not do with the available content. Further to this, the agent must be able to announce its own capabilities and intent to the server so that the server can determine if the agent is acting in accordance with these terms of use.

Until then, I've still got the problem that every once in a while I try to add a resource to this index only to discover that it has used The Robots Exclusion Protocol (REP) to prevent this action. Perhaps I'll sleep on it.

Atom Site Feed

This page is powered by Blogger. Isn't yours?