Monday, April 19, 2010

Blocking Bots and Crawlers without Blocking Search Engines

If you run any high-traffic websites it becomes apparent rather quickly that there are people out there trying to download your entire website (and there are many more people doing it than you would think). The question is, how do you block users with ill intent without blocking the major search engine bots (i.e., Yahoo, Google & Bing)?

The Google-recommended way is to do a reverse DNS look-up using the IP address. Then double-check the reverse DNS with a normal DNS look-up. The explanation is found here. Note that this code will not work alone. If implemented, all regular visitors will be blocked. You must add some code to count the number of visits from a unique IP address and use this information as well.

How is this done in code? Check out this great PHP solution. (Not sure if Bing uses the old msn search bot domain name or not.)

There is one issue that comes up with an implementation like this. By doing this you are blocking all bots and search engine crawlers which are not in your list. In other words, you are helping ensure that no other search engines can index your site. While it does still allow the major search engines to compete on the same level (at least in terms of your site), it is a bit anti-entrepreneurial and anti-competition because it locks-out small search engines and start-ups.

No comments: