Blocking Bad Robots


Few things are more frustrating than bad and nefarious actors out on the interwebs. I am sick of being hammered by the crawler bots and they purposely do not respect the robots.txt file. The robots.txt file is suppposed to be loaded by the web crawler and searched. If it finds itself listed, it is supposed to stop. But I guess profits, spying, and intelligence gathering trump honesty. Well, at least these bad web crawlers have something called a user agent string which you can identify from your server logs. Fortunately, blocking these idiots is not terribly difficult in terms of using a little magic. I think the hardest part is finding a good and accurate explanation.

Fortunately, I have found something that really works, courtesy of Perhishable Press. Perishable Press goes into great detail about how to block based on various criteria but I found the user agent string to be the most effective way of stopping the bad actors from crawling my website. Your mileage will vary, but here it is in a nutsell. I have made some changes to include adding a ‘.’ after a portion of the bot name. The . is regex for anything that comes after the initial match. You have to create an .htaccess file in the root directory of your website. Here’s an example taken right from mine.

RewriteEngine On
RewriteBase /
RewriteCond %{HTTP_USER_AGENT} "yak.*" [NC,OR]
RewriteCond %{HTTP_USER_AGENT} "paperli.*" [NC,OR]
RewriteCond %{HTTP_USER_AGENT} "baidu.*" [NC,OR]
RewriteCond %{HTTP_USER_AGENT} "sogou.*" [NC,OR]
RewriteCond %{HTTP_USER_AGENT} "yandex.*" [NC,OR]
RewriteCond %{HTTP_USER_AGENT} "searchatlas.*" [NC,OR]
RewriteCond %{HTTP_USER_AGENT} "masscan.*" [NC]
RewriteRule (.*) - [F,L]

That seems to get rid of these guys that just hammer me relentlessly. I will have to check my logs periodically for some other ones that pop up. I know that SemrushBot used to be a bad one but it’s behaving okay now so I won’t block it. I will add some more to this post if I find a need to block by referrer. However, the user agent string seems to be working well. I hope a few others can make use of this!


See also