Imagine there are scrappers crawling my website. How can I ban them and still white list Google Bots ?
I think I can find the ip range of Google bots, and I am thinking of using Redis to store all the access of the day and if in a short time I see too many requests from the same IP -> ban.
My stack is ubuntu server, nodejs, expressjs.
The main problem I see is that this detection is behind Varnish. So Varnish cache has to be disabled. Any better idea, or good thoughts ?
You could stop the crawler using the robots.txt
User-agent: BadCrawler
Disallow: /
This solution works if the crawler follow the robots.txt specifications
You can use an Varnish ACL [1], it will be possibly a bit harder to maintain that in apache, but surely will work:
acl bad_boys {
"666.666.666.0"/24; // Your evil range
"696.696.696.696"; //Another evil IP
}
// ...
sub vcl_recv {
if (client.ip ~ bad_boys) {
error 403 "Forbidden";
}
// ...
}
// ...
You can also white-listing, use user agent or other techniques to ensure that it isn't GoogleBot... but I would defend myself in Varnish rather than in Apache.
[1] https://www.varnish-cache.org/docs/3.0/reference/vcl.html#acls