Slow down rogue web srappers on my website and still use Varnish

Question

Slow down rogue web srappers on my website and still use Varnish

Imagine there are scrappers crawling my website. How can I ban them and still white list Google Bots ?

I think I can find the ip range of Google bots, and I am thinking of using Redis to store all the access of the day and if in a short time I see too many requests from the same IP -> ban.

My stack is ubuntu server, nodejs, expressjs.

The main problem I see is that this detection is behind Varnish. So Varnish cache has to be disabled. Any better idea, or good thoughts ?

node.js
redis
web-scraping
varnish
ubuntu-server

Answer 1

You could stop the crawler using the robots.txt

User-agent: BadCrawler
Disallow: /

This solution works if the crawler follow the robots.txt specifications

Answer 2

You can use an Varnish ACL [1], it will be possibly a bit harder to maintain that in apache, but surely will work:

acl bad_boys {
  "666.666.666.0"/24; // Your evil range
  "696.696.696.696"; //Another evil IP
}

// ...

sub vcl_recv {
  if (client.ip ~ bad_boys) {
    error 403 "Forbidden";
  }
  // ...
}

// ...

You can also white-listing, use user agent or other techniques to ensure that it isn't GoogleBot... but I would defend myself in Varnish rather than in Apache.

[1] https://www.varnish-cache.org/docs/3.0/reference/vcl.html#acls