Maximizing urls / second in a parallel scraper

I have to scrape thousands of different websites, as fast as possible. On a single node process I was able to fetch 10 urls per second. Though if I fork the task to 10 worker processes, I can reach 64 reqs/sec.

Why is so? Why I am limited to 10 reqs/sec on a single process and have to spawn workers to reach 64 reqs/sec?

  • I am not reaching max sockets/host (agent.maxSockets) limit: all urls are from unique hosts.
  • I am not reaching max file descriptors limit (AFAIK): my ulimit -n is 2560, and lsof shows that my scraper never uses more than 20 file descriptors.
  • I've increased settings for kern.maxfiles, kern.maxfilesperproc, kern.ipc.somaxconn, and kern.ipc.maxsockets in sysctl.conf, and rebooted. No effect.
  • Tried increasing ulimit -n. No change.

Is there any limit I don't know about? I am on Mac OS-X.

I don't think there is a hard limit of 10 requests per second, that just seems to be the highest speed at which node.js is able to crawl on a single process. The basics of crawling are as such:

  1. Request HTML page.
  2. Parse HTML page.
  3. Execute JavaScript.
  4. Do some-post processing.
  5. Load the next URL candidate.

At 10 requests per second, you are executing the above steps 10 times in 1 second. The fastest your crawler can crawl on a single process (thread) is with the speed of your bandwidth connection, that's if you're only doing step 1. If you're doing steps 2 through 5, then your crawl speed is going to be lower than your bandwidth connection because you're doing other things in between each web request.

In order to maximize the speed, you have to ensure that you're constantly performing step 1 until you max out your bandwidth connection and the way to do that is to add more processes (threads). A very simple example is to consider this situation: step 1 can be summarized as Fetching and step 2 through 5 can be summarized as Processing. So if you have 2 processes working concurrently, one can be fetching while the other is processing and that will theoretically maximize your throughput. In reality (as you have found out) you will need more than just two processes, because the processing part has multiple steps.

If you consider that the average web page is about 128 KB, your bandwidth usage is going to be 10 Mbps when you're making 10 requests per second using a single process. At 64 requests, you would need at least 64 Mbps bandwidth speed. So is your bandwidth connection really 64 Mbps?