NodeJS: Memory usage grows during recursive scrape until crash

Question

NodeJS: Memory usage grows during recursive scrape until crash

I am scraping a bunch of stuff from a GET URL API in NodeJS. I'm looping through the months of the year X a # of cities. I have a scrapeChunk() function that I call once for each instance of the parameters, i.e. {startDate: ..., endDate: ..., location:...}. Inside I do a jsdom parsing of a table, convert to CSV, append the CSV to a file. Inside all of the nested asynchronous callbacks, I finally call the scrapeChunk function again with the next parameters instance.

It all works, but the node instance grows and grows in RAM until I get a "FATAL ERROR: CALL_AND_RETRY_2 Allocation failed - process out of memory" error.

My Question: Am I doing something wrong or is this a limitation of JavaScript and/or the libraries I'm using? Can I somehow get each task to complete, FREE its memory, and then start the next task? I tried a sequence from FuturesJS and it seems to suffer from the same leak.

javascript
node.js

Answer 1

What is probably happening is that you're building a very deep call tree, and the upper levels of that keep references to their data around, so it never gets claimed by the garbage collector.

One thing to do is, in your own code, when you call a callback at the end, do that by invoking process.nextTick(). That way, the calling function can end and release its variables. Also, make sure you're not piling all your data into a global structure that keeps those references around forever.

Without seeing the code, it's a bit tricky to come up with good responses. But this is not a limitation of node.js or its approach (there are lots of long-running and complex applications out there that use it), but how you make use of it.

Answer 2

You may want to try cheerio instead of JSDom. The author claims it is leaner and 8x times faster.

Answer 3

Assuming your description is correct, I think the cause of the problem is obvious - the recursive call to scrapeChunk(). Dispatch the tasks using a loop (or look into node's stream facilities), and ensure that they actually return.

What's going on here sounds something like this:

var list = [1, 2, 3, 4, ... ];
function scrapeCheck(index) {
  // allocate variables, do work, etc, etc
  scrapeCheck(index+1)
}

With a long enough list, you are guaranteed to exhaust memory, or stack depth, or the heap, or any number of things, depending on what you do during the function body. What I'd suggest is something like this:

var list = [1, 2, 3, 4, ... ];
list.forEach(function scrapeCheck(index) {
  // allocate variables, do work, etc, etc
  return;
});

Frustratingly nested callbacks are an orthogonal problem, but I would suggest you take a look at the async library (in particular async/waterfall), which is both popular and useful for this class of task.