Can a lot of data exceed stack size in Node.js?

Question

Can a lot of data exceed stack size in Node.js?

I am not very familiar with the inner workings of Node.js, but as far as I know, you get 'Maximum call stack size exceeded' errors when you make too many function calls.

I'm making a spider that would follow links and I started getting these erros after a random number of crawled URLs. Node doesn't give you a stack trace when this happens, but I'm pretty sure that I don't have any recursion errors.

I am using request to fetch URLs and I was using cheerio to parse the fetched HTML and detect new links. The stack overflows always happened inside cheerio. When I swapped cheerio for htmlparser2 the errors dissapeared. Htmlparser2 is much much lighter since it just emits events on each open tag instead of parsing whole documents and constructing a tree.

My theory is that cheerio ate up all the memory in the stack, but I'm not sure if this is even possible?

Here's a simplified version of my code (it's for reading only, it won't run):

var _       = require('underscore');
var fs      = require('fs');
var urllib  = require('url');
var request = require('request');
var cheerio = require('cheerio');

var mongo   = "This is a global connection to mongodb.";
var maxConc = 7;

var crawler = {
  concurrent: 0,
  queue:      [],
  fetched:    {},

  fetch: function(url) {
    var self = this;

    self.concurrent += 1;
    self.fetched[url] = 0;

    request.get(url, { timeout: 10000, pool: { maxSockets: maxConc } }, function(err, response, body){
      self.concurrent  -= 1;
      self.fetched[url] = 1;
      self.extract(url, body);
    });
  },

  extract: function(referrer, data) {
    var self = this;
    var urls = [];

    mongo.pages.insert({ _id: referrer, html: data, time: +(new Date) });

    /**
     * THE ERROR HAPPENS HERE, AFTER A RANDOM NUMBER OF FETCHED PAGES
    **/
    cheerio.load(data)('a').each(function(){
      var href = resolve(this.attribs.href, referer); // resolves relative urls, not important

      // Save the href only if it hasn't been fetched, it's not already in the queue and it's not already on this page
      if(href && !_.has(self.fetched, href) && !_.contains(self.queue, href) && !_.contains(urls, href))
        urls.push(href);
    });

    // Check the database to see if we already visited some urls.
    mongo.pages.find({ _id: { $in: urls } }, { _id: 1 }).toArray(function(err, results){
      if(err) results = [];
      else    results = _.pluck(results, '_id');

      urls = urls.filter(function(url){ return !_.contains(results, url); });
      self.push(urls);
    });
  },

  push: function(urls) {
    Array.prototype.push.apply( this.queue, urls );
    var url, self = this;

    while((url = self.queue.shift()) && this.concurrent < maxConc) {
      self.fetch( url );
    }
  }

};

crawler.fetch( 'http://some.test.url.com/' );

node.js
callstack

Answer 1

Looks like you got some recursion going on there. Recursive function calls will eventually exceed the stack since this is where function pointers are stored.

So here is how it happens:

fetch calls extract in the request.get callback
extract calls push in the mongo.pages.find callback
push calls fetch inside the while loop

This cycle seems to repeat until you run out of stack.

In your case the stack is running very low by the time you call cheerio.load which is why it runs out right then and there.

Although you most likely want to examine if that is a bug or something you intended, in order to get the same effect in nodejs without using straight recursion is to use:

process.nextTick(functionToCall).

It will leave the enclosed function, which pops its pointer off the stack, but call the functionToCall on the next tick.

You can try it in the noderepl:

process.nextTick(function () { console.log('hello'); })

will print 'hello' immediately.

Its simliar to setTimeout(functionToCall, 0), but to be preferred over it.

Relating to your code you could replace self.fetch(url) with process.nextTick(function () { self.fetch(url); }) and should no longer run out of stack.

That being said, as mentioned above, it is more likely that there is a bug in your code, so look into that first.

Answer 2

You are decrementing with self.concurrent -= 1; too early, you should decrement it within extract function after all the async stuff is done. That's one gotcha that sticks out. Not sure if it would solve it though.