memory leak in node regex parser?

Question

memory leak in node regex parser?

The following code causes node to consume a lot of ram and crash when it runs out of memory. However if I change the length of the found string from 13 to 12 everything is fine. It looks as if strings returned by a regex search contain a hidden reference to the original string that was searched. But only if the length of the found match is at least 13 characters. Is this a bug or is there some good reason for that behavior?

function randString(length) {
  var a = "a".charCodeAt(0),
      result = [];
  for(var i = 0; i < length; i++) {
    result.push(a + Math.floor(Math.random() * 26));
  }
  return String.fromCharCode.apply(null, result);
}


var arr = [];

for(var i = 0; i < 1000000; i++) {
  if(i % 1000 === 0) console.log(i);
  var str = randString(13);
  str = randString(5000) + "<" +str + ">" + randString(5000);
  var re = /<([a-z]+)>/gm;
  var next = re.exec(str);
  arr.push(next[1]);
}

javascript
regex
node.js
memory-leaks

Answer 1

I observe the same behavior in Chrome. I think the two (node.js and Chrome) behave the same because they are based on the same Javascript engine (V8).

There is no memory leak, but there is a problem with the garbage management in Javascript. I deduct this from the observation, that the Gbytes of memory are freed when I force garbage collection in Google Dev Tools.

You could force to run the garbage collector, as explained here. That way, your node.js will not crash.

Edit

Testing further I can tell these things:

About your comment But as long as there is still a reference to the array no memory get's freed.:

It looks more complicated than that, but you are right, arr seems to occupy all that space 1.1 Go for 100'000 items, this is 10kB per item. When you look at the array next, it indeed has a size of roughly 10kB (10015 bytes for next.input. If all worked like expected, next[1] would be a simple string and use only slighly more than the 13 data bytes, but this is not the case. Referencing next[1] in the array arr does not allow next to be garbage collected.

As a solution, I came up with this modified code (fiddle):

function randString(length) {
  var a = "a".charCodeAt(0),
      result = [];
  for(var i = 0; i < length; i++) {
    result.push(a + Math.floor(Math.random() * 26));
  }
  return String.fromCharCode.apply(null, result);
}


var arr = [];

for(var i = 0; i < 100000; i++) {
  if(i % 1000 === 0) console.log(i);
  var str = randString(13);
  str = randString(5000) + "<" +str + ">" + randString(5000);
  var re = /<([a-z]+)>/gm;
  var next = re.exec(str);
  arr.push(next[1].split('').join(''));
}
console.log(arr)

The trick is to cut the reference between next and the string stored in arr by splitting the string and joining it again.

I don't know anything about the internals, but it looks like a bug in V8. Testing the same on Firefox, everything works as expected, and there is no excessive memory usage.

Answer 2

I found the source of the problem. It's not the regexp parser that's responsible for this but the substring method on strings. It's intended as a feature for making the creation of substrings more efficient. There is an open issue about this on the V8 bug report page. https://code.google.com/p/v8/issues/detail?id=2869