Parse large JSON file in Nodejs

Question

Parse large JSON file in Nodejs

I have a file which stores many JavaScript objects in JSON form and I need to read the file, create each of the objects, and do something with them (insert them into a db in my case). The JavaScript objects can be represented a format:

Format A:

[{name: 'thing1'},
....
{name: 'thing999999999'}]

or Format B:

{name: 'thing1'}         // <== My choice.
...
{name: 'thing999999999'}

Note that the ... indicates a lot of JSON objects. I am aware I could read the entire file into memory and then use JSON.parse() like this:

fs.readFile(filePath, 'utf-8', function (err, fileContents) {
  if (err) throw err;
  console.log(JSON.parse(fileContents));
});

However, the file could be really large, I would prefer to use a stream to accomplish this. The problem I see with a stream is that the file contents could be broken into data chunks at any point, so how can I use JSON.parse() on such objects?

Ideally, each object would be read as a separate data chunk, but I am not sure on how to do that.

var importStream = fs.createReadStream(filePath, {flags: 'r', encoding: 'utf-8'});
importStream.on('data', function(chunk) {

    var pleaseBeAJSObject = JSON.parse(chunk);           
    // insert pleaseBeAJSObject in a database
});
importStream.on('end', function(item) {
   console.log("Woot, imported objects into the database!");
});*/

Note, I wish to prevent reading the entire file into memory. Time efficiency does not matter to me. Yes, I could try to read a number of objects at once and insert them all at once, but that's a performance tweak - I need a way that is guaranteed not to cause a memory overload, not matter how many objects are contained in the file.

I can choose to use FormatA or FormatB or maybe something else, just please specify in your answer. Thanks!

javascript
json
file
node.js

Answer 1

To process a file line-by-line, you simply need to decouple the reading of the file and the code that acts upon that input. You can accomplish this by buffering your input until you hit a newline. Assuming we have one JSON object per line (basically, format B):

var stream = fs.createReadStream(filePath, {flags: 'r', encoding: 'utf-8'});
var buf = '';

stream.on('data', function(d) {
    buf += d.toString(); // when data is read, stash it in a string buffer
    pump(); // then process the buffer
});

function pump() {
    var pos;

    while ((pos = buf.indexOf('\n')) >= 0) { // keep going while there's a newline somewhere in the buffer
        if (pos == 0) { // if there's more than one newline in a row, the buffer will now start with a newline
            buf = buf.slice(1); // discard it
            continue; // so that the next iteration will start with data
        }
        process(buf.slice(0,pos)); // hand off the line
        buf = buf.slice(pos+1); // and slice the processed data off the buffer
    }
}

function process(line) { // here's where we do something with a line

    if (line[line.length-1] == '\r') line=line.substr(0,line.length-1); // discard CR (0x0D)

    if (line.length > 0) { // ignore empty lines
        var obj = JSON.parse(line); // parse the JSON
        console.log(obj); // do something with the data here!
    }
}

Each time the file stream receives data from the file system, it's stashed in a buffer, and then pump is called.

If there's no newline in the buffer, pump simply returns without doing anything. More data (and potentially a newline) will be added to the buffer the next time the stream gets data, and then we'll have a complete object.

If there is a newline, pump slices off the buffer from the beginning to the newline and hands it off to process. It then checks again if there's another newline in the buffer (the while loop). In this way, we can process all of the lines that were read in the current chunk.

Finally, process is called once per input line. If present, it strips off the carriage return character (to avoid issues with line endings – LF vs CRLF), and then calls JSON.parse one the line. At this point, you can do whatever you need to with your object.

Note that JSON.parse is strict about what it accepts as input; you must quote your identifiers and string values with double quotes. In other words, {name:'thing1'} will throw an error; you must use {"name":"thing1"}.

Because no more than a chunk of data will ever be in memory at a time, this will be extremely memory efficient. It will also be extremely fast. A quick test showed I processed 10,000 rows in under 15ms.

Answer 2

Just as I was thinking that it would be fun to write a streaming JSON parser, I also thought that maybe I should do a quick search to see if there's one already available.

Turns out there is.

JSONStream "streaming JSON.parse and stringify"

Since I just found it, I've obviously not used it, so I can't comment on its quality, but I'll be interested to hear if it works.

It does work consider the following CoffeeScript

stream.pipe(JSONStream.parse('*')) .on 'data', (d) -> console.log typeof d console.log "isString: #{_.isString d}"

This will log objects as they come in if the stream is an array of objects. Therefore the only thing being buffered is one object at a time.

Answer 3

As of October 2014, you can just do something like the following (using JSONStream) - https://www.npmjs.org/package/JSONStream

 var fs = require('fs'),
         JSONStream = require('JSONStream'),

    var getStream() = function () {
        var jsonData = 'myData.json',
            stream = fs.createReadStream(jsonData, {encoding: 'utf8'}),
            parser = JSONStream.parse('*');
            return stream.pipe(parser);
     }

     getStream().pipe(MyTransformToDoWhateverProcessingAsNeeded).on('error', function (err){
        // handle any errors
     });

Update:

To demonstrate with a working example:

npm install JSONStream event-stream

data.json:

{
  "greeting": "hello world"
}

hello.js:

var fs = require('fs'),
  JSONStream = require('JSONStream'),
  es = require('event-stream');

var getStream = function () {
    var jsonData = 'data.json',
        stream = fs.createReadStream(jsonData, {encoding: 'utf8'}),
        parser = JSONStream.parse('*');
        return stream.pipe(parser);
};

 getStream()
  .pipe(es.mapSync(function (data) {
    console.log(data);
  }));

node hello.js

hello world

Answer 4

I think you need to use a database. MongoDB is a good choice in this case because it is JSON compatible.

UPDATE: You can use mongoimport tool to import JSON data into MongoDB.

mongoimport --collection collection --file collection.json

Answer 5

I solved this problem using the splt npm module. Pipe your stream into split, and it will "Break up a stream and reassemble it so that each line is a chunk".

Sample code:

var fs = require('fs')
  , split = require('split')
  ;

var stream = fs.createReadStream(filePath, {flags: 'r', encoding: 'utf-8'});
var lineStream = stream.pipe(split());
linestream.on('data', function(chunk) {
    var json = JSON.parse(chunk);           
    // ...
});