I need to read a large zip file in node-js and process each file (approx 100MB zip file containing ca 40.000 XML files, 500kb each file uncompressed). I am looking for a 'streaming' solution that has acceptable speed and does not require to keep the whole dataset in memory (JSZip, node-zip worked for me, but it keeps everything in RAM and the performance is not good enough). A quick attempt in c# shows that loading, unpacking and parsing the XML can be achieved in approx 9 seconds on 2 year old laptop (using DotNetZip
). I don't expect nodejs to be as fast, but anything under one minute would be okay. Unpacking the file to local disk and then processing it, is not an option.
I am currently attempting to use the unzip
module (https://www.npmjs.org/package/unzip) but can't get it work, so I don't know if the speed is okay, but at least it looks like I can stream each file and process it in the callback. (The problem is that I only receive the first 2 entries, then it stops calling the .on('entry', callback)
callback. I don't get any error, it just silently stops after 2 files. It would also be good to know how I can get the full XML in one chunk instead of fetching buffer after buffer.)
function openArchive(){
fs.createReadStream('../../testdata/small2.zip')
.pipe(unzip.Parse())
.on('entry', function (entry) {
var fileName = entry.path;
var type = entry.type; // 'Directory' or 'File'
var size = entry.size;
console.log(fileName);
entry.on('data', function(data){
console.log("received data");
});
});
}
There's plenty of node-js modules for working with zip files, so this question is really about to figure out which library is best suited for this scenario.