filtering large datasets in node/javascript/underscore

Question

filtering large datasets in node/javascript/underscore

I have a large data set of usage records for virtual machines in a private cloud. Every hour, this set of records is generated for every VM running in my cloud. VMs all have a record that contains specs like RAM, memory, and have an id: field that corresponds to virtualmachineid in the usage records. I can quiery the API and get XML or JSON data sets back, and I chose JSON, sice it's much more lightweight on the wire.

This is one record, and there 13 types of usage corresponding to things like disk usage, bandwidth, running time,etc:

Usage Record:                                                                    
  { account: 'user_1',                                                           
    accountid: 'c22ed7ed-e51a-4782-83a7-c72e2883eb99',                             
    domainid: 'f88d8bbf-a83b-4be1-a788-e2ab51eb9973',                              
    zoneid: '4a7f62a8-3248-47ee-bf94-d63dac2a6668',                                
    description: 'VM2 running time (ServiceOffering: 11) (Template: 215)',         
    usage: '1 Hrs',                                                                
    usagetype: 1,                                                                  
    rawusage: '1',                                                                 
    virtualmachineid: 'f6661f34-4d03-4128-b738-38c330f2499c',                      
    name: 'VM2',                                                                   
    offeringid: 'f1d82c2e-25e3-4c97-bae8-b6f916860faf',                            
    templateid: '2bf2e295-fdd6-4326-a652-6d07581be070',                            
    usageid: 'f6661f34-4d03-4128-b738-38c330f2499c',                               
    type: 'XenServer',                                                             
    startdate: '2012-12-25\'T\'22:00:00-06:00',                                    
    enddate: '2012-12-25\'T\'22:59:59-06:00' }

What I'm trying to do:

I need to move through the list of VM's, of which there will be hundreds, and for each VM, build a usage report for the previous period, which is typically a month, but can be ad-hoc too. So out of the 10000+ usage records for each VM each month, I need to calculate each usage type total.

Is there a more efficient, novel way then the traditional loop-loop-loop, then loop-again method? In pseudo code:

for (each vm in vms)
    for (each usage_record in usage_records)
        if (vm.id === usage_record.vmid)
            switch usage_record.usage_type
                case 1: its runtime
                case 2: its disk usage
                case 3: its some other type of usage
                ...

using underscore, here's what I have done so far:

_.each(virtualMachines.virtualmachine, function (vm) {   
    var recs = _.filter(usageList.usagerecord, function (foo) { 
        return (foo.virtualmachineid === vm.id && foo.usagetype === 1); 
    });
        console.log("recs count:" + recs.count); 
        //now, recs contains all the usage record type 1 for one VM  

 });

which works fine now, but I'm not convinced it's optimized and will not scale as VM count goes up. For every VM, there will be 10,000 additional usage records added to the data set.

javascript
node.js
underscore.js

Answer 1

Since you need to process each VM and need results per VM, I'd first order the list by VM. After that you should only need a single loop and a single object for the "current VM stats". Once you encounter a next VM in the list, you know the current stats are complete.

sortRecordsByVM();
currentStats = { runtime: 0, disk: 0, other: 0 };
currentVM;
for each record
  if currentVM != record.VM
    writeToOutput(currenStats);
    currentStats = { runtime: 0, disk: 0, other: 0 };
  addRecordTo(record, currentStats);
writeToOutput(currenStats);

That said: I don't think iterating over 10K records will cause problems for modern machines, so I'd start with the simplest possible approach and only optimize when there is a performance problem.

I'd just not use a nested loop and leave the lookup to built-in data structures (which are typically more optimized than any code I tend to write first try):

allStats = {};
for each record
  stats = allStats[record.VM];
  if (!stats)
    stats = { runtime: 0, disk: 0, other: 0 };
  addRecordTo(record, stats);
  allStats[record.VM] = stats;