How to scale MongoDB based analytics up to millions events per day?

Regarding my previous question ( How to quickly build large scale analytics server? ) I'm now feeding my analytics data to MongoDB - Every event (with bunch of metadata) gets their own document in views collection.

However, now I've hit the next roadblock: Now that inserts are done and analytics data is flowing, what would be the path of least resistance to do runs against that data? The idea is that once that data is sharded, the specific views would run mapReduces (say all events with specific ID one month back).

So, my question is: As I'm quite new with MongoDB, what are the steps I would need to do to get those mapReduces as fast as possible? Should I structure the raw data differently or is the one document per event correct? Are there Mongo specific tricks that I can do to make things flow faster when running against dataset that gets millions of inserts per day?

I would prefer to keep my technology stack as simple as possible (Node.js + MongoDB), so I'd prefer if things could be done without introducing extra technology (like say Hadoop).

Example document of a event:

{
    id: 'abc',
    ip: '1.1.1.1',
    type: 'event1',
    timestamp: 1234,
    metadata: {
        client: 'client1'
    }
}

All main aggregations will be ID centric, analysing events in said ID, the most used being get all events with said ID for the last month. Minor aggregations would be grouping things with metadata (how many percents used client1 vs. client2 etc.). All the aggregations will be system defined, so users can't set them by themselves, at this point at least. Thus, as far as I understand, the sharding should be done through ID, as big majority of the aggregations will be ID centric? Also, this should mean that the most recent events on any given ID are always in memory as Mongo keeps the latest stuff in memory and only dumps the overflow to disk.

Also, real time is not a requirement. Although, ofc it would be nice. :P

Edit: Added example data

Edit: Title should have been "...per day" not "...per page" + more spec about the aggregation sets

Steps to get mapreduce as fast as possible:

  • optimize your data so you have as much as possible in-memory. going to disk 'kills' (makes it really slow for) everything! (please see comments below on why I'd advise to not use mapreduce!)

Should I structure the raw data differently ?

  • if your number of dimensions that you aggregate upon are known and limited, then why not store & increment your aggregates on insertion? rather than query for it? (obviously this puts more load/processing on insertion) MongoDB documentation on pre-aggregating patterns

Are there Mongo specific tricks that I can do to make things flow faster when running against dataset that gets millions of inserts per day?

  • sharding scales your writes. if a single node is loaded, add another. pick an appropriate shard key that enables balanced writes (one shard doesnt get all the writes)

Other things to consider:

MapReduce is slow and not suitable for online operations. Its a batch processing framework. It is also single-threaded, as the JS engine is. It also has overhead in translating to/from JS. I use the aggregation framework to get around these limits and have found it to be much faster and more scalable than using the mongodb mapreduce commands.

If performing either mapreduce or aggregation framework operations, try to get as much of your active dataset in-memory! going to disk kills performance for either. I found the mongodb notes on a right-balanced index invaluable to keeping our working set in-memory.

MapReduce won't be quick. It uses the JavaScript engine for one thing, so the more code you need to run to query your data the slower it will be. The more you can leverage the native DB engine with indexes the better. Can you provide some examples of what your data structure might look like?

To expand on the other answer with my own flavour; map reduce is a solid option for a lot of your processing. Make no mistake, map reduce will be a vital core of your analytics if you do it fully.

The best sort of MR is an incremental one which builds things like archival statistics on certain events and people so as to shrink the overall database size and working set size of pulling out all that old, dusty data.

As for realtime usage, I have found (from many dicussions on the subject in early days of MongoDB in the Google User group) the best way is to pre-aggregate your data so that a simple linear query like: db.some_preaggregated_data.find({date: {$gt, $lt}}) will get your results. This eases the sharding of the data and also the ranging over it, not to mention your working set size. Overall giving a much more performant operation.

I would recommend one thing:- don't go into the aggregation framework or complex queries if you really expect this to scale fully in realtime. It will start to create too much work on such a hugely expansive data set. You will need the full lock, working set etc on your side using simple find() queries to satisfy your needs across a big table format normally.