I have a serialized collection of wikipedia article edits that I am streaming and storing to mongodb with node.js. They look like this:
{ "time" : 1338144181565, "page" : "Pavol Országh Hviezdoslav", "url" : "http://es.wikipedia.org/w/index.php?diff=56528327&oldid=56521690", "delta" : -60, "_id" : ObjectId("4fc275b5cd08c22d31000001") }
{ "time" : 1338144183265, "page" : "Indian Premier League", "url" : "http://en.wikipedia.org/w/index.php?diff=494656175&oldid=494656151", "delta" : -12, "_id" : ObjectId("4fc275b7cd08c22d31000002") }
{ "time" : 1338144187346, "page" : "Dizz Knee Land", "url" : "http://en.wikipedia.org/w/index.php?diff=494656189&oldid=494656176", "delta" : -84, "_id" : ObjectId("4fc275bbcd08c22d31000003") }
The URL shows the differences in the edits and I will scrape the edited text with a python script and then will want to update the records with a new field "edit_text" and possibly the img src ("image_url") for the main image from each wikipedia article (if there is one).
The idea would be to ultimately stream out the updated data to a web application that shows the edited text in the context with the page title and image (if the latter exists).
How would it be possible to do this by keeping it all in the same collection, or would it be better to store the results in a new collection?
I would store the content of the scraped files in another collection because of a few reasons: