Scrape URLs in a mongodb and update records with scraped text

Question

Scrape URLs in a mongodb and update records with scraped text

I have a serialized collection of wikipedia article edits that I am streaming and storing to mongodb with node.js. They look like this:

{ "time" : 1338144181565, "page" : "Pavol Országh Hviezdoslav", "url" : "http://es.wikipedia.org/w/index.php?diff=56528327&oldid=56521690", "delta" : -60, "_id" : ObjectId("4fc275b5cd08c22d31000001") }
{ "time" : 1338144183265, "page" : "Indian Premier League", "url" : "http://en.wikipedia.org/w/index.php?diff=494656175&oldid=494656151", "delta" : -12, "_id" : ObjectId("4fc275b7cd08c22d31000002") }
{ "time" : 1338144187346, "page" : "Dizz Knee Land", "url" : "http://en.wikipedia.org/w/index.php?diff=494656189&oldid=494656176", "delta" : -84, "_id" : ObjectId("4fc275bbcd08c22d31000003") }

The URL shows the differences in the edits and I will scrape the edited text with a python script and then will want to update the records with a new field "edit_text" and possibly the img src ("image_url") for the main image from each wikipedia article (if there is one).

The idea would be to ultimately stream out the updated data to a web application that shows the edited text in the context with the page title and image (if the latter exists).

How would it be possible to do this by keeping it all in the same collection, or would it be better to store the results in a new collection?

python
node.js
mongodb

Answer 1

I would store the content of the scraped files in another collection because of a few reasons:

The current collection with edit events is append only, which means you never have any updates. This makes it lightning fast to store. The scraped document collection will likely contain documents that are orders of magnitude larger.
The two types of document are two very distinct document types, and hence don't really belong in the same collection. It makes indexes unnecessary complex, and you'll probably and up having to create more indexes that contain lots of irrelevant information.