This seems like a pretty simple thing but I can't find any discussions that really explain how to do it.
I'm building a scraper with MongoDB and Node.js. It runs once daily and scrapes several hundred urls and records to the database. Example:
url, img src, page title and domain name are saved to MongoDB.Here's what I'm trying to achieve:
mongodb record and update it.The bit I'm having trouble with is deleting entries that haven't been scraped. What's the best way to achieve this?
So far my code successfully checks whether entries exist, updates them. It's deleting records that are no longer relevant that I'm having trouble with. Pastebin link is here:
You either need to timestamp items (and update them on every scrape) and periodically delete items which haven't been updated in a while, or you need to associate items with a particular query. In the latter case, you would gather all of the items previously associated with the query, and mark them off as the new results come in. Any items not marked off the list at the end, need to be deleted.
another possibility is to use the new TTL index option in mongodb 2.4 allowing you to set time to live on documents
http://docs.mongodb.org/manual/tutorial/expire-data/
This will let the server expire them over time instead of having to perform big expensive remove executions.
Another optimization is to use the power of 2 option for collections to avoid the high fragmentation of memory that write, remove cycles create