Suggestions needed for using Node.js and MongoDB to detect platform changes of a site

Question

Suggestions needed for using Node.js and MongoDB to detect platform changes of a site

I am in need of some advice for this project I am working on.

I am currently working on a project requesting headers; an example of a scraped header is below, in Mongo document-style:

{
    "url": "google.com",
    "statusCode": 301,
    "headers": {
        "location": "http://www.google.com/",
        "content-type": "text/html; charset=UTF-8",
        "date": "Mon, 25 Mar 2013 13:50:31 GMT",
        "expires": "Wed, 24 Apr 2013 13:50:31 GMT",
        "cache-control": "public, max-age=2592000",
        "server": "gws",
        "content-length": "219",
        "x-xss-protection": "1; mode=block",
        "x-frame-options": "SAMEORIGIN"
    }
}

This project uses Node.JS, Javascript, and MongoDB. Currently I have a few thousand of these responses stored in a MongoDB, and I am interested in using some of the items in headers to detect platform changes. Headers like server, x-powered-by, x-aspnet-version are all headers that in my opinion can be used to cross-referenced in the future. For example - if a website "today" was upgraded from Microsoft-IIS/7.0 to Microsoft-IIS/7.5 when I run this scraper again in two months, there is reason to believe there was an upgrade with-in this website.

My question is - what is the best way to do this?

Should I make two collections - collectionToday and collectionInTwoMonths?

Then do a regex search of integer changes/increments for each server, x-powered-by, and x-aspnet-version?

How would an implementation of this work?

Any suggestions will be appreciated.

javascript
regex
node.js
mongodb
document

Answer 1

There are a few ways that you could do this. One would be, as you suggested, creating different collections for each time period, and storing the entire group of headers for each one. You could then query for differences by running find for the url for each time period, comparing the results application side, and reporting the results.

Another way would be to store a "differences" collection, that held, for each point in time, the differences between the headers then and the headers the last time you queried. This would require more application logic each time you query for the headers, but would be less work when actually querying the differences. This is what I would do.

Edit

If those are the three headers you need, then I think that sounds good. Remember that when you query to find the differences, you need to find the last time each header changed to compare against, which means the last entry (timewise) in the collection that both corresponds to the correct url and has an entry for the header in question.

Psuedo-code for diffing:

for every url you want:
    query collection by url, sorting by date 
    for each header:
        find the last document with that field
        if the header value in that document and the current header are different:
            add the field to the new document
    add the new document, holding the url, date, and all different fields, to the collection