Web Scraping Basics

I'm giving web page scraping a go, as I can see a lot of potential to do interesting stuff with it. I've spent a few hours researching what I need and I've decided to use node.js with the 'request' and 'cheerio' modules to perform the scrape.

So for a first project I thought I'd try and get a random sentence from this random sentence generator site: http://watchout4snakes.com/wo4snakes/Random/RandomSentence

The markup looks relatively simple, here's the bit I'm interested in:

<div class="resultBox">
    <table class="centeredResult">
        <tbody><tr>
            <td>
                <span id="result">An amateur regret slights the lust outside his contentious century.</span>
            </td>
        </tr>
    </tbody></table>

</div>

So the bit I want is in the span (obviously it'll be different on inspection of the actual page), I wrote the following Javascript file and ran it in node:

var request = require("./node_modules/request/");
    cheerio = require('./node_modules/cheerio/');

request('http://watchout4snakes.com/wo4snakes/Random/RandomSentence', function(err, resp, body){

    if(!err && resp.statusCode == 200){


        console.log("connected...\n");

        var $ = cheerio.load(body);


        console.log($('#result').html());
    }
    else console.log("Failed To Connect...");
});

I get the notification so I do some checks and determine that I've correctly scraped the page's data. So now all I want to do is select the text in the #result ID field. However I'm just given a blank space, in fact if I get the cheerio module to print the actual markup of that region I'm given a <span ID="result"></span> with no random sentence inside.

My initial guess is that node is scraping the markup before the random sentence script has finished running. But I don't know a way of diagnosing what's happening, so does anyone have an idea?

Yes, your intuition is correct in that the request module is scraping the markup before the random sentence script has finished running. If you print out body, you'll see that it contains:

<table class="centeredResult">
    <tr>
        <td>
            <span id="result"></span>
        </td>
    </tr>
</table> 

In fact, the request module is never going to execute any JavaScript on the fetched page.

If you need JavaScript to run on the pages you are scraping, I'd recommend looking at headless browsers like phantomjs that give you the ability to interact with the page via a JavaScript API.

Looking on the page:

<script>
    (function ($) {
        $(document).ready(function () {
            var options = {
                target: '#result',
                beforeSubmit: function () {
                $('#result').empty();
                $.fnWait();
            },
            success: function () {
                $.unblockUI();
            }
        };
            $('#frmSentence').ajaxForm(options)
                             .find('input[type=submit]')
                             .click();
        });
    })(jQuery);

it looks like the #evidence span is being filled with AJAX. When your library loads the page, it doesn't execute the Javascript, and so it doesn't load the quote.

It might be easiest if you just try to query the same page they're pulling it from. Otherwise, you'll need to use something that will execute the javascript that's on the page for you -- like Selenium or similar.

Load the page in your browser, and look at the network requests. You'll see that the sentence is loaded asynchronously, after cheerio has stopped loading the DOM. There's a POST to http://watchout4snakes.com/wo4snakes/Random/NewRandomSentence that returns a plain text string (Content-Type:text/html; charset=utf-8) with the quote, which is then inserted into the DOM.

I don't know cheerio, but you can either (a) use a timer to wait a few seconds, or (b) switch to wd which has an explicit wait for something, that will trigger once that DOM element has been loaded.

So after some fiddling with my script, here's what I've ended up with:

var page = require('webpage').create();

console.log("connecting...");   


page.open("http://watchout4snakes.com/wo4snakes/Random/RandomSentence", function(){

    console.log('connected');

    var content = page.content;

    var phrase = page.evaluate(function() {

        return document.getElementById("result").innerHTML;

    });

    console.log(phrase);

});

Thanks to go-oleg for the prompt to use phantomjs, looks as if the headless browser method allows the script to run before grabbing the HTML content. I then extracted the sentence from the page using page.evaluate().

Looks as if Phantomjs has some issues on my system though. None of the processes ever exit on phantom.exit() which, according to Google searches, has something to do with Nvidia graphics drivers. Plus the script is rather slow, as it waits for all elements of the page to load the connection can take up to 10 seconds, which isn't great for iterative processes. But I managed to get the sentence, so I'll build on it from here, thanks for the info guys!