I just wrote my first script for pjscrape, but I find that it runs terribly slow. I'm new to both pjscrape and phantomjs, so I don't know which one is the culprit.
I am loading the file from localhost, so the bottleneck is definitely not in the transfer.
My config.js script looks like this:
pjs.addSuite({
url: 'http://localhost/file.html'.
scraper: function() {
var people = $('table.person');
var results = [];
$.each(people, function() {
var $this = $(this);
results.push({
firstName: $this.find('.firstName').text(),
lastName: $this.find('.lastName').text(),
age: $this.find('.age').text()
});
}
return results;
}
}
Then I just execute PhantomJS using the command line instructions here.
~> phantomjs pjscrape.js config.js
I run the same code (just the scraper function() ) in Chrome and it is instant. In phantomjs/pjscrape, it takes a good 30 seconds.
Any clue what is causing the slowness?
Is there a better way to do this DOM screen scraping? Maybe a nodejs solution?
If Node.JS is an option, might I introduce you to cheerio? It's a great library for consuming questionably-formed HTML documents. It gives you a jQuery-like API for working with a DOM-like representation of the page you're scraping. Paired with request, it makes for a pretty easy environment for scraping HTML.
Your example would end up looking something like this (error handling left as an exercise for the reader):
var cheerio = require("cheerio"),
request = require("request");
request("http://localhost/file.html", function(err, res, data) {
var $ = cheerio.load(data);
var people = $('table.person');
var results = [];
$.each(people, function() {
var $this = $(this);
results.push({
firstName: $this.find('.firstName').text(),
lastName: $this.find('.lastName').text(),
age: $this.find('.age').text()
});
}
do_something_with(results);
});
If the web page you are using sends fully-formed HTML and does not require client-side javascript to manipulate the DOM into its final form, skip phantomjs and just scrape with an http client library (node core or request or superagent or hyperquest) and use cheerio to extract the data you need from the DOM.