I'm into some web scraping with Node.js. I'd like to use XPath as I can generate it semi-automatically with several sorts of GUI. The problem is that I cannot find a way to do this effectively.
jsdom
is extremely slow. It's parsing 500KiB file in a minute or so with full CPU load and a heavy memory footprint.cheerio
) neither support XPath, nor expose W3C-compliant DOM.phantom
or casper
would be an option, but those require to be running in a special way, not just node <script>
. I cannot rely on the risk implied by this change. For example, it's much more difficult to find how to run node-inspector
with phantom
.Spooky
is an option, but it's buggy enough, so that it didn't run at all on my machine.What's the right way to parse an HTML page with XPath then?
You can do so in several steps.
parse5
. The bad part is that the result is not DOM. Though it's fast enough and W3C-compiant.xmlserializer
that accepts DOM-like structures of parse5
as input.xmldom
. Now you finally have that DOM.xpath
library builds upon xmldom
, allowing you to tun XPath queries. Be aware that XHTML has its own namespace, and queries like //a
won't work.Finally you get something like this.
var fs = require('fs');
var xpath = require('xpath');
var parse5 = require('parse5');
var xmlser = require('xmlserializer');
var dom = require('xmldom').DOMParser;
fs.readFile('./test.htm', function (err, html) {
if (err) throw err;
var parser = new parse5.Parser();
var document = parser.parse(html.toString());
var xhtml = xmlser.serializeToString(document);
var doc = new dom().parseFromString(xhtml);
var select = xpath.useNamespaces({"x": "http://www.w3.org/1999/xhtml"});
var nodes = select("//x:a/@href", doc);
console.log(nodes);
});
Libxmljs is currently the fastest implementation (something like a benchmark) since it's only bindings to the LibXML C-library which supports XPath 1.0 queries:
var libxmljs = require("libxmljs");
var xmlDoc = libxmljs.parseXml(xml);
// xpath queries
var gchild = xmlDoc.get('//grandchild');
However, you need to sanitize your HTML first and convert it to proper XML. For that you could either use the HTMLTidy command line utility (tidy -q -asxml input.html
), or if you want it to keep node-only, something like xmlserializer should do the trick.
I have just started using npm install htmlstrip-native
which uses a native implementation to parse and extract the relevant html parts. It is claiming to be 50 times faster than the pure js implementation (I have not verified that claim).
Depending on your needs you can use html-strip directly, or lift the code and bindings to make you own C++ used internally in htmlstrip-native
If you want to use xpath, then use the wrapper already avaialble here; https://www.npmjs.org/package/xpath