Update:
I found a tag 'base' used in the page that I failed to run jQuery in. If the website contains that tag the jsdom would not work. Though I don't know why.
<base href="http://bbs.18183.com/" />
To verify this I created a brand-new HTML file and put a inside, the jsdom then fails.
I am currently playing with Node.js, and after reading How to Scrape Web Pages with Node.js and jQuery I decide to create one for me.
So I installed express, jsdom and a lot of stuff and found it's really convenient to scrape web pages. But later I found a weird situation that some particular page cannot be scraped, instead it prompts an error as followed:
var title = $('title').text();
^
TypeError: undefined is not a function
at H:\animalwar\personal\node\getter\app.js:82:23
at exports.env.exports.jsdom.env.scriptComplete (H:\animalwar\personal\node\
getter\node_modules\jsdom\lib\jsdom.js:207:39)
at process.startup.processNextTick.process._tickCallback (node.js:244:9)
Here is my code:
request({
url:'http://bbs.18183.com/'},
function (err, response, body) {
if(err && response.statusCode !== 200){
console.log('Connection Failure! Fuck GFW');
res.end('Connection Failure! Fuck GFW');
return;
}
jsdom.env({
html: body,
scripts: ['jquery.js']
}, function(err, window){
//Use jQuery just as in a regular HTML page
var $ = window.jQuery;
var title = $('title').text();
console.log('SUCCESSFULLY GOT: ', title );
res.end(title);
}
);
});
The website "http://bbs.18183.com/" is not working in this case but many other websites are working. For example, changed it to "http://www.18183.com/", it's working.
I guess it's due to some conflict of the definition of "$" but later I realized that with jsdom.env the page is just a DOM tree. Even though I changed $ to other names it still doesn't work.
Does anyone know anything about this?
I see what is happening here. This isn't quite a bug, but I can see where it's unexpected. Here's what's happening:
scripts: ['jquery.js'] translates into "insert a <script src="jquery.js">". When jsdom sees <script src="jquery.js">, it tries to load jquery.js relative to the current document's URL.
In documents without a <base> tag, when you load them explicitly with HTML fragment strings instead of via URLs, the document URL gets set to the file:// URL corresponding to your current script. And I bet jquery.js is right next to your current script, so that works great: <script src="jquery.js"> resolves just fine.
But in documents with a <base> tag, the document's URL gets set to that base. So <script src="jquery.js"> in this case translates to loading <base href="http://localhost/jquery.js">, and I bet you don't have a jquery.js available on a server running on localhost port 80. So this fails.
The fix is to be more explicit. I'd suggest something like
var path = require("path");
jsdom.env({
html: myHTML,
scripts: [path.resolve(__dirname, "jquery.js")],
done: function (errors, window) {
}
});
Note that if you checked your errors variable, you probably would have seen an error that gave you a clue. You don't seem to have any such error handling code.