Why 'base' tag is preventing jsdom.env from working?

Update:

I found a tag 'base' used in the page that I failed to run jQuery in. If the website contains that tag the jsdom would not work. Though I don't know why.

<base href="http://bbs.18183.com/" />

To verify this I created a brand-new HTML file and put a inside, the jsdom then fails.


I am currently playing with Node.js, and after reading How to Scrape Web Pages with Node.js and jQuery I decide to create one for me.

So I installed express, jsdom and a lot of stuff and found it's really convenient to scrape web pages. But later I found a weird situation that some particular page cannot be scraped, instead it prompts an error as followed:

          var title = $('title').text();
                      ^
TypeError: undefined is not a function
    at H:\animalwar\personal\node\getter\app.js:82:23
    at exports.env.exports.jsdom.env.scriptComplete (H:\animalwar\personal\node\
getter\node_modules\jsdom\lib\jsdom.js:207:39)
    at process.startup.processNextTick.process._tickCallback (node.js:244:9)

Here is my code:

request({
  url:'http://bbs.18183.com/'},
  function (err, response, body) {
    if(err && response.statusCode !== 200){
      console.log('Connection Failure! Fuck GFW');
      res.end('Connection Failure! Fuck GFW');
      return;
    }
    jsdom.env({
      html: body,
      scripts: ['jquery.js']
      }, function(err, window){
        //Use jQuery just as in a regular HTML page
        var $ = window.jQuery;
        var title = $('title').text();
        console.log('SUCCESSFULLY GOT: ', title );
        res.end(title);
      }
   );
});

The website "http://bbs.18183.com/" is not working in this case but many other websites are working. For example, changed it to "http://www.18183.com/", it's working.

I guess it's due to some conflict of the definition of "$" but later I realized that with jsdom.env the page is just a DOM tree. Even though I changed $ to other names it still doesn't work.

Does anyone know anything about this?

I see what is happening here. This isn't quite a bug, but I can see where it's unexpected. Here's what's happening:

scripts: ['jquery.js'] translates into "insert a <script src="jquery.js">". When jsdom sees <script src="jquery.js">, it tries to load jquery.js relative to the current document's URL.

In documents without a <base> tag, when you load them explicitly with HTML fragment strings instead of via URLs, the document URL gets set to the file:// URL corresponding to your current script. And I bet jquery.js is right next to your current script, so that works great: <script src="jquery.js"> resolves just fine.

But in documents with a <base> tag, the document's URL gets set to that base. So <script src="jquery.js"> in this case translates to loading <base href="http://localhost/jquery.js">, and I bet you don't have a jquery.js available on a server running on localhost port 80. So this fails.

The fix is to be more explicit. I'd suggest something like

var path = require("path");

jsdom.env({
  html: myHTML,
  scripts: [path.resolve(__dirname, "jquery.js")],
  done: function (errors, window) {
  }
});

Note that if you checked your errors variable, you probably would have seen an error that gave you a clue. You don't seem to have any such error handling code.