Extract a string from HTML with NodeJS

Question

Extract a string from HTML with NodeJS

Here is the html...

<iframe width="100%" height="166" scrolling="no" frameborder="no" 
src="http://w.soundcloud.com/player/?url=http%3A%2F%2Fapi.soundcloud.com%2Ftracks%2F11111111&amp;auto_play=false
&amp;show_artwork=true&amp;color=c3000d&amp;show_comments=false&amp;liking=false
&amp;download=false&amp;show_user=false&amp;show_playcount=false"></iframe>

I'm using NodeJS. I'm trying to extract the trackID, in this case 11111111 following tracks%2F. What is the most stable method for performing this?

Should I use regex or some JS string method such as substring() or match()?

javascript
regex
node.js

Answer 1

If you know tracks%2F is only going to show up once you could do:

var your_track_ID = src.split(/tracks%2F/)[1].split(/&amp/)[0];

There are probably better ways, but that should work fine for your purposes.

Answer 2

It's generally a terribly bad idea to parse HTML with a regular expression, but this might be forgivable. I'd look for the complete URL for safety:

var pattern = /w\.soundcloud\.com.*tracks%2F(\d+)&/
  , trackID = (html.match(pattern) || [])[1]

Answer 3

You can find tracks with node module [url + jsdom + qs]

Try this

var jsdom = require('jsdom');
var url = require('url');
var qs = require('qs');

var str = '<iframe width="100%" height="166" scrolling="no" frameborder="no"'
  + 'src="http://w.soundcloud.com/player/?url=http%3A%2F%2Fapi.soundcloud.com%2Ftracks%2F11111111&amp;auto_play=false"'
  + '&amp;show_artwork=true&amp;color=c3000d&amp;show_comments=false&amp;liking=false'
  + '&amp;download=false&amp;show_user=false&amp;show_playcount=false"></iframe>';

jsdom.env({
  html: str,
  scripts: [
    'http://code.jquery.com/jquery-1.5.min.js'
  ],
  done: function(errors, window) {
    var $ = window.$;
    var src = $('iframe').attr('src');
    var aRes = qs.parse(decodeURIComponent(url.parse(src).query)).url.split('/');
    var track_id = aRes[aRes.length-1];

    console.log("track_id =", track_id);
  }
});

The result is:

track_id = 11111111

Answer 4

If the track id is always 8 digits and the html doesn't change you can do this:

var trackId = html.match(/\d{8}/)

Answer 5

The Right™ way to to do this is to parse the HTML using some XML parser and get the URL that way and then use a reg-exp to parse the URL.

If for some reasons you don't have an infinite amount of time and energy, one of the proposed purely reg-exp solutions would work.