I am trying to scrape a website and I'm having issues with both jsdom and cheerio dramatically changing the html they get. Most notably, they remove some tags such as table/tr/td tags, etc
simply having a local file say 1.html and doing:
// with cheerio -> or equivalent with jsdom
var $ = require('cheerio').load(fs.readFileSync(path));
fs.writeFileSync('2.html', $.html());
# bash
$> diff 1.html 2.html
.....
< <tr><td colspan="5"><a id="stats" name="stats"></a><div class="titlebar1" style="margin-top: 12px;margin-bottom: 4px;"><h2>Stats</h2><div class="element"><img src="img/element/10.png" /></div><div class="elementborder"><img src="img/elementborder.png" /></div></div></td></tr></table></td></div>
---
> <tr><td colspan="5"><a id="stats" name="stats"></a><div class="titlebar1" style="margin-top: 12px;margin-bottom: 4px;"><h2>Stats</h2><div class="element"><img src="img/element/10.png"></div><div class="elementborder"><img src="img/elementborder.png"></div></div></div></td></tr>
54,57c53,56
<
.....
EDIT: I realize that this is most likely due to invalid HTML, my question was is there anyway in which I can avoid this as if you view the page normally in a browser the elements are there
More precisely I'm trying to scrape this: http://www.puzzledragonx.com/en/monster.asp?n=1
EDIT: I realized that this is also some sort of browser problem. If you download the page with wget and paste the HTML with cheerio, you will have a different html yes, but the browsers remove the tags when parsing the DOM which leads me to believe that cheerio/jsdom output funky html
I also run that page through the html w3 validator and there are a lot of errors regarding doctype not allowing an element to be placed at a certain position, but nothing regarding invalid markup
It looks like your input HTML is malformed. $.html()
serializes the current DOM representation, which will not result in identical HTML unless the input HTML was syntactically correct.
To understand why this happens, think about what what happens under the covers. Cheerio parses HTML text into a normalized data structure. This data structure is what we refer to as the DOM: Document Object Model. HTML is just a text representation of this model; after cheerio parses HTML, it discards the input text (as it no longer needs it).
When you call $.html()
, cheerio must convert the DOM data structure back into a text representation of the document. To do this, it recurses over the DOM tree and generates HTML for each node. The original input HTML string has nothing to do with the output HTML other than the fact that the DOM was populated with input HTML.
At this point you should see why it is not possible for a library that parses HTML to later output the exact same HTML if the input HTML was malformed. The parsing and normalization of input text is necessarily lossy: a forgiving parser must throw out HTML text that doesn't make sense.
You can see this even in Chrome: do a diff of your page's source code and the string returned by document.documentElement.outerHTML
. Here too we see numerous differences, especially around the malformed tables. (Some differences are a result of scripts running and mutating the DOM.) These artifacts occur for the same reason as they do with cheerio, jsdom, or any other HTML-parsing library.