How can I replicate Chrome's ability to 'resolve' a DOM from bad html?

I'm using cheerio and node.js to parse a webpage and then use css selectors to find data on it. Cheerio doesn't perform so well on malformed html. jsdom is more forgiving, but both behave differently and I've seen both break when the other works fine in certain cases.

Chrome seems to do a fine job with the same malformed html in creating a DOM.

How can I replicate Chrome's ability to create a DOM from malformed HTML, then give the 'cleaned' html representation of this DOM to cheerio for processing?

This way I'll know the html it gets is wellformed. I tried phantomjs by setting page.content, but then when I read page.content's value the html is still malformed.

So you can use https://github.com/aredridel/html5/ which is a lot more forgiving and from my experience works where jsdom fails.

But last time I tested it, a few month back, it was super slow. I hope it got better. Then there is also the possibility to spawn a phantomjs process and to output on stdout a json of the data you want to feed it back to your Node.

This seems to do the trick, using phantomjs-node and jquery:

function cleanHtmlWithPhantom(html, callback){
    var phantom = require('phantom');
    phantom.create(
        function(ph){
            ph.createPage(
                function(page){
                    page.injectJs(
                        "/some_local_location/jquery_1.6.1.min.js",
                        function(){
                            page.evaluate(
                                function(){
                                    $('html').html(newHtml)
                                    return $('html').html();
                                }.toString().replace(/newHtml/g, "'"+html+"'"),
                                function(result){
                                    callback("<html>" + result + "</html>")
                                    ph.exit();
                                }
                            )
                        }
                    );
                }
            )
        }
    )
}

cleanHtmlWithPhantom(
    "<p>malformed",
    function(newHtml){
        console.log(newHtml);
    }
)