Parse HTML and preserve original content

Question

Parse HTML and preserve original content

I have lots of HTML files. I want to replace some elements, keeping all the other content unchanged. For example, I would like to execute this jQuery expression (or some equivalent of it):

$('.header .title').text('my new content')

on the following HTML document:

<div class=header><span class=title>Foo</span></div>
<p>1<p>2
<table><tr><td>1</td></tr></table>

and have the following result:

<div class=header><span class=title>my new content</span></div>
<p>1<p>2
<table><tr><td>1</td></tr></table>

The problem is, all parsers I’ve tried (Nokogiri, BeautifulSoup, html5lib) serialize it to something like this:

<html>
  <head></head>
  <body>
    <div class=header><span class=title>my new content</span></div>
    <p>1</p><p>2</p>
    <table><tbody><tr><td>1</td></tr></tbody></table>
  </body>
</html>

E.g. they add:

html, head and body elements
closing p tags
tbody

Is there a parser that satisfies my needs? It should work in either Node.js, Ruby or Python.

python
html
ruby
node.js
html-parsing

Answer 1

I highly recommend the pyquery package, for python. It is a jquery-like interface layered ontop of the extremely reliable lxml package, a python binding to libxml2.

I believe this does exactly what you want, with a quite familiar interface.

from pyquery import PyQuery as pq
html = '''
<div class=header><span class=title>Foo</span></div>
<p>1<p>2
<table><tr><td>1</td></tr></table>
'''
doc = pq(html)

doc('.header .title').text('my new content')
print doc

Output:

<div><div class="header"><span class="title">my new content</span></div>
<p>1</p><p>2
</p><table><tr><td>1</td></tr></table></div>

The closing p tag can't be helped. lxml only keeps the values from the original document, not the vagaries of the original. Paragraphs can be made two ways, and it chooses the more standard way when doing serialization. I don't believe you'll find a (bug-free) parser that does better.

Answer 2

Note: I'm on Python 3.

This will only handle a subset of CSS selectors, but it may be enough for your purposes.

from html.parser import HTMLParser

class AttrQuery():
    def __init__(self):
        self.repl_text = ""
        self.selectors = []

    def add_css_sel(self, seltext):
        sels = seltext.split(" ")

        for selector in sels:
            if selector[:1] == "#":
                self.add_selector({"id": selector[1:]})
            elif selector[:1] == ".":
                self.add_selector({"class": selector[1:]})
            elif "." in selector:
                html_tag, html_class = selector.split(".")
                self.add_selector({"html_tag": html_tag, "class": html_class})
            else:
                self.add_selector({"html_tag": selector})

    def add_selector(self, selector_dict):
        self.selectors.append(selector_dict)

    def match_test(self, tagwithattrs_list):
        for selector in self.selectors:
            for condition in selector:
                condition_value = selector[condition]
                if not self._condition_test(tagwithattrs_list, condition, condition_value):
                    return False
        return True

    def _condition_test(self, tagwithattrs_list, condition, condition_value):
        for tagwithattrs in tagwithattrs_list:
            try:
                if condition_value == tagwithattrs[condition]:
                    return True
            except KeyError:
                pass
        return False


class HTMLAttrParser(HTMLParser):
    def __init__(self, html, **kwargs):
        super().__init__(self, **kwargs)
        self.tagwithattrs_list = []
        self.queries = []
        self.matchrepl_list = []
        self.html = html

    def handle_starttag(self, tag, attrs):
        tagwithattrs = dict(attrs)
        tagwithattrs["html_tag"] = tag
        self.tagwithattrs_list.append(tagwithattrs)

        if debug:
            print("push\t", end="")
            for attrname in tagwithattrs:
                print("{}:{}, ".format(attrname, tagwithattrs[attrname]), end="")
            print("")

    def handle_endtag(self, tag):
        try:
            while True:
                tagwithattrs = self.tagwithattrs_list.pop()
                if debug:
                    print("pop \t", end="")
                    for attrname in tagwithattrs:
                        print("{}:{}, ".format(attrname, tagwithattrs[attrname]), end="")
                    print("")
                if tag == tagwithattrs["html_tag"]: break
        except IndexError:
            raise IndexError("Found a close-tag for a non-existent element.")

    def handle_data(self, data):
        if self.tagwithattrs_list:
            for query in self.queries:
                if query.match_test(self.tagwithattrs_list):
                    line, position = self.getpos()
                    length = len(data)
                    match_replace = (line-1, position, length, query.repl_text)
                    self.matchrepl_list.append(match_replace)

    def addquery(self, query):
        self.queries.append(query)

    def transform(self):
        split_html = self.html.split("\n")
        self.matchrepl_list.reverse()
        if debug: print ("\nreversed list of matches (line, position, len, repl_text):\n{}\n".format(self.matchrepl_list))

        for line, position, length, repl_text in self.matchrepl_list:
            oldline = split_html[line]
            newline = oldline[:position] + repl_text + oldline[position+length:]
            split_html = split_html[:line] + [newline] + split_html[line+1:]

        return "\n".join(split_html)

See the example usage below.

html_test = """<div class=header><span class=title>Foo</span></div>
<p>1<p>2
<table><tr><td class=hi><div id=there>1</div></td></tr></table>"""

debug = False
parser = HTMLAttrParser(html_test)

query = AttrQuery()
query.repl_text = "Bar"
query.add_selector({"html_tag": "div", "class": "header"})
query.add_selector({"class": "title"})
parser.addquery(query)

query = AttrQuery()
query.repl_text = "InTable"
query.add_css_sel("table tr td.hi #there")
parser.addquery(query)

parser.feed(html_test)

transformed_html = parser.transform()
print("transformed html:\n{}".format(transformed_html))

Output:

transformed html:
<div class=header><span class=title>Bar</span></div>
<p>1<p>2
<table><tr><td class=hi><div id=there>InTable</div></td></tr></table>

Answer 3

You can use Nokogiri HTML Fragment for this:

fragment = Nokogiri::HTML.fragment('<div class=header><span class=title>Foo</span></div>
                                    <p>1<p>2
                                    <table><tr><td>1</td></tr></table>')

fragment.css('.title').children.first.replace(Nokogiri::XML::Text.new('HEY', fragment))

frament.to_s #=> "<div class=\"header\"><span class=\"title\">HEY</span></div>\n<p>1</p><p>2\n</p><table><tr><td>1</td></tr></table>"

The problem with the p tag persists, because it is invalid HTML, but this should return your document without html, head or body and tbody tags.

Answer 4

Ok I have done this in a few languages and I have to say the best parser I have seen that preserves whitespace and even HTML comments is:

Jericho which is unfortunately Java.

That is Jericho knows how to parse and preserve fragments.

Yes I know its Java but you could easily make a RESTful service with a tiny bit of Java that would take the payload and convert it. In the Java REST service you could use JRuby, Jython, Rhino Javascript etc. to coordinate with Jericho.

Answer 5

With Python - using lxml.html is fairly straight forward: (It meets points 1 & 3, but I don't think much can be done about 2, and handles the unquoted class='s)

import lxml.html

fragment = """<div class=header><span class=title>Foo</span></div>
<p>1<p>2
<table><tr><td>1</td></tr></table>
"""

page = lxml.html.fromstring(fragment)
for span in page.cssselect('.header .title'):
    span.text = 'my new value'
print lxml.html.tostring(page, pretty_print=True)

Result:

<div>
<div class="header"><span class="title">my new content</span></div>
<p>1</p>
<p>2
</p>
<table><tr><td>1</td></tr></table>
</div>

Answer 6

This is a slightly separate solution but if this is only for a few simple instances then perhaps CSS is the answer.

Generated Content

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN">
<html>
  <head>
    <style type="text/css">
    #header.title1:first-child:before {
      content: "This is your title!";
      display: block;
      width: 100%;
    }
    #header.title2:first-child:before {
      content: "This is your other title!";
      display: block;
      width: 100%;
    }
    </style>

  </head>
  <body>
   <div id="header" class="title1">
    <span class="non-title">Blah Blah Blah Blah</span>
   </div>
  </body>
</html>

In this instance you could just have jQuery swap the classes and you'd get the change for free with css. I haven't tested this particular usage but it should work.

We use this for things like outage messages.

Answer 7

If you're running a Node.js app, this module will do exactly what you want, a JQuery style DOM manipulator: https://github.com/cheeriojs/cheerio

An example from their wiki:

var cheerio = require('cheerio'),
$ = cheerio.load('<h2 class="title">Hello world</h2>');

$('h2.title').text('Hello there!');
$('h2').addClass('welcome');

$.html();
//=> <h2 class="title welcome">Hello there!</h2>