Any JavaScript/jQuery-based html-data processor/beautifier?

I need to process "dirty" html data on server-side. As I'm using Node.JS and jQuery on server-side, I'm able to use all power of JS and jQuery DOM-parser for processing my html content.

"Dirty" data means following:

<br ><br />Home <a href="http://habrahabr.ru/post/169139/"> gamy code </ a> 
<br>
Technique: <a href="http://habrahabr.ru/post/173903/"> Preparation methods </ a> <br>
<br>
In continuation, the technique based on the book Refactoring Refactoring <a href="http://www.ozon.ru/context/detail/id/1308678/">. Improvement of existing code by Martin Fowler. </ A> <br>
  <a href="http://habrahabr.ru/post/174779/#habracut"> Read more → </ a>

So, it may have several br's in the beginning/in the middle, empty p's, etc. I've tried to use

$('*:empty').remove();

However, if post begins form

Home <a href="http://habrahabr.ru/post/169139/"> gamy code </ a> <br>

everything before "< a href="http://habrahabr..." is deleted.

So, are there any reliable production-ready JS/jQuery-based solutions to beautify html data to remove empty tags in the beginning/double br's/p's in the middle, etc?

p.s. don't want to use simple regexp's 'cause there are so many different cases that may happen in such a dirty content

There's a plugin called jQuery-Clean that might be helpful in this scenario: https://code.google.com/p/jquery-clean/

This plugin performs the following operations:

  • fix self closing tags
  • lower-case tags
  • remove non-standard attributes
  • remove in-line style attributes
  • remove in-line event attributes
  • optionally remove other attributes
  • tidy unnecessary white space and new lines
  • remove comments
  • remove proprietary word formatting tags
  • replace tags e.g. i=>em
  • optionally leave css classes in place
  • format and indent html

Unfortunately I was unable to locate anything else. I feel I must say that it might be necessary to write some regular expressions to accomplish what you're looking for.

Totally, here's one called js-beautify. This will beautify Javascript, HTML, CSS, and JSON.