I need to process "dirty" html data on server-side. As I'm using Node.JS and jQuery on server-side, I'm able to use all power of JS and jQuery DOM-parser for processing my html content.
"Dirty" data means following:
<br ><br />Home <a href="http://habrahabr.ru/post/169139/"> gamy code </ a>
<br>
Technique: <a href="http://habrahabr.ru/post/173903/"> Preparation methods </ a> <br>
<br>
In continuation, the technique based on the book Refactoring Refactoring <a href="http://www.ozon.ru/context/detail/id/1308678/">. Improvement of existing code by Martin Fowler. </ A> <br>
<a href="http://habrahabr.ru/post/174779/#habracut"> Read more → </ a>
So, it may have several br's in the beginning/in the middle, empty p's, etc. I've tried to use
$('*:empty').remove();
However, if post begins form
Home <a href="http://habrahabr.ru/post/169139/"> gamy code </ a> <br>
everything before "< a href="http://habrahabr..." is deleted.
So, are there any reliable production-ready JS/jQuery-based solutions to beautify html data to remove empty tags in the beginning/double br's/p's in the middle, etc?
p.s. don't want to use simple regexp's 'cause there are so many different cases that may happen in such a dirty content
There's a plugin called jQuery-Clean that might be helpful in this scenario: https://code.google.com/p/jquery-clean/
This plugin performs the following operations:
Unfortunately I was unable to locate anything else. I feel I must say that it might be necessary to write some regular expressions to accomplish what you're looking for.
Totally, here's one called js-beautify. This will beautify Javascript, HTML, CSS, and JSON.