Jericho – the incredible HTML parsing machine

February 22nd, 2009 by nadav

Visually, HTML is remarkably similar to XML. No wonder, since they both share a long history with SGML. But while XML documents are generally well formed, real world HTML is far from perfect. As any (frustrated man-loathing) web browser developer knows, writing a browser that displays real world HTML requires a lifetime of work and a handful of patience. After all, HTML is not an achievement of working groups and standards organizations like the W3C. It is first and foremost the brainchild of a bunch of brilliant geeks, and the outcome of the famous browser wars.

Confronted with the task of parsing HTML, and reluctant to roll our own parser, we went looking for the most potent HTML parser out there – preferably one that’s written in Java, or perhaps one that targets the .NET Framework.

Of all the libraries we’ve checked out back then, Jericho stood out from the crowd – for the following reasons:

  1. It’s not naive. Many libraries out there start out as an experiment of a naive programmer who witnesses the simplicity and elegance of XML and attempts to apply it to HTML. Author Martin Jericho knows what’s out there.
  2. By default, it does nothing. Unlike JTidy and many other proactive libraries, Jericho only modified the segments of the document it is instructed to modify. Web pages are generally written to be parsed by the most popular web browsers. Proactive libraries have to “understand” HTML the same way the popular browsers do; until they do, they will keep introducing unintended changes to the page when viewed in a browser.
  3. It does one thing, and does it well. Jericho does not attempt fix broken XML. It doesn’t try to fit the latest Web 2.0 AJAX framework. It does one thing – parse and surgically modify HTML documents – and does it well.

It is licensed under both the Eclipse Public License and the LGPL.

Posted in Technology, Web | 1 Comment »

One Response

  1. xegevalaq Says:

    xegevalaq…

    Cursive Letters Tattoos