Parsing HTML: A Guide to Select the Right Library

Parsing HTML: A Guide To Select The Right Library

HTML is a markup language with a simple structure. It would be quite easy to build a parser for HTML with a parser generator. Actually, you may not need even to do that, if you choose a popular parser generator, like ANTLR. That is because there are already available grammars ready to be used.

HTML is so popular that there is even a better option: using a library. It is better because it is easier to use and usually provides more features, such as a way to create an HTML document or support easy navigation through the parsed document. For example, usually it comes with a CSS/jQuery-like selector to find nodes according to their position in the hierarchy.

The goal of this article is helping you to find the right library to process HTML. Whatever you are using: Java, C#, Python, or JavaScript we got you covered.

We are not going to see libraries for more specific tasks, such as article extractors or web scraping, like Goose. They have typically restricted uses, while in this article we focus on the generic libraries to process HTML.

Parsing: Tools and Libraries

Parsing   tools and libraries   cover

Receive the guide to your inbox to read it on all your devices when you have time. Learn about parsing in Java, Python, C#, and JavaScript

We won't send you spam. Unsubscribe at any time. Powered by ConvertKit

The Libraries We Considered

Java

Let’s start with the Java libraries to process HTML.

Lagarto and Jerry

Jodd is set of Java micro frameworks, tools and utilities

Among the many Jodd components available there are Lagarto, an HTML parser, and Jerry, defined as jQuery in Java. There are even more components that can do other things. For instance, CSSelly, which is a parser for CSS-selectors strings and powers Jerry, and StripHtml, which reduces the size of HTML documents.

Lagarto works as a traditional parser, more than the typical library. You have to build a visitor and then the parser will call the proper function each time a tag is encountered. The interface is simple and mainly you have to implement a visitor that will be called for each tag and for each piece of text. Lagarto is quite basic, it just does parsing. Even the building of the (DOM) tree is done by an extension, aptly called DOMBuilder.

While Lagarto could be very useful for advanced parsing tasks, usually you will want to use Jerry. Jerry tries to stay as close as possible to jQuery, but only to its static and HTML manipulation parts. It does not implement animations or ajax calls. Behind the scenes Jerry uses Lagarto and CSSelly, but it is much easier to use. Also, you are probably already familiar with jQuery.

The documentation of Jerry is good and there are a few examples in the documentation, including the following one.

HTMLCleaner

HTMLCleaner is a parser that is mainly designed to be a cleaner of HTML for further processing. As the documentation explains it.

HtmlCleaner is an open source HTML parser written in Java. HTML found on the Web is usually dirty, ill-formed and unsuitable for further processing. For any serious consumption of such documents, it is necessary to first clean up the mess and bring some order to the tags, attributes and ordinary text. For any given HTML document, HtmlCleaner reorders individual elements and produces well-formed XML. By default, it follows similar rules that the most of web browsers use in order to create the Document Object Model. However, you can provide custom tag and rule sets for tag filtering and balancing.

This explanation also reveals that the project is old, given that in the last few years the broken HTML problem is much less prominent that it was before. However, it is still updated and maintained. So the disadvantage of using HTMLCleaner is that the interface is a bit old and can be clunky when you need to manipulate HTML.

The advantage is that it works well even on old HTML documents. It can also write the documents in XML or pretty HTML (i.e., with the correct indentation). If you need JDOM and a product that support XPath, or you even like XML, look no further.

The documentation offers a few examples and API documentation, but nothing more. The following example comes from it.

Jsoup

jsoup is a Java library for working with real-world HTML

Jsoup is a library with a long history, but a modern attitude:

  • it can handle old and bad HTML, but it also equipped for HTML5
  • it has powerful support for manipulation, with support for CSS selectors, DOM Traversal and easy addition or removal of HTML
  • it can clean HTML, both to protect against XSS attacks and in the sense that it improves structure and formatting

There is little more to say about jsoup, because it does everything you need from an HTML parser and even more (e.g., cleaning HTML documents). It can be very concise.

In this example it directly fetches HTML documents from an URL and select a few links. On line 9 you can also see a nice option: the chance to automatically get the absolute url even if the attribute href reference a local one. This is possible by using the proper setting, which is set implicitly when you fetch the URL with the connect method.

The documentation lacks a tutorial, but it provides a cookbook, that essentially fulfills the same function, and an API reference. There is also an online interactive demo that shows how jsoup parses an HTML document.

C#

Let’s move to the C# library to process HTML.

AngleSharp

The ultimate angle brackets parser library parsing HTML5, MathML, SVG and CSS to construct a DOM based on the official W3C specifications.

AngleSharp is quite simply the default choice for whenever you need a modern HTML parser for a C# project. In fact, it does not just parse HTML5, but also its most used companions: CSS and SVG. There is also an extension to integrate scripting in the contest of parsing HTML documents: both C# and JavaScript, based on Jint. Which means that you can parse HTML documents after they have been modified by JavaScript. Both the JavaScript included in the page or a script you add yourself.

AngleSharp fully support modern conventions for easy manipulation, like CSS selectors and jQuery-like constructs. But it is also well integrated in the .NET world, with support for LINQ for DOM elements. The author mention that it may want to evolving it in something more than a parser, for the moment it can do simple things like submitting forms.

The following example, from the documentation, shows a few features of AngleSharp.

The documentation may contain all the information you need, but it certainly could use a better organization. For the most part it is delivered within the GitHub project, but there are also tutorials on CodeProject, by the author of the library.

HtmlAgilityPack

HtmlAgilityPack was once considered the default choice for HTML parsing with C#. Although some says for the lack of better alternatives, because the quality of the code was low. In any case it was essentially abandoned for the last few years, until it was recently revived by ZZZ Projects.

In terms of features and quality it is quite lacking, at least compared to AngleSharp. Support for CSS selector, necessary for modern HTML parsing, and support for .NET Standard, necessary for modern C# projects, are on the roadmap. On the same document there is also planned a cleanup of the code.

If you are in need for things like XPath HtmlAgilityPack should be your best choice. In other cases, I do not think it is the best choice right now, unless you are already using it. That is especially true since there is no documentation. Though the new maintainer and the prospect for better features are a good reason to keep using it, if you are already a user.

Python

Now it is the turn of the Python libraries.

HTML Parser of The Standard Library

The standard Python library is quite rich and implement even an HTML Parser. The bad news is that the parser works like a simple and traditional parser, so there are no advanced functionalities geared to handle HTML. The parser essentially makes available a visitor with basic functions for handle the data inside tags, the beginning and the ending of tags.

It works, but it does not really offer anything better than a parser generated by ANTLR or any other generic parser generator.

Html5lib

html5lib is a pure-python library for parsing HTML. It is designed to conform to the WHATWG HTML specification, as is implemented by all major web browsers.

Html5lib it is considered a good library to parse HTML5 and a very slow one. Partially because it is written in Python and not in C, like some of the alternatives.

By default the parsing produces an ElementTree tree, but it can be set to create a DOM tree, based on xml.dom.minidom. Html5lib provides walkers that simplify the traversing of the tree and serializers.

The following example shows the parser, walker and serializer in action.

It has a sparse documentation.

Html5-parser

Html5-parser is a parser for Python, but written in C. It also just a parser that produces a tree. It exposes literally one function named parse. The documentation compares it to html5lib, claiming that it is 30x quicker.

To produce the output tree, by default, it relies on the library lxml. The same library allows also to pretty print the output. It even refers to the documentation of that library to explain how to navigate the resulting tree.

Lxml

lxml is the most feature-rich and easy-to-use library for processing XML and HTML in the Python language.

Lxml is probably the most used low-level parsing library for Python, because of its speed, reliability and features. It is written in Cython, but it relies mostly on the C libraries libxml2 and libxml. Though, this does not mean that it is only a low-level library, but that is also used by other HTML libraries.

The library it is designed to work with the ElementTree API, a container for storing XML documents in memory. If you are not familiar with it, the important thing to know it is that it is an old-school way of dealing with (X)HTML. Basically, you are going to search with XPath and work as if it was the golden age of XML.

Fortunately, there is also a specific package for HTML, lxml.html that provide a few features specifically for parsing HTML. The most important one is that support CSS selectors to easily find elements.

There are also many other features, for example:

  • it can submit forms
  • it provides an internal DSL to create HTML documents
  • it can remove unwanted elements from the input, such as script content or CSS style annotations (i.e., it can clean HTML in the semantic sense, eliminating foreign elements)

In short: it can do many things, but not always in the easiest way you can imagine.

The documentation is very thorough and it also available as one 496-pages PDF. There is everything you can think of: tutorials, examples, explanations of the concept used in the library…

AdvancedHTMLParser

AdvancedHTMLParser is a Python parser that aims to reproduce the behavior of raw JavaScript in Python. By raw JavaScript I mean without jQuery or CSS selector syntax. So, it build a DOM-like representation that you can interact with.

If it works in HTML javascript on a tag element, it should work on an AdvancedTag element with python.

The parser also adds a few additional features. For instance, it supports direct modification of attributes (e.g., tag.id = "nope") instead of using the JavaScript-like syntax (e.g., setAttribute function). It can also perform a basic validation of an HTML document (i.e., check for missing closing tokens) and output a prettified HTML.

The most important addition, though, is the support for advanced search and filtering methods for tags. The method find search value and attributes, while filter is more advanced. The second one depends on another library called QueryableList, which is described as “ORM-style filtering to any list of items“. It is not as powerful as XPath or CSS selectors and it does not use a familiar syntax for HTML manipulation. However, it is similar to the one used for database queries.

The documentation is good enough, though it consists just of what you find in the README of the GitHub project and the following example in the source code.

Beautiful Soup

Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.

As the description on their website reminds you, technically Beautiful Soup it is not properly a parser. In fact, it can use a few parsers behind the scenes, like the standard Python parser or lxml. However, in practical terms, if you are using Python and you need to parse HTML, probably you want to use something like Beautiful Soup to work with HTML.

Beautiful Soup is the go-to library when you need an easy way to parse HTML documents. In terms of features it might not provide all that you think of, but it probably gives all that you actually need to use.

While you can navigate the parse tree yourself, using standard functions, to move around the tree (e.g., next_element, find_parent) you are probably going to use the simplest methods it provides.

The first are CSS selectors, to easily select the needed elements of the document. But there are also simpler functions to find elements according to their name or directly accessing the tags (e.g., title). They are both quite powerful, but the first will be more familiar to users of JavaScript, while the other is more pythonic.

There are a few functions to manipulate the document and easily add or remove elements. For instance, there are a few functions to wrap an element inside a provided one or doing the inverse operation.

Beautiful Soup also gives functions to pretty print the output or get only the text of the HTML document.

The documentation is great: there are explanation and plenty examples for all features. There is not an official tutorial, but given the quality of the documentation it is not really needed.

JavaScript

Of course, we need also to see JavaScript libraries to process HTML. We are going to divide between parsing HTML in the browser and running in Node.js.

Browser

The browser automatically parses the current HTML document, which means that a parser is always included.

Plain JavaScript or jQuery

HTML parsing is implicit in JavaScript, since it was basically created to manipulate the DOM. Which means that the browser automatically parses HTML for you and makes it accessible in the form of a DOM. This means also that you can access the same functionality. The easiest way is by parsing an HTML in a new element of the current document. However, you can also create a new document altogether.

You can pick between plain JavaScript and using the jQuery library. JQuery offers great support for CSS selectors and a few of its own selectors to easily find DOM elements. Parsing HTML is also made easier, you just need a single function: parseHTML.

The library does other things, other than making easier to manipulate the DOM, such as dealing with forms and asynchronous calls to the server. Given the environment in which it runs, it is also easy to add elements to the page and have them automatically parsed.

JQuery may be the most popular library in existence because it also deals with the issues of compatibility between different browsers. You might start using it because all the examples around the web are in jQuery, and not in plain JavaScript. But then you keep using it, because JavaScript is actually less portable between different browsers. There are inconsistencies between the API and the behavior of different browsers, which are masked by this wonderful library.

DOMParser

The native DOM manipulation capabilities of JavaScript and jQuery are great for simple parsing of HTML fragments. However, if you actually need to parse a complete HTML or XML source in a DOM document programmatically, there is a better solution: DOMParser. This is classified as an experimental feature, but it is available in all modern browsers.

By using DOMParser you can easily parse HTML document. Instead usually you have to resort to trick the browser into parsing it for you, for instance by adding a new element to the current document.

Node.js

While Node.js can easily work with the web, it does not make easily accessible parsing functionalities like that of the browser. In this sense, JavaScript in Node.js works like a traditional language, when it comes to parsing: you have to take care of it yourself.

Cheerio

Fast, flexible, and lean implementation of core jQuery designed specifically for the server.

There is little more to say about Cheerio than it is jQuery on the server. It should be obvious, but we are going to state it anyway: it looks like jQuery, but there is no browser. This means that Cheerio parses HTML and make easy to manipulate it, but it does not make things happen. It does not interpret the HTML as if it were in the browser; both in the sense that it might parse things differently from a browser and that the results of the parsing are not send directly to the user. If you need that you will have to take care of it yourself.

The library includes also a few jQuery utility functions, such as slice and eq, to manipulate ranges. It can serialize in an array name and value of form elements, but it cannot submit them to the server, as jQuery can. That is because Node.js run on the server.

The developer created this library because it wanted a lightweight alternative to jsdom, that was also quicker and less strict in parsing. The last thing it is needed to parse real and messy websites.

The syntax and usage of Cheerio should be very familiar to any JavaScript developer.

The documentation is limited to the long README of the project, but that is probably all that you need.

Jsdom

 jsdom is a pure-JavaScript implementation of many web standards, notably the WHATWG DOM and HTML Standards, for use with Node.js. In general, the goal of the project is to emulate enough of a subset of a web browser to be useful for testing and scraping real-world web applications.

So jsdom is more than an HTML parser, it works as a browser. In the context of parsing, it means that it would automatically add the necessary tags, if you omit them from the data you are trying to parse. For instance, if there were no html tag it would implicitly add it, just like a browser would do.

The fact that supports the DOM standard means that a jsdom object will have familiar properties, such as document or window, and that manipulating the DOM would be like using plain JavaScript.

You can also optionally specify few properties, like the URL of the document, referrer or user agent. The url is particularly useful, if you need to parse links that contains local URLs.

Since it is not really related to parsing, we just mention that jsdom have a (virtual) console, support for cookies, etc. In short, all you need to simulate a browser environment. It can also deal with external resources, even JavaScript scripts. Which means that it can load and execute them, if you ask it. Note however that there are security risks in doing so, just like when you execute any external code. All of that have a number of caveats that you should read in the documentation.

One important thing to notice is that you can alter the environment before the parsing happens. For instance, you can add JavaScript libraries that simulate functionalities not supported by the jsdom parser. These libraries are usually called shims.

The documentation is good enough. It might be surprisingly short given the vastity of the project, but it can get away with little, because you can find documentation for using the DOM elsewhere.

Htmlparser2 and related libraries

Felix Böhm has made a few libraries to parse HTML (XML and RSS), CSS selectors and building a DOM. It is successful and good enough to even power the Cheerio library. The libraries can be used separately, but works also together.

The HTML parser is quick, but it is also really basic. The following example shows that allows you just to execute functions, when you meet tags or text elements.

They are powerful and great if you need to do advanced and complex manipulation of HTML documents. However, even together, they are somewhat clunky to use if you intend to simply parse HTML and do some simple manipulation of the DOM. In part this is due to the features themselves. For instance, the DOM library just builds the DOM, there are no helpers to manipulate it. In fact, to manipulate the DOM you need yet another library called domutils, for which there is literally zero documentation.

However, the issue really is that though they work together, they do not provide functionalities on top of each other, they just work along each other. They are mostly designed for advanced parsing need. For example, if you want to build a word processor that use HTML behind the scenes, these are great. Otherwise you are probably going to look somewhere else.

This difficulty of using it is compounded by the limited documentation. The only good part is for the CSS selectors engine.

Parse5

parse5 provides nearly everything you may need when dealing with HTML.

Parse5 is a library meant to be used to build other tools but can also be used to parse HTML directly for simple tasks. However, it is somewhat limited in this second regard. This is shown by the following example.

It is easy to use, but the issue is that it does not provide the methods that the browser gives you to manipulate the DOM (e.g., getElementById).

The difficulty is also increased by the limited documentation: it is basically a series of question that are answered with an API reference (e.g., “I need to parse a HTML string” => Use parse5.parse method). So, it is feasible to use it for simple DOM manipulation, but you are probably not going to want to.

On the other hand, parse5 lists an impressive series of project that adopt it: jsdom, Angular2 and Polymer. So, if you need a reliable foundation for advanced manipulation or parsing of HTML, it is clearly a good choice.

Summary

We have seen a few libraries for Java, C#, Python, and JavaScript. You might be surprised that, despite the popularity of HTML, there are usually few mature choices for each language. That is because while HTML is very popular and structurally simple, providing support for all the multiple standards is hard work.

On top of that, the actual HTML documents out there might be in a wrong form, according to the standard, but they still work in the browser. So, they must work with your library, too. Add to it the need to give an easy way to traverse an HTML document, and the shortage is readily explained.

While there might not always be that many choices, luckily there is always at least one good choice available for all the languages we have considered.

Do You Need a Parser?

We can design parsers for new languages, or rewrite parsers for existing languages built in house.

On top of parsers we can then help building interpreters, compilers, code generators, documentation generators, or translators (code converters) to other languages.