Tutorials and issues on all aspects of creating software to analyse code

A Guide to Parsing: Algorithms and Terminology

We have already introduced a few parsing terms, while listing the major tools and libraries used for parsing in Java, C#, Python and JavaScript. In this article we make a more in-depth presentation of the concepts and algorithms used in parsing, so that you can get a better understanding of this fascinating world.

We have tried to be practical in this article. Our goal is to help practicioners, not to explain the full theory. We just explain what you need to know to understand and build parser.

After the definition of parsing the article is divided in three parts:

  1. The Big Picture. A section in which we describe the fundamental terms and components of a parser.
  2. Grammars. In this part we explain the main formats of a grammar and the most common issues in writing them.
  3. Parsing Algorithms. Here we discuss all the most used parsing algorithms and say what they are good for.

Parsing: Tools and Libraries

Parsing_-_tools_and_libraries_-_cover

Receive the guide to your inbox to read it on all your devices when you have time. Learn about parsing in Java, Python, C#, and JavaScript

We won't send you spam. Unsubscribe at any time. Powered by ConvertKit

Table of Contents

  1. Definition of Parsing
  2. The Big Picture
  3. Grammars
  4. Parsing Algorithms
  5. Summary

Definition of Parsing

The analysis of an input to organize the data according to the rule of a grammar

There are a few ways to define parsing. However the gist remain the same: parsing means to find the underlying structure of the data we are given.

Simple definition of parsing

In a way parsing can be considered the inverse of templating: identifying the structure and extracting the data. In templating instead we have a structure and we fill it with data. In the case of parsing you have to determine the model from the raw representation. While for templating you have to combine the data with the model, to create the raw representation. Raw representation is usually text, but it can also be binary data.

Fundamentally parsing is necessary because different entities need the data to be in different forms. Parsing allows to transform data in a way that can be understood by a specific software. The obvious example are programs: they are written by humans, but they must be executed by computers. So humans write them in a form that they can understand, then a software transform them in a way that can be used by a computer.

However parsing might be necessary even when passing data between two software that have different needs. For instance, it is needed when you have to serialize or deserialize a class.

The Big Picture

The big wall of text about parsing

In this section we are going to describe the fundamental components of a parser. We are not trying to give you formal explanations, but practical ones.

Regular Expressions

A sequence of characters that can be defined by a pattern

Regular expression are often touted as the thing you should not use for parsing. This is not strictly correct, because you can use regular expressions for parsing simple input. The problem is that some programmers only know regular expressions. So they use them to try to parse everything, even the things they should not. The result is usually a series of regular expressions hacked together, that are very fragile.

You can use regular expressions to parse some simpler languages, but this exclude most programming languages. Even the ones that look simple enough like HTML. In fact, languages that can be parsed with just regular expressions are called regular languages. There is a formal mathematical definition, but that is beyond the scope of this article.

Though one important consequence of the theory is that regular languages can be parsed or expressed also by a finite state machine. That is to say regular expressions and finite state machines are equally powerful. This is the reason because they are used to implement lexers, as we are going to see later.

A regular language can be defined by a series of regular expressions, while more complex languages need something more. A simple rule of thumb is: if a grammar of a language has recursive, or nested, elements it is not a regular language. For instance, HTML can contain an arbitrary number of tags inside any tag, therefore is not a regular language and it cannot be parsed using solely regular expressions, no matter how clever.

Regular Expressions in Grammars

The familiarity of a typical programmer with regular expressions lend them to be often used to define the grammar of a language. More precisely, their syntax is used to define the rules of a lexer or a parser. For example, the Kleene star (*) is used in a rule to indicate that a particular element can be present zero or an infinite amount of times.

The definition of the rule should not be confused with how the actual lexer or parser is implemented. You can implement a lexer using the regular expression engine provided by your language. However usually the regular expressions defined in the grammar are converted are actually converted to a finite-state machine to gain better performance.

Structure of a Parser

Having now clarified the role of regular expressions, we can look at the general structure of a parser. A complete parser is usually composed of two parts: a lexer, also known as scanner or tokenizer, and the proper parser. The parser need the lexer because it does not work directly on the text, but on the output produced by the lexer. Not all parsers adopt this two-steps schema: some parsers do not depend on a separate lexer and they combine the two steps. They are called scannerless parsers.

A lexer and a parser work in sequence: the lexer scans the input and produces the matching tokens, the parser then scans the tokens and produces the parsing result.

Let’s look at the following example and imagine that we are trying to parse an addition.

The lexer scans the text and find 4, 3, 7 and then a space ( ). The job of the lexer is to recognize that the characters 437 constitute one token of type NUM. Then the lexer finds a + symbol, which corresponds to a second token of type PLUS, and lastly it finds another token of type NUM.The input stream is transformed in Tokens and then in an AST by the parser

The parser will typically combine the tokens produced by the lexer and group them.

The definitions used by lexers and parsers are called rules or productions. In our example a lexer rule will specify that a sequence of digits correspond to a token of type NUM, while a parser rule will specify that a sequence of tokens of type NUM, PLUS, NUM corresponds to a sum expression.

It is now typical to find suites that can generate both a lexer and parser. In the past it was instead more common to combine two different tools: one to produce the lexer and one to produce the parser. For example, this was the case of the venerable lex & yacc couple: using lex it was possible to generate a lexer, while using yacc it was possible to generate a parser

Scannerless Parsers

Scannerless parsers operate differently because they process directly the original text, instead of processing a list of tokens produced by a lexer. That is to say, a scannerless parser works as a lexer and a parser combined.

While it is certainly important to know for debugging purposes if a particular parsing tool is a scannerless parser or not, in many cases it is not relevant to define a grammar. That is because you usually still define pattern that group sequence of characters to create (virtual) tokens; then you combine them to obtain the final result. This is simply because it is more convenient to do so. In other words, the grammar of a scannerless parser looks very similar to one for a tool with separate steps. Again, you should not confuse how things are defined for your convenience and how things works behind the scene.

Grammar

A formal grammar is a set of rules that describes syntactically a language

There are two important parts of this definition: a grammar describes a language, but this description pertains only the syntax of the language and not the semantics. That is to say, it defines its structure, but not its meaning. The correctness of the meaning of the input must be checked, if necessary, in some other way.

For instance, imagine that we want to define a grammar for the language shown in the paragraph Definition.

This grammar accepts input such as "Hello Michael" and  "Hello Programming". They are both syntactically correct, but we know that “Programming” it is not a name. Thus it is semantically wrong. A grammar does not specify semantic rules, and they are not verified by a parser. You would need to ensure the validity of the provided name some other way, for example comparing it to a database of valid names.

Anatomy of a Grammar

To define the elements of a grammars, let us look at an example in the most used format to describe grammars: the Backus-Naur Form (BNF). This format has many variants, including the Extended Backus-Naur Form. The Extended variant has the advantage of including a simple way to denote repetitions. Another notable variant is the Augmented Backus-Naur Form which is mostly used to describe bidirectional communications protocols.

A typical rule in a Backus-Naur grammar looks like this:

The <simbol> is a nonterminal, which means that it can be replaced by the group of elements on the right, __expression__. The element __expression__ could contains other nonterminal symbols or terminal ones. Terminal symbols are simply the ones that do not appear as a <symbol> anywhere in the grammar. A typical example of a terminal symbol is a string of characters, like “Hello”.

A rule can also be called production rule. Technically it defines a transformation between the nonterminal and the set of nonterminals and terminals on the right.

Types of Grammars

There are mainly two kinds of grammars used in parsing: regular grammars and context-free grammars. Usually to one kind of grammar corresponds the same kind of language: a regular grammar defines a regular language and so on. However, there is also a more recent kind of grammars called Parsing Expression Grammar (PEG) that is equally powerful as context-free grammars and thus define a context-free language. The difference between the two is in the notation and the interpretation of the rules.

As we already mentioned the two kinds of languages are in a hierarchy of complexity: regular languages are simpler than context-free languages.

A relatively easy way to distinguish the two grammars would be that the __expression__ of a regular grammar, that is to say the right side of the rule, could be only one of:

  • the empty string
  • a single terminal symbol
  • a single terminal symbol followed by a nonterminal symbol

In practice though this is harder to check, because a particular tool could allow to use more terminal symbols in one definition. Then the tool itself will automatically transform this expression in an equivalent series of expressions all belonging to one of the three mentioned cases.

So you could write an expression that is incompatible with a regular language, but the expression will be transformed in the proper form. In other words, the tool could provide syntactic sugar for writing grammars.

In a later paragraph we are going to discuss more at length the different kinds of grammars and their formats.

Lexer

A lexer transforms a sequence of characters in a sequence of tokens

Lexers are also known as scanners or tokenizers. Lexers play a role in parsing, because they transform the initial input in a form that is more manageable by the proper parser, who works at a later stage. Typically lexers are easier to write than parsers. Although there are special cases when both are quite complicated, for instance in the case of C (see the lexer hack).

A very important part of the job of the lexer is dealing with whitespace. Most of the times you want the lexer to discard whitespace. That is because otherwise the parser would have to check for the presence of whitespace between every single token, which would quickly become annoying.

There are cases when you cannot do that, because whitespace is relevant to the language, like in the case of Python where is used to identify block of code. Even in these cases, though, usually it is the lexer that deals with the problem of distinguishing the relevant whitespace from the irrelevant one. Which means that you want the lexer to understand which whitespace is relevant to parsing. For example, when parsing Python you want the lexer to check if whitespace define indentation (relevant) or space between the keyword if and the following expression (irrelevant).

Where the Lexer Ends and the Parser Begins

Given that lexers are almost exclusively used in conjunction with parsers, the dividing line between the two can be blurred at times. That is because the parsing must produce a result that is useful for the particular need of a program. So there is not just one correct way of parsing something, but you care about only the one way that serves your needs.

For example, imagine that you are creating a program that must parse the logs of a server to save them in a database. For this goal, the lexer will identify the series of numbers and dots and transform them in a IPv4 token.

Then the parser will analyze the sequence of tokens to determine if it is a message, a warning, etc.

What would happen instead if you were developing software that had to use the IP address to identify the country of the visitor? Probably you would want the lexer to identity the octets of the address, for later use, and make IPv4 a parser element.

This is one example of how the same information can be parsed in different ways because of different goals.

Parser

In the context of parsing, parser can refer both to the software that perform the whole process and also just to the proper parser, that analyze the tokens produced by the lexer. This is simply a consequence of the fact that the parser takes care of the most important and difficult part of the whole process of parsing. By most important we mean that the user cares about the most and will actually sees. In fact, as we said, the lexer works as an helper to facilitate the work of the parser.

In any sense you mean the parser its output is an organized structure of the code, usually a tree. The tree can be a parse tree or an abstract syntax tree. They are both trees, but they differ in how closely they represent the actual code written and the intermediate elements defined by the parser. The line between the two can be blurry at times, we are going to see their differences a bit better in a later paragraph.

The form of a tree is chosen because it is an easy and natural way to work with the code at different levels of detail. For example, a class in C# has one body, this body is made of one statement, the block statement, that is a list of statements enclosed in a pair of curly brackets, and so on…

Syntactic vs Semantic Correctness

A parser is a fundamental part of a compiler, or an interpreter, but of course can be part of a variety of other software. For example, in our article Generate diagrams from C# source code using Roslyn we parsed C# files to produce diagrams.

The parser can only check the syntactic correctness of piece of code, but the compiler can use its output also in the process of checking the semantic validity of the same piece of code.

Let us see an example of code that is syntactically correct, but semantically incorrect.

The problem is that one variable (y) is never defined and thus if executed the program will fail. However the parser has no way of knowing this, because it does not keep track of variables, it just looks at the structure of the code.

A compiler instead would typically traverse the parse tree a first time and keeps a list of all the variables that are defined. Then it traverse the parse tree a second time and checks whether the variables that are used are all properly defined. In this example they are not and it will throw an error. So that is one way the parse tree can also be used to check the semantics by the compiler.

Scannerless Parser

A scannerless parser, or more rarely a lexerless parser, is a parser that performs the tokenization (i.e., the trasformation of sequence of characters in tokens) and the proper parsing in a single step. In theory having a separate lexer and parser is preferable because it allows a clearer separation of objectives and the creation of a more modular parser.

Scannerless parser is a better design for a language where a clear distinction between lexer and parser is difficult or unnecessary. An example is a parser for a markup language, where special markers are inserted in a sea of text. It can also facilitates the handling of languages where traditional lexing is difficult, like C. That is because a scannerless parser can more easily deal with complex tokenizations.

Issues With Parsing Real Programming Languages

In theory contemporary parsing is designed to handle real programming languages, in practice there are challenges with some real programming languages. At least, it might be harder parsing them using normal parsing generator tools.

Context-sensitive Parts

Parsing tools are traditionally designed to handle context-free languages, but sometimes the languages are context-sensitive. This might be the case to simplify the life of programmers or simply because of a bad design. I remember reading about a programmer that thought it could produce a parser for C in a week, but then it found so many corner cases that a year later he was still working on it…

A typical example of context-sensitive elements are the so-called soft keywords, i.e., strings that can be considered keywords in certain places, but otherwise can be used as identifiers).

Whitespace

Whitespace plays a significant role in some languages. The most well-known example is Python, where the indentation of a statement indicate if it parts of a certain block of code.

In most places whitespace is irrelevant even in Python: spaces between words or keywords do not matter. The real problem is indentation, that is used to identify block of code. The simplest way to deal with it is to check the indentation at the start of the line and transform in the proper token, i.e., to create a token when the indentation changes from the previous line.

In practice, a custom function in the lexer produce INDENT and DEDENT tokens, when the indentation increases or decreases. These tokens play the role that in C-like languages is played by curly brackets: they indicate the start and end of code blocks.

This approach makes the lexing context-sensitive, instead of context-free. This complicates parsing and you normally would not want to do it, but you are forced to do in this case.

Multiple Syntaxes

Another common issue is dealing with the fact that a language might actually include a few different syntaxes. In other words, the same source file may contain sections of code that follow a different syntax. In the context of parsing effectively the same source file contains different languages. The most famous example is probably the C or C++ preprocessor, which is actually a fairly complicated language on its own and can magically appear inside any random C code.

An easier case to deal with are annotations, that are present in many contemporary programming languages. Among other things they can be used to process the code before it arrives to the compiler. They can command the annotation processor to transform the code in some way, for instance to execute a specific function before executing the annotated one. They are easier to manage, because they can appear only in specific places.

Dangling Else

The dangling else is a common problem in parsing linked to the if-then-else statement. Since the else clause is optional a sequence of if statements could be ambiguous. For example this one.

It is unclear if the else belongs to the first if or the second one.

To be fair, this is for the most part a problem of language design. Most of the solutions do not really complicate parsing that much, for example requiring the use of an endif or requiring the use of blocks to delimit the if statement when it includes an else clause.

However there are also languages that offer no solution, that is to say that are designed ambiguously, for example (you guessed it) C. The conventional approach is to associate the else to the nearest if statement, which makes the parsing context-sensitive.

Parsing Tree and Abstract Syntax Tree

There are two terms that are related and sometimes they are used interchangeably: Parse Tree and Abstract Syntax Tree (AST). Technically the parse tree could also be called Concrete Syntax Tree (CST), because it should reflect more concretely the actual syntax of the input, at least compared to the AST.

Conceptually they are very similar. They are both trees: there is a root that has nodes representing the whole source code. The root have children nodes, that contains subtrees representing smaller and smaller portions of code, until single tokens (terminals) appears in the tree.

The difference is in the level of abstraction: a parse tree might contains all the tokens which appeared in the program and possibly a set of intermediate rules.  The AST instead is a polished version of the parse tree, in which only the information relevant to understanding the code is maintained. We are going to see an example of an intermediate rule in the next section.

Some information might be absent both in the AST and the parse tree. For instance, comments and grouping symbols (i.e., parentheses) are usually not represented. Things like comments are superfluous for a program and grouping symbols are implicitly defined by the structure of the tree.

From Parse Tree to Abstract Syntax Tree

A parse tree is a representation of the code closer to the concrete syntax. It shows many details of the implementation of the parser. For instance, usually each rule corresponds to a specific type of a node. A parse tree is usually transformed in an AST by the user, possibly with some help from the parser generator. A common help allows to annotate some rule in the grammar, to exclude the corresponding nodes from the generated tree. Another one is an option to collapse some kinds of nodes if they have only one child.

This make sense because the parse tree is easier to produce for the parser, since it is a direct representation of the parsing process. However the AST is simpler and easier to process by the following steps of the program. They typically include all the operations that you may want to perform on the tree: code validation, interpretation, compilation, etc…

Let us look at a simple example to show the difference between a parse tree and an AST. Let’s start with a look at an example grammar.

In this grammar we can define a sum using both the symbol plus (+) or the string plus as operators. Imagine that you have to parse the following code.

These could be the resulting parse tree and abstract syntax tree.

A Parse Tree and Abstract Syntax Tree

In the AST the indication of the specific operator is disappeared and all that remains is the operation to be performed. The specific operator is an example of an intermediate rule.

Graphical Representation Of A Tree

The output of a parser is a tree, but the tree can also be represented in graphical ways. That is to allow an easier understanding to the developer. Some parsing generator tools can output a file in the DOT language, a language designed to describe graphs (a tree is a particular kind of graph). Then this file is fed to a program that can create a graphical representation starting from this textual description (e.g., Graphviz).

Let’s see a .DOT text, based on the previous sum example.

The appropriate tool can create the following graphical representation.

Example image of a tree created by graphviz

If you want to see a bit more of DOT you can read our article Language Server Protocol: A Language Server For DOT With Visual Studio Code. In that article we show show how to create a Visual Studio Code plugin that can handle DOT files.

Grammars

A dictionary open on the term grammar

Grammars are a set of rules used to describe a language, so it comes naturally to study the formats of the rules. However there are also several elements of a typical grammar that could use further attention. Some of them are due to the fact that a grammar can also be used to define other duties or to execute some code.

Typical Grammar Issues

First we are going to talk about some special rules or issues you might encounter in parsing.

The Missing Tokens

If you read grammars you will probably encounter many in which only a few tokens are defined and not all of them. Like in this grammar:

There is no definition for the token "Hello", but since you know that a parser deals with tokens, you may ask yourself how is this possible. The answer is that some tools generate for you the corresponding token for a string literal, to save you some time.

Be aware that this might be possible only under certain conditions. For instance, with ANTLR you must define all tokens yourself if you define separate lexer and parser grammars.

Left-recursive Rules

In the context of parsers, an important feature is the support for left-recursive rules. This means that a rule starts with a reference to itself. Sometime this reference could also be indirect, that is to say it could appear in another rule referenced by the first one.

Consider for example arithmetic operations. An addition could be described as two expression(s) separated by the plus (+) symbol, but the operands of the additions could be other additions.

In this example expression contains an indirect reference to itself via the rules addition and multiplication.

This description also matches multiple additions like 5 + 4 + 3. That is because it can be interpreted as expression (5) ('+') expression(4+3) (the rule addition: the first expression corresponds to the option [0-9]+, the second one is another addition). And then 4 + 3 itself can be divided in its two components: expression(4) ('+') expression(3) (the rule addition:both the first and second expression corresponds to the option [0-9]+) .

The problem is that left-recursive rules may not be used with some parser generators. The alternative is a long chain of expressions, that takes care also of the precedence of operators. A typical grammar for a parser that does not support such rules would look similar to this one:

As you can see, the expressions are defined in the inverse order of precedence. So the parser would put the expression with the lower precedence at the lowest level of the three; thus they would be executed first.

Some parser generators support direct left-recursive rules, but not indirect ones. Notice that usually the issue is with the parsing algorithm itself, that does not support left-recursive rules. So the parser generator may transform rules written as left-recursive in the proper way to make it work with its algorithm. In this sense, left-recursive support may be (very useful) syntactic sugar.

How Left-recursive Rules Are Transformed

The specific way in which the rules are transformed vary from one parser generator to the other, however the logic remains the same. The expressions are divided in two groups: the ones with an operator and two operands and the atomic ones. In our example the only atomic expression is a number ([0-9]+), but it could also be an expression between parentheses ((5 + 4)). That is because in mathematics parentheses are used to increase the precedence of an expression.

Once you have these two groups: you maintain the order of the members of the second group and reverse the order of the members of the first group. The reason is that humans reason on a first come, first serve basis: it is easier to write the expressions in their order of precedence.

Example expression parse tree

However the final form of the parsing is a tree, which operates on a different principle: you start working on the leaves and rise up. So that at the end of this process the root node contain the final result. Which means that in the parsing tree the atomic expressions are at the bottom, while the ones with operators appears in the inverse order in which are applied.

Predicates

Predicates, sometimes called syntactic or semantic predicates, are special rules that are matched only if a certain condition is met. The condition is defined with code in a programming language supported by the tool for which the grammar is written.

Their advantage is that they allow some form of context-sensitive parsing, which is sometime unavoidable to match certain elements. For instance, they can be used to determine if a sequence of characters that defines a soft keyword is used in a position where it would be a keyword (e.g., the previous token can be followed by the keyword) or it is a simple identifier.

The disadvantages are that they slowdown parsing and they make the grammar dependent on said programming language. That is because the condition is expressed in a programming language and must be checked.

Embedded Actions

Embedded actions identify code that is executed every time the rule is matched. They have the clear disadvantage that make the grammar harder to read, since the rules are surrounded by code. Furthermore, just like predicates, they break the separation between a grammar that describes the language and the code that manipulates the results of the parsing.

Actions are frequently used by less sophisticate parsing generators as the only way to easily execute some code when a node is matched. using these parser generations, the only alternative would be to traverse the tree and execute the proper code yourself. More advanced tools instead allow to use the visitor pattern to execute arbitrary code when needed, and also to govern the traversing of the tree.

They can also be used to add certain tokens or change the generated tree. While ugly, this might be the only practical way to deal with complicate languages, like C, or specific issues, like whitespace in Python.

Formats

There are two main formats for a grammar: BNF (and its variants) and PEG. Many tools implement their own variants of these ideal formats. Some tools use custom formats altogether. A frequent custom format consist in a three parts grammar: options together with custom code, followed by the lexer section and finally the parser one.

Given that a BNF-like format is usually the foundation of a context-free grammar you could also see it identified like the CFG format.

Backus-Naur Form and Its Variants

Backus–Naur Form (BNF) is the most successful format and was even the basis of PEG. However it is quite simple, and thus it is not often used in its base form. A more powerful variant is typically used.

To show why these variants were necessary let’s show an example in BNF: the description of a character.

The symbol <letter> can be transformed in any of the letters of the English alphabet, although in our example only lowercase letters are valid. A similar process happens for <digit>, which can indicate any among the alternative digits. The first issue is that you have to list all the alternatives individually; you cannot use characters classes like you do with regular expressions. This is annoying, but usually manageable, unless of course you have to list all Unicode characters.

A much harder problem is that there is no easy way to denote optional elements or repetitions. So if you want to do that you have to rely on boolean logic and the alternative (|) symbol.

This rule states that <text> can be composed of a character or a shorter <text> followed by a <character>. This would be the parse tree for the word “dog”.

Parse tree for 'dog'

BNF has many other limitations: it makes complicate to use empty strings or the symbols used by the format (e.g., ::=) in the grammar, it has a verbose syntax (e.g., you have to include terminals between < and >), etc.

Extended Backus-Naur Form

To solve some of these limitations Niklaus Wirth created the Extended Backus-Naur Form (EBNF), which includes some concepts from his own Wirth syntax notation.

EBNF is the most used form in contemporary parsing tool, although tools might deviate from the standard notation. EBNF has a much cleaner notation and adopt more operators to deal with concatenation or optional elements.

Let’s see how you would write the previous example in EBNF.

The text symbol now it is described more easily with the help of the concatenation operator (,) and the optional one ({ .. }). The resulting parse tree would also be simpler. The standard EBNF still presents some problems, like the unfamiliar syntax. To overcome this issue most parsing tools adopt a syntax inspired by regular expressions and also support additional elements like characters classes.

If you need an in-depth coverage of the format you can read our article on EBNF.

ABNF

Augmented BNF (ABNF) is another variants of BNF, mainly developed to describe bidirectional communication protocols and formalized by the IETF with several documents (see RFC 5234 and updates).

ABNF can be as productive as EBNF, but it had some quirks that limited its adoption outside of internet protocols. For example, until recently the standard dictated that strings were matched in a case-insensitive way, which would have been a problem for matching identifiers in many programming languages.

ABNF has a different syntax from EBNF, for example the alternative operator is the slash (/), and sometimes it is plainly better. For instance, there is no need for a concatenation operator. It also has a few more things than standard EBNF. For instance, it allows you to define numeric ranges, such as %x30-39 that is equivalent to [0-9]. This is also used by the designers themselves, to include standard character classes-like basic rules that the end user can use. An example of such rule is ALPHA, that is equivalent to [a-zA-Z].

PEG

Parsing Expression Grammar (PEG) is a format presented by Brian Ford in a 2004 paper. Technically it derives from an old formal grammar called Top-Down Parsing Language (TDPL). However a simple way to describe is: EBNF in the real world.

In fact it looks similar to EBNF, but also directly support things widely used, like character ranges (character classes). Although it also has some differences that are not actually that pragmatic, like using the more formal arrow symbol () for assignment, instead of the more common equals symbol (=). The previous example could be written this way with PEG.

As you can see, this is the most obvious way a programmer would write it, with character classes and regular expression operators. The only anomalies are the alternative operator (/) and the arrow character, and in fact many implementation of PEG use the equals character.

The pragmatic approach is the foundation of the PEG formalism: it was created to describe more naturally programming languages. That is because context-free grammar has its origin in the work to describe natural languages. In theory, CFG is a generative grammar, while PEG an analytic grammar.

The first should be a sort of recipe to generate only the valid sentences in the language described by the grammar. It does not have to be related to the parsing algorithm. Instead the second kind define directly the structure and semantics of a parser for that language.

PEG vs CFG

The practical implications of this theoretical difference are limited: PEG is more closely associated to the packrat algorithm, but that is basically it. For instance, generally PEG (packrat) does not permits left recursion. Although the algorithm itself can be modified to support it, this eliminate the linear-time parsing property. Also, PEG parsers generally are scannerless parsers.

Probably the most important difference between PEG and CFG is that the ordering of choices is meaningful in PEG, but not in CFG. If there are many possible valid ways to parse an input, a CFG will be ambiguous and thus will return an error. By usually wrong we mean that some parsers that adopt CFGs can deal with ambiguous grammars. For instance, by providing all possible valid results to the developer and let him sort it out. Instead PEG eliminate ambiguity altogether because the first applicable choice will be chosen, thus a PEG can never be ambiguous.

The disadvantage of this approach is that you have to be more careful in listing possible alternatives, because otherwise you could have unexpected consequences. That is to say some choices could never be matched.

In the following example doge will never be matched, since dog comes first it will be picked each time.

It is an open area of research whether PEG can describe all grammars that can be defined with CFG, but for all practical purposes they do.

Parsing Algorithms

A terrible pun on abstrac trees and real trees. Also, the sun is a gear.

In theory parsing is a solved problem, but it is the kind of problem that keep being solved again and again. That is to say that there are many different algorithms, each one with strong and weak points, and they are still improved by academics.

In this section we are not going to teach how the implement every one of the parsing algorithm, but we are going to explain their features. The goal is that you can choose with more awareness which parsing tool to use, or which algorithm to study better and implement for your custom parser.

Overview

Let’s start with a global overview of the features and strategies of all parsers.

Two Strategies

There are two strategies for parsing: top-down parsing and bottom-up parsing. Both terms are defined in relation to the parse tree generated by the parser. In a simple way:

  • a top-down parser tries to identity the root of the parse tree first, then it moves down the subtrees, until it find the leaves of the tree.
  • a bottom-up parses instead starts from the lowest part of the tree, the leaves, and rise up until it determines the root of the tree.

Let’s see an example, starting with a parse tree.

Example Parse Tree

Example Parse Tree from Wikipedia

The same tree would be generated in a different order by a top-down and a bottom-up parser. In the following images the number indicate the order in which the nodes are created.

Top-down Parse Tree Order

Top-down order of generation of the tree (from Wikipedia)

Bottom-up Parse Tree Order

Bottom-up order of generation of the tree (from Wikipedia)

 

 

 

 

 

 

 

 

 

 

 

Traditionally top-down parsers were easier to build, but bottom-up parsers were more powerful. Now the situation is more balanced, mostly because of advancement in top-down parsing strategies.

The concept of derivation is closely associated to the strategies. Derivation indicates the order in which the nonterminal elements that appears in the rule, on the right, are applied to obtain the nonterminal symbol, on the left. Using the BNF terminology, it indicates how the elements that appear in __expression__ are used to obtain <symbol>. The two possibilities are: leftmost derivation and rightmost derivation. The first indicate the rule are applied from left to right, while the second indicate the opposite.

A simple example: imagine that you are trying to parse the symbol result which is defined as such in the grammar.

You can apply first the rule for symbol expr_one and then expr_two or vice versa. In the case of leftmost derivation you pick the first option, while for rightmost derivation you pick the second one.

It is important to understand that the derivation is applied depth-first or recursively. That is to say, it is applied on the starting expression then it is applied again on the intermediate result that is obtained. So, in this example, if after applying the rule corresponding to expr_one there is a new nonterminal, that one is transformed first. The nonterminal expr_two is applied only when it becomes the first nonterminal and not following the order in the original rule.

Derivation is associated with the two strategies, because for bottom-up parsing you would apply rightmost derivation, while for top-down parsing you would choose leftmost derivation. Note that this has no effect on the final parsing tree, it just affects the intermediate elements and the algorithm used.

Common Elements

Parsers built with top-down and bottom-up strategies shares a few elements that we can talk about.

Lookahead and Backtracking

The terms lookadhead and backtracking do not have a different meaning in parsing than the one they have in the larger computer science field. Lookahead indicates the number of elements, following the current one, that are taken into consideration to decide which current action to take.

A simple example: a parser might check the next token to decide which rule to apply now. When the proper rule is matched the current token is consumed, but the next one remains in the queue.

Backtracking is a technique of an algorithm. It consists in finding a solution to a complex problems by trying partial solutions, and then keep checking the most promising one. If the one that is currently tested fails, then the parser backtracks (i.e., it goes back to the last position that was successfully parsed) and try another one.

They are especially relevant to LL, LR and LALR parsing algorithms, because parsers for language that just needs one lookahead token are easier to build and quicker to run. The lookahead tokens used by such algorithms are indicated between parentheses after the name of the algorithm (e.g., LL(1), LR(k)). The notation (*) indicate that the algorithm can check infinite lookahead tokens, although this might affect the performance of the algorithm.

Chart Parsers

Chart parsers are a family of parsers that can be bottom-up (e.g., CYK) or top-down (e.g., Earley). Chart parsers essentially try to avoid backtracking, which can be expensive, by using dynamic programming. Dynamic programming, or dynamic optimization, is a general method to break down larger problem in smaller subproblems.

A common dynamic programming algorithm used by chart parser is the Viterbi algorithm. The goal of the algorithm is to find the most likely hidden states given the sequence of known events. Basically given the tokens that we know, we try to find the most probable rules that have produced them.

The name chart parser derives from the fact that the partial results are stored in a structure called chart (usually the chart is a table). The particular technique of storing partial results is called memoization. Memoization is also used by other algorithms, unrelated to chart parsers, like packrat.

Automatons

Before discussing parsing algorithms we would like to talk about the use of automatons in parsing algorithms. Automaton are a family of abstract machines, among which there is the well known Turing machine.

When it comes to parsers you might here the term (Deterministic) Pushdown Automaton (PDA) and when you read about lexers you would hear the term Deterministic Finite Automaton (DFA). A PDA is more powerful and complex than a DFA (although still simpler than a Turing machine).

Since they define abstract machines, usually they are not directly related to an actual algorithm. Rather, they describe in a formal way the level of complexity that an algorithm must be able to deal with. If somebody says that to solve problem X you need a DFA, he means that you need an algorithm as equally powerful as a DFA.

However since DFA are state machines in the case of lexer the distinction is frequently moot . That is because state machines are relative straightforward to implement (i.e., there are ready to use libraries), so most of the time a DFA is implemented with a state machine. That is why we are going to briefly talk about DFA and why they are frequently used for lexers.

Lexing With a Deterministic Finite Automaton

DFA is a (finite-)state machine, a concept with which we assume you are familiar. Basically, a state machine has many possible states and a transition function for each of them. These transition functions govern how the machine can move from one state to a different one in response to an event. When used for lexing, the machine is fed the input characters one at a time until it reach an accepted state (i.e., it can build a token).

They are used for a few reasons:

  • it has been proven that they recognize exactly the set of regular languages, that is to say they are equally powerful as regular languages
  • there are a few mathematical methods to manipulate and check their properties (e.g., whether they can parse all strings or any strings)
  • they can work with an efficient online algorithm (see below)

An online algorithm is one that does not need the whole input to work. In the case of a lexer, it means that can recognize a token as soon as its characters are seen by the algorithm. The alternative would be that it needed the whole input to identify each token.

In addition to these properties is fairly easy to transform a set of regular expressions in a DFA, which makes possible to input the rules in a simple way that is familiar to many developers. Then you can automatically convert them in a state machine that can work on them efficiently.

Tables of Parsing Algorithms

We provide a table to offer a summary of the main information needed to understand and implement a specific parser algorithm. You can find more implementations by reading our articles that present parsing tools and libraries for Java, C#, Python and JavaScript.

The table lists:

  • a formal description, to explain the theory behind the algorithm
  • a more practical explanation
  • one or two implementations, usually one easier and the other a professional parser. Sometimes, though, there is no easier version or a professional one.
AlgorithmFormal DescriptionExplanationExample Implementation
CYKAn Efficient Recognition and Syntax-Analysis Algorithm for Context-Free Languages (PDF)CKY And Earley Algorithms (PDF)CYK Algorithm in Golang / Chart-parsers
EarleyAn Efficient Context-Free Parsing Algorithm (PDF)Earley Parsing ExplainedNearley
LL LL(*): The Foundation of the ANTLR Parser Generator (PDF)LL Parser on Wikipedia / LL-parser.jsll(1) parser / ANTLR
LROn the Translation of Languages from Left to Right (PDF)Compiler CourseJison / Bison
Packrat (PEG)Packrat Parsing: a Practical Linear-Time Algorithm with Backtracking (PDF)Packrat Parsing: Simple, Powerful, Lazy, Linear TimePackrat Parsing in Scala (PDF)Companion code of the thesis / Canopy
Parser CombinatorConstructing natural language interpreters in a lazy functional language (PDF) Parser combinators explainedUnderstanding Parser Combinators / Sprache
PrattTop Down Operator PrecedenceSimple Top-Down Parsing in PythonA Pratt Parser in Python / JSLint

To understand how a parsing algorithm works you can also look at the syntax analytic toolkit. It is an educational parser generator that describes the steps that the generated parser takes to accomplish its objective. It implements a LL and a LR algorithm.

The second table shows a summary of the main features of the different parsing algorithms and for what they are generally used.
Table of parsing Algorithms features and uses

Top-down Algorithms

The top-down strategy is the most widespread of the two strategies and there are several successful algorithms applying it.

LL Parser

LL (Left-to-right read of the input, Leftmost derivation) parsers are table-based parsers without backtracking, but with lookahead. Table-based means that they rely on a parsing table to decide which rule to apply. The parsing table use as rows and columns nonterminals and terminals, respectively.

To find the correct rule to apply:

  1. firstly the parser look at the current token and the appropriate amount of lookahead tokens
  2. then it tries to apply the different rules until it finds the correct match.

The concept of LL parser does not refers to a specific algorithm, but more to a class of parsers. They are defined in relations to grammars. That is to say an LL parser is one that can parse a LL grammar. In turn LL grammars are defined in relation to the number of lookahead tokens that are needed to parse them. This number is indicated between parentheses next to LL, so in the form LL(k).

An LL(k) parser uses k tokens of lookahead and thus it can parse, at most, a grammar which need k tokens of lookahead to be parsed. Effectively the concept of LL(k) grammar is more widely employed than the corresponding parser. Which means that LL(k) grammars are used as a meter when comparing different algorithms. For instance, you would read that PEG parsers can handle LL(*) grammars.

The Value Of LL Grammars

This use of LL grammars is due both to the fact that LL parser are widely used and that they are a bit restrictive. In fact, LL grammars does not support left-recursive rules. You can transform any left-recursive grammar in an equivalent non-recursive form, but this limitation matters for a couple of reasons: productivity and power.

The loss of productivity depends on the requirement that you have to write the grammar in a special way, which takes time. The power is limited because a grammar that might need 1 token of lookahead, when written with a left-recursive rule, might need 2-3 tokens of lookahead, when written in a non-recursive way. So this limitation is not merely annoying, but it is limiting the power of the algorithm, i.e., the grammars it can be used for.

The loss of productivity can be mitigated by an algorithm that automatically transforms a left-recursive grammar in a non-recursive one. ANTLR is a tool that can do that, but, of course, if you are building your own parser, you have to do that yourself.

There are two special kind of LL(k) grammars: LL(1) and LL(*). In the past the first kinds were the only one considered practical, because it is easy to build efficient parsers for them. Up to the point that many computer languages were purposefully designed to be described by a LL(1) grammar. An LL(*), also known as LL-regular, parser can deal with languages using an infinite amount of lookahead tokens.

On StackOverflow you can read a simple comparison between LL parsers and Recursive Descent parsers or one between LL parsers and LR parsers.

Earley Parser

The Earley parser is a chart parser named after its inventor Jay Earley. The algorithm is usually compared to CYK, another chart parser, that is simpler, but also usually worse in performance and memory. The distinguishing feature of the Earley algorithm is that, in addition to storing partial results, it implement a prediction step to decide which rule is going to try to match next.

The Earley parser fundamentally works by dividing a rule in segments, like in the following example.

Then working on this segments, that can be connected at the dot (.), tries to reach a completed state, that is to say one with the dot at the end.

The appeal of an Earley parser is that it is guaranteed to be able to parse all context-free languages, while other famous algorithms (e.g., LL, LR) can parse only a subset of them. For instance, it has no problem with left-recursive grammars. More generally, an Earley parser can also deal with nondeterministic and ambiguous grammars.

It can do that at the risk of a worse performance (O(n3)), in the worst case. However it has a linear time performance for normal grammars. The catch is that the set of languages parsed by more traditional algorithms are the one we are usually interested in.

There is also a side effect of the lack of limitations: by forcing a developer to write the grammar in certain way the parsing can be more efficient. I.e., building a LL(1) grammar might be harder for the developer, but the parser can apply it very efficiently. With Earley you do less work, so the parser does more of it.

In short, Earley allows you to use grammars that are easier to write, but that might be suboptimal in terms of performance.

Earley Use Cases

So Earley parsers are easy to use, but the advantage, in terms of performance, in the average case might be non-existent. This makes the algorithm great for an educational environment or whenever productivity is more relevant than speed.

In the first case is useful, for example, because most of the time the grammars your users write work just fine. The problem is that the parser will throw at them obscure and seemingly random errors. Of course the errors are not actually random, but they are due to the limitations of an algorithm that your users do not know or understand. So you are forcing the user to understand the inner workins of your parser to use it, which should be unnecessary.

An example of when productivity is more important than speed might be a parser generator to implement syntax highlighting, for an editor that need to support many languages. In a similar situation, being able to support quickly new languages might be more desirable than completing the task as soon as possible.

Packrat (PEG)

Packrat is often associated to the formal grammar PEG, since they were invented by the same person: Bryan Ford. Packrat was described first in his thesis: Packrat Parsing: a Practical Linear-Time Algorithm with Backtracking. The title says almost everything that we care about: it has a linear time of execution, also because it does not use backtracking.

The other reason for its efficiency it is memoization: the storing of partial results during the parsing process. The drawback, and the reason because the technique was not used until recently, is the quantity of memory it needs to store all the intermediate results. If the memory required exceed what is available, the algorithm loses its linear time of execution.

Packrat also does not support left-recursive rules, a consequence of the fact that PEG requires to always choose the first option. Actually some variants can support direct left-recursive rules, but at the cost of losing linear complexity.

Packrat parsers can perform with an infinite amount of lookahead, if necessary. This influence the execution time, that in the worst case can be exponential.

Recursive Descent Parser

A recursive descent parser is a parser that works with a set of (mutually) recursive procedures, usually one for each rule of the grammars. Thus the structure of the parser mirrors the structure of the grammar.

The term predictive parser is used in a few different ways: some people mean it as a synonym for top-down parser, some as a recursive descent parser that never backtracks.

The opposite to this second meaning is a recursive descent parser that do backtracks. That is to say one that find the rule that matches the input by trying each one of the rules in sequence, and then it goes back each time it fails.

Typically recursive descent parser have problems parsing left-recursive rules, because the algorithm would end up calling the same function again and again. A possible solution to this problem is using tail recursion. Parsers that use this method are called tail recursive parsers.

Tail recursion per se is simply recursion that happens at the end of the function. However tail recursion is employed in conjuction with transformations of the grammar rules. The combination of transforming the grammar rules and putting recursion at the end of the process allows to deal with left-recursive rules.

Pratt Parser

A Pratt parser is a widely unused, but much appreciated (by the few that knows it) parsing algorithm defined by Vaughan Pratt in a paper called Top Down Operator Precedence. The paper itself starts with a polemic on BNF grammars, which the author argues wrongly are the exclusive concerns of parsing studies. This is one of the reasons for the lack of success. In fact the algorithm does not rely on a grammar, but works directly on tokens, which makes it unusual to parsing experts.

The second reason is that traditional top-down parsers works great if you have a meaningful prefix that helps distinguish between different rules. For example, if you get the token FOR you are looking at a for statement. Since this essentially applies to all programming languages and their statements, it is easy to understand why the Pratt parser did not change the parsing world.

Where the Pratt algorithm shines is with expressions. In fact, the concept of precedence makes impossible to understand the structure of the input simply by looking at the order in which the tokens are presented.

Basically, the algorithm requires you to assign a precedence value to each operator token and a couple of functions that determines what to do, according to what is on the left and right of the token. Then it uses these values and functions to bind the operations together while it traverse the input.

While the Pratt algorithm has not been overtly successful it is used for parsing expressions. It is also adopted by Douglas Crockford (of JSON fame) for JSLint.

Parser Combinator

A parser combinator is a higher-order function that accepts parser functions as input and return a new parser function as output. A parser function usually means a function that accepts a string and output a parse tree.

A parser combinator is modular and easy to build, but they are also slower (the have O(n4) complexity in the worst case) and less sophisticated. They are typically adopted for easier parsing tasks or for prototyping. In a sense the user of a parser combinator builds the parser partially by hand, but relying on the hard word done by whoever created the parser combinator.

Generally they do not support left recursive rules, but there are more advanced implementations that do just that. See, for example, the paper Parser Combinators for Ambiguous Left-Recursive Grammars, that also manages to describe an algorithm that has polynomial time of execution.

Many contemporary implementations are called monadic parser combinator, since they rely on the structure of functional programming called monad. Monads are a fairly complex concept that we cannot hope to explain here. However basically a monad is able to combine functions and actions relying on a data type. The crucial feature is that the data type specifies how its different values can be combined.

The most basic example is the Maybe monad. This is a wrapper around a normal type, like integer, that returns the value itself when the value is valid (e.g., 567), but a special value Nothing when it is not (e.g., undefined or division by zero). Thus you can avoid using a null value and unceremoniously crashing the program. Instead the Nothing value is managed normally, like it would manage any other value.

Bottom-up Algorithms

The bottom-up strategy main success is the family of many different LR parsers. The reason of their relative unpopularity is because historically they have been harder to build, although LR parser are also more powerful than traditional LL(1) grammars. So we mostly concentrate on them, apart from a brief description of CYK parsers.

This means that we avoid talking about the more generic class of Shift-reduce Parser, which also includes LR parsers.

We only say that shift-reduce algorithms works with two steps:

  • Shift: read one token from the input, that becomes a new (momentarily isolated) node
  • Reduce: once the proper rule is matched join the resulting tree with a precedent existing subtree

Basically the shift step read the input until completion, while the reduce join the subtrees until the final parse tree is built.

CYK Parser

The Cocke-Younger-Kasami (CYK) it has been formulated independently by the three authors. Its notability is due to a great worst-case performance (O(n3)), although it is hampered by a comparatively bad performance in most common scenarios.

However the real disadvantage of the algorithm is that it requires the grammars to be expressed in the Chomsky normal form.

That is because the algorithm relies on the properties of this particular form to be able to split the input in half to trying matching all the possibilities. Now, in theory, any context-free grammar can be transformed in a corresponding CNF, but this is seldom practical to do by hand. Imagine being annoyed by the fact that you cannot use left-recursive rules and being asked to learn a special kind of form…

The CYK algorithm is used mostly for specific problems. For instance the membership problem: to determine if a string is compatible with a certain grammar. It can also be used in natural language processing to find the most probable parsing between many options.

For all practical purposes, if you need to parser all context-free grammar with a great worst-case performance, you want to use an Earley parser.

LR Parser

LR (Left-to-right read of the input, Rightmost derivation) parsers are bottom-up parsers that can handle deterministic context-free languages in linear time, with lookahead and without backtracking. The invention of LR parsers is credited to the renown Donald Knuth.

Traditionally they have been compared, and competed, with LL parsers. So there is a similar analysis related to the number of lookahead tokens necessary to parse a language. An LR(k) parser can parse grammars that need k tokens of lookahead to be parsed. However LR grammars are less restrictive, and thus more powerful, than the corresponding LL grammars. For example, there is no need to exclude left-recursive rules.

Technically, LR grammars are a superset of LL grammars.  One consequence of this is that you need only LR(1) grammars, so usually the (k) is omitted.

They are also table-based, just like LL-parsers, but they need two complicate tables. In very simple terms:

  • one table tells the parser what to do depending on the current token, the state is in and the tokens that can possibly follow the current one (lookahead sets)
  • the other one tells the parser to which state move next

LR parsers are powerful and have great performance, so where is the catch? The tables they need are hard to build by hand and can grow very large for normal computer languages, so usually they are mostly used through parser generators. If you need to build a parser by hand, you would probably prefer a top-down parser.

Simple LR and Lookahead LR

Parser generators avoid the problem of creating such tables, but they do not solve the issue of the cost of generating and navigating such large tables. So there are simpler alternatives to the Canonical LR(1) parser, described by Knuth. These alternatives are less powerful than the original one. They are: Simple LR parser (SLR) and Lookahead LR parser (LALR). So in order of power we have: LR(1) > LALR(1) > SLR(1) > LR(0).

The names of the two parsers, both invented by Frank DeRemer, are somewhat misleading: one is not really that simple and the other is not the only one that uses lookahead. We can say that one is simpler, and the other rely more heavily on lookahead to make decisions.

Basically they differ in the tables they employ, mostly they change the what “to do” part and the lookahead sets. Which in turn pose different restrictions on the grammars that they can parse. In other words, they use different algorithms to derive the parsing tables from the grammar.

A SLR parser is quite restrictive in practical terms and it is not very used. A LALR parser instead works for most practical grammars and it is widely employed. In fact the popular tools yacc and bison works with LALR parser tables.

Contrary to LR grammars, LALR and SLR grammars are not a superset of LL grammars. They are not  easily comparable: some grammars can be covered by one class and not the other or vice versa.

Generalized LR Parser

Generalized LR parsers (GLR) are more powerful variants of LR parsers. They were described by Bernard Land, in 1974, and implemented the first time by Masaru Tomita, in 1984. The reason for GLR existence is the need to parse nondeterministic and ambiguous grammars.

The power of a GLR parser is not found on its tables, which are equivalent to the one of a traditional LR parser. Instead it can move to different states. In practice, when there is an ambiguity it forks new parser(s) that handle that particular case. These parsers might fail at a later stage and be discarded.

The worst-case complexity of a GLR parser is the same of an Earley one (O(n3)), though it may have better performance with the best case of deterministic grammars. A GLR parser is also harder to build than an Earley one.

Summary

With this great (or at least large) article we hope to have solved most of the doubts about parsing terms and algorithms. What the terms means and why to pick a certain algorithm instead of another one. We have not just explained them, but we have given a few pointers for common issues with parsing programming languages.

For reason of space we could not provide a detailed explanation of all parsing algorithms. So we have also provided a few links to get a deeper understanding of the algorithms: the theory behind them and an explanation of how they work. If you are specifically interested in building things like compiler and interpreters, you can read another of our articles to find resources to create programming languages.

If you are just interested in parsing you may want to read Parsing Techniques, a book that is as comprehensive as it is expensive. It obviously goes much more in depth what we could, but it also cover less used parsing algorithms.

Parsing HTML: A Guide to Select the Right Library

HTML is a markup language with a simple structure. It would be quite easy to build a parser for HTML with a parser generator. Actually, you may not need even to do that, if you choose a popular parser generator, like ANTLR. That is because there are already available grammars ready to be used.

HTML is so popular that there is even a better option: using a library. It is better because it is easier to use and usually provides more features, such as a way to create an HTML document or support easy navigation through the parsed document. For example, usually it comes with a CSS/jQuery-like selector to find nodes according to their position in the hierarchy.

The goal of this article is helping you to find the right library to process HTML. Whatever you are using: Java, C#, Python, or JavaScript we got you covered.

We are not going to see libraries for more specific tasks, such as article extractors or web scraping, like Goose. They have typically restricted uses, while in this article we focus on the generic libraries to process HTML.

Parsing: Tools and Libraries

Parsing_-_tools_and_libraries_-_cover

Receive the guide to your inbox to read it on all your devices when you have time. Learn about parsing in Java, Python, C#, and JavaScript

We won't send you spam. Unsubscribe at any time. Powered by ConvertKit

The Libraries We Considered

Java

Let’s start with the Java libraries to process HTML.

Lagarto and Jerry

Jodd is set of Java micro frameworks, tools and utilities

Among the many Jodd components available there are Lagarto, an HTML parser, and Jerry, defined as jQuery in Java. There are even more components that can do other things. For instance CSSelly, which is a parser for CSS-selectors strings and powers Jerry, and StripHtml, which reduces the size of HTML documents.

Lagarto works as a traditional parser, more than the typical library. You have to build a visitor and then the parser will call the proper function each time a tag is encountered. The interface is simple and mainly you have to implement a visitor that will be called for each tag and for each piece of text. Lagarto is quite basic, it just do parsing. Even the building of the (DOM) tree is done by an extension, aptly called DOMBuilder.

While Lagarto could be very useful for advanced parsing tasks, usually you will want to use Jerry. Jerry tries to stay as close as possible to jQuery, but only to its static and HTML manipulation parts. It does not implement animations or ajax calls. Behind the scenes Jerry uses Lagarto and CSSelly, but it is much easier to use. Also, you are probably already familiar with jQuery.

The documentation of Jerry is good and there are a few examples in the documentation, including the following one.

HTMLCleaner

HTMLCleaner is a parser that is mainly designed to be a cleaner of HTML for further processing. As the documentation explains it.

HtmlCleaner is an open source HTML parser written in Java. HTML found on the Web is usually dirty, ill-formed and unsuitable for further processing. For any serious consumption of such documents, it is necessary to first clean up the mess and bring some order to the tags, attributes and ordinary text. For any given HTML document, HtmlCleaner reorders individual elements and produces well-formed XML. By default, it follows similar rules that the most of web browsers use in order to create the Document Object Model. However, you can provide custom tag and rule sets for tag filtering and balancing.

This explanation also reveals that the project is old, given that in the last few years the broken HTML problem is much less prominent that it was before. However it is still updated and maintained. So the disadvantage of using HTMLCleaner is that the interface is a bit old and can be clunky when you need to manipulate HTML.

The advantage is that it works well even on old HTML documents. It can also writing the documents in XML or pretty HTML (i.e., with the correct indentation). If you need JDOM and a product that support XPath, or you even like XML, look no further.

The  documentation offers a few examples and API documentation, but nothing more. The following example comes from it.

Jsoup

jsoup is a Java library for working with real-world HTML

Jsoup is a library with a long history, but a modern attitude:

        • it can handle old and bad HTML, but it also equipped for HTML5
        • it has powerful support for manipulation, with support for CSS selectors, DOM Traversal and easy addition or removal of HTML
        • it can clean HTML, both to protect against XSS attacks and in the sense that it improves structure and formatting

There is little more to say about jsoup, because it does everything you need from an HTML parser and even more (e.g., cleaning HTML documents). It can be very concise.

In this example it directly fetch HTML documents from an URL and select a few links. On line 9 you can also see a nice option: the chance to automatically get the absolute url even if the attribute href reference a local one. This is possible by using the proper setting, which is set implicitly when you fetch the URL with the connect method.

The documentation lacks a tutorial, but it provides a cookbook, that essentially fulfills the same function, and an API reference. There is also an online interactive demo that shows how jsoup parses an HTML document.

C#

Let’s move to the C# library to process HTML.

AngleSharp

The ultimate angle brackets parser library parsing HTML5, MathML, SVG and CSS to construct a DOM based on the official W3C specifications.

AngleSharp is quite simply the default choice for whenever you need a modern HTML parser for a C# project. In fact it does not just parse HTML5, but also its most used companions: CSS and SVG. There is also an extension to integrate scripting in the contest of parsing HTML documents: both C# and JavaScript, based on Jint. Which means that you can parse HTML documents after they have been modified by JavaScript. Both the JavaScript included in the page or a script you add yourself.

AngleSharp fully support modern conventions for easy manipulation, like CSS selectors and jQuery-like constructs. But it is also well integrated in the .NET world, with support for LINQ for DOM elements. The author mention that it may want to evolving it in something more than a parser, for the moment it can do simple things like submitting forms.

The following example, from the documentation, shows a few features of AngleSharp.

The documentation may contain all the information you need, but it certainly could use a better organization. For the most part it is delivered withing the GitHub project, but there are also tutorials on CodeProject, by the author of the library.

HtmlAgilityPack

HtmlAgilityPack was once considered the default choice for HTML parsing with C#. Although some says for the lack of better alternatives, because the quality of the code was low. In any case it was essentially abandoned for the last few years, until it was recently revived by ZZZ Projects.

In terms of features and quality it is quite lacking, at least compared to AngleSharp. Support for CSS selector, necessary for modern HTML parsing, and support for .NET Standard, necessary for modern C# projects, are on the roadmap. On the same document there is also planned a cleanup of the code.

If you are in need for things like XPath HtmlAgilityPack should be your best choice. In other cases, I do not think it is the best choice right now, unless you are already using it. That is especially true since there is no documentation. Though the new maintainer and the prospect for better features are a good reason to keep using it, if you are already a user.

Python

Now it is the turn of the Python libraries.

HTML Parser Of The Standard Library

The standard Python library is quite rich and implement even an HTML Parser. The bad news is that the parser works like a simple and traditional parser, so there are no advanced functionalities geared to handle HTML. The parser essentially makes available a visitor with basic functions for handle the data inside tags, the beginning and the ending of tags.

It works, but it does not really offer anything better than a parser generated by ANTLR.

Html5lib

html5lib is a pure-python library for parsing HTML. It is designed to conform to the WHATWG HTML specification, as is implemented by all major web browsers.

Html5lib it is considered a good library to parser HTML5 and a very slow one. Partially because it is written in Python and not in C, like some of the alternatives.

By default the parsing produces an ElementTree tree, but it can be set to create a DOM tree, based on xml.dom.minidom. Html5lib provides walkers that simplify the traversing of the tree and serializers.

The following example shows how the parser, walker and serializer in action.

It has a sparse documentation.

Html5-parser

Html5-parser is a parser for Python, but written in C. It also just a parser that produces a tree. It exposes literally one function named parse. The documentation compares it to html5lib, claiming that it is 30x quicker.

To produce the output tree, by default, it relies on the library lxml. The same library allows also to pretty print the output. It even refer to the documentation of that library to explain how to navigate the resulting tree.

Lxml

lxml is the most feature-rich and easy-to-use library for processing XML and HTML in the Python language.

Lxml is probably the most used low-level parsing library for Python, because of its speed, reliability and features. It is written in Cython, but it relies mostly on the C libraries libxml2 and libxml. Though, this does not mean that it is only a low-level library, but that is also used by other HTML libraries.

The library it is designed to work with the ElementTree API, a container for storing XML documents in memory. If you are not familiar with it, the important thing to know it is that it is an old-school way of dealing with (X)HTML. Basically you are going to search with XPath and work as if it was the golden age of XML.

Fortunately there is also a specific package for HTML, lxml.html that provide a few features specifically for parsing HTML. The most important one is that support CSS selectors to easily find elements.

There are also many other features, for example:

        • it can submit forms
        • it provides an internal DSL to create HTML documents
        • it can remove unwanted elements from the input, such as script content or CSS style annotations (i.e., it can clean HTML in the semantic sense, eliminating foreign elements)

In short: it can do many things, but not always in the easiest way you can imagine.

The documentation is very thorough and it also available as one 496-pages PDF. There is everything you can think of: tutorials, examples, explanations of the concept used in the library…

 

AdvancedHTMLParser

AdvancedHTMLParser is a Python parser that aims to reproduce the behavior of raw JavaScript in Python. By raw JavaScript I mean without jQuery or CSS selector syntax. So, it build a DOM-like representation that you can interact with.

If it works in HTML javascript on a tag element, it should work on an AdvancedTag element with python.

The parser also add a few additional features. For instance, it supports direct modification of attributes (e.g., tag.id = "nope") instead of using the JavaScript-like syntax (e.g., setAttribute function). It can also perform a basic validation of an HTML document (i.e., check for missing closing tokens) and output a prettified HTML.

The most important addition, though, is the support for advanced search and filtering methods for tags. The method find search value and attributes, while filter is more advanced. The second one depends on another library called QueryableList, which is described as “ORM-style filtering to any list of items“. It is not as powerful as XPath or CSS selectors and it does not use a familiar syntax for HTML manipulation. However it is similar to the one used for database queries.

The documentation is good enough, though it consist just of what you find in the README of the GitHub project and the following example in the source code.

Beautiful Soup

Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.

As the description on their website reminds you, technically Beautiful Soup it is not properly a parser. In fact, it can use a few parsers behind the scenes, like the standard Python parser or lxml. However, in practical terms, if you are using Python and you need to parse HTML, probably you want to use something like Beautiful Soup to work with HTML.

Beautiful Soup is the go-to library when you need an easy way to parse HTML documents. In terms of features it might not provide all that you think of, but it probably gives all that you actually need to use.

While you can navigate the parse tree yourself, using standard functions, to move around the tree (e.g., next_element, find_parent) you are probably going to use the simplest methods it provides.

The first are CSS selectors, to easily select the needed elements of the document. But there are also simpler functions to find elements according to their name or directly accessing the tags (e.g., title). They are both quite powerful, but the first will be more familiar to users of JavaScript, while the other is more pythonic.

There are a few functions to manipulate the document and easily add or remove elements. For instance, there are a few functions to wrap an element inside a provided one or doing the inverse operation.

Beautiful Soup also gives functions to pretty print the output or get only the text of the HTML document.

The documentation is great: there are explanation and plenty examples for all features. There is not an official tutorial, but given the quality of the documentation it is not really needed.

JavaScript

Of course we need also to see JavaScript libraries to process HTML. We are going to divide between parsing HTML in the browser and running in Node.js.

Browser

The browser automatically parses the current HTML document, which means that a parser is always included.

Plain JavaScript or jQuery

HTML parsing is implicit in JavaScript, since it was basically created to manipulate the DOM. Which means that the browser automatically parses HTML for you and makes it accessible in the form of a DOM. This means also that you can access the same functionality. The easiest way is by parsing an HTML in a new element of the current document. However you can also create a new document altogether.

You can pick between plain JavaScript and using the jQuery library. JQuery offers great support for CSS selectors and a few of its own selectors to easily find DOM elements. Parsing HTML is also made easier, you just need a single function: parseHTML.

The library do other things, other than making easier to manipulate the DOM, such as dealing with forms and asynchronous calls to the server. Given the environment in which it runs, it is also easy to add elements to the page and have them automatically parsed.

JQuery may be the most popular library in existence because it also deals with the issues of compatibility between different browsers. You might start using it because all the examples are in jQuery, and not in plain JavaScript. But then you keep using it, because JavaScript is actually less portable between different browsers. There are inconsistencies between the API and the behavior of different browsers, which are masked by this wonderful library.

DOMParser

The native DOM manipulation capabilities of JavaScript and jQuery are great for simple parsing of HTML fragments. However, if you actually need to parse a complete HTML or XML source in a DOM document programmatically, there is a better solution: DOMParser. This is classified as an experimental feature, but it is available in all modern browsers.

By using DOMParser you can easily parse HTML document. Instead usually you have to resort to trick the browser into parsing it for you, for instance by adding a new element to the current document.

Node.js

While Node.js can easily work with the web, it does not make easily accessible parsing functionalities like that of the browser. In this sense, JavaScript in Node.js works like a traditional language, when it comes to parsing: you have to take care of it yourself.

Cheerio

Fast, flexible, and lean implementation of core jQuery designed specifically for the server.

There is little more to say about Cheerio than it is jQuery on the server. It should be obvious, but we are going to state it anyway: it looks like  jQuery, but there is no browser. This means that Cheerio parses HTML and make easy to manipulate it, but it does not make things happen. It does not interpret the HTML as if it were in the browser; both in the sense that it might parse things differently from a browser and that the results of the parsing are not send directly to the user. If you need that you will have to take care of it yourself.

The library includes also a few jQuery utility functions, such as slice and eq, to manipulate ranges. It can serialize in an array name and value of form elements, but it cannot submit them to the server, as jQuery can. That is because Node.js run on the server.

The developer created this library because it wanted a lightweight alternative to jsdom, that was also quicker and less strict in parsing. The last thing it is needed to parse real and messy websites.

The syntax and usage of Cheerio should be very familiar to any JavaScript developer.

The documentation is limited to the long README of the project, but that is probably all that you need.

Jsdom

 jsdom is a pure-JavaScript implementation of many web standards, notably the WHATWG DOM and HTML Standards, for use with Node.js. In general, the goal of the project is to emulate enough of a subset of a web browser to be useful for testing and scraping real-world web applications.

So jsdom is more than an HTML parser, it works as a browser. In the context of parsing, it means that it would automatically add the necessary tags, if you omit them from the data you are trying to parse. For instance, if there were no html tag it would implicity add it, just like a browser would do.

The fact that supports the DOM standard means that a jsdom object will have familiar properties, such as document or window, and that manipulating the DOM would be like using plain JavaScript.

You can also optionally specify few properties, like the URL of the document, referrer or user agent. The url is particularly useful, if you need to parse links that contains local URLs.

Since it is not really related to parsing, we just mention that jsdom have a (virtual) console, support for cookies, etc. In short, all you need to simulate a browser environment. It can also deal with external resources, even JavaScript scripts. Which means that it can load and execute them, if you ask it. Note however that there are security risks in doing so, just like when you execute any external code. All of that have a number of caveats that you should read in the documentation.

One important thing to notice is that you can alter the environment before the parsing happens. For instance, you can add JavaScript libraries that simulate functionalities not supported by the jsdom parser. These libraries are usually called shims.

The documentation is good enough. It might be surprisingly short given the vastity of the project, but it can get away with little, because you can find documentation for using the DOM elsewhere.

Htmlparser2 and related libraries

Felix Böhm has made a few libraries to parse HTML (XML and RSS), CSS selectors and building a DOM. It is successful and good enough to even power the Cheerio library. The libraries can be used separately, but works also together.

The HTML parser is quick, but it is also really basic. The following example shows that allows you just to execute functions, when you meet tags or text elements.

They are powerful and great if you need to do advanced and complex manipulation of HTML documents. However, even together, they are somewhat clunky to use if you intend to simply parse HTML and do some simple manipulation of the DOM. In part this is due to the features themselves. For instance, the DOM library just builds the DOM, there are no helpers to manipulate it. In fact, to manipulate the DOM you need yet another library called domutils, for which there is literally zero documentation.

However the issue really is that though they work together, they do not provide functionalities on top of each other, they just work along each other. They are mostly designed for advanced parsing need. For example, if you want to build a word processor that use HTML behind the scence, these are great. Otherwise you are probably going to look somewhere else.

This difficulty of using it is compounded by the limited documentation. The only good part is for the CSS selectors engine.

Parse5

parse5 provides nearly everything you may need when dealing with HTML.

Parse5 is a library meant to be used to build other tools, but can also be used to parse HTML directly for simple tasks. However it is somewhat limited in this second regard. This is shown by the following example.

It is easy to use, but the issue is that it does not provide the methods that the browser gives you to manipulate the DOM (e.g., getElementById).

The difficulty is also increased by the limited documentation: it is basically a series of question that are answered with an API reference (e.g., “I need to parse a HTML string” => Use parse5.parse method). So it is feasible to use it for simple DOM manipulation, but you are probably not going to want to.

On the other hand, parse5 lists an impressive series of project that adopt it: jsdom, Angular2 and Polymer. So if you need a reliable foundation for advanced manipulaton or parsing of HTML, it is clearly a good choice.

Summary

We have seen a few libraries for Java, C#, Python, and JavaScript. You might be surprised that, despite the popularity of HTML, there are usually few mature choices for each language. That is because while HTML is very popular and structurally simple, providing support for all the multiple standards is hard work.

On top of that, the actual HTML documents out there might be in a wrong form, according to the standard, but they still works in the browser. So they must work with your library, too. Add to it the need to give an easy way to traverse an HTML document, and the shortage is readily explained.

While there might not always be that many choices, luckily there is always at least one good choice available for all the languages we have considered.

EBNF: How to describe the grammar of a language

EBNF: The extended Backus-Naur form

EBNF: The extended Backus-Naur form

The EBNF is the most commonly used formalism to describe the structure of languages.

In this article we are going to see:

Does it sound like a plan?

Parsing: Tools and Libraries

Parsing_-_tools_and_libraries_-_cover

Receive the guide to your inbox to read it on all your devices when you have time. Learn about parsing in Java, Python, C#, and JavaScript

We won't send you spam. Unsubscribe at any time. Powered by ConvertKit

What is the EBNF?

The EBNF is a way to specify a formal language grammar. It can be considered a metalanguage because it is a language to describe other languages.

A formal language is a language with a precise structure, like programming languages, data languages, or Domain Specific Languages (DSLs). Java, XML, and CSS are all examples of formal languages.

A grammar can be used to define two opposite things:

  • how to recognize the different portions in a piece of code written in the formal language
  • the possible ways to build a valid piece of code in the formal language

For example, a simple grammar could tell us that a document in our language is composed by a list of declarations, and declarations are defined by the sequence of a keyword Xyz and a a name.

Based on this we could:

  • recognize in the code sequences of the keyword Xyz and a name as instances of these declarations we have considered
  • we could generate valid documents in our language by writing a list of declarations, each time inserting a keyword Xyz and a name

While there are two possible usages for a grammar, we are typically interested only in the first one: recognizing if a piece of code is valid for a given language and identifying the different structures typical of the language (like functions, methods, classes, etc.).

What EBNF means?

Ok, but what EBNF stands for? EBNF stands for Extended Backus-Naur Form. It will not surprise you to read that it is an extended version of the Backus-Naur form (BNF).

There is at least another format derived from BNF which is called ABNF, for Augment Backus-Naur Form. ABNF main purpose is to describe bidirectional communications protocols. EBNF is the most used one.

While there is a standard for EBNF it is common to see different extensions or slightly different syntaxes to be used. In the rest of the article we will add more comments when looking a specific parts of EBNF.,

Examples of EBNF grammars

We are going to see some examples of grammars taken from a list available on github. Later we could refer to them while explaining the rules.

Note also that for each language you could have different equivalent grammars. So for the same languages you could find grammars which are not exactly the same and yet they are correct anyway.

An EBNF grammar for HTML

An EBNF grammar for TinyC

TinyC is a simplified version of C. We picked it because a grammar for a common programming language would be way too complex to serve as an example. We would be typically look at grammars longer than 1,000 lines.

How we typically define a grammar using EBNF?

We define a grammar by specifying how to combine single elements in significant structures. As a first approximation we can consider single words to be elements. The structures correspond to sentences, periods, paragraphs, chapters, and entire documents.

A grammar tells us what are the correct ways to put together single words to obtain sentences, and how we can progress by combining sentences into periods, periods into paragraphs and so on until we get the entire document.

EBNF: Terminals and Non-Terminals

EBNF: Terminals and Non-Terminals

Terminals and non-terminals

In the approximation we used, single words correspond to terminals, while all the structures built on top of them (sentences, periods, paragraphs, chapters, and entire documents) correspond to non-terminals.

How terminals look like

Terminals are sometimes also called tokens. They are the smallest block we consider in our EBNF grammars.

A terminal could be either:

  1. a quoted literal
  2. a regular expression
  3. a name referring to a terminal definition. This option is not part of the EBNF standard, but it is used very frequently. Typically uppercase names are used for these terminal definitions, to distinguish them from non-terminal definitions. These latter definitions are the proper production rules

For example, in the grammar you could use "when" to indicate that exact string. Also regular expressions are used, like /[a-z]+/. Finally we could group terminal definitions somewhere and then use their names to refer to them. Using definitions have the advantage of permitting to reuse them multiple times.

Let’s see some typical terminals:

  • identifiers: these are the names used for variables, classes, functions, methods and so on. Typically most languages use different conventions for different names. For example in Java class names start with an upper case letter, static constants are written using all upper case letters, while methods and variable names start with a lowercase letter. However these are just best practices: in Java there is just one type of identifier that can be used everywhere. This is not the case for all the languages. In languages like Haskell, identifiers used for types must start with an uppercase letter. Another thing to consider is that the definition of identifiers typically overlaps with the definitions of keywords. The latters should take precedence. I.e., if a token could be either an identifier or a keyword than it should be recognized as a keyword.
  • keywords: almost every language uses keywords. They are exact strings that are used to indicate the start of a definition (think about class in Java or def in Python), a modifier (public, private, static, final, etc.) or control flow structures (while, for, until, etc.)
  • literals: these permit to define values in our languages. We can have string literals, numeric literal, char literals, boolean literals (but we could consider them keywords as well), array literals, map literals, and more, depending on the language
  • separators and delimiters: like colons, semicolons, commas, parenthesis, brackets, braces
  • whitespaces: spaces, tabs, newlines. They are typically not meaningful and they can be used everywhere in your code
  • comments: most languages have one or more ways to define comments and commonly they can be used everywhere. Some languages could have more structured forms of documentation comments

Whitespaces and comments are typically ignored in EBNF grammars. This is because usually they could be used everywhere in the language, so they should be reported all over the grammar. Some tools have specific options to mark some terminals as terminals to ignore.

How terminals are defined

Terminals are defined using string constants or regular expressions. Let’s see some typical definitions.

CategoryTerminalDefinitionNote
IdentifierJava identifier/[a-zA-Z$_][a-zA-Z0-9$_]*/This is a simplification because it is technically possible to use also UTF escape sequences in Java identifiers
Haskell type identifier/[A-Z][a-zA-Z0-9$_’]*/
Haskell value identifier/[a-z][a-zA-Z0-9_’]*/
KeywordSome Java keywords“abstract”, “assert”, “boolean”, “break”, “byte”, “case”, “catch”, “char”, “class”, “const”…“const” is a keyword in Java even if it is not used by any construct. It is a reserved keyword for future usages (same is true for “goto”)
Some Python 3 keywords/td>“def”, “return”, “raise”, “from”, “import”, “as”, “global”, “nonlocal”, “assert”…
LiteralJava string literal/'”‘ (~[“\\] | ‘\\’ [btnfr”‘\\])* ‘”‘/This is a simplification. We are ignoring the octal and unicode escape sequences
Java character literal/’\” (~[‘\\] |’\\’ [btnfr”‘\\]) ‘\”
Java integer literal/[“0”-“9”](([“0”-“9″,”_”])*[“0”-“9”])?/A Java integer literal can actually be expressed in decimal, hexadecimal, octal or binary format. We are just considering the decimal format here. This is true also for Java long literals
Java long literal/[“0”-“9”](([“0”-“9″,”_”])*[“0”-“9”])?(‘l’|’L’)
Java float literal/[“0”-“9”](([“0”-“9″,”_”])*[“0”-“9”])?’.'([“0”-“9”](([“0”-“9″,”_”])*[“0”-“9”])?)?(‘f’|’F’)A Java float literal can actually be expressed in decimal or hexadecimal format. We are just considering the decimal format here. We are also ignoring the possibility of specifying the exponent
Java boolean literal/”true”|”false”/
Separator/DelimiterSome Java separators and delimiters“(“, “)”, “{“, “}”, “,”, “;”…
Some Ruby separators and delimiters“,”, “;”…
WhitespaceJava whitespace/[ \t\r\n\u000C]+/
Ruby whitespace/(‘ ‘|’\t’)+/
CommentJava line comment/’//’ ~[\r\n]*/
Java block comment/’/*’ .*? ‘*/’/
Python line comment/’#’ ~[\r\n\f]*/

How non-terminals look like

Non-terminals are obtained by grouping terminals and other non-terminals in a hierarchy. After all our goal is to obtain an Abstract Syntax Tree, which is a tree. Our tree will have a root: one non-terminal representing our entire document. The root will contain other non-terminals that will contain other non-terminals and so on.

The picture below show how we can go from a stream of tokens (or terminals) to an AST, which groups terminals into a hierarchy of non-terminals.

EBNF - From stream of tokens to AST

EBNF – From stream of tokens to AST

In the grammar we will define the parser rules that determine how the AST is built.

We have seen that non-terminals represent structures at different levels. Examples of non-terminals are:

  • program/document: represent the entire file
  • module/classes: group several declarations togethers
  • functions/methods: group statements together
  • statements: these are the single instructions. Some of them can contain other statements. For example the loops have a body which is a list of other statements
  • expressions: are typically used within statements and can be composed in various ways

How non terminals are defined? We are going to see it in the next section.

Definition of production rules

An EBNF grammar is substantially a list of production rules. Each production rule tells us how a non-terminal can be composed. We are now going to see the different elements that we can use in such rules.

Terminal

We have seen that a terminal can be defined in-line, specifying a string or a regular expression, or can be defined elsewhere and simply referred to in a rule. The latter method is not technically part of EBNF, but it is commonly used.

Refer to a non-terminal

Similarly to references to terminals we can also refer to non-terminals. However this can lead to left-recursive rules that are typically forbidden. For more details see the paragraph on recursion in grammars.

Sequence

A sequence is simply represented by specifying two elements one after the other.

It is also called concatenation, and in the standard EBNF commas are used between elements, while typically you just use spaces to separate the elements of the sequence.

Alternative

There are some constructs, or portion of constructs, that can be defined in different ways. To represent this case we use alternatives. The different alternatives are separated by the pipe sign (“|“)

Optional (zero or one time)

An element can appear zero or one time. In other words it can be optional.

In the standard EBNF optional elements are represented inside square brackets. However it is more common to find them represented by a question mark following the optional element.

Zero or more times

An element can appear zero or more times (no upper limit).

In the standard EBNF optional elements are represented inside curly brackets. However it is more common to find them represented by an asterisk following the element to repeat.

One or more time

An element can appear one or more times (no upper limit).

In this example we are saying that a program should have at least one statement.

This is not present in standard EBNF, however this is equivalent to a sequence in which the same elements is present twice and the second time it is followed by an asterisk, so that the same element is effectively present always at least once.

Grouping

We can group multiple elements together by using round parenthesis. This is typically used because a modifier (optional, zero-or-more, one-or-more) must be applied to a set of elements. It can also be used to control the precedence of operators.

Without using the grouping we would have an alternative between the sequence of SCRIPT_OPEN SCRIPT_BODY (first alternative) and SCRIPT_SHORT_BODY (second alternative).

A few things to consider

We have seen what constructs we can use to define production rules. Now let’s dive in some aspects that we should consider when writing grammars.

Recursion in grammars: left or right?

EBNF lets us define recurring grammars. Recurring grammars are grammars that have recurring production rules, i.e., production rules that refers to themselves and they do so at the beginning of the production rule (left recurring grammars) or at the end (right recurring grammars).

Left recurring grammars are more common. Consider this example:

This production rule is left recurring because the expression could start with… an expression.

The fact is that many tools that process EBNF grammars cannot deal with that because they risk to enter infinite loops. There are ways to refactor left recurring rules, however they lead to less clear grammars. So, if you can, just use modern tooling that deal with left and right recurring grammars.

Precedence

Precedence could refer to two things: precedence in the EBNF grammar (i.e., in which order operators in grammars are applied) and precedence in the languages defined by EBNF.

This is very important because it would permit to interpret correctly expressions like:

In this case we all know that the multiplication has precedence on the addition. In the languages we are going to define we can have many different rules and we need to define a specific order to consider them.

Let’s talk about the second one.

The traditional way to manage precedence is to define a list of different rules that refers to each other. At the top level we will have the more abstract rules, like expression, at the bottom level we will have rules representing the smallest components, like a single term. The typical example is shown in TinyC:

The grammar as it is defined makes first parse single terms (id, integer or expressions between parenthesis). Later this can be possibly combined using the plus or the minus sign. Then the less than operator can be used and this can be part of an expr, the top level rule considered.

This is a relatively simple example but in richer languages we could have 7-12 intermediate rules like test and sum, to represent different groups of operators with the same precedence. Thing about the multiplication, division, power, comparison operators, logical operators, array access, etc. . There are many possible ways to build expressions and you need to define an order of precedence.

Now, some modern tools use just the order in which alternatives are defined to derive the precedence rules. For example in ANTLR you could write the previous example like:

Typical pattern: defining lists

A typical thing we want to define in an EBNF grammar is a list.

We have list of parameters, list of variables, all sorts of lists.

Typically we have a separator between elements (like a COMMA). We could have lists that must have at least one element and lists which can have zero elements. Let’s take a look at how we can implement both:

Syntactic vs Semantic validity

We have seen that the EBNF can be used to define a grammar, we should consider that a grammar defines what is syntactically valid, but it tells us nothing about what is semantically valid.

What does it mean? Consider these examples:

  • a document containing multiple declarations of elements with the same name
  • a sum between two variables: an integer and a boolean
  • a reference to a variable that was not declared before

We can imagine a language where all these examples are syntactically correct, i.e., they are built according to the grammar of the language. However they are all semantically incorrect, because they do not respect additional constraints on how the language should be used. These semantics constraints are used in addition to the grammar, after the structures of the language have been recognized. In the grammar we can say things like a declaration should be composed by a certain keyword and a name, with semantic constraints we can say two declarations should not have the same name. By combining syntactic and semantic rules we can express what is valid in our language.

How do you define semantic rules? By writing code that works on the AST. You cannot do that in the EBNF grammar.

Where to go from here?

An EBNF grammar is useful to support discussion and to communicate with other langue designers. Typically you may want to do more with it: you want to build actual parsers and tools to process languages from your EBNF grammars. I personally like to work with ANTLR. My suggestion is to take a look at the ANTLR Mega Tutorial.

You could be interested in learning more about parsing. We have prepared different articles explaining how to proceed depending on the language you prefer to work with:

And if you are lost and unsure about how to move forward just let us know.

Summary

In this article we aimed to be practical: we have discussed what is done in practice and what you need to start writing grammars using EBNF. We have discussed the differences between typical usages and the standard and tried to give a complete and correct picture.

There are things we did not discuss: for example the history that lead to the creation of EBNF or its roots in the work from Chomsky and other scientists. We have not discussed what a context free grammar is. The theory tells us that EBNF cannot be used to describe all possible forms of grammars. However it is powerful enough to describe all formal languages you should be interested in writing.

In the rest of this website you should find articles about the next steps like building a compilers or an editor for your language. So, have fun!

Parsing In Python: Tools And Libraries

Parsing in Python: Tools and Libraries

This is an article similar to a previous one we wrote: Parsing in Java, so the introduction is the same. Skip to chapter 3 if you have already read it.

If you need to parse a language, or document, from Python there are fundamentally three ways to solve the problem:

  • use an existing library supporting that specific language: for example a library to parse XML
  • building your own custom parser by hand
  • a tool or library to generate a parser: for example ANTLR, that you can use to build parsers for any language

Parsing: Tools and Libraries

Parsing_-_tools_and_libraries_-_cover

Receive the guide to your inbox to read it on all your devices when you have time. Learn about parsing in Java, Python, C#, and JavaScript

We won't send you spam. Unsubscribe at any time. Powered by ConvertKit

Use An Existing Library

The first option is the best for well known and supported languages, like XML or HTML. A good library usually include also API to programmatically build and modify documents in that language. This is typically more of what you get from a basic parser. The problem is that such libraries are not so common and they support only the most common languages. In other cases you are out of luck.

Building Your Own Custom Parser By Hand

You may need to pick the second option if you have particular needs. Both in the sense that the language you need to parse cannot be parsed with traditional parser generators, or you have specific requirements that you cannot satisfy using a typical parser generator. For instance, because you need the best possible performance or a deep integration between different components.

A Tool Or Library To Generate A Parser

In all other cases the third option should be the default one, because is the one that is most flexible and has the shorter development time. That is why on this article we concentrate on the tools and libraries that correspond to this option.

Note: text in blockquote describing a program comes from the respective documentation

Tools To Create Parsers

We are going to see:

  • tools that can generate parsers usable from Python (and possibly from other languages)
  • Python libraries to build parsers

Tools that can be used to generate the code for a parser are called parser generators or compiler compiler. Libraries that create parsers are known as parser combinators.

Parser generators (or parser combinators) are not trivial: you need some time to learn how to use them and not all types of parser generators are suitable for all kinds of languages. That is why we have prepared a list of the best known of them, with a short introduction for each of them. We are also concentrating on one target language: Python. This also means that (usually) the parser itself will be written in Python.

To list all possible tools and libraries parser for all languages would be kind of interesting, but not that useful. That is because there will be simple too many options and we would all get lost in them. By concentrating on one programming language we can provide an apples-to-apples comparison and help you choose one option for your project.

Useful Things To Know About Parsers

To make sure that these list is accessible to all programmers we have prepared a short explanation for terms and concepts that you may encounter searching for a parser. We are not trying to give you formal explanations, but practical ones.

Structure Of A Parser

A parser is usually composed of two parts: a lexer, also known as scanner or tokenizer, and the proper parser. Not all parsers adopt this two-steps schema: some parsers do not depend on a lexer. They are called scannerless parsers.

A lexer and a parser work in sequence: the lexer scans the input and produces the matching tokens, the parser scans the tokens and produces the parsing result.

Let’s look at the following example and imagine that we are trying to parse a mathematical operation.

The lexer scans the text and find ‘4’, ‘3’, ‘7’ and then the space ‘ ‘. The job of the lexer is to recognize that the first characters constitute one token of type NUM. Then the lexer finds a ‘+’ symbol, which corresponds to a second token of type PLUS, and lastly it finds another token of type NUM.The input stream is transformed in Tokens and then in an AST by the parser

The parser will typically combine the tokens produced by the lexer and group them.

The definitions used by lexers or parser are called rules or productions. A lexer rule will specify that a sequence of digits correspond to a token of type NUM, while a parser rule will specify that a sequence of tokens of type NUM, PLUS, NUM corresponds to an expression.

Scannerless parsers are different because they process directly the original text, instead of processing a list of tokens produced by a lexer.

It is now typical to find suites that can generate both a lexer and parser. In the past it was instead more common to combine two different tools: one to produce the lexer and one to produce the parser. This was for example the case of the venerable lex & yacc couple: lex produced the lexer, while yacc produced the parser.

Parse Tree And Abstract Syntax Tree

There are two terms that are related and sometimes they are used interchangeably: parse tree and Abstract SyntaxTree (AST).

Conceptually they are very similar:

  • they are both trees: there is a root representing the whole piece of code parsed. Then there are smaller subtrees representing portions of code that become smaller until single tokens appear in the tree
  • the difference is the level of abstraction: the parse tree contains all the tokens which appeared in the program and possibly a set of intermediate rules. The AST instead is a polished version of the parse tree where the information that could be derived or is not important to understand the piece of code is removed

In the AST some information is lost, for instance comments and grouping symbols (parentheses) are not represented. Things like comments are superfluous for a program and grouping symbols are implicitly defined by the structure of the tree.

A parse tree is a representation of the code closer to the concrete syntax. It shows many details of the implementation of the parser. For instance, usually a rule corresponds to the type of a node. A parse tree is usually transformed in an AST by the user, possibly with some help from the parser generator.

A graphical representation of an AST looks like this.

An abstract syntax tree for the Euclidean algorithm

 

Sometimes you may want to start producing a parse tree and then derive from it an AST. This can make sense because the parse tree is easier to produce for the parser (it is a direct representation of the parsing process) but the AST is simpler and easier to process by the following steps. By following steps we mean all the operations that you may want to perform on the tree: code validation, interpretation, compilation, etc..

Grammar

A grammar is a formal description of a language that can be used to recognize its structure.

In simple terms is a list of rules that define how each construct can be composed. For example, a rule for an if statement could specify that it must starts with the “if” keyword, followed by a left parenthesis, an expression, a right parenthesis and a statement.

A rule could reference other rules or token types. In the example of the if statement, the keyword “if”, the left and the right parenthesis were token types, while expression and statement were references to other rules.

The most used format to describe grammars is the Backus-Naur Form (BNF), which also has many variants, including the Extended Backus-Naur Form. The Extended variant has the advantage of including a simple way to denote repetitions. A typical rule in a Backus-Naur grammar looks like this:

The <simbol> is usually nonterminal, which means that it can be replaced by the group of elements on the right, __expression__. The element __expression__ could contains other nonterminal symbols or terminal ones. Terminal symbols are simply the ones that do not appear as a <symbol> anywhere in the grammar. A typical example of a terminal symbol is a string of characters, like “class”.

Left-recursive Rules

In the context of parsers an important feature is the support for left-recursive rules. This means that a rule could start with a reference to itself. This reference could be also indirect.

Consider for example arithmetic operations. An addition could be described as two expression(s) separated by the plus (+) symbol, but an expression could also contain other additions.

This description also match multiple additions like 5 + 4 + 3. That is because it can be interpreted as expression (5) (‘+’) expression(4+3). And then 4 + 3 itself can be divided in its two components.

The problem is that this kind of rules may not be used with some parser generators. The alternative is a long chain of expressions that takes care also of the precedence of operators.

Some parser generators support direct left-recursive rules, but not indirect one.

Types Of Languages And Grammars

We care mostly about two types of languages that can be parsed with a parser generator: regular languages and context-free languages. We could give you the formal definition according to the Chomsky hierarchy of languages, but it would not be that useful. Let’s look at some practical aspects instead.

A regular language can be defined by a series of regular expressions, while a context-free one need something more. A simple rule of thumb is that if a grammar of a language has recursive elements it is not a regular language. For instance, as we said elsewhere, HTML is not a regular language. In fact, most programming languages are context-free languages.

Usually to a kind of language correspond the same kind of grammar. That is to say there are regular grammars and context-free grammars that corresponds respectively to regular and context-free languages. But to complicate matters, there is a relatively new (created in 2004) kind of grammar, called Parsing Expression Grammar (PEG). These grammars are as powerful as Context-free grammars, but according to their authors they describe programming languages more naturally.

The Differences Between PEG and CFG

The main difference between PEG and CFG is that the ordering of choices is meaningful in PEG, but not in CFG. If there are many possible valid ways to parse an input, a CFG will be ambiguous and thus wrong. Instead with PEG the first applicable choice will be chosen, and this automatically solve some ambiguities.

Another difference is that PEG use scannerless parsers: they do not need a separate lexer, or lexical analysis phase.

Traditionally both PEG and some CFG have been unable to deal with left-recursive rules, but some tools have found workarounds for this. Either by modifying the basic parsing algorithm, or by having the tool automatically rewrite a left-recursive rule in a non recursive way. Either of these ways has downsides: either by making the generated parser less intelligible or by worsen its performance. However, in practical terms, the advantages of easier and quicker development outweigh the drawbacks.

Parser Generators

The basic workflow of a parser generator tool is quite simple: you write a grammar that defines the language, or document, and you run the tool to generate a parser usable from your Python code.

The parser might produce the AST, that you may have to traverse yourself or you can traverse with additional ready-to-use classes, such Listeners or Visitors. Some tools instead offer the chance to embed code inside the grammar to be executed every time the specific rule is matched.

Usually you need a runtime library and/or program to use the generated parser.

Context Free

Let’s see the tools that generate Context Free parsers.

ANTLR

ANTLR is a great parser generator written in Java that can also generate parsers for Python and many other languages. There is also a beta version for TypeScript from the same guy that makes the optimized C# version. ANTLR is based on an new LL algorithm developed by the author and described in this paper: Adaptive LL(*) Parsing: The Power of Dynamic Analysis (PDF).

It is quite popular for its many useful features: for instance version 4 supports direct left-recursive rules. However a real added value of a vast community it is the large amount of grammars available.

It provides two ways to walk the AST, instead of embedding actions in the grammar: visitors and listeners. The first one is suited when you have to manipulate or interact with the elements of the tree, while the second is useful when you just have to do something when a rule is matched.

The typical grammar is divided in two parts: lexer rules and parser rules. The division is implicit, since all the rules starting with an uppercase letter are lexer rules, while the ones starting with a lowercase letter are parser rules. Alternatively lexer and parser grammars can be defined in separate files.

If you are interested in ANTLR you can look into this giant ANTLR tutorial we have written.

Lark

A modern parsing library for Python, implementing Earley & LALR(1) and an easy interface

Lark is a parser generator that works as a library. You write the grammar in a string or a file and then use it as an argument to dynamically generate the parser. Lark can use two algorithms: Earley is used when you need to parse all grammars and LALR when you need speed. Earley can parse also ambiguous grammars. Lark offers the chance to automatically solve the ambiguity by choosing the simplest option or reporting all options.

Lark grammars are written in an EBNF format. They cannot include actions. This means that they are clean and readable, but also that you have to traverse the resulting tree yourself. Although there is a function that can help with that if you use the LALR algorithm. On the positive side you can also use specific notations in the grammar to automatically generate an AST. You can do that by dropping certain nodes, merging or transforming them.

The following example grammar shows an useful feature of Lark: it includes rules for common things, like whitespace or numbers.

Lark comes with a tool to convert Nearley grammars in its own format. It also include a useful function to transform the tree generated by the parser in a image.

It has a sufficient documentation, with examples and tutorials available. There is also a small reference.

Lrparsing

lrparsing is an LR(1) parser hiding behind a pythonic interface

Lrparsing is a parser generator whose grammars are defined as Python expressions. These expressions are attribute of a class that corresponds to rule of a traditional grammar. They are usually dynamically generated, but the library provide a function to precompile a parse table beforehand.

Given their format depending on Python, lrparsing grammars can be easy to read for Python developers, but they are harder to read than a traditional grammar.

Lrparsing also provide some basic functions to print parsing tree and grammar rules for debugging purposes.

The documentation is really good: it explain everything you need to know about the library and it also provide some guidance on creating good grammars (eg. solving ambiguities). There are also quite complex example grammars, like one for SQLite.

PLY

PLY doesn’t try to do anything more or less than provide the basic lex/yacc functionality. In other words, it’s not a large parsing framework or a component of some larger system.

PLY is a stable and maintained tool with a long history starting from 2001. It is also quite basic, given that there are no tools for automatic creation of AST, or anything that a C developer of the previous century would define as fancy stuff. The tool was primarily created as instructional tool. This explains its simplicity, but it also the reason because it offer great support for diagnostics or catching mistakes in the grammar.

A PLY grammar is written in Python code in a BNF-like format. Lexer and parser functions can be used separately. The following example shows only the lexer, but the parser works in the same way.

The documentation is extensive, clear, with abundant examples and explanations of parsing concepts. All that you need, if you can get pass the ’90 looks.

There is a port for RPython called RPLY.

PlyPlus

Plyplus is a general-purpose parser built on top of PLY (LALR(1)), and written in Python. Plyplus features a modern design, and focuses on simplicity without losing power.

PlyPlus is a tool that is built on top of PLY, but it is very different from it. The authors and the way the names are written are different. Compared to its father the documentation is lacking, but the features are many.

You can write a grammar in a .g file or in a string, but it is always generate dynamically. The format is based on EBNF, but a grammar can also include special notations to simplify the creation of an AST. This notations allows to exclude or drop certain rules from the generated tree.

PlyPlus include a function to draw a image of a parse tree based upon pydot and graphviz. PlyPlus have a unique features, too. It allows you to select nodes in the AST using selectors similar to the CSS selectors used in web development. For instance if you want to fill all terminal nodes that contain the letter ‘n’, you can find them like this:

This is a unique feature that can be useful, for example, if you are developing a static analysis or refactoring tool.

Pyleri

Python Left-Right Parser (pyleri) is part of a family of similar parser generators for JavaScript, Python, C and Go.

A grammar for Pyleri must be defined in Python expressions part of a class. Once defined the grammar can be exported as a file defining the grammar in Python or any other supported language. Apart from this peculiarity Pyleri is a simple and easy to use tool.

PEG

After the CFG parsers is time to see the PEG parsers available for JavaScript.

Arpeggio

Arpeggio is recursive descent parser with backtracking and memoization (a.k.a. pacrat parser). Arpeggio grammars are based on PEG formalism.

The documentation defines Arpeggio as a parser interpreter, since parser are generate dynamically from a grammar. In any case it does not work any different from many other Python parser generators. A peculiarity of Arpeggio is that you can define a grammar in a textual PEG format or using Python expressions. Actually there are two dialects of PEGs, one with a cleaner Python-like syntax and the other the traditional PEG one.

Arpeggio generate a simple parse tree, but it supports the use of a visitor. The visitor can also include a second action to perform after all the tree nodes have been processed. This is used for post-processing, for instance it can be used to deal with symbol reference.

An Arpeggio grammar defined with either a PEG notation or the Python one is usually quite readable. The following example uses Python notation.

There are a couple of options for debugging: verbose and informative ouput and the generation of DOT files of the parser. The DOT files can be used for creating a visualization of the parser, but you will have to call graphviz yourself. The documentation is comprehensive and well-organized.

Arpeggio is the foundation of a more advanced tool for the creation of DSL called textX. TextX is made by the same developer that created Arpeggio and it is inspired by the more famous XText.

Canopy

Canopy is a parser compiler targeting Java, JavaScript, Python and Ruby. It takes a file describing a parsing expression grammar and compiles it into a parser module in the target language. The generated parsers have no runtime dependency on Canopy itself.

It also provides easy access to the parse tree nodes.

A Canopy grammar has the neat feature of using actions annotation to use custom code in the parser. In practical terms. you just write the name of a function next to a rule and then you implement the function in your source code.

The Python file containing the action code.

Parsimonious

Parsimonious aims to be the fastest arbitrary-lookahead parser written in pure Python—and the most usable. It’s based on parsing expression grammars (PEGs), which means you feed it a simplified sort of EBNF notation.

Parsimonious is a no-nonsense tool designed for speed and low usage of RAM. It is also a no-documentation tool, there are not even complete examples. Actually the short README file explain the basics and redirect you to Docstring for more specific information.

In any case Parsimonious is good working tool that allows you dynamically create a grammar defined in a file or a string. You can also define a visitor to traverse and transform the parsing tree. So, if you are already familiar with the PEG format you do not need to know anything else to use it at its fullest.

A Parsimonious grammar is readable like any other PEG grammar.

pyPEG

pyPEG is a plain and simple intrinsic parser interpreter framework for Python version 2.7 and 3.x

PyPEG is a framework to parse and compose text. Which means that you define a grammar in a syntax as powerful as PEG, but you do it in Python code. And then you use this grammar to parse and/or compose a text based upon that grammar. Obviously if you compose a text you have to provide the data yourself. In this case it works as a template system.

The syntax for a PyPEG is on the verbose side, frankly it is too verbose to be productive if you just want to use it for simple parsing. But it is a cool library if you want to parse and manipulate some document in a specific format. For instance you could use it to transform documentation in one format to another.

PyPEG does not produce a standard tree, but a structure based upon the defined grammar. Look at what happens for the previous example.

TatSu

TatSu (for grammar compiler) is a tool that takes grammars in a variation of EBNF as input, and outputs memoizing (Packrat) PEG parsers in Python.

TatSu is the successor of Grako, another parser generator tool, and it has a good level of compatibility with it. It can create a parser dynamically from a grammar or compiling into a Python module.

TatSu generate PEG parsers, but grammars are defined in a variant of EBNF. Though the order of rules matters as it is usual for PEG grammars. So it is actually a sort of cross between the two. This variant includes support for dealing with associativity and simplifying the generated tree or model (more on that later). Support for left-recursive rule is present, but experimental.

TatSu grammars cannot include actions, that can be defined in a separate Python class. Instead you have to annotate the grammar if you want to use an object model in place of semantic actions. An object model is a way to separate the parsing process from the entity that is parsed. In practical terms instead of doing something when a certain rule is matched you do something when a certain object is defined. This object may be defined by more than one rule.

The following extract example defines an object Multiply that corresponds to the rule multiplication.

The object model can then be used for what TatSu calls walker (essentially a visitor for the model).

The same object model can also be used for code generation, for instance to transform one format into another one. But for that you obviously cannot reuse the walker, but you have to define a template class for each object.

TatSu provides also: a tool to translate ANTLR grammars, complex trace output and a graphical representation of the tree using pygraphviz. ANLTR grammar may have to be manually adapted to respect PEG constraints.

The documentation is complete: it shows all the features, provide examples and even has basic introduction to parsing concepts, like AST.

Waxeye

Waxeye is a parser generator based on parsing expression grammars (PEGs). It supports C, Java, Javascript, Python, Ruby and Scheme.

Waxeye can facilitate the creation of an AST by defining nodes in the grammar that will not be included in the generated tree. That is quite useful, but a drawback of Waxeye is that it only generate a AST. In the sense that there is no way to automatically execute an action when you match a node. You have to traverse and execute what you need manually.

One positive side-effect of this limiation is that grammars are easily readable and clean. They are also independent from any language.

A particular feature of Waxeye is that it provides some help to compose different grammars together and then it facilitate modularity. For instance you could create a common grammar for identifiers, that are usually similar in many languages.

Waxeye has a great documentation in the form of a manual that explains basic concepts and how to use the tool for all the languages it supports. There are a few example grammars.

Parser Combinators

They allow you to create a parser by combining different pattern matching functions, that are equivalent to grammar rules. They are generally considered best suited for simpler parsing needs.

In practice this means that they are very useful for all the little parsing problems you find. If the typical developer encounter a problem, that is too complex for a simple regular expression, these libraries are usually the solution. In short, if you need to build a parser, but you don’t actually want to, a parser combinator may be your best option.

Some readers have indicated us funcparserlib, but we decided to not include it because it has been unmantained for a few years.

Parsec.py, Parsy and Pyparsing

A universal Python parser combinator library inspired by Parsec library of Haskell.

That is basically the extent of the documentation on Parsec.py. Though there are a couple of examples. If you already know how to use the original Parsec library or one of its many clones you can try to use it. It does not look bad, but the lack of documentation is a problem for new users.

Parsy is an easy way to combine simple, small parsers into complex, larger parsers. If it means anything to you, it’s a monadic parser combinator library for LL(infinity) grammars in the spirit of Parsec, Parsnip, and Parsimmon.

Parsy was an abandoned project for a while, but it was recently recovered and taken up by a new maintainer and it is now in a good shape. Among other things the new developer brought the project to recent coding practices (e.g., testing coverage).

The project might not be as powerful as an “industrial-strength” parser combinator such Parsec (the original one), but it has a few nice features. For instance, you can create a generator function to create a parser. It now requires Python 3.3 or later, which should only be a problem for people stuck with Python 2.

The project now has ample documentation, examples and a tutorial. The following example comes from the documentation and shows how to parse a date.

The pyparsing module is an alternative approach to creating and executing simple grammars, vs. the traditional lex/yacc approach, or the use of regular expressions. The pyparsing module provides a library of classes that client code uses to construct the grammar directly in Python code.

Pyparsing is a stable and mature software developed for more than 14 years which has many examples, but still a confusing and lacking documentation. While Pyparsing is as equally powerful as a traditional parser combinator, it works a bit differently and this lack in the proper documentation makes it frustrating.

However, if you take the time to learn on its own, the following example shows that can be easy to use.

Python Libraries Related To Parsing

Python offers also some other libraries or tools related to parsing.

Parsing Python Inside Python

There is one special case that could be managed in more specific way: the case in which you want to parse Python code in Python. When it comes to Python the best choice is to rely on your own Python interpreter.

The standard reference implementation of Python, known as CPython, include a few modules to access its internals for parsing: tokenize, parser and ast. You may also be able to use the parser in the PyPy interpreter.

Parsing With Regular Expressions And The Like

Usually you resort to parsing libraries and tools when regular expression are not enough. However there is a good library for Python than can extend the life and usefulness of regular expressions or using elements of similar complexity.

Regular Expression based parsers for extracting data from natural languages [..]

This library basically just gives you a way to combine Regular Expressions together and hook them up to some callback functions in Python.

Reparse is a simple tool that can nonetheless quite useful in certain scenarios. The author himself says that it is much simpler and with less feature than PyParsing or Parboiled.

The basic idea is that you define regular expressions, the patterns in which they can combine and the functions that are called when an expression or pattern is found. You must define functions in Python, but expressions and pattern can be defined in Yaml, JSON or Python.

In this example from the documentation expressions and patterns are defined in Yaml.

Fields like Matches are there for humans, but can be used for testing by Reparse.

An example function in Python for the pattern.

The file that puts everything together.

Parsing Binary Data: Construct

Instead of writing imperative code to parse a piece of data, you declaratively define a data structure that describes your data. As this data structure is not code, you can use it in one direction to parse data into Pythonic objects, and in the other direction, to build objects into binary data.

And that is it: Construct. You could parse binary data even with some parser generators (e.g. ANTLR), but Constuct make it much easier. It is a sort of DSL combined with a parser combinator to parse binary formats. It gives you a bunch of fields to manage binary data: apart from the obvious ones (e.g. float, string, bytes etc.), there are a few specialized to manage sequences of fields (sequence), group of them (struct) and a few conditional statements.

It also makes available functions to adapt or validate (test) the data and debug any problem you found.

As you can see in the following example, it is quite easy to use.

There is a nice amount of documentation and even many example grammars for different kinds of format, such as filesystems or graphics files.

Summary

Any programming language has a different community with its peculiarities. This differences remains even when we compare the same interests across the languages. For instance, when we compare parsers tools we can see how Java and Python developers live in a different world.

The parsing tools and libraries for Python for the most part use very readable grammars and are simple to use. But the most interesting thing is that they cover a very wide spectrum of competence and use cases. There seems to be an uninterrupted line of tools available from regular expression, passing through Reparse to end with TatSu and ANTLR.

Sometimes this means that it can be confusing if you are a parsing experts coming from a different language. Because few parser generators actually generate parsers, but they mostly interpret them at runtime. On the other hand with Python you can really find the perfect library, or tools, for your needs. And to help you with that we hope that this comparison has been useful for you.

Parsing in JavaScript: Tools and Libraries

Parsing in JavaScript: Tools and Libraries

This is an article similar to a previous one we wrote: Parsing in Java, so the introduction is the same. Skip to chapter 3 if you have already read it.

If you need to parse a language, or document, from JavaScript there are fundamentally three ways to solve the problem:

  • use an existing library supporting that specific language: for example a library to parse XML
  • building your own custom parser by hand
  • a tool or library to generate a parser: for example ANTLR, that you can use to build parsers for any language

Parsing: Tools and Libraries

Parsing_-_tools_and_libraries_-_cover

Receive the guide to your inbox to read it on all your devices when you have time. Learn about parsing in Java, Python, C#, and JavaScript

We won't send you spam. Unsubscribe at any time. Powered by ConvertKit

Use An Existing Library

The first option is the best for well known and supported languages, like XML or HTML. A good library usually include also API to programmatically build and modify documents in that language. This is typically more of what you get from a basic parser. The problem is that such libraries are not so common and they support only the most common languages. In other cases you are out of luck.

Building Your Own Custom Parser By Hand

You may need to pick the second option if you have particular needs. Both in the sense that the language you need to parse cannot be parsed with traditional parser generators, or you have specific requirements that you cannot satisfy using a typical parser generator. For instance, because you need the best possible performance or a deep integration between different components.

A Tool Or Library To Generate A Parser

In all other cases the third option should be the default one, because is the one that is most flexible and has the shorter development time. That is why on this article we concentrate on the tools and libraries that correspond to this option.

Note: text in blockquote describing a program comes from the respective documentation

Tools To Create Parsers

We are going to see:

  • tools that can generate parsers usable from JavaScript (and possibly from other languages)
  • JavaScript libraries to build parsers

Tools that can be used to generate the code for a parser are called parser generators or compiler compiler. Libraries that create parsers are known as parser combinators.

Parser generators (or parser combinators) are not trivial: you need some time to learn how to use them and not all types of parser generators are suitable for all kinds of languages. That is why we have prepared a list of the best known of them, with a short introduction for each of them. We are also concentrating on one target language: JavaScript. This also means that (usually) the parser itself will be written in JavaScript.

To list all possible tools and libraries parser for all languages would be kind of interesting, but not that useful. That is because there will be simple too many options and we would all get lost in them. By concentrating on one programming language we can provide an apples-to-apples comparison and help you choose one option for your project.

Useful Things To Know About Parsers

To make sure that these list is accessible to all programmers we have prepared a short explanation for terms and concepts that you may encounter searching for a parser. We are not trying to give you formal explanations, but practical ones.

Structure Of A Parser

A parser is usually composed of two parts: a lexer, also known as scanner or tokenizer, and the proper parser. Not all parsers adopt this two-steps schema: some parsers do not depend on a lexer. They are called scannerless parsers.

A lexer and a parser work in sequence: the lexer scans the input and produces the matching tokens, the parser scans the tokens and produces the parsing result.

Let’s look at the following example and imagine that we are trying to parse a mathematical operation.

The lexer scans the text and find ‘4’, ‘3’, ‘7’ and then the space ‘ ‘. The job of the lexer is to recognize that the first characters constitute one token of type NUM. Then the lexer finds a ‘+’ symbol, which corresponds to a second token of type PLUS, and lastly it finds another token of type NUM.The input stream is transformed in Tokens and then in an AST by the parser

The parser will typically combine the tokens produced by the lexer and group them.

The definitions used by lexers or parser are called rules or productions. A lexer rule will specify that a sequence of digits correspond to a token of type NUM, while a parser rule will specify that a sequence of tokens of type NUM, PLUS, NUM corresponds to an expression.

Scannerless parsers are different because they process directly the original text, instead of processing a list of tokens produced by a lexer.

It is now typical to find suites that can generate both a lexer and parser. In the past it was instead more common to combine two different tools: one to produce the lexer and one to produce the parser. This was for example the case of the venerable lex & yacc couple: lex produced the lexer, while yacc produced the parser.

Parse Tree And Abstract Syntax Tree

There are two terms that are related and sometimes they are used interchangeably: parse tree and Abstract SyntaxTree (AST).

Conceptually they are very similar:

  • they are both trees: there is a root representing the whole piece of code parsed. Then there are smaller subtrees representing portions of code that become smaller until single tokens appear in the tree
  • the difference is the level of abstraction: the parse tree contains all the tokens which appeared in the program and possibly a set of intermediate rules. The AST instead is a polished version of the parse tree where the information that could be derived or is not important to understand the piece of code is removed

In the AST some information is lost, for instance comments and grouping symbols (parentheses) are not represented. Things like comments are superfluous for a program and grouping symbols are implicitly defined by the structure of the tree.

A parse tree is a representation of the code closer to the concrete syntax. It shows many details of the implementation of the parser. For instance, usually a rule corresponds to the type of a node. A parse tree is usually transformed in an AST by the user, possibly with some help from the parser generator.

A graphical representation of an AST looks like this.

An abstract syntax tree for the Euclidean algorithm

 

Sometimes you may want to start producing a parse tree and then derive from it an AST. This can make sense because the parse tree is easier to produce for the parser (it is a direct representation of the parsing process) but the AST is simpler and easier to process by the following steps. By following steps we mean all the operations that you may want to perform on the tree: code validation, interpretation, compilation, etc..

Grammar

A grammar is a formal description of a language that can be used to recognize its structure.

In simple terms is a list of rules that define how each construct can be composed. For example, a rule for an if statement could specify that it must starts with the “if” keyword, followed by a left parenthesis, an expression, a right parenthesis and a statement.

A rule could reference other rules or token types. In the example of the if statement, the keyword “if”, the left and the right parenthesis were token types, while expression and statement were references to other rules.

The most used format to describe grammars is the Backus-Naur Form (BNF), which also has many variants, including the Extended Backus-Naur Form. The Extended variant has the advantage of including a simple way to denote repetitions. A typical rule in a Backus-Naur grammar looks like this:

The <simbol> is usually nonterminal, which means that it can be replaced by the group of elements on the right, __expression__. The element __expression__ could contains other nonterminal symbols or terminal ones. Terminal symbols are simply the ones that do not appear as a <symbol> anywhere in the grammar. A typical example of a terminal symbol is a string of characters, like “class”.

Left-recursive Rules

In the context of parsers an important feature is the support for left-recursive rules. This means that a rule could start with a reference to itself. This reference could be also indirect.

Consider for example arithmetic operations. An addition could be described as two expression(s) separated by the plus (+) symbol, but an expression could also contain other additions.

This description also match multiple additions like 5 + 4 + 3. That is because it can be interpreted as expression (5) (‘+’) expression(4+3). And then 4 + 3 itself can be divided in its two components.

The problem is that this kind of rules may not be used with some parser generators. The alternative is a long chain of expressions that takes care also of the precedence of operators.

Some parser generators support direct left-recursive rules, but not indirect one.

Types Of Languages And Grammars

We care mostly about two types of languages that can be parsed with a parser generator: regular languages and context-free languages. We could give you the formal definition according to the Chomsky hierarchy of languages, but it would not be that useful. Let’s look at some practical aspects instead.

A regular language can be defined by a series of regular expressions, while a context-free one need something more. A simple rule of thumb is that if a grammar of a language has recursive elements it is not a regular language. For instance, as we said elsewhere, HTML is not a regular language. In fact, most programming languages are context-free languages.

Usually to a kind of language correspond the same kind of grammar. That is to say there are regular grammars and context-free grammars that corresponds respectively to regular and context-free languages. But to complicate matters, there is a relatively new (created in 2004) kind of grammar, called Parsing Expression Grammar (PEG). These grammars are as powerful as Context-free grammars, but according to their authors they describe programming languages more naturally.

The Differences Between PEG and CFG

The main difference between PEG and CFG is that the ordering of choices is meaningful in PEG, but not in CFG. If there are many possible valid ways to parse an input, a CFG will be ambiguous and thus wrong. Instead with PEG the first applicable choice will be chosen, and this automatically solve some ambiguities.

Another difference is that PEG use scannerless parsers: they do not need a separate lexer, or lexical analysis phase.

Traditionally both PEG and some CFG have been unable to deal with left-recursive rules, but some tools have found workarounds for this. Either by modifying the basic parsing algorithm, or by having the tool automatically rewrite a left-recursive rule in a non recursive way. Either of these ways has downsides: either by making the generated parser less intelligible or by worsen its performance. However, in practical terms, the advantages of easier and quicker development outweigh the drawbacks.

Parser Generators

The basic workflow of a parser generator tool is quite simple: you write a grammar that defines the language, or document, and you run the tool to generate a parser usable from your JavaScript code.

The parser might produce the AST, that you may have to traverse yourself or you can traverse with additional ready-to-use classes, such Listeners or Visitors. Some tools instead offer the chance to embed code inside the grammar to be executed every time the specific rule is matched.

Usually you need a runtime library and/or program to use the generated parser.

Regular (Lexer)

Tools that analyze regular languages are typically called lexers.

Lexer

Lexer is a lexer that claims to be modelled after flex. Though a more fair description would be a short lexer based upon RegExp. The documentation seems minimal, with just a few examples, but the whole thing is 147 lines of code, so it is actually comprehensive.

However in a few lines manages to support a few interesting things and it appears to be quite popular and easy to use. One thing is its supports RingoJS, a JavaScript platform on top of the JVM. Another one is the integration with Jison, the Bison clone in JavaScript. If you temper your expectations it can be an useful tool.

There is no grammar, you just use a function to define the RegExp pattern and the action that should be executed when the pattern is matched. So it is a cross between a lexer generator and a lexer combinator. You cannot combine different lexer functions, like in a lexer combinator, but the lexer it is only created dynamically at runtime, so it is not a proper lexer generator either.

Context Free

Let’s see the tools that generate Context Free parsers.

ANTLR

ANTLR is a great parser generator written in Java that can also generate parsers for JavaScript and many other languages. There is also a beta version for TypeScript from the same guy that makes the optimized C# version. ANTLR is based on an new LL algorithm developed by the author and described in this paper: Adaptive LL(*) Parsing: The Power of Dynamic Analysis (PDF).

It is quite popular for its many useful features: for instance version 4 supports direct left-recursive rules. However a real added value of a vast community it is the large amount of grammars available.

It provides two ways to walk the AST, instead of embedding actions in the grammar: visitors and listeners. The first one is suited when you have to manipulate or interact with the elements of the tree, while the second is useful when you just have to do something when a rule is matched.

The typical grammar is divided in two parts: lexer rules and parser rules. The division is implicit, since all the rules starting with an uppercase letter are lexer rules, while the ones starting with a lowercase letter are parser rules. Alternatively lexer and parser grammars can be defined in separate files.

If you are interested in ANTLR you can look into this giant ANTLR tutorial we have written.

APG

APG is a recursive-descent parser using a variation of Augmented BNF, that they call Superset Augmented BNF. ABNF is a particular variant of BNF designed to better support bidirectional communications protocol. APG also support additional operators, like syntactic predicates and custom user defined matching functions.

It can generate parsers in C/C++, Java e JavaScript. Support for the last language seems superior and more up to date: it has a few more features and seems more updated. In fact the documentation says it is designed to have the look and feel of JavaScript RegExp.

Because it is based on ABNF, it is especially well suited to parsing the languages of many Internet technical specifications and, in fact, is the parser of choice for a number of large Telecom companies.

An APG grammar is very clean and easy to understand.

Jison

Jison generates bottom-up parsers in JavaScript. Its API is similar to Bison’s, hence the name.

Despite the name Jison can also replace flex, so you do not need a separate lexer. Although you can use one or build your own custom lexer. All you need is an object with the functions setInput and lex. It can best recognize languages described by LALR(1) grammars, though it also has modes for LR(0), SLR(1).

The generated parser does not require a runtime component, you can use it as a standalone software.

It has a good enough documentation with a few examples and even a section to try your grammars online. Although at times it relies on the Bison manual to cover the lacks of its own documentation.

A Jison grammar can be inputted using a custom JSON format or a Bison-styled one. Both requires you to use embedded actions if you want to do something when a rule is matched. The following example is in the custom JSON format.

It is very popular and used by many project including CoffeeScript and Handlebars.js.

Nearley

nearley uses the Earley parsing algorithm augmented with Joop Leo’s optimizations to parse complex data structures easily. nearley is über-fast and really powerful. It can parse literally anything you throw at it.

The Earley algorithm is designed to easily handle all grammars, including left-recursive and ambiguous ones. Essentially its main advantage it is that it should never catastrophically fail. On the other hand it could be slower than other parsing algorithms. Nearly itself also is able to detect some ambiguous grammars. It can also and reports multiple results in the case of an ambiguous input.

A Nearley grammar is a written in a .ne file that can include custom code. A rule can include an embedded action, which the documentation calls a postprocessing function. You can also use a custom lexer.

Another interesting feature is that you could build custom tokens. You can define them using a tokenizing library, a literal or a test function. The test function must return true if the text correspond to that specific token.

Nearley include tools for debugging and understanding your parser. For instance Unparser can automatically generate random strings that are considered correct by your parser. This is useful to test your parser against random noise or even to generate data from a schema (e.g. a random email address). It also include a tool to generate SVG railroad diagrams: a graphical way to represent a grammar.

Railroad diagrams

Railroad diagrams from the documentation demo

A Nearley parser requires the Nearley runtime.

Nearley documentation is a good overview of what is available and there is also a third-party playground to try a grammar online. But you will not find a complete explanation of all the features.

PEG

After the CFG parsers is time to see the PEG parsers available for JavaScript.

Canopy

Canopy is a parser compiler targeting Java, JavaScript, Python and Ruby. It takes a file describing a parsing expression grammar and compiles it into a parser module in the target language. The generated parsers have no runtime dependency on Canopy itself.

It also provides easy access to the parse tree nodes.

A Canopy grammar has the neat feature of using actions annotation to use custom code in the parser. In practical terms. you just write the name of a function next to a rule and then you implement the function in your source code.

The JavaScript file containing the action code.

Ohm

Ohm is a parser generator consisting of a library and a domain-specific language

Ohm grammars are defined in a custom format that can be put in a separate file or in a string. However the parser is generate dynamically and not with a separate tool. In that sense it works like a parser library more than a traditional parser generator.

An helper function to create an AST is included among the extras.

In Ohm, a grammar defines a language, and semantic actions specify what to do with valid inputs in that language

A grammar is completely separated from semantic actions. This simplify portability and readability and allows to support different languages with the same grammar. At the moment Ohm only supports JavaScript, but more language are planned for the future. The typical grammar is then clean and readable.

Semantic actions can be implemented using a simple API. In practical terms this ends up working like the visitor pattern with the difference that is easier to define more groups of semantic actions.

To support debugging Ohm has a text trace and (work in progress) graphical visualizer. You can see the graphical visualizer at work and test a grammar in the interactive editor.

The documentation is not that bad, though you have to go under the doc directory to find it. There is no tutorial, but there are a few examples and a reference. In particular the documentation suggests reading a well commented Math example.

PEG.js

PEG.js is a simple parser generator for JavaScript that produces fast parsers with excellent error reporting.

PEG.js can work as a traditional parser generator and create a parser with a tool or can generate one using a grammar defined in the code. It supports different module loaders (e.g. AMD, CommonJS, etc.).

PEG.js has a neat online editor that allows to write a grammar, test the generated parser and download it.

A PEG.js grammar is usually written in a .pegjs file and can include embedded actions, custom initializing code and semantic predicates. That is to say functions that determine if a specific match is activated or not.

The documentation is good enough, there are a few example grammars, but there are no tutorials available. The popularity of the project had led to the development of third-party tools, like one to generate railroad diagrams, and plugins, like one to generate TypeScrypt parsers.

Waxeye

Waxeye is a parser generator based on parsing expression grammars (PEGs). It supports C, Java, Javascript, Python, Ruby and Scheme.

Waxeye can facilitate the creation of an AST by defining nodes in the grammar that will not be included in the generated tree. That is quite useful, but a drawback of Waxeye is that it only generate a AST. In the sense that there is no way to automatically execute an action when you match a node. You have to traverse and execute what you need manually.

One positive side-effect of this limitation is that grammars are easily readable and clean. They are also independent from any language.

A particular feature of Waxeye is that it provides some help to compose different grammars together and then it facilitate modularity. For instance you could create a common grammar for identifiers, that are usually similar in many languages.

Waxeye has a great documentation in the form of a manual that explains basic concepts and how to use the tool for all the languages it supports. There are a few example grammars.

Waxeye seems to be maintained, but it is not actively developed.

Parser Combinators

They allow you to create a parser simply with JavaScript code, by combining different pattern matching functions, that are equivalent to grammar rules. They are generally considered best suited for simpler parsing needs. Given they are just JavaScript libraries you can easily introduce them into your project: you do not need any specific generation step and you can write all of your code in your favorite editor. Their main advantage is the possibility of being integrated in your traditional workflow and IDE.

In practice this means that they are very useful for all the little parsing problems you find. If the typical developer encounter a problem, that is too complex for a simple regular expression, these libraries are usually the solution. In short, if you need to build a parser, but you don’t actually want to, a parser combinator may be your best option.

Bennu, Parjs And Parsimmon

Bennu is a Javascript parser combinator library based on Parsec. The Bennu library consists of a core set of parser combinators that implement Fantasy Land interfaces. More advanced functionality such as detailed error messaging, custom parser state, memoization, and running unmodified parsers incrementally is also supported.

All libraries are inspired by Parsec. Bennu and Parsimmon are the oldest and Bennu the more advanced between the two. They also all have an extensive documentation, but they have is no tutorial. Bennu seems to be maintained, but it is not actively developed.

An example from the documentation.

Parsimmon is a small library for writing big parsers made up of lots of little parsers. The API is inspired by parsec and Promises/A+.

Parsimmon is the most popular among the three, it is stable and updated.

The following is a part of the JSON example.

Parjs is a JavaScript library of parser combinators, similar in principle and in design to the likes of Parsec and in particular its F# adaptation FParsec.

It’s also similar to the parsimmon library, but intends to be superior to it.

Parjs is only a few months old, but it is already quite developed. It also have the advantage of begin written int TypeScript.

All the libraries have good documentation, but Parjs is great: it explained how to use the parser and also how to design good parsers with it. There are a few examples, including the following on string formatting.

JavaScript Libraries Related To Parsing

There are also some other interesting libraries related to parsing that are not part of a common category.

JavaScript Libraries That Parse JavaScript

There is one special case that could be managed in more specific way: the case in which you want to parse JavaScript code in JavaScript. Contrary to what we have found for Java and C# there is not a definitive choice: there are many good choices to parse JavaScript.

The three most popular libraries seems to be: Acorn, Esprima and UglifyJS. We are not going to say which one it is best because they all seem to be awesome, updated and well supported.

One important difference is that UglifyJS is also a mangler/compressor/beautifier toolkit, which means that it also has many other uses. On the other hand it is the only one to support only up to the version ECMAScript 5. Another thing to consider is that only esprima have a documentation worthy of projects of such magnitude.

Interesting Parsing Libraries: Chevrotain

Chevrotain is a very fast and feature rich JavaScript LL(k) Parsing DSL. It can be used to build parsers/compilers/interperters for various use cases ranging from simple configuration files, to full fledged programing languages.

There is another interesting parsing tool that does not really fit in more common categories of tools, like parser generators or combinators: Chevrotain, a parsing DSL. A parsing DSL works as a cross between a parser combinator and a parser generator. You define a grammar in JavaScript code directly, but using the (Chevrotrain) API and not a standard syntax like EBNF or PEG.

Chevotrain supports many advanced features typical of parser generators: like semantic predicates, separate lexer and parser and a grammar definition (optionally) separated from the actions. The actions can be implemented using a visitor and thus you can reuse the same grammar for multiple projects.

The following is a partial JSON example grammar from the documentation. As you can see the syntax is clearer to understand for a developer unexperienced in parsing, but a bit more verbose than a standard grammar.

It is very fast, faster than any other JavaScript library and can compete with a custom parser written by hand, depending on the JavaScript engine on which it run on. You can see the numbers and get more details on the benchmark of parsing libraries developed by the author of the library.

It also supports features useful for debugging like railroad diagram generation and custom error reporting. There are also a few features that are useful for building compiler, interpreters or tools for editors, such as automatic error recovery or syntactic content assist. The last one means that it can suggests the next token given a certain input, so it could be used as the building block for an autocomplete feature.

Chevrotain has a good documentation, with a tutorial, examples grammars and a reference. It also has a great online editor/playground.

Chevrotrain is written in TypeScript.

Summary

As we said in the sisters article about parsing in Java and C#, the world of parsers is a bit different from the usual world of programmers. In the case of JavaScript also the language lives in a different world from any other programming language. There is such disparate level of competence between its developers that you could find the best ones working with people that just barely know how to put together a script. And both want to parse things.

So for JavaScript there are tools that a bit all over this spectrum. We have serious tools developed by academics for their courses or in the course of their degrees together with much simpler tools. Some of which blur the lines between parser generators and parser combinators. And all of them have their place. A further complication is that while usually parser combinators are reserved for easier uses, with JavaScript it is not always the case. You could find very powerful and complex parser combinators and much easier parser generators.

So with JavaScript more than ever we cannot definitely suggests one software over the other. What it is best for a user might not be the best for somebody else. And we all know that the most technically correct solution might not be ideal in real life with all its constraints. So we wanted to share what we have learned on the best options for parsing in JavaScript.

We would like to thank Shahar Soel for having signalled to us Chevrotain and having suggested some needed corrections.

Parsing In C#: Tools And Libraries

Parsing in C#: Tools and Libraries

This is an article similar to a previous one we wrote: Parsing in Java, so the introduction is the same. Skip to chapter 3 if you have already read it.

If you need to parse a language, or document, from C# there are fundamentally three ways to solve the problem:

  • use an existing library supporting that specific language: for example a library to parse XML
  • building your own custom parser by hand
  • a tool or library to generate a parser: for example ANTLR, that you can use to build parsers for any language

Parsing: Tools and Libraries

Parsing_-_tools_and_libraries_-_cover

Receive the guide to your inbox to read it on all your devices when you have time. Learn about parsing in Java, Python, C#, and JavaScript

We won't send you spam. Unsubscribe at any time. Powered by ConvertKit

Use An Existing Library

The first option is the best for well known and supported languages, like XML or HTML. A good library usually include also API to programmatically build and modify documents in that language. This is typically more of what you get from a basic parser. The problem is that such libraries are not so common and they support only the most common languages. In other cases you are out of luck.

Building Your Own Custom Parser By Hand

You may need to pick the second option if you have particular needs. Both in the sense that the language you need to parse cannot be parsed with traditional parser generators, or you have specific requirements that you cannot satisfy using a typical parser generator. For instance, because you need the best possible performance or a deep integration between different components.

A Tool Or Library To Generate A Parser

In all other cases the third option should be the default one, because is the one that is most flexible and has the shorter development time. That is why on this article we concentrate on the tools and libraries that correspond to this option.

Note: text in blockquote describing a program comes from the respective documentation

Tools To Create Parsers

We are going to see:

  • tools that can generate parsers usable from C# (and possibly from other languages)
  • C# libraries to build parsers

Tools that can be used to generate the code for a parser are called parser generators or compiler compiler. Libraries that create parsers are known as parser combinators.

Parser generators (or parser combinators) are not trivial: you need some time to learn how to use them and not all types of parser generators are suitable for all kinds of languages. That is why we have prepared a list of the best known of them, with a short introduction for each of them. We are also concentrating on one target language: C#. This also means that (usually) the parser itself will be written in C#.

To list all possible tools and libraries parser for all languages would be kind of interesting, but not that useful. That is because there will be simple too many options and we would all get lost in them. By concentrating on one programming language we can provide an apples-to-apples comparison and help you choose one option for your project.

Useful Things To Know About Parsers

To make sure that these list is accessible to all programmers we have prepared a short explanation for terms and concepts that you may encounter searching for a parser. We are not trying to give you formal explanations, but practical ones.

Structure Of A Parser

A parser is usually composed of two parts: a lexer, also known as scanner or tokenizer, and the proper parser. Not all parsers adopt this two-steps schema: some parsers do not depend on a lexer. They are called scannerless parsers.

A lexer and a parser work in sequence: the lexer scans the input and produces the matching tokens, the parser scans the tokens and produces the parsing result.

Let’s look at the following example and imagine that we are trying to parse a mathematical operation.

The lexer scans the text and find ‘4’, ‘3’, ‘7’ and then the space ‘ ‘. The job of the lexer is to recognize that the first characters constitute one token of type NUM. Then the lexer finds a ‘+’ symbol, which corresponds to a second token of type PLUS, and lastly it finds another token of type NUM.The input stream is transformed in Tokens and then in an AST by the parser

The parser will typically combine the tokens produced by the lexer and group them.

The definitions used by lexers or parser are called rules or productions. A lexer rule will specify that a sequence of digits correspond to a token of type NUM, while a parser rule will specify that a sequence of tokens of type NUM, PLUS, NUM corresponds to an expression.

Scannerless parsers are different because they process directly the original text, instead of processing a list of tokens produced by a lexer.

It is now typical to find suites that can generate both a lexer and parser. In the past it was instead more common to combine two different tools: one to produce the lexer and one to produce the parser. This was for example the case of the venerable lex & yacc couple: lex produced the lexer, while yacc produced the parser.

Parse Tree And Abstract Syntax Tree

There are two terms that are related and sometimes they are used interchangeably: parse tree and Abstract SyntaxTree (AST).

Conceptually they are very similar:

  • they are both trees: there is a root representing the whole piece of code parsed. Then there are smaller subtrees representing portions of code that become smaller until single tokens appear in the tree
  • the difference is the level of abstraction: the parse tree contains all the tokens which appeared in the program and possibly a set of intermediate rules. The AST instead is a polished version of the parse tree where the information that could be derived or is not important to understand the piece of code is removed

In the AST some information is lost, for instance comments and grouping symbols (parentheses) are not represented. Things like comments are superfluous for a program and grouping symbols are implicitly defined by the structure of the tree.

A parse tree is a representation of the code closer to the concrete syntax. It shows many details of the implementation of the parser. For instance, usually a rule corresponds to the type of a node. A parse tree is usually transformed in an AST by the user, possibly with some help from the parser generator.

A graphical representation of an AST looks like this.

An abstract syntax tree for the Euclidean algorithm

 

Sometimes you may want to start producing a parse tree and then derive from it an AST. This can make sense because the parse tree is easier to produce for the parser (it is a direct representation of the parsing process) but the AST is simpler and easier to process by the following steps. By following steps we mean all the operations that you may want to perform on the tree: code validation, interpretation, compilation, etc..

Grammar

A grammar is a formal description of a language that can be used to recognize its structure.

In simple terms is a list of rules that define how each construct can be composed. For example, a rule for an if statement could specify that it must starts with the “if” keyword, followed by a left parenthesis, an expression, a right parenthesis and a statement.

A rule could reference other rules or token types. In the example of the if statement, the keyword “if”, the left and the right parenthesis were token types, while expression and statement were references to other rules.

The most used format to describe grammars is the Backus-Naur Form (BNF), which also has many variants, including the Extended Backus-Naur Form. The Extended variant has the advantage of including a simple way to denote repetitions. A typical rule in a Backus-Naur grammar looks like this:

The <simbol> is usually nonterminal, which means that it can be replaced by the group of elements on the right, __expression__. The element __expression__ could contains other nonterminal symbols or terminal ones. Terminal symbols are simply the ones that do not appear as a <symbol> anywhere in the grammar. A typical example of a terminal symbol is a string of characters, like “class”.

Left-recursive Rules

In the context of parsers an important feature is the support for left-recursive rules. This means that a rule could start with a reference to itself. This reference could be also indirect.

Consider for example arithmetic operations. An addition could be described as two expression(s) separated by the plus (+) symbol, but an expression could also contain other additions.

This description also match multiple additions like 5 + 4 + 3. That is because it can be interpreted as expression (5) (‘+’) expression(4+3). And then 4 + 3 itself can be divided in its two components.

The problem is that this kind of rules may not be used with some parser generators. The alternative is a long chain of expressions that takes care also of the precedence of operators.

Some parser generators support direct left-recursive rules, but not indirect one.

Types Of Languages And Grammars

We care mostly about two types of languages that can be parsed with a parser generator: regular languages and context-free languages. We could give you the formal definition according to the Chomsky hierarchy of languages, but it would not be that useful. Let’s look at some practical aspects instead.

A regular language can be defined by a series of regular expressions, while a context-free one need something more. A simple rule of thumb is that if a grammar of a language has recursive elements it is not a regular language. For instance, as we said elsewhere, HTML is not a regular language. In fact, most programming languages are context-free languages.

Usually to a kind of language correspond the same kind of grammar. That is to say there are regular grammars and context-free grammars that corresponds respectively to regular and context-free languages. But to complicate matters, there is a relatively new (created in 2004) kind of grammar, called Parsing Expression Grammar (PEG). These grammars are as powerful as Context-free grammars, but according to their authors they describe programming languages more naturally.

The Differences Between PEG and CFG

The main difference between PEG and CFG is that the ordering of choices is meaningful in PEG, but not in CFG. If there are many possible valid ways to parse an input, a CFG will be ambiguous and thus wrong. Instead with PEG the first applicable choice will be chosen, and this automatically solve some ambiguities.

Another differnce is that PEG use scannerless parsers: they do not need a separate lexer, or lexical analysis phase.

Traditionally both PEG and some CFG have been unable to deal with left-recursive rules, but some tools have found workarounds for this. Either by modifying the basic parsing algorithm, or by having the tool automatically rewrite a left-recursive rule in a non recursive way. Either of these ways has downsides: either by making the generated parser less intelligible or by worsen its performance. However, in practical terms, the advantages of easier and quicker development outweigh the drawbacks.

Parser Generators

The basic workflow of a parser generator tool is quite simple: you write a grammar that defines the language, or document, and you run the tool to generate a parser usable from your C# code.

The parser might produce the AST, that you may have to traverse yourself or you can traverse with additional ready-to-use classes, such Listeners or Visitors. Some tools instead offer the chance to embed code inside the grammar to be executed every time the specific rule is matched.

Usually you need a runtime library and/or program to use the generated parser.

Regular (Lexer)

Tools that analyze regular languages are typically called lexers.

Gardens Point LEX

Gardens Point LEX (GPLEX) is a lexical analyzer (lexer) generator based upon finite state automata. The input language is similar to the original LEX, but it also implement some extensions of FLEX. Obviously it has better Unicode support. GPLEX can generate a C# lexer from a grammar file .lex.

The typical grammar is divided in three parts, separated by ‘%%’:

  1. options and C# definitions
  2. rules with embedded actions
  3. usercode

As you can see in the following example in standalone programs the usercode section contains the Main function, so a .lex file can generate a complete functioning program. Although this make it always quite messy and hard to read for the untrained reader.

It can be used as a standalone tool, but being a lexer generator can also work with parser generators: it is designed to work with Gardens Point Parser Generator, however it has also been used with COCO/R and custom parsers.

Context Free

Let’s see the tools that generate Context Free parsers.

ANTLR

ANTLR is a great parser generator written in Java that can also generate parsers for C# and many other languages. A particularity of the C# target is that there are actually two versions: the original by sharwell and the new standard runtime. The original defined itself as C# optimized, while the standard one is included in the general distribution of the tool. Neither is a fork, since the authors work together and both are mentioned in the official website. It is more of a divergent path. The grammars are compatible, but the generated parsers are not.

If you are unsure which one to pick I suggest the standard one, since it is slightly more updated. In fact the standard has a release version supporting .NET Core, while the original only a pre-release.

ANTLR is based on an new LL algorithm developed by the author and described in this paper: Adaptive LL(*) Parsing: The Power of Dynamic Analysis (PDF).

It is quite popular for its many useful features: for instance version 4 supports direct left-recursive rules. However a real added value of a vast community it is the large amount of grammars available.

It provides two ways to walk the AST, instead of embedding actions in the grammar: visitors and listeners. The first one is suited when you have to manipulate or interact with the elements of the tree, while the second is useful when you just have to do something when a rule is matched.

The typical grammar is divided in two parts: lexer rules and parser rules. The division is implicit, since all the rules starting with an uppercase letter are lexer rules, while the ones starting with a lowercase letter are parser rules. Alternatively lexer and parser grammars can be defined in separate files.

If you are interested in ANTLR you can look into this giant ANTLR tutorial we have written.

Coco/R

Coco/R is a compiler generator that takes an attributed grammar and generates a scanner and a recursive descent parser. Attributed grammar means that the rules, that are written in an EBNF variant, can be annotated in several ways to change the methods of the generated parser.

The scanner includes support for dealing with things like compiler directives, called pragmas. They can be ignored by the parser and handled by custom code. The scanner can also be suppressed and substituted with one built by hand.

Technically all the grammars must be LL(1), that is to say the parser must be able to choose the correct rule only looking one symbol ahead. But Coco/R provides several methods to bypass this limitation, including semantic checks, which are basically custom functions that must return a boolean value. The manual also provides some suggestions for refactoring your code to respect this limitation.

A Coco/R grammar looks like this.

Coco/R has a good documentation, with several examples grammars. It supports several languages including Java, C# and C++.

Gardens Point Parser Generator

Gardens Point Parser Generator (GPPG) is a parser generator that produces parsers written in the C# V2 or higher. The input language is YACC-like, and the parsers are LALR(1), with the usual automatic disambiguations. Designed to work with GPLEX.

There are some adaptation to make it work with C# and its tools (e.g. MPF). However a particular feature of GPPG is the possibility of generating also an HTML report of the structure of the generated parser. This is meant to simplify understanding and analysis of the parser and grammar.

It is designed to work with its brother GPLEX, however it has also been used with COCO/R and custom lexers. The structure of the grammar is similar to the one of the brother, but instead of .lex it has the extension .y.

Grammatica

Grammatica is a C# and Java parser generator (compiler compiler). It reads a grammar file (in an EBNF format) and creates well- commented and readable C# or Java source code for the parser. It supports LL(k) grammars, automatic error recovery, readable error messages and a clean separation between the grammar and the source code.

The description on the Grammatica website is itself a good representation of Grammatica: simple to use, well-documented, with a good amount of features. You can build a listener by subclassing the generated classes, but not a visitor. There is a good reference, but not many examples.

A typical grammar of Grammatica is divided in three sections: header, tokens and productions. It is also clean, almost as much as an ANTLR one. It is also based on a similar Extended BNF, although the format is slightly different.

Hime Parser Generator

Hime Parser Generator is a parser generator for .NET and Java, with a modern implementation of GLR with the RNGLR algorithm. It is a pragmatic tool with everything you need and nothing more. It does not reinvent the wheel, but it does improve it.

The grammar uses a custom language based on BNF with some enhancement. For instance, it supports context-sensitive lexing (useful for soft keywords), template rules to avoid repetition of similar rules and feature to transform the parse tree in an AST. These feature are called tree actions: drop and promote. One drop the node from the tree, the other substitute the node with its children.

Instead of embedding code in a Hime grammar has could you can annotate a rule with something called semantic action in the documentation. In practical terms, you just write the name of a function next to a rule and then you implement the function in your source code.

The grammar is put in a file with .gram extension. The structure of the file resemble the classic one with the three sections: options, terminals and rules. But it is much cleaner and looks like a C# class.

The documentation is concise, but complete: there are tutorials and recipes to explain practical usage of the tool. There is an equally good enough grammar repository. It has debug features, like the generation of DOT files.

LLLPG

LLLPG is not a dedicated tool the way ANTLR is. Instead, LLLPG is designed to be embedded inside another programming language

So LLLPG is a LL(k) parser generator, that is not actually a standalone parser generator, but a part of a larger project called Enhanced C#. It is not really usable standalone, because it does not even generate a complete class, but the tool only translate the parts of the input file that it recognizes.

It also bizarrely claims to be better than ANTLR2 (released in 2006), despite being updated until recently. But we mentioned it because for the very narrow objective of building a custom language on the .NET it is a good tool designed just for that objective.

PEG

After the CFG parsers is time to see the PEG parsers available for C#.

IronMeta

The IronMeta parser generator provides a programming language and application for generating pattern matchers on arbitrary streams of objects. It is an implementation of Alessandro Warth’s OMeta system in C#.

Despite the potentially perplexing reference to being a programming language IronMeta is a PEG parser generator that works just like any other one.

IronMeta improve upon base OMeta allowing direct and indirect left recursion. On the other hand error reporting and documentation are limited.

An IronMeta grammar can contain embedded actions and conditions. A condition is a boolean expression that controls the activation of the following rule. If the condition is true the rule activates.

Pegasus

Pegasus is a PEG (Parsing Expression Grammar) parser generator for .NET that integrates with MSBuild and Visual Studio.

Pegasus is a simple parser generator with an equally sparse documentation. It supports the formal definition of PEG and does have basic features to simplify the management of indentation and debugging.

A Pegasus grammar is written in a .peg file that is quite clean, but can also include embedded action.

Parser Combinators

They allow you to create a parser simply with C# code, by combining different pattern matching functions, that are equivalent to grammar rules. They are generally considered best suited for simpler parsing needs. Given they are just C# libraries you can easily introduce them into your project: you do not need any specific generation step and you can write all of your code in your favorite editor. Their main advantage is the possibility of being integrated in your traditional workflow and IDE.

In practice this means that they are very useful for all the little parsing problems you find. If the typical developer encounter a problem, that is too complex for a simple regular expression, these libraries are usually the solution. In short, if you need to build a parser, but you don’t actually want to, a parser combinator may be your best option.

Sprache And Superpower

Sprache is a simple, lightweight library for constructing parsers directly in C# code.

It doesn’t compete with “industrial strength” language workbenches – it fits somewhere in between regular expressions and a full-featured toolset like ANTLR.

The documentation says it all, except how to use it. You can understand how to use it mostly reading tutorials, including one we have written for Sprache. However it is quite popular and cited in the credits for ReSharper.

There is no grammar, you use the functions it provides as you would for normal code.