Parsing In C#: Tools And Libraries

Parsing in C#: Tools and Libraries

This is an article similar to a previous one we wrote: Parsing in Java, so the introduction is the same. Skip to chapter 3 if you have already read it.

If you need to parse a language, or document, from C# there are fundamentally three ways to solve the problem:

  • use an existing library supporting that specific language: for example a library to parse XML
  • building your own custom parser by hand
  • a tool or library to generate a parser: for example ANTLR, that you can use to build parsers for any language

Parsing: Tools and Libraries

Parsing   tools and libraries   cover

Receive the guide to your inbox to read it on all your devices when you have time. Learn about parsing in Java, Python, C#, and JavaScript

We won't send you spam. Unsubscribe at any time. Powered by ConvertKit

Use An Existing Library

The first option is the best for well known and supported languages, like XML or HTML. A good library usually include also API to programmatically build and modify documents in that language. This is typically more of what you get from a basic parser. The problem is that such libraries are not so common and they support only the most common languages. In other cases you are out of luck.

Building Your Own Custom Parser By Hand

You may need to pick the second option if you have particular needs. Both in the sense that the language you need to parse cannot be parsed with traditional parser generators, or you have specific requirements that you cannot satisfy using a typical parser generator. For instance, because you need the best possible performance or a deep integration between different components.

A Tool Or Library To Generate A Parser

In all other cases the third option should be the default one, because is the one that is most flexible and has the shorter development time. That is why on this article we concentrate on the tools and libraries that correspond to this option.

Note: text in blockquote describing a program comes from the respective documentation

Tools To Create Parsers

We are going to see:

  • tools that can generate parsers usable from C# (and possibly from other languages)
  • C# libraries to build parsers

Tools that can be used to generate the code for a parser are called parser generators or compiler compiler. Libraries that create parsers are known as parser combinators.

Parser generators (or parser combinators) are not trivial: you need some time to learn how to use them and not all types of parser generators are suitable for all kinds of languages. That is why we have prepared a list of the best known of them, with a short introduction for each of them. We are also concentrating on one target language: C#. This also means that (usually) the parser itself will be written in C#.

To list all possible tools and libraries parser for all languages would be kind of interesting, but not that useful. That is because there will be simple too many options and we would all get lost in them. By concentrating on one programming language we can provide an apples-to-apples comparison and help you choose one option for your project.

Useful Things To Know About Parsers

To make sure that these list is accessible to all programmers we have prepared a short explanation for terms and concepts that you may encounter searching for a parser. We are not trying to give you formal explanations, but practical ones.

Structure Of A Parser

A parser is usually composed of two parts: a lexer, also known as scanner or tokenizer, and the proper parser. Not all parsers adopt this two-steps schema: some parsers do not depend on a lexer. They are called scannerless parsers.

A lexer and a parser work in sequence: the lexer scans the input and produces the matching tokens, the parser scans the tokens and produces the parsing result.

Let’s look at the following example and imagine that we are trying to parse a mathematical operation.

The lexer scans the text and find ‘4’, ‘3’, ‘7’ and then the space ‘ ‘. The job of the lexer is to recognize that the first characters constitute one token of type NUM. Then the lexer finds a ‘+’ symbol, which corresponds to a second token of type PLUS, and lastly it finds another token of type NUM.The input stream is transformed in Tokens and then in an AST by the parser

The parser will typically combine the tokens produced by the lexer and group them.

The definitions used by lexers or parser are called rules or productions. A lexer rule will specify that a sequence of digits correspond to a token of type NUM, while a parser rule will specify that a sequence of tokens of type NUM, PLUS, NUM corresponds to an expression.

Scannerless parsers are different because they process directly the original text, instead of processing a list of tokens produced by a lexer.

It is now typical to find suites that can generate both a lexer and parser. In the past it was instead more common to combine two different tools: one to produce the lexer and one to produce the parser. This was for example the case of the venerable lex & yacc couple: lex produced the lexer, while yacc produced the parser.

Parse Tree And Abstract Syntax Tree

There are two terms that are related and sometimes they are used interchangeably: parse tree and Abstract SyntaxTree (AST).

Conceptually they are very similar:

  • they are both trees: there is a root representing the whole piece of code parsed. Then there are smaller subtrees representing portions of code that become smaller until single tokens appear in the tree
  • the difference is the level of abstraction: the parse tree contains all the tokens which appeared in the program and possibly a set of intermediate rules. The AST instead is a polished version of the parse tree where the information that could be derived or is not important to understand the piece of code is removed

In the AST some information is lost, for instance comments and grouping symbols (parentheses) are not represented. Things like comments are superfluous for a program and grouping symbols are implicitly defined by the structure of the tree.

A parse tree is a representation of the code closer to the concrete syntax. It shows many details of the implementation of the parser. For instance, usually a rule corresponds to the type of a node. A parse tree is usually transformed in an AST by the user, possibly with some help from the parser generator.

A graphical representation of an AST looks like this.

An abstract syntax tree for the Euclidean algorithm

 

Sometimes you may want to start producing a parse tree and then derive from it an AST. This can make sense because the parse tree is easier to produce for the parser (it is a direct representation of the parsing process) but the AST is simpler and easier to process by the following steps. By following steps we mean all the operations that you may want to perform on the tree: code validation, interpretation, compilation, etc..

Grammar

A grammar is a formal description of a language that can be used to recognize its structure.

In simple terms is a list of rules that define how each construct can be composed. For example, a rule for an if statement could specify that it must starts with the “if” keyword, followed by a left parenthesis, an expression, a right parenthesis and a statement.

A rule could reference other rules or token types. In the example of the if statement, the keyword “if”, the left and the right parenthesis were token types, while expression and statement were references to other rules.

The most used format to describe grammars is the Backus-Naur Form (BNF), which also has many variants, including the Extended Backus-Naur Form. The Extended variant has the advantage of including a simple way to denote repetitions. A typical rule in a Backus-Naur grammar looks like this:

The <simbol> is usually nonterminal, which means that it can be replaced by the group of elements on the right, __expression__. The element __expression__ could contains other nonterminal symbols or terminal ones. Terminal symbols are simply the ones that do not appear as a <symbol> anywhere in the grammar. A typical example of a terminal symbol is a string of characters, like “class”.

Left-recursive Rules

In the context of parsers an important feature is the support for left-recursive rules. This means that a rule could start with a reference to itself. This reference could be also indirect.

Consider for example arithmetic operations. An addition could be described as two expression(s) separated by the plus (+) symbol, but an expression could also contain other additions.

This description also match multiple additions like 5 + 4 + 3. That is because it can be interpreted as expression (5) (‘+’) expression(4+3). And then 4 + 3 itself can be divided in its two components.

The problem is that this kind of rules may not be used with some parser generators. The alternative is a long chain of expressions that takes care also of the precedence of operators.

Some parser generators support direct left-recursive rules, but not indirect one.

Types Of Languages And Grammars

We care mostly about two types of languages that can be parsed with a parser generator: regular languages and context-free languages. We could give you the formal definition according to the Chomsky hierarchy of languages, but it would not be that useful. Let’s look at some practical aspects instead.

A regular language can be defined by a series of regular expressions, while a context-free one need something more. A simple rule of thumb is that if a grammar of a language has recursive elements it is not a regular language. For instance, as we said elsewhere, HTML is not a regular language. In fact, most programming languages are context-free languages.

Usually to a kind of language correspond the same kind of grammar. That is to say there are regular grammars and context-free grammars that corresponds respectively to regular and context-free languages. But to complicate matters, there is a relatively new (created in 2004) kind of grammar, called Parsing Expression Grammar (PEG). These grammars are as powerful as Context-free grammars, but according to their authors they describe programming languages more naturally.

The Differences Between PEG and CFG

The main difference between PEG and CFG is that the ordering of choices is meaningful in PEG, but not in CFG. If there are many possible valid ways to parse an input, a CFG will be ambiguous and thus wrong. Instead with PEG the first applicable choice will be chosen, and this automatically solve some ambiguities.

Another differnce is that PEG use scannerless parsers: they do not need a separate lexer, or lexical analysis phase.

Traditionally both PEG and some CFG have been unable to deal with left-recursive rules, but some tools have found workarounds for this. Either by modifying the basic parsing algorithm, or by having the tool automatically rewrite a left-recursive rule in a non recursive way. Either of these ways has downsides: either by making the generated parser less intelligible or by worsen its performance. However, in practical terms, the advantages of easier and quicker development outweigh the drawbacks.

Parser Generators

The basic workflow of a parser generator tool is quite simple: you write a grammar that defines the language, or document, and you run the tool to generate a parser usable from your C# code.

The parser might produce the AST, that you may have to traverse yourself or you can traverse with additional ready-to-use classes, such Listeners or Visitors. Some tools instead offer the chance to embed code inside the grammar to be executed every time the specific rule is matched.

Usually you need a runtime library and/or program to use the generated parser.

Regular (Lexer)

Tools that analyze regular languages are typically called lexers.

Gardens Point LEX

Gardens Point LEX (GPLEX) is a lexical analyzer (lexer) generator based upon finite state automata. The input language is similar to the original LEX, but it also implement some extensions of FLEX. Obviously it has better Unicode support. GPLEX can generate a C# lexer from a grammar file .lex.

The typical grammar is divided in three parts, separated by ‘%%’:

  1. options and C# definitions
  2. rules with embedded actions
  3. usercode

As you can see in the following example in standalone programs the usercode section contains the Main function, so a .lex file can generate a complete functioning program. Although this make it always quite messy and hard to read for the untrained reader.

It can be used as a standalone tool, but being a lexer generator can also work with parser generators: it is designed to work with Gardens Point Parser Generator, however it has also been used with COCO/R and custom parsers.

Context Free

Let’s see the tools that generate Context Free parsers.

ANTLR

ANTLR is a great parser generator written in Java that can also generate parsers for C# and many other languages. A particularity of the C# target is that there are actually two versions: the original by sharwell and the new standard runtime. The original defined itself as C# optimized, while the standard one is included in the general distribution of the tool. Neither is a fork, since the authors work together and both are mentioned in the official website. It is more of a divergent path. The grammars are compatible, but the generated parsers are not.

If you are unsure which one to pick I suggest the standard one, since it is slightly more updated. In fact the standard has a release version supporting .NET Core, while the original only a pre-release.

ANTLR is based on an new LL algorithm developed by the author and described in this paper: Adaptive LL(*) Parsing: The Power of Dynamic Analysis (PDF).

It is quite popular for its many useful features: for instance version 4 supports direct left-recursive rules. However a real added value of a vast community it is the large amount of grammars available.

It provides two ways to walk the AST, instead of embedding actions in the grammar: visitors and listeners. The first one is suited when you have to manipulate or interact with the elements of the tree, while the second is useful when you just have to do something when a rule is matched.

The typical grammar is divided in two parts: lexer rules and parser rules. The division is implicit, since all the rules starting with an uppercase letter are lexer rules, while the ones starting with a lowercase letter are parser rules. Alternatively lexer and parser grammars can be defined in separate files.

If you are interested in ANTLR you can look into this giant ANTLR tutorial we have written.

Coco/R

Coco/R is a compiler generator that takes an attributed grammar and generates a scanner and a recursive descent parser. Attributed grammar means that the rules, that are written in an EBNF variant, can be annotated in several ways to change the methods of the generated parser.

The scanner includes support for dealing with things like compiler directives, called pragmas. They can be ignored by the parser and handled by custom code. The scanner can also be suppressed and substituted with one built by hand.

Technically all the grammars must be LL(1), that is to say the parser must be able to choose the correct rule only looking one symbol ahead. But Coco/R provides several methods to bypass this limitation, including semantic checks, which are basically custom functions that must return a boolean value. The manual also provides some suggestions for refactoring your code to respect this limitation.

A Coco/R grammar looks like this.

Coco/R has a good documentation, with several examples grammars. It supports several languages including Java, C# and C++.

Gardens Point Parser Generator

Gardens Point Parser Generator (GPPG) is a parser generator that produces parsers written in the C# V2 or higher. The input language is YACC-like, and the parsers are LALR(1), with the usual automatic disambiguations. Designed to work with GPLEX.

There are some adaptation to make it work with C# and its tools (e.g. MPF). However a particular feature of GPPG is the possibility of generating also an HTML report of the structure of the generated parser. This is meant to simplify understanding and analysis of the parser and grammar.

It is designed to work with its brother GPLEX, however it has also been used with COCO/R and custom lexers. The structure of the grammar is similar to the one of the brother, but instead of .lex it has the extension .y.

Grammatica

Grammatica is a C# and Java parser generator (compiler compiler). It reads a grammar file (in an EBNF format) and creates well- commented and readable C# or Java source code for the parser. It supports LL(k) grammars, automatic error recovery, readable error messages and a clean separation between the grammar and the source code.

The description on the Grammatica website is itself a good representation of Grammatica: simple to use, well-documented, with a good amount of features. You can build a listener by subclassing the generated classes, but not a visitor. There is a good reference, but not many examples.

A typical grammar of Grammatica is divided in three sections: header, tokens and productions. It is also clean, almost as much as an ANTLR one. It is also based on a similar Extended BNF, although the format is slightly different.

Hime Parser Generator

Hime Parser Generator is a parser generator for .NET and Java, with a modern implementation of GLR with the RNGLR algorithm. It is a pragmatic tool with everything you need and nothing more. It does not reinvent the wheel, but it does improve it.

The grammar uses a custom language based on BNF with some enhancement. For instance, it supports context-sensitive lexing (useful for soft keywords), template rules to avoid repetition of similar rules and feature to transform the parse tree in an AST. These feature are called tree actions: drop and promote. One drop the node from the tree, the other substitute the node with its children.

Instead of embedding code in a Hime grammar has could you can annotate a rule with something called semantic action in the documentation. In practical terms, you just write the name of a function next to a rule and then you implement the function in your source code.

The grammar is put in a file with .gram extension. The structure of the file resemble the classic one with the three sections: options, terminals and rules. But it is much cleaner and looks like a C# class.

The documentation is concise, but complete: there are tutorials and recipes to explain practical usage of the tool. There is an equally good enough grammar repository. It has debug features, like the generation of DOT files.

LLLPG

LLLPG is not a dedicated tool the way ANTLR is. Instead, LLLPG is designed to be embedded inside another programming language

So LLLPG is a LL(k) parser generator, that is not actually a standalone parser generator, but a part of a larger project called Enhanced C#. It is not really usable standalone, because it does not even generate a complete class, but the tool only translate the parts of the input file that it recognizes.

It also bizarrely claims to be better than ANTLR2 (released in 2006), despite being updated until recently. But we mentioned it because for the very narrow objective of building a custom language on the .NET it is a good tool designed just for that objective.

PEG

After the CFG parsers is time to see the PEG parsers available for C#.

IronMeta

The IronMeta parser generator provides a programming language and application for generating pattern matchers on arbitrary streams of objects. It is an implementation of Alessandro Warth’s OMeta system in C#.

Despite the potentially perplexing reference to being a programming language IronMeta is a PEG parser generator that works just like any other one.

IronMeta improve upon base OMeta allowing direct and indirect left recursion. On the other hand error reporting and documentation are limited.

An IronMeta grammar can contain embedded actions and conditions. A condition is a boolean expression that controls the activation of the following rule. If the condition is true the rule activates.

Pegasus

Pegasus is a PEG (Parsing Expression Grammar) parser generator for .NET that integrates with MSBuild and Visual Studio.

Pegasus is a simple parser generator with an equally sparse documentation. It supports the formal definition of PEG and does have basic features to simplify the management of indentation and debugging.

A Pegasus grammar is written in a .peg file that is quite clean, but can also include embedded action.

Parser Combinators

They allow you to create a parser simply with C# code, by combining different pattern matching functions, that are equivalent to grammar rules. They are generally considered best suited for simpler parsing needs. Given they are just C# libraries you can easily introduce them into your project: you do not need any specific generation step and you can write all of your code in your favorite editor. Their main advantage is the possibility of being integrated in your traditional workflow and IDE.

In practice this means that they are very useful for all the little parsing problems you find. If the typical developer encounter a problem, that is too complex for a simple regular expression, these libraries are usually the solution. In short, if you need to build a parser, but you don’t actually want to, a parser combinator may be your best option.

Sprache And Superpower

Sprache is a simple, lightweight library for constructing parsers directly in C# code.

It doesn’t compete with “industrial strength” language workbenches – it fits somewhere in between regular expressions and a full-featured toolset like ANTLR.

The documentation says it all, except how to use it. You can understand how to use it mostly reading tutorials, including one we have written for Sprache. However it is quite popular and cited in the credits for ReSharper.

There is no grammar, you use the functions it provides as you would for normal code.

A parser combinator library based on Sprache. Superpower generates friendlier error messages through its support for token-based parsers.

Superpower comes from the same author and it is a slightly more advanced tool with an equal lack of documentation. Being newer there are also no tutorials.

Both Sprache and Superpower supports .NET Standard 1.0.

Parseq, Parsley and LanguageExt.Parsec

These are three ports of the famous Parsec Library in Haskell. They all have some reasons to chose one over the other.

Parseq is a monadic parser combinator library written for C#, It can parse context-sensitive, infinite-lookahead grammers.

Parseq seems to be a straight port of Haskell. But there is no documentation, so if you know how to use Parsec it might be a good choice, otherwise you are on your own.

Parsley is a monadic parser combinator library inspired by Haskell’s Parsec and F#’s FParsec. It can parse context-sensitive, infinite look-ahead grammars but it performs best on predictive (LL[1]) grammars.

Parsley is a parser combinator, but it has a separate lexer and parser phase. In practical terms it means that is simple to use, but it also familiar to experienced creators of parsers. There is a limited amount of documentation, but a complete JSON example used as an integration test.

Parsley supports .NET Standard 1.1.

The Parsec library is almost an exact replica of the Haskell Parsec library, it can be used to parse very simple blocks of text up to entire language parsers.

LanguageExt.Parsec is a port of the Haskell library in a larger library designed to bring functional features in C# 6. There is a minimum amount of documentation to get you started.

LanguageExt.Parsec supports .NET Standard 1.3.

It is obviously the best choice if you also need a bit of F# in your C#, but is quite good on its own.

Pidgin

Pidgin is a parser combinator library, a lightweight, high-level, declarative tool for constructing parsers.

Pidgin is a new parser combinator library that is already quite mature and useful. Like Sprache it is easy to use and supports a nice LINQ-like syntax. It also has a few advantages over Sprache: it is more actively maintained, is faster, consumes less memory, supports binary input and include support for advanced features such as recursive structures or operator precedence.

Recursive structures are made possible by a specific operator that allows to defer execution of a parser to another section of the code. The operator precedence is managed with a class made to deal with expressions.

The following is a partial JSON example from the repository.

The documentation is good and cover many aspects: a tutorial/reference, suggestions to speed up your code and a comparison with other parser combinator libraries. The repository also contains examples on JSON and XML. The tutorial/reference is not as deep as one would like, but it gets you started. The author also gave a talk at NDC that includes a tutorial about Pidgin.

Best Way To Parse C#: Roslyn

Roslyn provides open-source C# and Visual Basic compilers with rich code analysis APIs. It enables building code analysis tools with the same APIs that are used by Visual Studio.

There is one special case that could be managed in more specific way: the case in which you want to parse C# code in C#. In such cases you should use the .NET Compiler Platform which it is a compiler as a service better known as Roslyn. It is open source and also the official C# parser, so there is no better choice.

In practical terms it works as a library that you can use to parse C#, but also to generate C# and do everything a compiler can do. The only weak point may be the abundant, but somewhat badly organized documentation. Luckily you can read a few tutorials we have written for Roslyn.

Tools That We Cannot Recommend

We want to also list some tools that people usually mention and are interesting, but we could not include in this analysis for several reasons.

Irony

Irony is a development kit for implementing languages on .NET platform.
Irony is a parser generator that does not rely on a grammar, but on overloading operators in C# to express grammar constructs. It also include an interpreter. It has not been updated since a 2013 beta release and it does not seem it ever had a stable version. Although there is a recent modified version that supports .NET Core.

GOLD

GOLD is a free parsing system that is designed to support multiple programming languages.

In practical terms it is an IDE that supports the creation of BNF grammars to generate parsers in many languages, including Assembly, C, C#, D, Java, Pascal, Python, Visual Basic.NET and Visual C++. It has been relevant enough to have its own wikipedia article, but it is not updated since 2012.

TinyPG

Tiny Parser Generator is an interesting tool presented in a popular CodeProject article that also spawn a fork. It is a tool with a simple IDE that can generate lexer, scanner and parse trees representation. But it can also generate a syntax highlighter for a text box. It is neat, but we cannot recommend it because it was never really meant for professional use and it is not update anymore.

Summary

As we said in the sister article about parsing in Java, the world of parsers is a bit different from the usual world of programmers. That is because a lot of good tools come directly from academia, and in that sector Java is more popular than C#. So there are fewer parsing tool for C# than for Java. Also some, like ANTLR, are written in Java, but can produce C# code. This does not mean that there are not good options, but there are fewer of them.

In fact, if you need a complete parser generator for a .NET Core project your only option is using ANTLR. Though you have more choices available for parser combinators.

On the other hand, if you need to parse C# you have the chance to use the official compiler very easily, so that is a plus.

We cannot really say definitely what software you should use. What it is best for a user might not be the best for somebody else. And we all know that the most technically correct solution might not be ideal in real life with all its constraints. So we wanted to share what we have learned on the best options for parsing in C#.

Thanks to Lee Humphries for his feedback on this article and Benjamin Hodgson for having signalled to us Pidgin.

Download the guide with 68 resources on Creating Programming Languages

68resources

Receive the guide to your inbox to read it on all your devices when you have time

Powered by ConvertKit