The science, tools, strategies, patterns and tools behind the creation and processing of languages

Parsing in Java: Tools and Libraries

Parsing in Java: Tools and Libraries

If you need to parse a language, or document, from Java there are fundamentally three ways to solve the problem:

  • use an existing library supporting that specific language: for example a library to parse XML
  • building your own custom parser by hand
  • a tool or library to generate a parser: for example ANTLR, that you can use to build parsers for any language

Use an existing library

The first option is the best for well known and supported languages, like XML or HTML. A good library usually include also API to programmatically build and modify documents in that language. This is typically more of what you get from a basic parser. The problem is that such libraries are not so common and they support only the most common languages. In other cases you are out of luck.

Building your own custom parser by hand

You could need to go for the second option if you have particular needs. Both in the sense that the language you need to parse cannot be parsed with traditional parser generators, or you have specific requirements that you cannot satisfy using a typical parser generator. For instance, because you need the best possible performance or a deep integration between different components.

A tool or library to generate a parser

In all other cases the third option should be the default one, because is the one that is most flexible and has the shorter development time. That is why on this article we concentrate on the tools and libraries that correspond to this option.

Note: text in blockquote describing a program comes from the respective documentation

Parsing in Java

Parsing_in_java__tools_and_libraries_2

Get the guide to parsing in Java delivered to your email and read it when you want on the device you want

Powered by ConvertKit

Tools To Create Parsers

We are going to see:

  • tools that can generate parsers usable from Java (and possibly from other languages)
  • Java libraries to build parsers

Tools that can be used to generate the code for a parser are called parser generators or compiler compiler. Libraries that create parsers are known as parser combinators.

Parser generators (or parser combinators) are not trivial: you need some time to learn how to use them and not all types of parser generators are suitable for all kinds of languages. That is why we have prepared a list of the best known of them, with a short introduction for each of them. We are also concentrating on one target language: Java. This also means that (usually) the parser itself will be written in Java.

To list all possible tools and libraries parser for all languages would be kind of interesting, but not that useful. That is because there will be simple too many options and we would all get lost in them. By concentrating on one programming language we can provide an apples-to-apples comparison and help you choose one option for your project.

Useful Things To Know About Parsers

To make sure that these list is accessible to all programmers we have prepared a short explanation for terms and concepts that you may encounter searching for a parser. We are not trying to give you formal explanations, but practical ones.

Structure Of A Parser

A parser is usually composed of two parts: a lexer, also known as scanner or tokenizer, and the proper parser. Not all parsers adopt this two-steps schema: some parsers do not depend on a lexer. They are called scannerless parsers.

A lexer and a parser work in sequence: the lexer scans the input and produces the matching tokens, the parser scans the tokens and produces the parsing result.

Let’s look at the following example and imagine that we are trying to parse a mathematical operation.

The lexer scans the text and find ‘4’, ‘3’, ‘7’ and then the space ‘ ‘. The job of the lexer is to recognize that the first characters constitute one token of type NUM. Then the lexer finds a ‘+’ symbol, which corresponds to a second token of type PLUS, and lastly it finds another token of type NUM.The input stream is transformed in Tokens and then in an AST by the parser

The parser will typically combine the tokens produced by the lexer and group them.

The definitions used by lexers or parser are called rules or productions. A lexer rule will specify that a sequence of digits correspond to a token of type NUM, while a parser rule will specify that a sequence of tokens of type NUM, PLUS, NUM corresponds to an expression.

Scannerless parsers are different because they process directly the original text, instead of processing a list of tokens produced by a lexer.

It is now typical to find suites that can generate both a lexer and parser. In the past it was instead more common to combine two different tools: one to produce the lexer and one to produce the parser. This was for example the case of the venerable lex & yacc couple: lex produced the lexer, while yacc produced the parser.

Parse Tree And Abstract Syntax Tree

There are two terms that are related and sometimes they are used interchangeably: parse tree and Abstract SyntaxTree (AST).

Conceptually they are very similar:

  • they are both trees: there is a root representing the whole piece of code parsed. Then there are smaller subtrees representing portions of code that become smaller until single tokens appear in the tree
  • the difference is the level of abstraction: the parse tree contains all the tokens which appeared in the program and possibly a set of intermediate rules. The AST instead is a polished version of the parse tree where the information that could be derived or is not important to understand the piece of code is removed

In the AST some information is lost, for instance comments and grouping symbols (parentheses) are not represented. Things like comments are superfluous for a program and grouping symbols are implicitly defined by the structure of the tree.

A parse tree is a representation of the code closer to the concrete syntax. It shows many details of the implementation of the parser. For instance, usually a rules correspond to the type of a node. They are usually transformed in AST by the user, with some help from the parser generator.

A graphical representation of an AST looks like this.

An abstract syntax tree for the Euclidean algorithm

 

Sometimes you may want to start producing a parse tree and then derive from it an AST. This can make sense because the parse tree is easier to produce for the parser (it is a direct representation of the parsing process) but the AST is simpler and easier to process by the following steps. By following steps we mean all the operations that you may want to perform on the tree: code validation, interpretation, compilation, etc..

Grammar

A grammar is a formal description of a language that can be used to recognize its structure.

In simple terms is a list of rules that define how each construct can be composed. For example, a rule for an if statement could specify that it must starts with the “if” keyword, followed by a left parenthesis, an expression, a right parenthesis and a statement.

A rule could reference other rules or token types. In the example of the if statement, the keyword “if”, the left and the right parenthesis were token types, while expression and statement were references to other rules.

The most used format to describe grammars is the Backus-Naur Form (BNF), which also has many variants, including the Extended Backus-Naur Form. The Extented variant has the advantage of including a simple way to denote repetitions. A typical rule in a Backus-Naur grammar looks like this:

The <simbol> is usually nonterminal, which means that it can be replaced by the group of elements on the right, __expression__. The element __expression__ could contains other nonterminal symbols or terminal ones. Terminal symbols are simply the ones that do not appear as a <symbol> anywhere in the grammar. A typical example of a terminal symbol is a string of characters, like “class”.

Left-recursive Rules

In the context of parsers an important feature is the support for  left-recursive rules. This means that a rule could start with a reference to itself. This reference could be also indirect.

Consider for example arithmetic operations.  An addition could be described as two expression(s) separated by the plus (+) symbol, but an expression could also contain other additions.

This description also match multiple additions like 5 + 4 +  3. That is because it can be interpreted as expression (5) (‘+’) expression(4+3). And then 4 + 3 itself can be divided in its two components.

The problem is that this kind of rules may not be used with some parser generators. The alternative is a long chain of expressions that takes care also of the precedence of operators.

Some parser generators support direct left-recursive rules, but not indirect one.

Types Of Languages And Grammars

We care mostly about two types of languages that can be parsed with a parser generator: regular languages and context-free languages. We could give you the formal definition according to the Chomsky hierarchy of languages, but it would not be that useful. Let’s look at some practical aspects instead.

A regular language can be defined by a series of regular expressions, while a context-free one need something more. A simple rule of thumb is that if a grammar of a language has recursive elements it is not a regular language. For instance, as we said elsewhere, HTML is not a regular language. In fact, most programming languages are context-free languages.

Usually to a kind of language correspond the same kind of grammar. That is to say there are regular grammars and context-free grammars that corresponds respectively to regular and context-free languages. But to complicate matters, there is a relatively new (created in 2004) kind of grammar, called Parsing Expression Grammar (PEG). These grammars are as powerful as Context-free grammars, but according to their authors they describe more naturally programming languages.

The Differences Between PEG and CFG

The main difference between PEG and CFG is that the ordering of choices is meaningful in PEG, but not in CFG. If there are many possible valid ways to parse an input, a CFG will be ambiguous and thus wrong. Instead with PEG the first applicable choice will be chosen, and this automatically solve some ambiguities.

Another differnce is that PEG use scannerless parsers: they do not need a separate lexer, or lexical analysis phase.

Traditionally both PEG and some CFG have been unable to deal with left-recursive rules, but some tools have found workarounds for this. Either by modifying the basic parsing algorithm, or by having the tool automatically rewrite a left-recursive rule in a non recursive way. Either of these ways has downsides: either by making the generated parser less intelligible or by worsen its performance. However, in practical terms, the advantages of easier and quicker development outweigh the drawbacks.

Parser Generators

The  basic workflow of a parser generator tool is quite simple: you write a grammar that defines the language, or document, and you run the tool to generate a parser usable from your Java code.

The parser might produce the AST, that you may have to traverse yourself or you can traverse with additional ready-to-use classes, such Listeners or Visitors. Some tools instead offer the chance to embed code inside the grammar to be executed every time the specific rule is matched.

Usually you need a runtime library and/or program to use the generated parser.

Regular (Lexer)

Tools that analyze regular languages are typically lexers.

We are not going to talk about it, because it is very basic, but Java includes a library to parse data with numbers and simple patterns: java.util.Scanner. It could be defined as a smart library to read streams of data. It might be worth to check it out if you are in need of quickly parse some data.

JFlex

JFlex is a lexical analyzer (lexer) generator based upon deterministic finite automata (DFA). A JFlex lexer matches the input according to the defined grammar (called spec) and executes the corresponding action (embedded in the grammar).

It can be used as a standalone tool, but being a lexer generator is designed to work with parser generators: typically it is used with CUP or BYacc/J. It can also work with ANTLR.

The typical grammar (spec) is divided three parts, separated by ‘%%’:

  1. usercode, that will be included in the generated class,
  2. options/macros,
  3. and finally the lexer rules.

Context Free

Let’s see the tools that generate Context Free parsers.

ANTLR

ANTLR is probably the most used parser generator for Java. ANTLR is based on an new LL algorithm developed by the author and described in this paper: Adaptive LL(*) Parsing:  The  Power  of  Dynamic  Analysis (PDF).

It can output parsers in many languages. But the real added value of a vast community it is the large amount of grammars available. The version 4 supports direct left-recursive rules.

It provides two ways to walk the AST, instead of embedding actions in the grammar: visitors and listeners. The first one is suited when you have to manipulate or interact with the elements of the tree, while the second is useful when you just have to do something when a rule is matched.

The typical grammar is divided in two parts: lexer rules and parser rules. The division is implicit, since all the rules starting with an uppercase letter are lexer rules, while the ones starting with a lowercase letter are parser rules. Alternatively lexer and parser grammars can be defined in separate files.

If you are interested in ANTLR you can look into this giant ANTLR tutorial we have written.

APG

APG is a recursive-descent parser using a variation of Augmented BNF, that they call Superset Augmented BNF. ABNF is a particular variant of BNF designed to better support bidirectional communications protocol. APG also support additional operators, like syntactic predicates and custom user defined matching functions.

It can generate parsers in C/C++, Java e JavaScript. Support for the last language seems superior and more up to date: it has a few more features and seems more updated. In fact the documentation says it is designed to have the look and feel of JavaScript RegExp.

Because it is based on ABNF, it is especially well suited to parsing the languages of many Internet technical specifications and, in fact, is the parser of choice for a number of large Telecom companies.

An APG grammar is very clean and easy to understand.

BYACC/J

BYACC is Yacc that generates Java code. That is the whole idea and it defines its advantages and disadvantages. It is well known, it allows easier conversion of a Yacc and C program to a Java program. Although you obviously still need to convert all the C code embedded in semantic actions into Java code. Another advantage it is that you do not need a separate runtime, the generated parser it is all you need.

On the other hand it is old and the parsing world has made many improvements. If you are an experienced Yacc developer with a code base to upgrade it is a good choice, otherwise there are many more modern alternatives you should consider.

The typical grammar is divided in three sections, separated by ‘%%’: DECLARATIONS, ACTIONS and CODE. The second one contains the grammar rules and the third one the custom user code.

Coco/R

Coco/R is a compiler generator that takes an attributed grammar and generates a scanner and a recursive descent parser. Attributed grammar means that the rules, that are written in an EBNF variant, can be annotated in several ways to change the methods of the generated parser.

The scanner includes support for dealing with things like compiler directives, called pragmas. They can be ignored by the parser and handled by custom code. The scanner can also be suppressed and substituted with one built by hand.

Technically all the grammars must be LL(1), that is to say the parser must be able to choose the correct rule only looking one symbol ahead. But Coco/R provides several methods to bypass this limitation, including semantic checks, which are basically custom functions that must return a boolean value. The manual also provides some suggestions for refactoring your code to respect this limitation.

A Coco/R grammar looks like this.

Coco/R has a good documentation, with several examples grammars. It supports several languages including Java, C# and C++.

CookCC

CookCC is a LALR (1) parser generator written in Java. Grammars can be specified in three different ways:

  • in Yacc format: it can reads grammar defined for Yacc
  • in its own XML format
  • in Java code, by using specific annotations

A unique feature is that it can also output a Yacc grammar. This can be useful if you need to interact with a tool that support a Yacc grammar. Like some old C program with which you must maintain compatibility.

It requires Java 7 to generate the parser, but it can run on earlier versions.

A typical parser defined with annotations will look like this.

For the standard of parser generators, using Java annotations it is a peculiar choice. Compared to an alternative like ANTLR there is certainly a less clear division between the grammar and the actions. This could make the parser harder to maintain for complex languages. Also porting to another language could require a complete rewrite.

On the other hand this approach permit to mix grammar rules with the actions to perform when you match them. Furthermore it has the advantage of being integrated in the IDE of your choice, since it is just Java code.

CUP

CUP is the acronym of Construction of Useful Parsers and it is LALR parser generator for Java. It just generates the proper parser part, but it is well suited to work with JFlex. Although obviously you can also build a lexer by hand to work with CUP. The grammar has a syntax similar to Yacc and it allows to embed code for each rule.

It can automatically generate a parse tree, but not an AST.

It also has an Eclipse plugin to aid you in the creation of a grammar, so effectively it has its own IDE.

The typical grammar is similar to YACC.

Grammatica

Grammatica is a C# and Java parser generator (compiler compiler). It reads a grammar file (in an EBNF format) and creates well- commented and readable C# or Java source code for the parser. It supports LL(k) grammars, automatic error recovery, readable error messages and a clean separation between the grammar and the source code.

The description on the Grammatica website is itself a good representation of Grammatica: simple to use, well-documented, with a good amount of features. You can build a listener by subclassing the generated classes, but not a visitor. There is a good reference, but not many examples.

A typical grammar of Grammatica is divided in three sections: header, tokens and productions. It is also clean, almost as much as an ANTLR one. It is also based on a similar Extended BNF, although the format is slightly different.

Jacc

Jacc is similar to BYACC/J, except that is written in Java and thus it can run wherever your program can run. As a rule of thumb it is developed as a more modern version of Yacc. The author describes small improvements in areas like error messages, modularity and debug support.

If you know Yacc and you do not have any code base to upgrade, it might be a great choice.

JavaCC

JavaCC is the other widely used parser generator for Java. The grammar file contain actions and all the custom code needed by your parser.

Compared to ANTLR the grammar file is much less clean and include a lot of Java source code.

Thanks to its long history it is used in important projects, like JavaParser. This has left some quirks in the documentation and usage. For instance, technically JavaCC itself does not build an AST, but it comes with a tool that does it, JTree, so for practical purposes it does.

There is a grammar repository, but it does not have many grammars in it. It requires Java 5 or later.

ModelCC

ModelCC is a model-based parser generator that decouples language specification from language processing [..]. ModelCC receives a conceptual model as input, along with constraints that annotate it.

In practical terms you define a model of your language, that works as a grammar, in Java, using annotations. Then you feed to ModelCC the model you have created to obtain a parser.

With ModelCC you define your language in a way that is independent from the parsing algorithm used. Instead, it should the best conceptual representation of the language. Although, under the hood, it uses a traditional parsing algorithm. So the grammar per se use a form that is independent from any parsing algorithm, but ModelCC does not use magic and produces a normal parser.

There is a clear description of the intentions of the authors of the tools, but a limited documentation. Nonetheless there are examples available, including the following model for a calculator partially shown here.

SableCC

SableCC is a parser generator created for a thesis and with the aim to be easy to use and to offer a clean separation between grammar and Java code. Version 3 should also offer an included a ready-to-use way to walk the AST using a visitor. But that is all in theory because there is virtually no documentation and we have no idea how to use any of these things.

Also, a version 4 was started in 2015 and apparently lies abandoned.

UrchinCC

Urchin(CC) is a parser generator that allows you to define a grammar, called Urchin parser definition. Then you generate a Java parser from it. Urchin also generate a visitor from the UPD.

There is an exhaustive tutorial that is also used to explain how Urchin works and its limitations, but the manual is limited.

A UPD is divided in three sections: terminals, token and rules.

PEG

After the CFG parsers is time to see the PEG parsers available in Java.

Canopy

Canopy is a parser compiler targeting Java, JavaScript, Python and Ruby. It takes a file describing a parsing expression grammar and compiles it into a parser module in the target language. The generated parsers have no runtime dependency on Canopy itself.

It also provides easy access to the parse tree nodes.

A Canopy grammar has the neat feature of using actions annotation to use custom code in the parser. In practical terms. you just write the name of a function next to a rule and then you implement the function in your source code.

The Java file containing the action code.

Laja

Laja is a two-phase scannerless, top-down, backtracking parser generator with support for runtime grammar rules.

Laja is a code generator and a parser generator and it is mainly designed to create external DSLs. This means that it have some peculiar features. With Laja you must specify not just the structure of the data, but also how the data should be mapped into Java structures. This structures are usually objects in a hierarchy or flat organization. In short, it makes very easy to parse data files, but it is less suitable for a generic programming language.

Laja options, like output directory or input file, are set in a configuration file.

A Laja grammar is divided in a rules section and the data mapping section. It looks like this.

Mouse

Mouse is a tool to transcribe PEG into an executable parser written in Java.

It does not use packrat and thus it uses less memory than the typical PEG parser (the manual explicitly compares Mouse to Rats!).

It does not have a grammar repository, but there are grammars for Java 6-8 and C.

A Mouse grammar is quite clean. To include custom code, a feature called semantic predicates, you do something similar to what you do in Canopy. You include a name in the grammar and then later, in a Java file, you actually write the custom code.

Rats!

Rats! is a parser generator part of xtc (eXTensible Compiler). It is based on PEG, but it uses “additional expressions and operators necessary for generating actual parsers”. It supports left-recursive productions. It can automatically generate an AST.

It requires Java 6 or later.

The grammar can be quite clean, but you can embed custom code after each production.

Parser Combinators

They allow you to create a parser simply with Java code, by combining different pattern matching functions, that are equivalent to grammar rules. They are generally considered suited for simpler parsing needs. Given they are just Java libraries you can easily introduce them into your project: you do not need any specific generation step and you can write all of your code in your favorite Java editor. Their main advantage is the possibility of being integrated in your traditional workflow and IDE.

In practice this means that they are very useful for all the little parsing problems you find. If the typical developer encounter a problem, that is too complex for a simple regular expression, these libraries are usually the solution. In short, if you need to build a parser, but you don’t actually want to, a parser combinator may be your best option.

Jparsec

Jparsec is the port of the parsec library of Haskell.

Parser combinators are usually used in one phase, that is to say they are without lexer. This is simply because it can quickly become too complex to manage all the combinators chains directly in the code. Having said that, jparsec has a special class to support lexical analysis.

It does not support left-recursive rules, but it provides a special class for the most common use case: managing the precedence of operators.

A typical parser written with jparsec is similar to this one.

Parboiled

Parboiled provides a recursive descent PEG parser implementation that operates on PEG rules you specify.

The objective of parboiled is to provide an easy to use and understand way to create small DSLs in Java. It put itself in the space between a simple bunch of regular expressions and an industrial-strength parser generator like ANTLR. A parboiled grammar can include actions with custom code, included directly into the grammar code or through an interface.

It does not build an AST for you, but it provides a parse tree and some classes to make it easier to build it. That is because its authors maintain that the AST is heavily dependent on your exact project needs, so they prefer to offer an “open and flexible approach”. It sound quite appropriate to the project objective and one of our readers find the approach better than a straight AST.

The documentation is very good, it explain features, shows example, compare the ideas behind parboiled with the other options. There are some example grammars in the repository, including one for Java.

It is used by several projects, including important ones like neo4j.

PetitParser

PetitParser combines ideas from scannerless parsing, parser combinators, parsing expression grammars and packrat parsers to model grammars and parsers as objects that can be reconfigured dynamically.

PetitParser is a cross between a parser combinator and a traditional parser generator. All the information is written in the source code, but the source code is divided in two files. In one file you define the grammar, while in the other one you define the actions corresponding to the various elements. The idea is that it should allow you to dynamically redefine grammars. While it is smartly engineered, it is debatable if it is also smartly designed. You can see that the example JSON grammar it is more lengthy than one expects it to be.

An excerpt from the example grammar file for JSON.

An excerpt from the example parser definiton file (that defines the actions for the rules) for JSON .

There is a version written in Java, but there are also versions in Smalltalk, Dart, PHP, and TypeScript.

The documentation is lacking, but there are example grammars available.

Java Libraries That Parse Java: JavaParser

There is one special case that requires some more comments: the case in which you want to parse Java code in Java. In this case we have to suggest to use a library named JavaParser. Incidentally we heavily contribute to JavaParser, but this is not the only reason why we suggest it. The fact is that JavaParser is a project with tens of contributors and thousands of users, so it is pretty robust.

A quick list of features:

  • it supports all versions of Java from 1 to 9
  • it supports lexical preservation and pretty printing: it means you can parse Java code, modify it and printing it back either with the original formatting or pretty printed
  • it can be used with JavaSymbolSolver, which gives you symbol resolution. I.e., it understands which methods are invoked, to which declarations references are linked to, it calculates the type of expressions, etc.

Convinced? You want still to write your own Java parser for Java?

Summary

Parsing in Java is a broad topic and the world of parsers is a bit different from the usual world of programmers. You will find the best tools coming directly from academia, which is typically not the case with software. Some tools and libraries have been started for a thesis or a research project. The upside is that tools tend to be easily and freely available. The downside is that some authors prefer to have a good explanation of the theory behind what their tools do, rather than a good documentation on how to use them. Also, some tools end up being abandoned as the original authors finish their master or their PhD.

We tend to use parser generators quite a lot: ANTLR is our favorite one and we use JavaCC extensively in our work on JavaParser.  We do not use parser combinators very much. It is not because they are bad, they have their uses and in fact we wrote an article about one in C#. But for the problems we deal with, they typically lead to less mantainable code. However they could be easier to start with so you may want to consider those. Especially if until now you have hacked something terrible using regular expressions and an half baked parser written by hand.

We cannot really say to you definitely what software you should use. What it is best for a user might not be the best for somebody else. And we all known that the most technically correct solution might not be ideal in real life with all its constraints. But we have searched and tried many similar tools in our work and something like this article would have helped us save some time. So we wanted to share what we have learned on the best options for parsing in Java.

The difference between a compiler and an interpreter

According to their definitions, the difference between a compiler and an interpreter seems clear enough:

  • interpreter a program that directly executes instructions written in a programming language
  • compiler a program that transforms source code in a low(er)-level language

If you dig deeper, though, you might find some blurring between the two.

In fact an interpreter could translate the source language in a intermediate form, to speed up execution. That is what usually happens with a language that relies on a virtual machine. This naturally lead to the some questions:

Are all languages that use a virtual machine interpreted?

Are they all actually compiled?

You might say both: a language is first compiled in an intermediate form/language and then this intermediate thing is interpreted at run time. Which also lead to another issue, a compiler and an interpreter should not be thought as one program, but more of a group of programs, a system. What you, as a user, think as a compiler may actually include more than one program. For instance, it may include a linker: a program that combines different object files in one file, so that it can be more easily used. Something similar could be said of an interpreter.

Download the guide with 68 resources on Creating Programming Languages

68resources

Receive the guide to your inbox to read it on all your devices when you have time

Powered by ConvertKit

Can You Tell Me Everything About Compilers & Interpreters?

So, which are all the pieces that compose a compiler or an interpreter? You could look for a precise and technical answer to such questions in academia. Or you can find discussions on these issues on StackOverflow.

What it really matters to us as developers, or even us as creators of a language, is what are the differences in working with them. Both have advantages and disadvantages, and in fact some languages can have both an interpreter and a compiler, or more than one. That is what we are going to see.

The main point still stands: an interpreter executes the code now, a compiler prepares the source code for an execution that comes later. All the practical differences descend from these different objectives.

How Do You Distribute Programs

In practical terms one important difference is that a compiler generates a stand-alone program, while an interpreted program always need the interpreter to run.

Once you have a compiled program you can run it without needing to install anything else. This simplifies distribution. On the other hand the executable work on one specific platform: different operating systems and different processors need different compiled versions.

If you are going to interpret a program you can distribute the same copy to users on different platforms. However they will need an interpreter that runs on their specific platform. You could either distribute the original source code or an intermediate form. An intuitive way to look at an interpreter is this: it is like the eval function in JavaScript. It works wherever JavaScript works, but it need a Java interpreter for that platform to run.

Cross-Platform Support

This is a technical difference that leads to important real consequences: it is easier to make cross-platform programs with an interpreted programming language.

That is because, for the most part, you are just creating a program for the interpreter platform. It will be the interpreter itself that will translate it into the proper form for the real platform (e.g., Windows/Linux and x86/ARM). Of course, there are still some differences in each platform of which you must be aware. A common example is the directory separator character.

When you compile a program, instead, you need to take care yourself of all the little differences between each platform. This happens in part because compiled languages tend to be low(er) level languages, such as C++, so they give you lower access to the system and thus more responsibility. But another reason is that all the libraries you are using need themselves to support different platforms. So if they do not support Windows, you cannot support Windows.

Speed Has Multiple Faces

Again, in the case of speed, we have a sort of paradox: a compiler is both faster and slower than an interpreter. Many people knows that a compiled program is much faster than an interpreted one, but this is not the whole picture. A compiled program is faster to run that an interpreted program, but it takes more time to compile and run a program than to just interpret it.

A compiler indeed produces faster programs. It happens fundamentally because it must analyze each statement just once, while an interpreter must analyze it each time. Furthermore, a compiler can optimize the executable code it produces. That is both because it knows exactly where it will run and also it takes time to optimize the code. Time that would make the interpretation too slow.

Runtime Speed Versus Development Speed

You might think that this is nitpicking: if you compile a program it runs faster, the time that it takes to compile does not matter. This is usually the old school opinion. And without doubt the resulting program is run more times than it is compiled. So who cares if the development takes more time? Well, for sure it takes a “bring the pain” attitude to development that is somewhat admirable. But what if the gains in run time are not relevant, while the losses in development productivity are significant?

It is one thing if you creating an operating system and another one if you making a selfie app. Even your users might prefer a not-even-noticeable loss in run time speed in exchange for getting features quicker. There is not a one-size-fits-all answer: in some contexts productivity matters more than speed, in others the inverse is true.

The Mysteries Of Debugging

There is another particular aspect that ends up being more uncertain than what one would aspect: debugging. On paper debugging is easier while using a interpreter than using a compiler. That is true for several reasons:

  • with the interpreter there is one version of the executable; you don’t need a debug version for development and a release one for the end user
  • there are fewer platform specific bugs using an interpreter
  • since the interpreter transforms code on the fly, the information from the source code is still available
  • since the interpreter executes one statement at a time it is easier to find a mistake

The Difference That Development Tools Makes

While this is all true in practice, this might be less relevant than it seems. In fact if you think about your experience, you would probably find that debugging JavaScript is harder than debugging C++. Why is that? In part is the design of the languages themselves. JavaScript uses dynamic typing, while C++ uses static typing. The latter makes easier to catch mistakes early. But in the end it comes down to the development tools. Compiling C++ by hand is hard, so most people use IDEs to develop with it. On the other hand, you can easily use text editor and command line tools to develop in JavaScript.

Having said that, if we put them in the same context, each one with a great IDE and supporting tools, the situation comes back to normal. Indeed many interpreted environment are used by people that want to learn how to use a new language. It is easier to test and find what is right and wrong by looking at what happens line by line and in real time.

Conclusions

We have seen the main differences that matters between a compiler and an interpreter. More importantly we have seen that the consequences of different philosophies are more important than the technical ones. In short, there are cultures that come with certain technical choices that end up being relevant on their own. If you want speed and ease of development you are going to pick all the technologies for speed and not just one. And your users are going to follow your lead.

This is a crucial aspect to think through, especially if you want to create your own programming language. Rasmus Lerdorf created PHP to be easy to use. And indeed it was incredibly easier to use compared to the alternatives, at least at the time of its creation. But it started more as a library than as a language. And while it has improved a lot, it still suffers from its beginnings. You can still create good PHP code, but fewer of its users than usual do. Because if you just need something that works, security, maintenance, etc., all the rest come later.

If you want to know how you can practically build an interpreter or a compiler for your language, you may want to take a look at the resources to create a programming language. If you want to learn that, and all that you need to create your own language, you just need to pick a great book, loved by kids and adults alike, on how to create pragmatic, lightweight languages.

68 Resources To Help You To Create Programming Languages

Here it is a new guide, to collect and organize all the knowledge that you need to create your programming language from scratch.

Creating a programming language is one of the most fascinating challenge you can dream of as a developer.

The problem is that there are a lot of moving parts, a lot of things to do right and it is difficult to find a well detailed map, to show you the way. Sure, you can find a tutorial on writing half a parser there, an half-baked list of advices on language design, an example of a naive interpreter. To find those things you will need to spend hours navigating forums and following links.

We thought it was the case to save you some time by collecting relevant resources, evaluate them and organize them. So you can spend time using good resources, not looking for them.

We organized the resources around the three stages in the creation of a programming language: design, parsing, and execution.

Download the guide with 68 resources on Creating Programming Languages

68resources

Receive the guide to your inbox to read it on all your devices when you have time

Powered by ConvertKit

Section 1:

Designing the Language

When creating a programming language you need to take ideas and transform them in decisions. This is what you do during the design phase.

Before You Start…

Some good resources to beef up your culture on language design.

Articles

Books

  • Design Concepts in Programming Languages, if you want to make deliberate choices in the creation of your programming language, this is the book you need. Otherwise, if you don’t already have the necessary theoretical background, you risk doing things the way everybody else does them. It’s also useful to develop a general framework to understand how the different programming languages behave and why.
  • Practical Foundations for Programming Languages, this is, for the most part, a book about studying and classifying programming languages. But by understanding the different options available it can also be used to guide the implementation of your programming language.
  • Programming Language Pragmatics, 4th Edition, this is the most comprehensive book to understand contemporary programming languages. It discusses different aspects, of everything from C# to OCaml, and even the different kinds of programming languages such as functional and logical ones. It also covers the several steps and parts of the implementation, such as an intermediate language, linking, virtual machines, etc.
  • Structure and Interpretation of Computer Programs, Second Edition, an introduction to computer science for people that already have a degree in it. A book widely praised by programmers, including Paul Graham directly on the Amazon Page, that helps you developing a new way to think about programming language. It’s quite abstract and examples are proposed in Scheme. It also covers many different aspect of programming languages including advanced topics like garbage collection.

Type Systems

Long discussions and infinite disputes are fought around type systems. Whatever choices you end up making it make sense to know the different positions.

Articles

  • These are two good introductory articles on the subject of type systems. The first discuss the dichotomy Static/Dynamic and the second one dive into Introspection.
  • What To Know Before Debating Type Systems, if you already know the basics of type systems this article is for you. It will permit you to understand them better by going into definitions and details.
  • Type Systems (PDF), a paper on the formalization of type systems that also introduces more precise definitions of the different type systems.

Books

  • Types and Programming Languages, a comprehensive book on understanding type systems. It will impact your ability to design programming languages and compilers. It has a strong theoretical support, but it also explains the practical importance of individual concepts.
  • Functional programming and type systems, an interesting university course on type systems for functional programming. It is used in a well known French university. There are also notes and presentation material available. It is as advanced as you would expect.
  • Type Systems for Programming Language, is a simpler course on type system for (functional) programming languages.

Section 2:

Parsing

Parsing transforms the concrete syntax in a form that is more easily manageable by computers. This usually means transforming text written by humans in a more useful representation of the source code, an Abstract Syntax Tree.

An abstract syntax tree for the Euclidean algorithm

There are usually two components in parsing: a lexical analyzer and the proper parser. Lexers, which are also known as tokenizers or scanners, transform the individual characters in tokens, the atom of meaning. Parsers instead organize the tokens in the proper Abstract Syntax Tree for the program. But since they are usually meant to work together you may use a single tool that does both the tasks.

Tools

  • Flex, as a lexer generator and (Berkeley) Yacc or Bison, for the generation of the proper parser, are the venerable choices to generate a complete parser. They are a few decades old and they are still maintained as open source software. They are written in and thought for C/C++. They still works, but they have limitations in features and support for other languages.
  • ANTLR, is both a lexer and a parser generator. It’s also more actively developed as open source software. It is written in Java, but it can generate both the lexer and the parser in languages as varied as C#, C++, Python, Javascript, Go, etc.
  • Your own lexer and parser. If you need the best performance and you can create your own parser. You just need to have the necessary computer science knowledge.

Tutorials

  • Flex and Bison tutorial, a good introduction to the two tools with bonus tips.
  • Lex and Yacc Tutorial, at 40 pages this is the ideal starting point to learn how to put together lex and yacc in a few hours.
  • Video Tutorial on lex/yacc in two parts, in an hour of video you can learn the basics of using lex and yacc.
  • ANTLR Mega Tutorial, the renown and beloved tutorial that explains everything you need to know about ANTLR, with bonus tips and tricks and even resources to know more.

Books

  • lex & yacc, despite being a book written in 1992 it’s still the most recommended book on the subject. Some people say because the lack of competition, others because it is good enough.
  • flex & bison: Text Processing Tools, the best book on the subject written in this millennium.
  • The Definitive ANTLR 4 Reference, written by the main author of the tool this is really the definitive book on ANTLR 4. It explains all of its secrets and it’s also a good introduction about how the whole parsing thing works.
  • Parsing Techniques, 2nd edition, a comprehensive, advanced and costly book to know more than you possibly need about parsing.

Section 3:

Execution

To implement your programming language, that is to say to actually making something happens, you can build one of two things: a compiler or an interpreter. You could also build both of them if you want. Here you can find a good overview if you need it: Compiled and Interpreted Languages.

The resources here are dedicated to explaining how compilers and/or interpreters are built, but for practical reasons often they also explain the basics of creating lexers and parsers.

Compilers

A compiler transforms the original code into something else, usually machine code, but it could also be simply any lower level language, such as C. In the latter case some people prefer to use the term transpiler.

Tools

  • LLVM, a collection of modular and reusable compiler and toolchain technologies used to create compilers.
  • CLR, is the virtual machine part of the .NET technologies, that permits to execute different languages transformed in a common intermediate language.
  • JVM, the Java Virtual Machine that powers the Java execution.
  • Babel, a JavaScript compiler. It has a very modern structure, based upon plugins, which means that it can be easily adapted, but it doesn’t do anything by default, really, that is what the documentation says. Its creators present it as a tool to support new featuers of JavaScript in old environment, but it can do much more. For instance, you can use to create JavaScript-based languages, such as LightScript.

Articles & Tutorials

  • Building Domain Specific Languages on the CLR, an article on how to build internal DSL on the CLR. It’s slightly outdated, since it’s from 2008, but it’s still a good presentation on the subject.
  • The digital issue of MSDN Magazine for February 2008 (CHM format), contains an article on how to Create a Language Compiler for the .NET Framework. It’s still a competent overview of the whole process.
  • Create a working compiler with the LLVM framework, Part 1 and Part 2, a two-part series of articles on creating a custom compiler by IBM, from 2012 and thus slightly outdated.
  • A few series of tutorials froms the LLVM Documentation, this is three great linked series of tutorial on how to implement a language, called Kaleidoscope, with LLVM. The only problem is that some parts are not always up-to-date.
  • My First LLVM Compiler, a short and gentle introduction to the topic of building a compiler with LLVM.
  • Creating an LLVM Backend for the Cpu0 Architecture, a whopping 600-pages tutorial to learn how to create a LLVM backend, also available in PDF or ePub. The content is great, but the English is lacking. On the positive side, if you are a student, they feel your pain of transforming theoretical knowledge into practical applications, and the book was made for you.
  • A Nanopass Framework for Compiler Education, a paper that present a framework to teach the creation of a compiler in a simpler way, transforming the traditional monolithic approach in a long series of simple transformations. It’s an interesting read if you already have some theoretical background in computer science.
  • An Incremental Approach to Compiler Construction (PDF), a paper that it’s also a tutorial that develops a basic Scheme compiler with an easier to learn approach.
  • The Super Tiny Compiler!: “This is an ultra-simplified example of all the major pieces of a modern compiler written in easy to read JavaScript” (and using Babel). There is a short introduction, but this is more of a walkthrough of the code than a proper tutorial. This may be a good or a bad thing, depending on your preference.

Books

  • Compilers: Principles, Techniques, and Tools, 2nd Edition, this is the widely known Dragon book (because of the cover) in the 2nd edition (purple dragon). There is a paperback edition, which probably costs less but it has no dragon on it, so you cannot buy that. It is a theoretical book, so don’t expect the techniques to actually include a lot of reusable code.
  • Modern Compiler Implementation in ML, this is known as a the Tiger book and a competitor of the Dragon book. It is a book that teaches the structure and the elements of a compiler in detail. It’s a theoretical book, although it explains the concept with code. There are other versions of the same book written using Java and C, but it’s widely agreed that the ML one is the best.
  • Engineering a Compiler, 2nd edition, it is another compiler book with a theoretical approach, but that it covers a more modern approach and it is more readable. It’s also more dedicated to the optimization of the compiler. So if you need a theoretical foundation and an engineering approach this is the best book to get.

Interpreters

An interpreter directly executes the language without transforming it in another form.

Articles & Tutorials

Books

  • Writing An Interpreter In Go, despite the title it actually shows everything from parsing to creating an interpreter. It’s contemporary book both in the sense that is recent (a few months old), and it is a short one with a learn-by-doing attitude full of code, testing and without 3-rd party libraries. We have interviewed the author, Thorsten Ball.
  • Crafting Interpreters, a work-in-progress and free book that already has good reviews. It is focused on making interpreters that works well, and in fact it will builds two of them. It plan to have just the right amount of theory to be able to fit in at a party of programming language creators.

Section 4:

General

This are resources that cover a wide range of the process of creating a programming language. They may be comprehensive or just give the general overview.

Tools

In this section we include tools that cover the whole spectrum of building a programming language and that are usually used as standalone tools.

  • Xtext, is a framework part of several related technologies to develop programming languages and especially Domain Specific Languages. It allows you to build everything from the parser, to the editor, to validation rules. You can use it to build great IDE support for your language. It simplifies the whole language building process by reusing and linking existing technologies under the hood, such as the ANTLR parser generator.
  • JetBrains MPS, is a projectional language workbench. Projectional means that the Abstract Syntax Tree is saved on disk and a projection is presented to the user. The projection could be text-like, or be a table or diagram or anything else you can imagine. One side effect of this is that you will not need to do any parsing, because it is not necessary. The term Language Workbench indicates that Jetbrains MPS is a whole system of technologies created to help you create your own programming language: everything from the language itself to IDE and supporting tools designed for your language. You can use it to build every kind of language, but the possibility and need to create everything makes it ideal to create Domain Specific Languages that are used for specific purposes, by specific audiences.
  • Racket, is described by its authors as “a general-purpose programming language as well as the world’s first ecosystem for developing and deploying new languages”. It’s a pedagogical tool developed with practical ambitions that has even a manifesto. It is a language made to create other languages that has everything: from libraries to develop GUI applications to an IDE and the tools to develop logic languages. It’s part of the Lisp family of languages, and this tells everything you need to know: it’s all or nothing and always the Lisp-way.

Articles

Tutorials

Books

  • How to create pragmatic, lightweight languages, the focus here is on making a language that works in practice. It explains how to generate bytecode, target the LLVM, build an editor for your language. Once you read the book you should know everything you need to make a usable, productive language. Incidentally, we have written this book.
  • How To Create Your Own Freaking Awesome Programming Language, it’s a 100-page PDF and a screencast that teach how to create a programming language using Ruby or the JVM. If you like the quick-and-dirty approach this book will get you started in little time.
  • Writing Compilers and Interpreters: A Software Engineering Approach, 3rd edition, it’s a pragmatic book that still teaches the proper approach to compilers/interpreters. Only that instead of an academic focus, it has an engineering one. This means that it’s full of Java code and there is also UML sprinkled here and there. Both the techniques and the code are slightly outdated, but this is still the best book if you are a software engineer and you need to actually do something that works correctly right now, that is to say in a few months after the proper review process has completed.
  • Language Implementation Patterns, this is a book from the author of ANTLR, which is also a computer science professor. So it’s a book with a mix of theory and practice, that guides you from start to finish, from parsing to compilers and interpreters. As the name implies, it focuses on explaining the known working patterns that are used in building this kind of software, more than directly explaining all the theory followed by a practical application. It’s the book to get if you need something that really works right now. It’s even recommended by Guido van Rossum, the designer of Python.
  • Build Your Own Lisp, it’s a very peculiar book meant to teach you how to use the C language and how to build you own programming language, using a mini-Lisp as the main example. You can read it for free online or buy it. It’s meant you to teach about C, but you have to be already familiar with programming. There is even a picture of Mike Tyson (because… lisp): it’s all so weird, but fascinating.
  • Beautiful Racket: how to make your own programming languages with Racket, it’s a good and continually updated online book on how to use Racket to build a programming language. The book is composed of a series of tutorials and parts of explanation and reference. It’s the kind of book that is technically free, but you should pay for it if you use it.
  • Programming Languages: Application and Interpretation, an interesting book that explains how to create a programming language from scratch using Racket. The author is a teacher, but of the good and understandable kind. In fact, there is also a series of recordings of the companion lectures, that sometimes have questionable audio. There is an updated version of the book and of the recordings, but the new book has a different focus, because it want also to teach about programming language. It also doesn’t uses Racket. If you don’t know any programming at all you may want to read the new version, if you are an experienced one you may prefer the old one.
  • Implementing Domain-Specific Languages with Xtext and Xtend, 2nd edition, is a great book for people that want to learn with examples and using a test-driven approach. It covers all levels of designing a DSL, from the design of the type system, to parsing and building a compiler.
  • Implementing Programming Languages, is an introduction to building compilers and interpreters with the JVM as the main target. There are related materials (presentations, source code, etc.) in a dedicated webpage. It has a good balance of theory and practice, but it’s explicitly meant as a textbook. So don’t expect much reusable code. It’s the typical textbook also in the sense that it can be a great and productive read if you already have the necessary background (or a teacher), otherwise you risk ending up confused.
  • Implementing functional languages: a tutorial, a free book that explains how to create a simple functional programming language from the parsing to the interpreter and compiler. On the other hand: “this book gives a practical approach to understanding implementations of non-strict functional languages using lazy graph reduction”. Also, expect a lot of math.
  • DSL Engineering, a great book that explains the theory and practice of building DSLs using language workbenches, such as MPS and Xtext. This means that other than traditional design aspects, such as parsing and interpreters, it covers things like how to create an IDE or how to test your DSL. It’s especially useful to software engineers, because it also discusses software engineering and business related aspects of DSLs. That is to say it talks about why a company should build a DSL.
  • Lisp in Small Pieces, an interesting book that explain in details how to design and implement a language of the Lisp family. It describes “11 interpreters and 2 compilers” and many advanced implementation details such as the optimization of the compiler. It’s obviously most useful to people interested in creating a Lisp-related language, but it can be an interesting reading for everybody.

Section 5:

Summary

Here you have the most complete collection of high-quality resources on creating programming languages. You have just to decide what you are going to read first.

At this point we have two advices for you:

  1. Get started. It does not matter how many amazing resources we will send you, if you do not take the time to practice, trying and learning from your mistake you will never create a programming language
  2. If you are interested in building programming languages you should subscribe to our newsletter. You will receive updates on new articles, more resources, ideas, advices and ultimately become part of a community that share your interests on building languages

You should have all you need to get started. If you have questions, advices or ideas to share feel free to write at federico@tomassetti.me. We read and answer every email.

Our thanks to Krishna and the people on Hacker News for a few good suggestions.

Language Server Protocol: A Language Server For DOT With Visual Studio Code

A Language Server For DOT With Visual Studio Code

In a previous post we have seen how the Language Server Protocol can be a game changer in language development: we can now build support for one language and integrate it with all the IDEs compatible with this protocol.

In this article we are going to see how easy is to build support for the DOT Language in Visual Studio Code.

Note that Visual Studio Code now runs also on Mac and Linux

To do this we are going to build a system composed of two elements:

  • A server which will provide support for our language (DOT)
  • A very thin client that will be integrated in Visual Studio Code

If you want to support another editor you must create a new client for that editor, but you could reuse the same server. In fact, the server is nothing else than a node app.

The code for this article is in its companion repository. In this tutorial we are not going to show every little details. We are going to focus instead on the interesting parts. We will start slow and explain how to setup the client and then only the parts necessary to understand how everything works. So you have to refer to the repository if you want to see all the code and the configuration details.

Language Server Protocol Recap

If you haven’t read the previous article there is a few things that you may want to know. The protocol is based upon JSON-RPC, this means it is lightweight and simple both to use and to implement. In fact there are already editors that support it, such as Visual Studio Code and Eclipse, and libaries for many languages and formats.

The client informs the server when a document is opened or changed, and the server must mantain it’s own representation of the open documents. The client can also send requests that must be fullfilled by the server, such as requesting hover information or completion suggestions.

DOT Language

The DOT Language permits to describe graphs. A tool named Graphviz can read descriptions in DOT and generate nice pictures from those. Below you can see an example. It does not matter if you are not familiar with DOT. It is a very simple language that is perfect to show you how to use the Language Server Protocol in practice.

 

From this graph Graphviz will generate this image:

 

Setup

Before starting, we have to install a couple of things:

  1. VS Code extension generator: to generate a skeleton extension
  2. VS Code extension for the DOT language: to register the DOT language

VS Code Extension Generator

The generator can be installed and used the following way.

The generator will guide through the creation of an extension. In our case, we need for the client which is the proper VS Code extension.

VSCode Extension Generator

VS Code Extension For The DOT Language

The Language Support for the DOT language is an extension to register the DOT language with Visual Studio Code.

You can navigate to the extensions page

Once you are there search for dot and install it.

You may wonder why we need an extension for the DOT Language. I mean, aren’t we building it one right now?

We are building a Language Server for the DOT Language, to implement things such as verifying the correcteness of the syntax or suggesting terms for autocompletion. The extension we are going to install provides instead basic stuff like syntax highlighting. It also associate the extension .dot files with the DOT language. These kinds of extensions are quite easy to create: they consist just of configuration files.

Structure Of The Project

The project will be composed by:

  • a Language Protocol client
  • a Language Protocol server
  • a backend in C# and .NET Core that will provide the validation of the code

So under the project there will be a folder client, that will contain the client code, a folder server, that will contain the server code, and a folder csharp, that will contain the C# service. There will also be a data folder, that will contain two files with a list of shapes and colors.

Language Server Protocol architecture

The architecture of our application

Since client and server are actually node projects you have to install the node packages. And for the .NET Core project you have to restore the nuget packages using the well known commands inside the respective folders. More precise information is available in the repository.

You have also to remember to start the C# project first (dotnet run) and then run the extension/client after that.

1. The Client For The Language Server Protocol

Configuration Of The Client

Our client is a Visual Studio Code extension, and a very simple one, given the integrated support in VS Code for the protocol.

We are writing it using TypeScript, a language that compiles to JavaScript. We are using TypeScript for two reasons: 1) because it provides some useful features, such as strong typing, and 2) because it’s the default language for a VS Code extension.

First we are going to setup how TypeScript will be compiled into Javascript, by editing tsconfig.json.

In addition to the usual node_modules, we exclude the folder server, because it will contain the already compiled server. In fact, to simplify the use of the extension, the client will also contain the server. That is to say, we are going to create the server in another project, but we will output the source under the client/server folder, so that we could launch the extension with one command.

To do just that you can use CTLR+Shift+B, while you are in the server project, to automatically build and output the server under the client, as soon as you edit the code.

Setting Up The Extension

In package.json we include the needed packages and the scripts necessary to automatically compile, install the extension in our development environment, and run it.

For the most part it’s the default one, we just add a dependency.

The same file is also used to configure the extension.

The setting activationEvents is used to configure when should the extension be activated, in our case for DOT files. The particular value (onLanguage:dot) is available because we have installed the extension for GraphViz (Dot) files. This is also the only thing that we strictly need from the DOT extensions. Alternatively, we could avoid installing the extension by adding the following to the contributes section. But doing this way we lose syntax highlighting.

The contributes section can also contain custom properties to modify the configuration of the extension. As an example we have a maxNumberOfProblems setting.

This is the time to call npm install so that the extension vscode-languageclient will be installed.

Setting Up The Client

Let’s see the code for the client. Copy it (or type it) in src/extension.ts

This is the entire client: if you remove the comments it is just a bunch of lines long.

We first create and setup the server, for normal and debug sessions.

Then we take care of the client: on line 27 we order the client to call the server only for DOT files.

On line 37 we create the client with all the options and the informations it need.

This is quite easy, given the integrated support for the Language Server Protocol in Visual Studio Code, but if you had to create a client from scratch you would have to support the protocol JSON-RPC and configure the client to call the server whenever is needed: for instance when the document changes or a completion suggestion is asked by the user.

2. The Server For The Language Server Protocol

Creating the Server

The basics of a server are equally easy: you just need to setup the connection and find a way to maintain a model of the documents.

The first thing to do is to create the connection and start listening.

Then we create an instance of TextDocuments, a class provided by Visual Studio Code to manage the documents on the server. In fact, for the server to work, it must maintain a model of the document on which the client is working on. This class listens on the connection and updates the model when the server is notified of a change.

On initialization we inform the client of the capabilities of the server. The Language Server Protocol can work in two different ways: either sending only the portion of the document that have changed or sending the whole document each time. We choose the latter and inform the client to send the complete document every time, on line 13.

We communicate to the client that our server supports autocompletion. In particular, on line 17, we say that it should only ask for suggestions after the character equals ‘=’. This makes sense for the DOT language, but in other cases you could choose to not specify any character or to specify more than one character.

We also support hover information: when the user leaves the mouse pointer over a token for some time we can provide additional information.

Finally we support validation, but we don’t need to tell the client about it. The rationale is that when we are informed of changes on the document we inform the client about any issue. So the client itself doesn’t have to do anything special, apart from notifying the server of any change.

Implement Autocompletion

The suggestions for autocompletion depends on the position of the cursor. For this reason the client specify to the server the document, and the position for each autocompletion request.

Given the simplicity of the DOT language there aren’t many element to consider for autocompletion. In this example we consider the values for colors and shapes. In this article we are going to see how to create suggestions for color, but the code in the repository contains also suggestions for shapes, which are created in the same way.

In the first few lines we set the proper values, and load the list of possible colors.

In particular, on line 2, we get the text of the document. We do that by calling the document manager, using a document URI, that we are given as input. In theory, we could also read the document directly from disk, using the provided document URI, but this way we would had only the version that is saved on disk. We would miss any eventual changes in the current version.

Then, on line 11, we find the position of the equals (=) character. You may wonder why we don’t just  use  position.character - 1: since the completion is triggered by that character don’t we already know the relative position of the symbol? The answer is yes, if we are starting a suggestion for a new item, but this isn’t always true. For instance it’s not true if there is already a value, but we want to change it.

VSCode Complete for existing items

Autocomplete for an existing value

By making sure to find the position of the equals sign, we can always know if we are assigning a value to an option, and which option it is.

Sending The Suggestions For Autocomplete

These suggestions are used when the user typed “color=”.

Now that we know the option name, if the option is color we send the list of colors. We provide them as an array of CompletionItems. On lines 26-28 you can see how they are created: you need a label and a CompletionItemKind. The “kind” value, at the moment, is only used for displaying an icon next to the suggestion.

The last element, data, is a field meant to contain custom data chosen by the developer. We are going to see later what is used for.

We always send all the colors to the client, Visual Studio Code itself will filter the values considering what the user is typing. This may or may not be a good thing, since this makes impossible to use abbreviations or nicknames to trigger a completion suggestion for something else. For example, you can’t type “Bill” to trigger “William Henry Gates III”.

Giving Additional Information For The Selected CompletionItem

You may want to give additional information to the user, to make easier choosing the correct suggestion, but you can’t send too much information at once. The solution is to use another event to give the necessary information once the user has selected one suggestion.

The method onCompletionResolve is the one we need to use for doing just that.

It accepts a CompletionItem and it adds values to it. In our case if the suggestion is a color we give a link to the DOT documentation that contains the whole list of colors and we specify which color scheme is part of. Notice that the specification of DOT also supports color schemes other than the X11 one, but our autocomplete doesn’t.

Validating A Document

Now we are going to see the main feature of our Language Server: the validation of the DOT file.

But first we make sure that the validation is triggered after every change of each document.

We do just that by calling the validation when we receive a notification of a change. Since there is a line of communication between the server and the client we don’t have to answer right away. We first receive notification of the change and once the verification is complete we send back the errors.

The function validateDotDocument takes the changed document as argument and then compute errors. In this particular case we use a C# backend to perform the actual validation. So we just have to proxy the request through the Language Server and format the results for the client.

While this may be overkill for our example, it’s probably the best choice for big projects. If you want to easily use many linters and libraries, you are not going to mantain a specific version of each of these just for the Language Server. By integrating other services you can mantain a lean Language Server.

The validation is the best phase in which to integrate such services, because it’s not time sensitive. You can send back eventual errors at any time within a reasonable timeframe.

On lines 5-6 we receive back both our errors and other values computed by our service (line 4). We take advantage of the validation to compute which are the names of nodes and graphs, that we store in a global variable. We are going to use these names later, to satisfy requests for hover information.

The rest of the marked lines shows the gist of the communication with the client:

  • we gather the diagnostic messages
  • we setup them for easier use by the user
  • we send them to the client.

We can choose the severity, for instance we can also simply communicate informations or warning. We must also choose a range for which the message apply, so that the user can deal with it. Sometimes this doesn’t always make sense or it’s possible, in such cases we choose as a range the rest of the line length, starting from the character that indicate the beginning of the error.

The editor will then take care of communicating the mistake to the user, usually by underlining the text. But nothing forbids the client to do something else, for instance if the client is a log manager, it could simply store the errore in some way.

How Visual Studio Code shows an error

How Visual Studio Code shows an error

We are going to see the actual validation later, now we are going to see the last feature of our server, instead, the Hover Provider.

A Simple Hover Provider

An Hover Provider job is to give additional information on the text that the user is hovering on, such as the type of an object (ex. “class X”), documentation about it (ex. “The method Y is […]) or the signature of a method.

For our language server we choose to show what are the elements that can be used in an edge declaration: node, graph or subgraph. To find that information ourselves we simply use a listener when we validate the DOT document on the service.

To communicate this information to the user, we search between the names that we have saved on the Languager Server. If we find one on the current position that the user is hovering on we tell the client, otherwise we show nothing.

VS Code Hover Information

VS Code Hover Information

3. The C# Backend

Choosing One ANTLR Runtime Library

Since we are creating a cross platform service we want to use the .NET Core Platform.

The new ANTLR 4.7.0 supports .NET Core, but at the moment the nuget package has a configuration problem and still doesn’t. Depending on when you read this the problem might have been solved, but now you have two choices: you compile ANTLR Runtime for C# yourself, or you use the “C# optimized” version.

The problem is that this version it’s still in beta, and the integration to automatically create the C# files from the grammar it’s still in the future. So the generate the C# files you have to download the latest beta nuget package for the ANTLR 4 Code Generator. Then you have to decompress the .nupkg files, which is actually a zip file, and then run the included ANTLR4 program.

You can’t use the default ANTLR4 because it generates valid C# code, but that generated code is not compatible with the “C# optimized” runtime (don’t ask…).

The ANTLR Service

We are going to rapidly skim through the structure of the C# ANTLR Service. It’s not that complicated, but if you don’t understand something you can read our ANTLR Mega Tutorial.

We setup ANTLR with a Listener, to gather the names to use for the hover information, and an ErrorListener, to collect any error in our DOT document. Then we create a simple ASP .NET Core app to communicate with the Language Server in TypeScript.

In Program.cs (not shown) we configure the program to listen on the port 3000, then we setup the main method, to comunicate with the server, in Startup.cs.

We use a RouteBuilder to configure from which path we answer to. Since we only answer to one queston we could have directly answered from the root path, but this way is cleaner and it’s easier to add other services.

You can see that we actually use two ErrorListener(s), one each for the lexer and the parser. This way we can give better error information in the case of parser errors. The rest is the standard ANTLR program that use a Listener:

  • we create the parser
  • we try to parse the main element (ie. graph)
  • we walk the resulting tree with our listener.

The errors are found when we try to parse, on line 22, the names when we use the LanguageListener, on line 24.

The rest of the code simply prepare the JSON output that must be sent to the server of the Language Server Protocol.

Finding The Names

Let’s see where we find the names of the elements that we are providing for the hover information. This is achieved by listening to the firing of the id rule.

In our grammar the id rule is used to parse every name, attributed and value. So we have to distinguish between each case to find the ones we care about and categorize them.

We do just that by looking at the type of the parent of the Id node that has fired the method. On line 18 we subtract 1 to the line because ANTLR count lines as humans do, starting from 1, while Visual Studio Code count as a developer, starting by 0.

Summary

In this tutorial we have seen how to create a client of a Language Server for Visual Studio Code, a server with Visual Studio Code and a backend in C#.

We have seen how easily you can add features to your Language Server and one way to integrate it with another service that you already are using. This way you can create your language server as lightweight as you want. You can make it useful from day one and improve it along the way.

Of course we have leveraged a few things ready to be used:

  • an ANTLR grammar for our language
  • a client and a server implementation of the Language Server Protocol

What is amazing is that with one server you can leverage all the clients, even the one you don’t write yourself.

If you need more inspiration you can find additional examples and informations on the Visual Studio Code documentation. There is also the description of the Language Server Protocol itself, if you need to implement everything from scratch.

If you run in any issues following the tutorial or if you want to provide any feedback please feel free to write at federico@tomassetti.me

The ANTLR mega tutorial

The ANTLR mega tutorialParsers are powerful tools, and using ANTLR you could write all sort of parsers usable from many different languages.

In this complete tutorial we are going to:

  • explain the basics: what a parser is, what it can be used for
  • see how to setup ANTLR to be used from Javascript, Python, Java and C#
  • discuss how to test your parser
  • present the most advanced and useful features present in ANTLR: you will learn all you need to parse all possible languages
  • show tons of examples

Maybe you have read some tutorial that was too complicated or so partial that seemed to assume that you already know how to use a parser. This is not that kind of tutorial. We just expect you to know how to code and how to use a text editor or an IDE. That’s it.

At the end of this tutorial:

  • you will be able to write a parser to recognize different formats and languages
  • you will be able to create all the rules you need to build a lexer and a parser
  • you will know how to deal with the common problems you will encounter
  • you will understand errors and you will know how to avoid them by testing your grammar.

In other words, we will start from the very beginning and when we reach the end you will have learned all you could possible need to learn about ANTLR.

ANTLR Mega Tutorial Giant List of Content

ANTLR Mega Tutorial Giant List of Content

What is ANTLR?

ANTLR is a parser generator, a tool that helps you to create parsers. A parser takes a piece of text and transform it in an organized structure, such as an Abstract Syntax Tree (AST). You can think of the AST as a story describing the content of the code or also as its logical representation created by putting together the various pieces.

An abstract syntax tree for the Euclidean algorithm

Graphical representation of an AST for the Euclidean algorithm

What you need to do to get an AST:

  1. define a lexer and parser grammar
  2. invoke ANTLR: it will generate a lexer and a parser in your target language (e.g., Java, Python, C#, Javascript)
  3. use the generated lexer and parser: you invoke them passing the code to recognize and they return to you an AST

So you need to start by defining a lexer and parser grammar for the thing that you are analyzing. Usually the “thing” is a language, but it could also be a data format, a diagram, or any kind of structure that is represented with text.

Aren’t regular expressions enough?

If you are the typical programmer you may ask yourself why can’t I use a regular expression? A regular expression is quite useful, such as when you want to find a number in a string of text, but it also has many limitations.

The most obvious is the lack of recursion: you can’t find a (regular) expression inside another one, unless you code it by hand for each level. Something that quickly became unmaintainable. But the larger problem is that it’s not really scalable: if you are going to put together even just a few regular expressions, you are going to create a fragile mess that would be hard to maintain.

It’s not that easy to use regular expressions

Have you ever tried parsing HTML with a regular expression? It’s a terrible idea, for one you risk summoning Cthulhu, but more importantly it doesn’t really work. You don’t believe me? Let’s see, you want to find the elements of a table, so you try a regular expression like this one: <table>(.*?)</table>. Brilliant! You did it! Except somebody adds attributes to their table, such as style or id. It doesn’t matter, you do this <table.*?>(.*?)</table>, but you actually cared about the data inside the table, so you then need to parse tr and td, but they are full of tags.

So you need to eliminate that, too. And somebody dares even to use comments like <!— my comment &gtl—>. Comments can be used everywhere, and that is not easy to treat with your regular expression. Is it?

So you forbid the internet to use comments in HTML: problem solved.

Or alternatively you use ANTLR, whatever seems simpler to you.

ANTLR vs writing your own parser by hand

Okay, you are convinced, you need a parser, but why to use a parser generator like ANTLR instead of building your own?

The main advantage of ANTLR is productivity

If you actually have to work with a parser all the time, because your language, or format, is evolving, you need to be able to keep the pace, something you can’t do if you have to deal with the details of implementing a parser. Since you are not parsing for parsing’s sake, you must have the chance to concentrate on accomplishing your goals. And ANTLR make it much easier to do that, rapidly and cleanly.

As second thing, once you defined your grammars you can ask ANTLR to generate multiple parsers in different languages. For example you can get a parser in C# and one in Javascript to parse the same language in a desktop application and in a web application.

Some people argue that writing a parser by hand you can make it faster and you can produce better error messages. There is some truth in this, but in my experience parsers generated by ANTLR are always fast enough. You can tweak them and improve both performance and error handling by working on your grammar, if you really need to. And you can do that once you are happy with your grammar.

Table of Contents or ok, I am convinced, show me what you got

Two small notes:

  • in the companion repository of this tutorial you are going to find all the code with testing, even where we don’t see it in the article
  • the examples will be in different languages, but the knowledge would be generally applicable to any language

Setup

  1. Setup ANTLR
  2. Javascript Setup
  3. Python Setup
  4. Java Setup
  5. C# Setup

Beginner

  1. Lexers and Parsers
  2. Creating a Grammar
  3. Designing a Data Format
  4. Lexer Rules
  5. Parser Rules
  6. Mistakes and Adjustments

Mid-Level

  1. Setting Up the Chat Project in Javascript
  2. Antlr.js
  3. HtmlChatListener.js
  4. Working with a Listener
  5. Solving Ambiguities with Semantic Predicates
  6. Continuing the Chat in Python
  7. The Python Way of Working with a Listener
  8. Testing with Python
  9. Parsing Markup
  10. Lexical Modes
  11. Parser Grammars

Advanced

  1. The Markup Project in Java
  2. The Main App.java
  3. Transforming Code with ANTLR
  4. Joy and Pain of Transforming Code
  5. Advanced Testing
  6. Dealing with Expressions
  7. Parsing Spreadsheets
  8. The Spreadsheet Project in C#
  9. Excel is Doomed
  10. Testing Everything

Final Remarks

  1. Tips and Tricks
  2. Conclusions

Setup

In this section we prepare our development environment to work with ANTLR: the parser generator tool, the supporting tools and the runtimes for each language.

1. Setup ANTLR

ANTLR is actually made up of two main parts: the tool, used to generate the lexer and parser, and the runtime, needed to run them.

The tool will be needed just by you, the language engineer, while the runtime will be included in the final software using your language.

The tool is always the same no matter which language you are targeting: it’s a Java program that you need on your development machine. While the runtime is different for every language and must be available both to the developer and to the user.

The only requirement for the tool is that you have installed at least Java 1.7. To install the Java program you need to download the last version from the official site, which at the moment is:

Instructions

  1. copy the downloaded tool where you usually put third-party java libraries (ex. /usr/local/lib or C:\Program Files\Java\lib)
  2. add the tool to your CLASSPATH. Add it to your startup script (ex. .bash_profile)
  3. (optional) add also aliases to your startup script to simplify the usage of ANTLR

Executing the instructions on Linux/Mac OS

Executing the instructions on Windows

Typical Workflow

When you use ANTLR you start by writing a grammar, a file with extension .g4 which contains the rules of the language that you are analyzing. You then use the antlr4 program to generate the files that your program will actually use, such as the lexer and the parser.

There are a couple of important options you can specify when running antlr4.

First, you can specify the target language, to generate a parser in Python or JavaScript or any other target different from Java (which is the default one). The other ones are used to generate visitor and listener (don’t worry if you don’t know what these are, we are going to explain it later).

By default only the listener is generated, so to create the visitor you use the -visitor command line option, and -no-listener if you don’t want to generate the listener. There are also the opposite options, -no-visitor and -listener, but they are the default values.

You can optionally test your grammar using a little utility named TestRig (although, as we have seen, it’s usually aliased to grun).

The filename(s) are optional and you can instead analyze the input that you type on the console.

If you want to use the testing tool you need to generate a Java parser, even if your program is written in another language. This can be done just by selecting a different option with antlr4.

Grun is useful when testing manually the first draft of your grammar. As it becomes more stable you may want to relay on automated tests (we will see how to write them).

Grun also has a few useful options: -tokens, to shows the tokens detected,  -gui to generate an image of the AST.

2. Javascript Setup

You can put your grammars in the same folder as your Javascript files. The file containing the grammar must have the same name of the grammar, which must be declared at the top of the file.

In the following example the name is Chat and the file is Chat.g4.

We can create the corresponding Javascript parser simply by specifying the correct option with the ANTLR4 Java program.

Notice that the option is case-sensitive, so pay attention to the uppercase ‘S’. If you make a mistake you will receive a message like the following.

ANTLR can be used both with node.js and in the browser. For the browser you need to use webpack or require.js. If you don’t know how to use either of the two you can look at the official documentation for some help or read this tutorial on antlr in the web. We are going to use node.js, for which you can install the ANTLR runtime simply by using the following standard command.

3. Python Setup

When you have a grammar you put that in the same folder as your Python files. The file must have the same name of the grammar, which must be declared at the top of the file. In the following example the name is Chat and the file is Chat.g4.

We can create the corresponding Python parser simply by specifying the correct option with the ANTLR4 Java program. For Python, you also need to pay attention to the version of Python, 2 or 3.

The runtime is available from PyPi so you just can install it using pip.

Again, you just have to remember to specify the proper python version.

4. Java Setup

To setup our Java project using ANTLR you can do things manually. Or you can be a civilized person and use Gradle or Maven.

Also, you can look in ANTLR plugins for your IDE.

4.1 Java Setup Using Gradle

This is how I typically setup my Gradle project.

I use a Gradle plugin to invoke ANTLR and I also use the IDEA plugin to generate the configuration for IntelliJ IDEA.

I put my grammars under src/main/antlr/ and the gradle configuration make sure they are generated in the directory corresponding to their package. For example, if I want the parser to be in the package me.tomassetti.mylanguage it has to be generated into generated-src/antlr/main/me/tomassetti/mylanguage.

At this point I can simply run:

And I get my lexer & parser generated from my grammar(s).

Then I can also run:

And I have an IDEA Project ready to be opened.

4.2 Java Setup Using Maven

First of all we are going to specify in our POM that we need antlr4-runtime as a dependency. We will also use a Maven plugin to run ANTLR through Maven.

We can also specify if we ANTLR to generate visitors or listeners. To do that we define a couple of corresponding properties.

Now you have to put the *.g4 files of your grammar under src/main/antlr4/me/tomassetti/examples/MarkupParser.

Once you have written your grammars you just run mvn package and all the magic happens: ANTLR is invoked, it generates the lexer and the parser and those are compiled together with the rest of your code.

If you have never used Maven you can look at the official ANTLR documentation for the Java target or also the Maven website to get you started.

There is a clear advantage in using Java for developing ANTLR grammars: there are plugins for several IDEs and it’s the language that the main developer of the tool actually works on. So they are tools, like the org.antlr.v4.gui.TestRig, that can be easily integrated in you workflow and are useful if you want to easily visualize the AST of an input.

5. C# Setup

There is support for .NET Framework and Mono 3.5, but there is no support for .NET core. We are going to use Visual Studio to create our ANTLR project, because there is a nice extension for Visual Studio 2015 created by the same author of the C# target, called ANTLR Language Support. You can install it by going in Tools -> Extensions and Updates. This extension will automatically generate parser, lexer and visitor/listener when you build your project.

Furthermore, the extension will allow you to create a new grammar file, using the well known menu to add a new item. Last, but not least, you can setup the options to generate listener/visitor right in the properties of each grammar file.

Alternatives If You Are Not Using Visual Studio 2015

At the moment the extension only support up to Visual Studio 2015, so if you are using another version of the IDE, or another IDE entirely, you need an alternative.

You can use the usual Java tool to generate everything, even for C#. You can do that just by indicating the right language. In this example the grammar is called “Spreadsheet”.

Notice that the ‘S’ in CSharp is uppercase.

Picking The Right Runtime

In both cases, if you are using the extension or the Java tool, you also need an ANTLR4 runtime for your project, and you can install it with the good ol’ nuget. But you have to remember that not all runtimes are created egual.

The problem is that recently the C# version has been integrated in the usual java tool, but the author of the extension has chosen to maintain a “C# optimized” version, that is incompatible with the usual ANTLR4 tool. This is not strictly a fork, since the same person continues to be a core contributor to the main ANTLR4 tool, but it’s more of a parallel development.

So, if you are using the Visual Studio Extension you need to use the most popular ANTLR4 runtime, authored by sharwell. If you are using the ANTLR4 tool to generate your C# lexer and parser. you need to use the recently created ANTLR4 standard runtime.

Beginner

In this section we lay the foundation you need to use ANTLR: what lexer and parsers are, the syntax to define them in a grammar and the strategies you can use to create one. We also see the first examples to show how to use what you have learned.  You can come back to this section if you don’t remember how ANTLR works.

6. Lexers and Parsers

Before looking into parsers, we need to first to look into lexers, also known as tokenizers. They are basically the first stepping stone toward a parser, and of course ANTLR allows you to build them too. A lexer takes the individual characters and transforms them in tokens, the atoms that the parser uses to create the logical structure.

Imagine this process applied to a natural language such as English. You are reading the single characters, putting them together until they make a word, and then you combine the different words to form a sentence.

Let’s look at the following example and imagine that we are trying to parse a mathematical operation.

The lexer scans the text and find ‘4’, ‘3’, ‘7’ and then the space ‘ ‘. So it knows that the first characters actually represent a number. Then it finds a ‘+’ symbol, so it knows that it represents an operator, and lastly it finds another number.How the input stream is analyzed

How does it knows that? Because we tell it.

This is not a complete grammar, but we can already see that lexer rules are all uppercase, while parser rules are all lowercase. Technically the rule about case applies only to the first character of their names, but usually they are all uppercase or lowercase for clarity.

Rules are typically written in this order: first the parser rules and then the lexer ones, although logically they are applied in the opposite order. It’s also important to remember that lexer rules are analyzed in the order that they appear, and they can be ambiguous.

The typical example is the identifier: in many programming language it can be any string of letters, but certain combinations, such as “class” or “function” are forbidden because they indicate a class or a function. So the order of the rules solves the ambiguity by using the first match and that’s why the tokens identifying keywords such as class or function are defined first, while the one for the identifier is put last.

The basic syntax of a rule is easy: there is a name, a colon, the definition of the rule and a terminating semicolon

The definition of NUMBER contains a typical range of digits and a ‘+’ symbol to indicate that one or more matches are allowed. These are all very typical indications with which I assume you are familiar with, if not, you can read more about the syntax of regular expressions.

The most interesting part is at the end, the lexer rule that defines the WHITESPACE token. It’s interesting because it shows how to indicate to ANTLR to ignore something. Consider how ignoring whitespace simplify parser rules: if we couldn’t say to ignore WHITESPACE we would have to include it between every single subrule of the parser, to let the user puts spaces where he wants. Like this:

And the same typically applies to comments: they can appear everywhere and we do not want to handle them specifically in every single piece of our grammar so we just ignore them (at least while parsing) .

7. Creating a Grammar

Now that we have seen the basic syntax of a rule, we can take a look at the two different approaches to define a grammar: top-down and bottom-up.

Top-down approach

This approach consist in starting from the general organization of a file written in your language.

What are the main section of a file? What is their order? What is contained in each section?

For example a Java file can be divided in three sections:

  • package declaration
  • imports
  • type definitions

This approach works best when you already know the language or format that you are designing a grammar for. It is probably the strategy preferred by people with a good theoretical background or people who prefer to start with “the big plan”.

When using this approach you start by defining the rule representing the whole file. It will probably include other rules, to represent the main sections. You then define those rules and you move from the most general, abstract rules to the low-level, practical ones.

Bottom-up approach

The bottom-up approach consists in focusing in the small elements first: defining how the tokens are captured, how the basic expressions are defined and so on. Then we move to higher level constructs until we define the rule representing the whole file.

I personally prefer to start from the bottom, the basic items, that are analyzed with the lexer. And then you grow naturally from there to the structure, that is dealt with the parser. This approach permits to focus on a small piece of the grammar, build thests for that, ensure it works as expected and then move on to the next bit.

This approach mimic the way we learn. Furthermore, there is the advantage of starting with real code that is actually quite common among many languages. In fact, most languages have things like identifiers, comments, whitespace, etc. Obviously you might have to tweak something, for example a comment in HTML is functionally the same as a comment in C#, but it has different delimiters.

The disadvantage of a bottom-up approach rests on the fact that the parser is the thing you actually cares about. You weren’t asked to build a lexer, you were asked to build a parser, that could provide a specific functionality. So by starting on the last part, the lexer, you might end up doing some refactoring, if you don’t already know how the rest of the program will work.

8. Designing a Data Format

Designing a grammar for a new language is difficult. You have to create a language simple and intuitive to the user, but also unambiguous to make the grammar manageable. It must be concise, clear, natural and it shouldn’t get in the way of the user.

So we are starting with something limited: a grammar for a simple chat program.

Let’s start with a better description of our objective:

  • there are not going to be paragraphs, and thus we can use newlines as separators between the messages
  • we want to allow emoticons, mentions and links. We are not going to support HTML tags
  • since our chat is going to be for annoying teenagers, we want to allow users an easy way to SHOUT and to format the color of the text.

Finally teenagers could shout, and all in pink. What a time to be alive.

9. Lexer Rules

We start with defining lexer rules for our chat language. Remember that lexer rules actually are at the end of the files.

In this example we use rules fragments: they are reusable building blocks for lexer rules. You define them and then you refer to them in lexer rule. If you define them but do not include them in lexer rules they have simply no effect.

We define a fragment for the letters we want to use in keywords. Why is that? because we want to support case-insensitive keywords. Other than to avoid repetition of the case of characters, they are also used when dealing with floating numbers. To avoid repeating digits, before and after the dot/comma. Such as in the following example.

The TEXT token shows how to capture everything, except for the characters the follows the tilde (‘~’). We are excluding the closing square bracket ‘]’, but since it is a character used to identify the end of a group of characters, we have to escape it by prefixing it with a backslash ‘\’.

The newlines rule is formulated that way because there are actually different ways in which operating systems indicate a newline, some include a carriage return ('\r') others a newline ('\n') character, or a combination of the two.

10. Parser Rules

We continue with parser rules, which are the rules with which our program will interact most directly.

The first interesting part is message, not so much for what it contains, but the structure it represents. We are saying that a message could be anything of the listed rules in any order. This is a simple way to solve the problem of dealing with whitespace without repeating it every time. Since we, as users, find whitespace irrelevant we see something like WORD WORD mention, but the parser actually sees WORD WHITESPACE WORD WHITESPACE mention WHITESPACE.

Another way of dealing with whitespace, when you can’t get rid of it, is more advanced: lexical modes. Basically it allows you to specify two lexer parts: one for the structured part, the other for simple text. This is useful for parsing things like XML or HTML. We are going to show it later.

The command rule it’s obvious, you have just to notice that you cannot have a space between the two options for command and the colon, but you need one WHITESPACE after. The emoticon rule shows another notation to indicate multiple choices, you can use the pipe character ‘|’ without the parenthesis. We support only two emoticons, happy and sad, with or without the middle line.

Something that could be considered a bug, or a poor implementation, is the link rule, as we already said, in fact, TEXT capture everything apart from certain special characters. You may want to only allows WORD and WHITESPACE, inside the parentheses, or to force a correct format for a link, inside the square brackets. On the other hand, this allows the user to make a mistake in writing the link without making the parser complain.

You have to remember that the parser cannot check for semantics

For instance, it cannot know if the WORD indicating the color actually represents a valid color. That is to say, it doesn’t know that it’s wrong to use “dog”, but it’s right to use “red”. This must be checked by the logic of the program, that can access which colors are available. You have to find the right balance of dividing enforcement between the grammar and your own code.

The parser should only check the syntax. So the rule of thumb is that when in doubt you let the parser pass the content up to your program. Then, in your program, you check the semantics and make sure that the rule actually have a proper meaning.

Let’s look at the rule color: it can include a message,  and it itself can be part of message; this ambiguity will be solved by the context in which is used.

11. Mistakes and Adjustments

Before trying our new grammar we have to add a name for it, at the beginning of the file. The name must be the same of the file, which should have the .g4 extension.

You can find how to install everything, for your platform, in the official documentation. After everything is installed, we create the grammar, compile the generate Java code and then we run the testing tool.

Okay, it doesn’t work. Why is it expecting WORD? It’s right there! Let’s try to find out, using the option -tokens to make it shows the tokens it recognizes.

So it only sees the TEXT token. But we put it at the end of the grammar, what happens? The problem is that it always try to match the largest possible token. And all this text is a valid TEXT token. How we solve this problem? There are many ways, the first, of course, is just getting rid of that token. But for now we are going to see the second easiest.

We have changed the problematic token to make it include a preceding parenthesis or square bracket. Note that this isn’t exactly the same thing, because it would allow two series of parenthesis or square brackets. But it is a first step and we are learning here, after all.

Let’s check if it works:

Using the option -gui we can also have a nice, and easier to understand, graphical representation.

ANTLR4 Parse Tree

The dot in mid air represents whitespace.

This works, but it isn’t very smart or nice, or organized. But don’t worry, later we are going to see a better way. One positive aspect of this solution is that it allows to show another trick.

This is an equivalent formulation of the token TEXT: the ‘.’ matches any character, ‘*’ says that the preceding match can be repeated any time, ‘?’ indicate that the previous match is non-greedy. That is to say the previous subrule matches everything except what follows it, allowing to match the closing parenthesis or square bracket.

Mid-Level

In this section we see how to use ANTLR in your programs, the libraries and functions you need to use, how to tests your parsers, and the like. We see what is and how to use a listener. We also build up on our knowledge of the basics, by looking at more advanced concepts, such as semantic predicates. While our projects are mainly in Javascript and Python, the concept are generally applicable to every language. You can come back to this section when you need to remember how to get your project organized.

12. Setting Up the Chat Project with Javascript

In the previous sections we have seen how to build a grammar for a chat program , piece by piece. Let’s now copy that grammar we just created in the same folder of our Javascript files.

We can create the corresponding Javascript parser simply by specifying the correct option with the ANTLR4 Java program.

Now you will find some new files in the folder, with names such as ChatLexer.js, ChatParser.js and there are also *.tokens files, none of which contains anything interesting for us, unless you want to understand the inner workings of ANTLR.

The file you want to look at is ChatListener.js,  you are not going to modify anything in it, but it contains methods and functions that we will override with our own listener. We are not going to modify it, because changes would be overwritten every time the grammar is regenerated.

Looking into it you can see several enter/exit functions, a pair for each of our parser rules. These functions will be invoked when a piece of code matching the rule will be encountered. This is the default implementation of the listener that allows you to just override the functions that you need, on your derived listener, and leave the rest to be.

The alternative to creating a Listener is creating a Visitor. The main differences are that you can’t neither control the flow of a listener nor returning anything from its functions, while you can do both of them with a visitor. So if you need to control how the nodes of the AST are entered, or to gather information from several of them, you probably want to use a visitor. This is useful, for example, with code generation, where some information that is needed to create new source code is spread around many parts. Both the listener and the visitor use depth-first search.

A depth-first search means that when a node will be accessed its children will be accessed, and if one the children nodes had its own children they will be accessed before continuing on with the other children of the first node. The following image will make it simpler to understand the concept.

Depth-first search

So in the case of a listener an enter event will be fired at the first encounter with the node and a exit one will be fired after after having exited all of its children. In the following image you can see the example of what functions will be fired when a listener would met a line node (for simplicity only the functions related to line are shown).

ANTLR Listener Example

With a standard visitor the behavior will be analogous except, of course, that only a single visit event will be fired for every single node. In the following image you can see the example of what function will be fired when a visitor would met a line node (for simplicity only the function related to line is shown).

ANTLR Visitor Example

Remember that this is true for the default implementation of a visitor and it’s done by returning the children of each node in every function. If you override a method of the visitor it’s your responsibility to make it continuing the journey or stop it right there.

13. Antlr.js

It is finally time to see how a typical ANTLR program looks.

At the beginning of the main file we import (using require) the necessary libraries and file, antlr4 (the runtime) and our generated parser, plus the listener that we are going to see later.

For simplicity we get the input from a string, while in a real scenario it would come from an editor.

Lines 16-19 shows the foundation of every ANTLR program: you create the stream of chars from the input, you give it to the lexer and it transforms them in tokens, that are then interpreted by the parser.

It’s useful to take a moment to reflect on this: the lexer works on the characters of the input, a copy of the input to be precise, while the parser works on the tokens generated by the parser. The lexer doesn’t work on the input directly, and the parser doesn’t even see the characters.

This is important to remember in case you need to do something advanced like manipulating the input. In this case the input is a string, but, of course, it could be any stream of content.

The line 20 is redundant, since the option already default to true, but that could change in future versions of the runtimes, so you are better off by specifying it.

Then, on line 21, we set the root node of the tree as a chat rule. You want to invoke the parser specifying a rule which typically is the first rule. However you can actually invoke any rule directly, like color.

Once we get the AST from the parser typically we want to process it using a listener or a visitor. In this case we specify a listener. Our particular listener take a parameter: the response object. We want to use it to put some text in the response to send to the user. After setting the listener up, we finally walk the tree with our listener.

14. HtmlChatListener.js

We continue by looking at the listener of our Chat project.

After the requires function calls we make our HtmlChatListener to extend ChatListener. The interesting stuff starts at line 17.

The ctx argument is an instance of a specific class context for the node that we are entering/exiting. So for enterName is NameContext, for exitEmoticon is EmoticonContext, etc. This specific context will have the proper elements for the rule, that would make possible to easily access the respective tokens and subrules. For example, NameContext will contain fields like WORD() and WHITESPACE(); CommandContext will contain fields like WHITESPACE(), SAYS() and SHOUTS().

These functions, enter* and exit*, are called by the walker everytime the corresponding nodes are entered or exited while it’s traversing the AST that represents the program newline. A listener allows you to execute some code, but it’s important to remember that you can’t stop the execution of the walker and the execution of the functions.

On line 18, we start by printing a strong tag because we want the name to be bold, then on exitName we take the text from the token WORD and close the tag. Note that we ignore the WHITESPACE token, nothing says that we have to show everything. In this case we could have done everything either on the enter or exit function.

On the function exitEmoticon we simply transform the emoticon text in an emoji character. We get the text of the whole rule because there are no tokens defined for this parser rule. On enterCommand, instead there could be any of two tokens SAYS or SHOUTS, so we check which one is defined. And then we alter the following text, by transforming in uppercase, if it’s a SHOUT. Note that we close the p tag at the exit of the line rule, because the command, semantically speaking, alter all the text of the message.

All we have to do now is launching node, with nodejs antlr.js, and point our browser at its address, usually at http://localhost:1337/ and we will be greeted with the following image.

ANTLR4 Javascript So all is good, we just have to add all the different listeners to handle the rest of the language. Let’s start with color and message.

15. Working with a Listener

We have seen how to start defining a listener. Now let’s get serious on see how to evolve in a complete, robust listener. Let’s start by adding support for color and checking the results of our hard work.

ANTLR4 Javascript ouput

Except that it doesn’t work. Or maybe it works too much: we are writing some part of message twice (“this will work”): first when we check the specific nodes, children of message, and then at the end.

Luckily with Javascript we can dynamically alter objects, so we can take advantage of this fact to change the *Context object themselves.

Only the modified parts are shown in the snippet above. We add a text field to every node that transforms its text, and then at the exit of every message we print the text if it’s the primary message, the one that is directly child of the line rule. If it’s a message, that is also a child of color, we add the text field to the node we are exiting and let color print it. We check this on line 30, where we look at the parent node to see if it’s an instance of the object LineContext. This is also further evidence of how each ctx argument corresponds to the proper type.

Between lines 23 and 27 we can see another field of every node of the generated tree: children, which obviously it contains the children node. You can observe that if a field text exists we add it to the proper variable, otherwise we use the usual function to get the text of the node.

16. Solving Ambiguities with Semantic Predicates

So far we have seen how to build a parser for a chat language in Javascript. Let’s continue working on this grammar but switch to python. Remember that all code is available in the repository. Before that, we have to solve an annoying problem: the TEXT token. The solution  we have  is terrible, and furthermore, if we tried to get the text of the token we would have to trim the edges, parentheses or square brackets. So what can we do?

We can use a particular feature of ANTLR called semantic predicates. As the name implies they are expressions that produce a boolean value. They selectively enable or disable the following rule and thus permit to solve ambiguities. Another reason that they could be used is to support different version of the same language, for instance a version with a new construct or an old without it.

Technically they are part of the larger group of actions, that allows to embed arbitrary code into the grammar. The downside is that the grammar is no more language independent, since the code in the action must be valid for the target language. For this reason, usually it’s considered a good idea to only use semantic predicates, when they can’t be avoided, and leave most of the code to the visitor/listener.

We restored link to its original formulation, but we added a semantic predicate to the TEXT token, written inside curly brackets and followed by a question mark. We use self._input.LA(-1) to check the character before the current one, if this character is a square bracket or the open parenthesis, we activate the TEXT token. It’s important to repeat that this must be valid code in our target language, it’s going to end up in the generated Lexer or Parser, in our case in ChatLexer.py.

This matters not just for the syntax itself, but also because different targets might have different fields or methods, for instance LA returns an int in python, so we have to convert the char to a int.

Let’s look at the equivalent form in other languages.

If you want to test for the preceding token, you can use the _input.LT(-1,)but you can only do that for parser rules. For example, if you want to enable the mention rule only if preceded by a WHITESPACE token.

17. Continuing the Chat in Python

Before seeing the Python example, we must modify our grammar and put the TEXT token before the WORD one. Otherwise ANTLR might assign the incorrect token, in cases where the characters between parentheses or brackets are all valid for WORD, for instance if it where [this](link).

Using ANTLR in python is not more difficult than with any other platform, you just need to pay attention to the version of Python, 2 or 3.

And that’s it. So when you have run the command, inside the directory of your python project, there will be a newly generated parser and a lexer. You may find interesting to look at ChatLexer.py and in particular the function TEXT_sempred (sempred stands for semantic predicate).

You can see our predicate right in the code. This also means that you have to check that the correct libraries, for the functions used in the predicate, are available to the lexer.

18. The Python Way of Working with a Listener

The main file of a Python project is very similar to a Javascript one, mutatis mutandis of course. That is to say we have to adapt libraries and functions to the proper version for a different language.

We have also changed the input and output to become files, this avoid the need to launch a server in Python or the problem of using characters that are not supported in the terminal.

Apart from lines 35-36, where we introduce support for links, there is nothing new. Though you might notice that Python syntax is cleaner and, while having dynamic typing, it is not loosely typed as Javascript. The different types of *Context objects are explicitly written out. If only Python tools were as easy to use as the language itself. But of course we cannot just fly over python like this, so we also introduce testing.

19. Testing with Python

While Visual Studio Code have a very nice extension for Python, that also supports unit testing, we are going to use the command line for the sake of compatibility.

That’s how you run the tests, but before that we have to write them. Actually, even before that, we have to write an ErrorListener to manage errors that we could find. While we could simply read the text outputted by the default error listener, there is an advantage in using our own implementation, namely that we can control more easily what happens.

Our class derives from ErrorListener and we simply have to implement syntaxError. Although we also add a property symbol to easily check which symbol might have caused an error.

The setup method is used to ensure that everything is properly set; on lines 19-21 we setup also our ChatErrorListener, but first we remove the default one, otherwise it would still output errors on the standard output. We are listening to errors in the parser, but we could also catch errors generated by the lexer. It depends on what you want to test. You may want to check both.

The two proper test methods checks for a valid and an invalid name. The checks are linked to the property symbol, that we have previously defined, if it’s empty everything is fine, otherwise it contains the symbol that created the error. Notice that on line 28, there is a space at the end of the string, because we have defined the rule name to end with a WHITESPACE token.

20. Parsing Markup

ANTLR can parse many things, including binary data, in that case tokens are made up of non printable characters. But a more common problem is parsing markup languages such as XML or HTML. Markup is also a useful format to adopt for your own creations, because it allows to mix unstructured text content with structured annotations. They fundamentally represent a form of smart document, containing both text and structured data. The technical term that describe them is island languages. This type is not restricted to include only markup, and sometimes it’s a matter of perspective.

For example, you may have to build a parser that ignores preprocessor directives. In that case, you have to find a way to distinguish proper code from directives, which obeys different rules.

In any case, the problem for parsing such languages is that there is a lot of text that we don’t actually have to parse, but we cannot ignore or discard, because the text contain useful information for the user and it is a structural part of the document. The solution is lexical modes, a way to parse structured content inside a larger sea of free text.

21. Lexical Modes

We are going to see how to use lexical modes, by starting with a new grammar.

Looking at the first line you could notice a difference: we are defining a lexer grammar, instead of the usual (combined) grammar. You simply can’t define a lexical mode together with a parser grammar. You can use lexical modes only in a lexer grammar, not in a combined grammar. The rest is not suprising, as you can see, we are defining a sort of BBCode markup, with tags delimited by square brackets.

On lines 3, 7 and 9 you will find basically all that you need to know about lexical modes. You define one or more tokens that can delimit the different modes and activate them.

The default mode is already implicitly defined, if you need to define yours you simply use mode followed by a name. Other than for markup languages, lexical modes are typically used to deal with string interpolation. When a string literal can contain more than simple text, but things like arbitrary expressions.

When we used a combined grammar we could define tokens implicitly: when in a parser rule we used a string like ‘=’ that is what we did. Now that we are using separate lexer and parser grammars we cannot do that. That means that every single token has to be defined explicitly. So we have definitions like SLASH or EQUALS which typically could be just be directly used in a parser rule. The concept is simple: in the lexer grammar we need to define all tokens, because they cannot be defined later in the parser grammar.

22. Parser Grammars

We look at the other side of a lexer grammar, so to speak.

On the first line we define a parser grammar. Since the tokens we need are defined in the lexer grammar, we need to use an option to say to ANTLR where it can find them. This is not necessary in combined grammars, since the tokens are defined in the same file.

There are many other options available, in the documentation.

There is almost nothing else to add, except that we define a content rule so that we can manage more easily the text that we find later in the program.

I just want to say that, as you can see, we don’t need to explicitly use the tokens everytime (es. SLASH), but instead we can use the corresponding text (es. ‘/’).

ANTLR will automatically transform the text in the corresponding token, but this can happen only if they are already defined. In short, it is as if we had written:

But we could not have used the implicit way, if we hadn’t already explicitly defined them in the lexer grammar. Another way to look at this is: when we define a combined grammar ANTLR defines for use all the tokens, that we have not explicitly defined ourselves. When we need to use a separate lexer and a parser grammar, we have to define explicitly every token ourselves. Once we have done that, we can use them in every way we want.

Before moving to actual Java code, let’s see the AST for a sample input.

Sample AST of the Markup parser

You can easily notice that the element rule is sort of transparent: where you would expect to find it there is always going to be a tag or content. So why did we define it? There are two advantages: avoid repetition in our grammar and simplify managing the results of the parsing. We avoid repetition because if we did not have the element rule we should repeat (content|tag) everywhere it is used. What if one day we add a new type of element? In addition to that it simplify the processing of the AST because it makes both the node represent tag and content extend a comment ancestor.

Advanced

In this section we deepen our understanding of ANTLR. We will look at more complex examples and situations we may have to handle in our parsing adventures. We will learn how to perform more adavanced testing, to catch more bugs and ensure a better quality for our code. We will see what a visitor is and how to use it. Finally, we will see how to deal with expressions and the complexity they bring.

You can come back to this section when you need to deal with complex parsing problems.

23. The Markup Project in Java

You can follow the instructions in Java Setup or just copy the antlr-java folder of the companion repository. Once the file pom.xml is properly configured, this is how you build and execute the application.

As you can see, it isn’t any different from any typical Maven project, although it’s indeed more complicated that a typical Javascript or Python project. Of course, if you use an IDE you don’t need to do anything different from your typical workflow.

24. The Main App.java

We are going to see how to write a typical ANTLR application in Java.

At this point the main java file should not come as a surprise, the only new development is the visitor. Of course, there are the obvious little differences in the names of the ANTLR classes and such. This time we are building a visitor, whose main advantage is the chance to control the flow of the program. While we are still dealing with text, we don’t want to display it, we want to transform it from pseudo-BBCode to pseudo-Markdown.

25. Transforming Code with ANTLR

The first issue to deal with our translation from pseudo-BBCode to pseudo-Markdown is a design decision. Our two languages are different and frankly neither of the two original one is that well designed.

BBCode was created as a safety precaution, to make possible to disallow the use of HTML but giove some of its power to users. Markdown was created to be an easy to read and write format, that could be translated into HTML. So they both mimic HTML, and you can actually use HTML in a Markdown document. Let’s start to look into how messy would be a real conversion.

The first version of our visitor prints all the text and ignore all the tags.

You can see how to control the flow, either by calling visitChildren, or any other visit* function, and deciding what to return. We just need to override the methods that we want to change. Otherwise, the default implementation would just do like visitContent, on line 23, it will visit the children nodes and allows the visitor to continue. Just like for a listener, the argument is the proper context type. If you want to stop the visitor just return null as on line 15.

26. Joy and Pain of Transforming Code

Transforming code, even at a very simple level, comes with some complications. Let’s start easy with some basic visitor methods.

Before looking at the main method, let’s look at the supporting ones. Foremost we have changed visitContent by making it return its text instead of printing it. Second, we have overridden the visitElement so that it prints the text of its child, but only if it’s a top element, and not inside a tag. In both cases, it achieve this by calling the proper visit* method. It knows which one to call because it checks if it actually has a tag or content node.

VisitTag contains more code than every other method, because it can also contain other elements, including other tags that have to be managed themselves, and thus they cannot be simply printed. We save the content of the ID on line 5, of course we don’t need to check that the corresponding end tag matches, because the parser will ensure that, as long as the input is well formed.

The first complication starts with at lines 14-15: as it often happens when transforming a language in a different one, there isn’t a perfect correspondence between the two. While BBCode tries to be a smarter and safer replacement for HTML, Markdown want to accomplish the same objective of HTML, to create a structured document. So BBCode has an underline tag, while Markdown does not.

So we have to make a decision

Do we want to discard the information, or directly print HTML, or something else? We choose something else and instead convert the underline to an italic. That might seem completely arbitrary, and indeed there is an element of choice in this decision. But the conversion forces us to lose some information, and both are used for emphasis, so we choose the closer thing in the new language.

The following case, on lines 18-22, force us to make another choice. We can’t maintain the information about the author of the quote in a structured way, so we choose to print the information in a way that will make sense to a human reader.

On lines 28-34 we do our “magic”: we visit the children and gather their text, then we close with the endDelimiter. Finally we return the text that we have created.

That’s how the visitor works

  1. every top element visit each child
    • if it’s a content node, it directly returns the text
    • if it’s a tag, it setups the correct delimiters and then it checks its children. It repeats step 2 for each children and then it returns the gathered text
  2. it prints the returned text

It’s obviously a simple example, but it show how you can have great freedom in managing the visitor once you have launched it. Together with the patterns that we have seen at the beginning of this section you can see all of the options: to return null to stop the visit, to return children to continue, to return something to perform an action ordered at an higher level of the tree.

27. Advanced Testing

The use of lexical modes permit to handle the parsing of island languages, but it complicates testing.

We are not going to show MarkupErrorListener.java because w edid not changed it; if you need you can see it on the repository.

You can run the tests by using the following command.

Now we are going to look at the tests code. We are skipping the setup part, because that also is obvious, we just copy the process seen on the main file, but we simply add our error listener to intercept the errors.

The first two methods are exactly as before, we simply check that there are no errors, or that there is the correct one because the input itself is erroneous. On lines 30-32 things start to get interesting: the issue is that by testing the rules one by one we don’t give the chance to the parser to switch automatically to the correct mode. So it remains always on the DEFAULT_MODE, which in our case makes everything looks like TEXT. This obviously makes the correct parsing of an attribute impossible.

The same lines shows also how you can check the current mode that you are in, and the exact type of the tokens that are found by the parser, which we use to confirm that indeed all is wrong in this case.

While we could use a string of text to trigger the correct mode, each time, that would make testing intertwined with several pieces of code, which is a no-no. So the solution is seen on line 39: we trigger the correct mode manually. Once you have done that, you can see that our attribute is recognized correctly.

28. Dealing with Expressions

So far we have written simple parser rules, now we are going to see one of the most challenging parts in analyzing a real (programming) language: expressions. While rules for statements are usually larger they are quite simple to deal with: you just need to write a rule that encapsulate the structure with the all the different optional parts. For instance a for statement can include all other kind of statements, but we can simply include them with something like statement*. An expression, instead, can be combined in many different ways.

An expression usually contains other expressions. For example the typical binary expression is composed by an expression on the left, an operator in the middle and another expression on the right. This can lead to ambiguities. Think, for example, at the expression 5 + 3 * 2, for ANTLR this expression is ambiguous because there are two ways to parse it. It could either parse it as 5 + (3 * 2) or (5 +3) * 2.

Until this moment we have avoided the problem simply because markup constructs surround the object on which they are applied. So there is not ambiguity in choosing which one to apply first: it’s the most external. Imagine if this expression was written as: 

That would make obvious to ANTLR how to parse it.

These types of rules are called left-recursive rules. You might say: just parse whatever comes first. The problem with that is semantic: the addition comes first, but we know that multiplications have a precedence over additions. Traditionally the way to solve this problem was to create a complex cascade of specific expressions like this:

This way ANTLR would have known to search first for a number, then for multiplications and finally for additions. This is cumbersome and also counterintuitive, because the last expression is the first to be actually recognized. Luckily ANTLR4 can create a similar structure automatically, so we can use a much more natural syntax.

In practice ANTLR consider the order in which we defined the alternatives to decide the precedence. By writing the rule in this way we are telling to ANTLR that the multiplication has precedence on the addition.

29. Parsing Spreadsheets

Now we are prepared to create our last application, in C#. We are going to build  the parser of an Excel-like application. In practice, we want to manage the expressions you write in the cells of a spreadsheet.

With all the knowledge you have acquired so far everything should be clear, except for possibly three things:

  1. why the parentheses are there,
  2. what’s the stuff on the right,
  3. that thing on line 6.

The parentheses comes first because its only role is to give the user a way to override the precedence of operator, if it needs to do so. This graphical representation of the AST should make it clear.Parentheses to change the operator precedence

The things on the right are labels, they are used to make ANTLR generate specific functions for the visitor or listener. So there will be a VisitFunctionExp, a VisitPowerExp, etc. This makes possible to avoid the use of a giant visitor for the expression rule.

The expression relative to exponentiation is different because there are two possible ways to act, to group them, when you meet two sequential expressions of the same type. The first one is to execute the one on the left first and then the one on the right, the second one is the inverse: this is called associativity. Usually the one that you want to use is left-associativity,  which is the default option. Nonetheless exponentiation is right-associative, so we have to signal this to ANTLR.

Another way to look at this is: if there are two expressions of the same type, which one has the precedence: the left one or the right one? Again, an image is worth a thousand words.

Associativity of an expression

We have also support for functions, alphanumeric variables that represents cells and real numbers.

30. The Spreadsheet Project in C#

You just need to follow the C# Setup: to install a nuget package for the runtime and an ANTLR4 extension for Visual Studio. The extension will automatically generate everything whenever you build your project: parser, listener and/or visitor.

Update. Notice that there are two small differences between the code for a project using the extension and one using the Java tool. These are noted in the README for the C# project at the repository.

After you have done that, you can also add grammar files just by using the usual menu Add -> New Item. Do exactly that to create a grammar called Spreadsheet.g4 and put in it the grammar we have just created. Now let’s see the main Program.cs.

There is nothing to say, apart from that, of course, you have to pay attention to yet another slight variation in the naming of things: pay attention to the casing. For instance, AntlrInputStream, in the C# program, was ANTLRInputStream in the Java program.

Also you can notice that, this time, we output on the screen the result of our visitor, instead of writing the result on a file.

31. Excel is Doomed

We are going to take a look at our visitor for the Spreadsheet project.