We extensively use ANTLR in our work, so we have seen what works and what does not. This article represents the summary of our experience. It explains the principles and patterns that we use, or we avoid, when designing ANTLR parsers.
We have written about many different aspects of ANTLR: from teaching you everything you need to know about ANTLR in our mega tutorial to a performance guide. So, if you have already read them and are looking for higher level guidance on how to create your parser, that is the article for you.
We have also created pragmatic howtos for using ANTLR in a custom editor or to power a language server for the much appreciated Visual Studio Code. These are all great articles that works well if you have a passion project involving ANTLR. You read them and then you get back at programming. If instead you need to use ANTLR professionally you might want to look at our complete course about ANTLR: Using ANTLR Like a Professional. The course explains how to use ANTLR in a more structured way and therefore can also fit a bit more advanced content.
The Rules
There are a few rules that you should keep in mind when creating an ANTLR parser. They range from practical tips for writing grammar rules to general principles in using your parser:
- Parse the whole input
- Label everything you can
- A parser can only parse one language
- Use as little code as possible in your grammar
- Testing
- Use an AST
This is the list of the six rules that we found works well in our experience. Let’s see them in depth.
Parsing the Whole Input
You should make sure you are parsing the whole input. This is valid both in the sense that your grammar rules should cover all possible input and that you should effectively consume all input. Fundamentally, this is to make your life easier in writing and debugging your grammar.
In theory, this rule could just be something like you should know exactly what input you are parsing. The issue is that in practice it is hard to check that you are always parsing just the part of the input that you care about. If there is a mistake in your parser rules, you might parse more in some parts and/or less in other parts. Let’s take this simple example:
rule: first_statement second_statement third_statement;
If there is a mistake in parsing the first_statement
, either because the input or the rule is wrong, then you have no idea of what second_statement
is going to see. These windows of parsing might shift depending on the input. There is no simple way to reliably check that you are parsing just and all of a specific section.
So, you could say that the principle is to know exactly what you are parsing, the practical application is to make sure you are parsing the whole input.
Parse Everything, Even if You Do Not Need It
The first acceptation means that even when you just need to parse some statements or parts of the language, as in our Parsing SQL article, you should create rules to parse all the input. Even in such cases, it is better to parse the whole input, but relegate all uninteresting parts to rules whose content you can ignore.
This is the best approach, because we have two conflicting needs: to ignore or drop some information and to keep all the necessary context to parse what we need. The more we ignore, the harder is to parse what we need. This is tricky particularly for lexing since the lexer has less information available to make a decision. So, by parsing the whole input we reduce errors and eliminate uncertainty in defining our rules.
The exact way to do that, when you are just interested in some parts of the input, varies from case to case. A general pattern is to have the main rule of our grammar also reference an ignore
rule that contains the parts that will be discarded by later. You can see the complete example in our linked article, here we just show how it looks.
statements : (statement | ignore)+ EOF ; statement : createStmt ; ignore : .*? SEMICOLON ;
The main rule statements
can accept either a statement
(i.e., what we care about) or the parts to ignore. What is captured by the rule ignore
will be simply ignored by the users of the parser.
Consume All the Input
The second meaning of consuming all input is important for a couple of reasons. The first one is because any unexpected input can slow down parsing. The second one is because this might shift the attribution of errors or even hide them. ANTLR might not always be able to report errors in the way you might expect. For instance, imagine you are writing a statement like a class definition in C-like language. This is the input you get:
class Example // missing { here string Field [..]
A human should recognize that this input is missing a curly bracket {
. However, depending on how you define the rules, ANTLR might report different errors or no errors at all. If class Example
and string Field
are both valid statements, and a newline is a valid separator, ANTLR might not find any errors. It might just stop parsing after string Field and report no errors. Why? Because you defined a grammar like this one:
file: statements+; statements: class_definition | variable [..]
This grammar defines a rule file
that captures one or more statements. It does not say anywhere that it is an error having also something more after the sequence of statements. For example, this could be valid input.
class Example // missing { here string Field I see Cthulhu dancing right behind you
As long as there is a sequence of at least one statement, and the lexer does not see illegal characters, the input is valid and correct according to the grammar.
And when you are dealing with unfamiliar languages, or complex statements, even you might have an hard time understanding where is the error. This makes debugging harder.
So, to make your life easier you want to be sure that you are parsing what you expect, in order to remove one level of uncertainty and errors.
The Grammar Rules to Consume All Input
The first part of achieving that is adding an EOF token to the end of your main parser rule. The EOF token is automatically defined by ANTLR to capture the end of file.
file: statements EOF;
This means that your parser must handle the whole input. In other words, it ensures that you will see an error if you are not parsing the whole input.
The second step is adding an ANY rule at the end of the lexer.
ANY: . ;
It is vital that this is the last rule, otherwise it will capture any content. This lexer rule matches any character. So, if you find it somewhere in the output of the parser this means that your grammar is not handling some parts of the input. The parsing might still succeed because ANTLR can overcome typos and small errors, but these errors will impact performance or lead the parser to recognize rules in different ways.
Label Everything You Can
You should label every part you can. This is valid both for parts of the rules and for an entire alternative. Labels facilitate accessing the content you care about and navigating the parse tree.
When you write a grammar your first worry is to write rules that parse the language correctly. Of course, that is a good first step. The second one is making easy to work with the parser. To achieve that, it helps to use labels. ANTLR has two kinds of labels: alternative labels and rule elements labels; both can be useful. We assume you are familiar with these two kinds of labels, but here it is an example.
expression : left=expression '*' right=expression #multiplication | expression '+' expression #addition | NUMBER #atom ;
Alternative labels are the one that follows an #
, the rule element labels are the one preceding the = sign. They serve two different purposes. The first ones facilitate act differently for each alternative while the second ones facilitate accessing the different parts of the rules.
Why Using Alternative Labels
Alternative labels create different nodes, that is different *Context
objects in the parse tree. If you label one alternative, you have to label all alternatives, because there will be no base node. For example, with the previous rule expression, there will be no expression
node in the parse tree.
The consequence is that it is easier to recognize the different alternative and to act differently depending on the alternative the parser found. Imagine you are navigating the parse tree with a visitor. Without alternative labels, you would need some kind of heuristic inside visitExpression
to understand what kind of expression you are in. For example, you will have to check which kind of sign there is, or the length of the expression
array. The array which could contain the left and right part of the expression. Instead by using alternative label you will automatically have visitAddition
, visitMultiplication
, etc. methods.
Why Using Rule Element Labels
Rule element labels instead provides an alternative way to access the content parsed by the subrule to which the label is assigned. Basically, they create a field in the corresponding *Context
object, but you can still access the same content in the traditional way. So, in the previous example, you can access the left part of expression using the left
field or accessing the first element of the expression
array.
public override double VisitMultiplication(ExampleParser.MultiplicationContext context) { // with rule element labels double left = context.left; double right = context.right; // always valid double left = Visit(context.expression(0)); double right = Visit(context.expression(1)); }
Rule element labels are usually good to improve readability and ease of use, but they can be a real lifesaver when dealing with optional parts. For instance, imagine this rule.
CASE case_expr=expression? (ADD add_expr+=expression)+ (FINALLY final_expr=expression)? END
You could not easily get a case condition (case_expr
) or a finally expression (final_expr
) without labels. You could get them by comparing the position of the individual expressions to the elements such as the ADD or FINALLY tokens, but that will be much more cumbersome.
A Parser Can Only Parse One Language
A parser can only parse one language. Designing software is complicated because of the immense freedom you have. You can do almost everything, therefore you have to set your own limitations to create order and create software that is understandable and manageable. This need has created different principles, like separations of concerns: every piece of software should deal with one need. Or encapsulation, the idea to group some related data in a class and hide the inner workings of the class from the outside.
This principle reveals itself also in designing languages and therefore designing grammars. A parser should parse one language. This seem an obvious principle, but it is not always trivial to understand how it applies to your case. For example, you may think of C++ as one language. However, C++ is made up of the language itself and its preprocessor.
The preprocessor is usually a separate program that manipulates the initial source code and produces the actual source code that will be fed to the compiler. The preprocessor can do stuff like removing or adapting some parts depending on the platform (e.g. conditional compilation allows to use some parts of the source code on Windows and other ones on Linux). Essentially, it tweaks the code to adapt it to the circumstances.
In short, the preprocessor does something different from the rest of the code in a C++ file. It serves different needs and operates according to different rules. It is a separate language and therefore you need to build a separate parser specifically for that. This is an issue of syntax as much as of behavior.
Look at the following example.
#include <stdio.h> int main(void) { printf("Hello, world!\n"); return 0; }
You may think that it would be easy to adapt the C++ parser to also handle its preprocessor language: you just need to parse and ignore lines starting with hash #
. It works in this case, but what about this one?
#include <stdio.h> int main(void) { printf("Hello, world!\n"); #ifdef VERBOSE printf("Printing a trace message"); #endif return 0; }
Now, you have a problem. What is the actual C++ source code that you have to execute/interpret/transpile? Should it include the trace message or not? There is no good answer to that. You might think that you can just discard everything between preprocessor directives. But can you?
#include <stdio.h> #ifdef WORLD int main(void) { printf("Hello, world!\n"); return 1; } #else int main(void) { printf("Hello, USA!\n"); return 0; } #endif
Depending on the preprocessor execution you effectively get two different programs. If you ignore it instead you get no valid program at all.
Sometimes you can get away with ignoring a language or hacking a partial parser for the other language on top of your existing one. This might be possible, depending on your input. So, this is a viable option if you can control and require the input to exclude the foreign language. However, this is not always possible and it is never pretty. You are building technical debt that might bankrupt your software later on.
Now, to be fair, I understand your pain. Mixing different languages in the same source file can now be considered a bad practice. It makes harder to understand what the code means, and it also changes what the code actually does depending on conditions that could be hard to control or test. Nowadays you would probably either design entirely separate languages or integrate the functionality in the main language. Or you may even see a combination of both approaches. However, you cannot fight bad with bad. You can only build an effective parser for one language at a time, because different languages require different rules.
The correct way of dealing with this situation is realizing that you have two languages and therefore you need two parsers that must be properly used. You should parse the preprocessor language and handle it with an interpreter. This interpreter should generate the source code to give to the C++ parser.
Different Lexers Can Peacefully Coexist
A parser can only parse one language, and a lexer can only lex one language. However, you can make one parser use different lexers. In fact, this is encouraged by ANTLR with the much-appreciated feature called lexical modes.
Okay, it is true that lexical modes are not literally different lexers. However, they come very close. With lexical modes you can effectively define separate sets of rules that will be triggered by what the lexer sees in the code. For example, the lexer can apply different rules inside a string than the ones outside a string. In fact, they are commonly used to implement string interpolation. Another common use case is to parse markup languages.
Why mixing different lexical modes is good, while trying to mix different parsers is bad? Well, lexical modes are an effective way to get around the stupidity of lexers. A lexer is necessarily less smart than the proper parser because it has less information, less context to make its decisions. The lexer does a hard job exactly to make the job of the parser easier. In some sense, the lexer creates the context that the parser will be able to use to make its decisions. That is why the lexer can see different worlds but creates one world for the parser to see. The different lexical modes will work differently but their output will work the same. In different terms, the output of different lexers can still follow the same syntax. Look at this example.
simple_text = "hello number 1"; inter_text = $"hello number {number+1}"; result = number+1;
Usually all text inside double quotes is treated as one string. The lexer just takes everything it sees and make one token out of it (e.g., STRING). It does not matter if it sees spaces, numbers, symbols, etc. It is all the same. However, inside an interpolated string, it treats differently everything inside the curly brackets ({
and }
). Everything inside them will be treated differently, as if it were outside the double quotes. So, it will produce tokens for the variable, the sign and the number.
For the parser, on the other hand, it makes no difference the origin of the tokens, whether inside or outside an interpolated string it will see an expression with the same structure. It can handle them the same way. So, you should take advantage of different lexical modes and use them for the same parser. Effectively, this is another way of following good separation of concerns. Each lexical mode will cater to one concern, one different set of rules. This will allow you to create simpler lexers that are easier to manage and have less bugs.
Use as Little Code as Possible in Your Grammar
Use as little code as possible in your grammar. An ANTLR grammar is mainly composed of lexer and parsing rules. In addition to that you can optionally add custom code written in one of normal programming languages supported (e.g., Java, C#, etc.). This custom code will be added to the generated lexer and/or parser. Technically the main advantage of avoiding the use of custom code is portability. You also get more predictability, since you can apply tested patterns to your grammar rules that will behave the way you expect.
Personally, I think that the technical advantage is of limited value in practical terms. Yes, you can automatically use one parser in multiple languages but this is useful only in few cases. The most common one is when you are building libraries in multiple languages to support an API of a product of your company. To me, the main advantage of avoiding custom code is in the quality of your parser. You can use tested patterns and you do not risk getting unpredictable results when you change one rule because of side effects in your custom code.
With ANTLR you automatically get two ways to work on the parse tree generated by the parser: listener and visitors. These two methods cover most of the common needs that you have to perform operations on the parse tree. They are also easy to use and, in the case of the visitor, flexible enough.
To be fair, it is fairly common the need to use semantic predicates to cover some corner case. You lose portability of your parser and a predictability, which is bad. However, you still keep the overall ANTLR-feel to the parser, so you can still apply your well-earned experience to understand how the parser will behave. You pay a small price for a good gain in functionality.
There Is One Good Reason to Write Custom Code
Therefore, there is only one good reason to write custom code: you need to fundamentally change the behavior of the parser itself in order to correctly parse your language. A common class of languages in which this happens are positional languages and languages in which whitespace is meaningful. In positional languages the position of individual elements is meaningful, so each element must appear in a certain position in a line.
The common characteristics that they share is that they need a smart lexer. That is, a lexer that needs contextual information to do its job. ANTLR could never support such use cases out of the box, so you have to implement it yourself.
Aside from the issues mentioned earlier, this also requires a more in-depth knowledge of how ANTLR works internally. In the larger scope of the whole program, the custom code of the parser is rarely significant in size. For example, the custom base lexer class for the Python on the ANTLR grammars repository is only 138 lines long. However, it certainly makes more cumbersome to maintain a series of parsers for different targets, each of which must have some custom code.
If you are using multiple languages because you need cross-platform support, a potential solution is to use just Kotlin for your custom code. We have developed the ANTLR Kotlin target also for this reason. You can write your custom code in Kotlin and then create a Kotlin multiplatform project to run it on JVM, JavaScript and native. It is not exactly trivial to create a truly multiplatform Kotlin project, because some Kotlin libraries might available only on one platform (usually the JVM). However, it is doable with a good organization and with a careful separation between the parser and the rest of your program.
Testing
In recent years the world of development has seen a massive increase in the importance of testing. Actually, testing was always important but the spread of opensource and the practice of using many different libraries has made it essential. The opportunity of taking ready-to-use libraries has massively increased productivity, but also the chance of bugs. To combat this phenomenon testing is paramount.
This is even more true for parsers, if that is even possible. That is because parsers, and development tools in general, are used by literally every program that needs them. They are also sort of invisible: people expect them to work, so they are the last thing people suspect of having bugs. If you have ever experienced discovering a compiler bug you know how frustrating it is. It is therefore your duty to make sure that you employ a robust testing regimen. This does not necessarily require to test every single rule, but you need to have a good strategy:
- test the main rule
- test all the most important statements rules
- create a test for any bug you solved after the initial design
This is the usual strategy that we employ in our projects. The last point is simply a good practice to cover the case of recurring bugs. In fact, if you discover a bug because some code is particularly complex, there is a higher than average chance that the bug will re-appear later.
In addition to these tests for parser rules, you should create a few tests of common patterns to ensure that the parse tree looks like the one you expect. This is especially important to check that the parsing of expressions is correct, given that to check that looking at a single rule is not enough. Basically you should:
- find a few typical programs for your language, some of which should have expressions
- manually check that the parse tree created by your parser is correct
- create a test to check that the parse tree does not change unexpectedly when you change your parser
This is our base testing strategy. On top of that we usually also add tests for the code that directly use the parser. For example, we add testing for visitors or listener methods. These are a bit more complex to test because they can be dependent on previous methods. Usually to test them properly you must use a mock library.
These are the general principles of testing parsers. If you need a more in-depth and practical examples of how to write these tests you can find it in our course.
What Is an AST and Why You Should Create One
We have often talked about the parse tree in this article, however in our work we almost always end up creating an Abstract Syntax Tree (AST). They are similar: they are both trees and they both remove some information of the source code. For instance, parse trees do not contain parentheses because these symbols are used to group things. And these groups are implicitly defined by the structure of the tree itself. For example, the arguments of a function are already children of the node function in a parse tree.
They differ in the level of abstraction and their purpose:
- a parse tree represents how you parsed the code
- the AST represents the code in a way that make sense for the needs of your program
Compared to the parse tree, the AST will drop some information and change its structure. In practical terms, you are going to start from a parse tree and you will modify it to obtain an AST.
The AST fundamentally hides the details of parsing, which make easier to:
- optimize parsing without making harder to work with the parser
- update the parser without changing the rest of the program
- use the parser for people that are unfamiliar with the grammar. Because the parser might do things in a way that are not obvious. For example, to improve performance
So, there are a few good reasons to create an AST.
The Stuff You Are Going to Change in an AST
Let’s look at the major things that you are going to change in an AST, compared to a parse tree.
Remove All Tokens
Tokens are usually absent from an AST and replaced with other elements.
sum: expression=left (PLUS|MINUS)=op expression=right ;
In this example the PLUS or MINUS tokens will be used to create different nodes in the AST. Depending on what we find, we can create an addition or subtraction node. The two expressions use the same syntactical structure but they mean different things. So, the rest of the program will see and treat them differently.
select: K_SELECT (K_DISTINCT | K_ALL)? column K_FROM expression ;
Tokens are also removed because they are used to indicate a specific statement, but then they may not add any more information. We can see that in this example of a simplified select statement. The developer needs to type the word SELECT
to indicate that is writing a select statement, but then all select statement have that, so we do not really need to keep that information. It is implicit in the fact that the AST has a node SelectStmt
.
Simplify the Structure of the Parse Tree
Sometimes the parse tree has different structures for what represent the same feature. Think about a version of a statement that accepts some options.
options : name EQUALS value | OPTIONS L_PAREN (name EQUALS value) (COMMA (name EQUALS value)* R_PAREN ;
In this example, we see two different ways to indicate some option. The simplified version has just the name and value of the option pair. The complete version is preceded by the keyword OPTIONS
and the options are surrounded by parentheses.
In such cases we do not need to create different nodes to record the specific format that was used. So, the AST will just have a node options with a list of values for each option. The list might have only one element, but it is not relevant to us or the user of the parser.
There are more complex transformations needed to make the AST more accessible. Common examples are identifiers/variable references and expressions. The rules to parse variable references can become quite complex when you consider scopes and hierarchies of objects. Consider the following example.
element : element_part (PERIOD element_part)* ; element_part: element_name* (AT_SIGN link_name=element_name)? arguments? ; element_name: id_expr (PERIOD id_expr)* ; id_expr : name | delimited_id // soft keywords ;
The fact that are referenced by many rules and used in multiple ways makes them are also hard to define well. For example, these rules would also be used to parse function calls of class methods.
The parse tree will therefore have a hierarchy of nodes which may not always make sense to a person that have never seen the parser. Think about, what is element_part
? It is just an odd consequence of the need to distinguish between a reference to a variable with a link and one without.
To make easy to work with your parser, you will have to transform and simplify the parse tree to create a cleaner, more meaningful AST.
We do not have the room to show a complete example of a transformation from parse tree to AST here. However, you can see our library kolasu. We use it professionally exactly to do just that: to create an AST. It also includes a few niceties, like a way to serialize the AST in JSON or XML. Kolasu is used in the opensource project JaRIKo: a JAva virtual machine Rpg Interpreter written in KOtlin, so you can see real world example there.
Summary
In this article we have seen the golden rules we use in our professional parsing projects with ANTLR. This article represents both the summary of many of our articles about ANTLR and the missing piece of our series. It contains the best practices we have found help the development of ANTLR parsers and references the rest of our articles for detailed references and examples.
Read more:
If you want to know how to use ANTLR you can read our article The ANTLR Mega Tutorial