Getting started with ANTLR in C#

The code for this article is available on github.

Readers of this website will know that ANTLR is a great tool to quickly create parsers and help you in working with a known language or create your DSL. While the tool itself is written in Java it can also be used to generate parsers in several other languages, for instance Python, C# or Javascript (with more languages supported by the newly released 4.6 version).

If you want to use C# you can integrate ANTLR in your favorite IDE, as long as that IDE is any recent edition of Visual Studio. The runtime itself works also on Mono, and can be used as a standalone and you can look at the issues for the official C# target for ANTLR 4 to see if you can make it work with other setups, but the easiest way is to use Visual Studio and the provided extension to integrate the generation of the grammar into the your C# project.

Setup

The first step is to install ANTLR Language Support extension for Visual Studio, you just have to search for it in for Visual Studio going to ToolsExtensions and Updates. This will allow to easily integrate ANTLR into your workflow by generating automatically the parser and, optionally, listener and visitor starting from your grammar. Now you can add a new ANTLR 4 Combined Grammar or an ANTLR 4 Lexer/Parser in the same way you add any other new item. Then, for each one of your projects, you must add the Nuget package for Antlr4. If you want to manage options and, for instance disable the visitor/listener generation, you can see the official github project.

Create the Grammar

For our simple project we are going to create grammar that parses two lines of text that represents a chat between two people. This could be the basis for a chat program or for a game in which whoever says the shortest word get beaten up with a thesaurus. This is not relevant for the grammar itself, because it handles only the recognition of the various elements of the program. What you choose to do with these elements is managed through the normal code. Add a new ANTLR 4 Combined Grammar with the name Speak. You will see that there is already some text in the new file; delete all and replace it with the following text.

While you may create separate lexer and parser grammar, for a simple project you will want to use a combined grammar and put the parser before the lexer. That’s because as soon as antlr recognize a token in the lexer part, it stop searching. So it’s also important to put the more specific tokens first and then the generic ones, like WORD or ID later. In this example, if we had inverted SAYS and WORD, SAYS would have been hidden by WORD. Another thing to notice is that you can’t use fragments outside of lexer rules.

Having said that, the lexer part is pretty straightforward: we identify a SAYS, that could be written uppercase or lowercase, a WORD, that could be composed of any letter uppercase or lowercase and a NEWLINE. Any text that is WHITESPACE, space and tab, is simply ignored. While this is clearly a simple case, lexer rules will hardly be more complicated than this. Usually the worst thing that could happen is to have to use semantic predicates. These are essentially statement that evaluates to true or false, and in the case they are false they disable the following rule. For instance, you may want to use a ‘/’ as the beginning of a comment, only if it is the first character of a line, otherwise it should be considered an arithmetic operator.

The parser is usually where things gets more complicated, although that’s not the case this time. Every document given to a speak grammar must contain a chat, that in turn is equal to two line rules followed by a End Of File marker. The line must contain a name, the SAYS keyword and a word. Name and word are identical rules, but they have different names because they correspond to different concepts, and they could easily change in a real program.

Visiting the tree

Just like we have seen for Roslyn, ANTLR will automatically create a tree and base visitor (and/or listener). We can create our own visitor class and change what we need. Let’s see an example.

The first line shows how to create a class that inherit from the SpeakBaseVisitor class, that is automatically generated by ANTLR. If you need it, you could restrict the type, for instance for a calculator grammar you could use something like int or double. SpeakLine (not shown) is a custom class that contains two properties: Person and Text. The line 5 shows how to override the function to visit the specific type of node that you want, you just need to use the appropriate type for the context, that contains the information provided by the parser generated by ANTLR. At line 13 we return the SpeakLine object that we just created, this is unusual and it’s useful for the unit testing that we will create later. Usually you would want to return base.VisitLine(context) so that the visitor could continue its journey across the tree.

This code simply populate a list of SpeakLine that hold the name of the person and the word they have spoken. The Lines properties will be used by the main program.

Putting it all together

As you can see there is nothing particularly complicated. The lines 15-18 shows how to create the lexer and then create the tree. The subsequent lines show how to launch the visitor that you have created: you have to get the context for whichever starting rule you use, in our case chat, and the order to visit the tree from that node.

The program itself simply output the information contained in the tree. It would be trivial to modify the grammar program to allow infinite lines to be added, both the Visitor and the main Program would not need to be changed.

Unit testing

Testing is useful in all cases, but it is absolutely crucial when you are creating a grammar, to check that everything is working correctly. If you are creating a grammar for an existing language you probably want to check many working source file, but in any case you want to start with unit testing the single rules. Luckily since the creation of the Community edition of Visual Studio there is a free version of Visual Studio that including an unit testing framework. All you have to do is to create a new Test Project, add all the necessary nuget packages and add a reference to the project assembly you need to test.

There is nothing unexpected in this tests. One observation is that we can create a test to check the single line visitor or we can test the matching of the rule itself. You obviously should do both. You may wonder how the last test works, since we are trying to match a rule that doesn’t match, but we still get the correct type of context as a return value and some correct matching values. This happens because antlr is quite robust and there is only checking one rule. There are no alternatives and since it starts the correct way it is considered a match, although a partial one.

Conclusions

Integrating an ANTLR grammar in a C# project is quite easy with the provided Visual Studio extensions and nuget packages, making it the best way to quickly create parser for your DSL. No more piles of fragile RegEx(s), but don’t forget the tests.

The ANTLR Mega Tutorial as a PDF

Antlr_mega_tutorial

Get the Mega Tutorial delivered to your email and read it when you want on the device you want

Powered by ConvertKit