Getting Started With ANTLR in C#

The code for this article is on GitHub: getting-started-with-antlr-in-csharp

Readers of this website will know that ANTLR is a great tool to quickly create parsers and help you in working with a known language or creating your DSL. While the tool itself is written in Java, it can also be used to generate parsers in several other languages, for instance Python, C# or JavaScript (with more languages supported by newer releases).

If you want to use C#, there are two options: one is the official version of ANTLR, the other is the special C#-optimized version of ANTLR by Sam Harwell. There are two options because in the past the official ANTLR tool did not include the ability to generate C#, so you had to use the second option.

Previous versions of this tutorial used the second option because it offered better integration with Visual Studio. However, in the years passed since the initial publication, the best practices have changed. Now, the best option for ANTLR development with C# relies on the standard official version and Visual Studio Code. This has to do with the fact that the standard version is more readily updated and the ANTLR Visual Studio Code extension provides more features. In addition to that, Visual Studio Code works cross-platform and it is more popular. There are really no standard for development tools, but Visual Studio Code it is closest thing there is to it.

This is why we updated the tutorial to use the standard ANTLR C# Runtime and Visual Studio Code. You can still get the old version of the code in the commit tagged as visual-studio in the repository. Changing between the two options is simple, but not without issues: the C# parser generated is not compatible and there are few differences in the API. You can look at the documentation of the C#-optimized version on their official page.

You may still want to opt for the C#-optimized version if you care more about older version of Visual Studio and Windows. However, for the general public, the new setup is to be preferred.

Setup

The first step is to install ANTLR grammar syntax support extension for Visual Studio Code. This will allow to easily integrate ANTLR into your workflow by generating automatically the parser and, optionally, listener and visitor starting from your grammar. The extension also has many useful features to understand and debug your ANTLR grammar, such as visualizations, code completion, formatting, etc.

Then, for each one of your projects, you must add the Nuget package for ANTLR4.Runtime.Standard. Keep in mind that the extension comes also with its own embedded ANTLR command line tool. Which is good and bad: you do not need to have ANTLR installed on your system to use it. However, the embedded version might not be the latest one released of the ANTLR4 Runtime Standard, so you must pay attention to use the same runtime as the one included in the extension. Alternatively, you can configure the extension to use an external ANTLR command line tool.

For example, at the time of writing of this article the latest runtime is on version 4.13.0, while the extension embeds version 4.9. This is not an issue for our example, but you might want to change it for your use case.

If you want to manage options and, for instance disable the visitor/listener generation, you can configure it in Visual Studio Code, like you would do for any other extension option.

These are the values we are going to set for this project in the settings.json file (or wherever you prefer to write your settings).

"antlr4.generation": {
    "mode": "external",
    "language": "CSharp",
    "listeners": false,
    "visitors": true
}

We set the mode to external to use the extension to generate our grammar for the use for the whole project. The default is to use the internal value, in which the extension generates only the grammar for internal use (e.g., to generate the nice diagram you can see in the extension). We also disable the generation of the listener (which is true by default) and instead enable the generation of the visitor.

For the rest of this tutorial we assume you are going to use this route and just install the Visual Studio Code extension to also get the ANTLR command line tool. This is the easiest way to get started with ANTLR in C#. If you instead prefer to have the ANTLR command line tool, we suggest reading the setup section of our ANTLR Mega Tutorial.

Creating the Project

You start by creating a standard dotnet project. At the time of writing of this article, the latest version is .NET 6.0.1 and these are the commands you can use to create a solution with a console project and a test project.

# this creates a new Solution
dotnet new sln
# these commands create a console and MS Test projects
dotnet new console -o AntlrCSharp
dotnet new mstest -o AntlrCSharpTests
# these commands add the newly created projects to the solution
dotnet sln add .\AntlrCSharp\AntlrCSharp.csproj
dotnet sln add .\AntlrCSharpTests\AntlrCSharpTests.csproj

Then we need to add the ANTLR4 runtime to the main console project and a reference to the main console project in the test project.

# installing the ANTLR4 runtime package in the main console project
dotnet add AntlrCSharp package Antlr4.Runtime.Standard
# adding a reference to the main project in the test one
dotnet add AntlrCSharpTests reference AntlrCSharp

Creating the Grammar

We are going to create a grammar that parses two lines of text that represents a chat between two people. This could be the basis for a chat program or for a game in which whoever says the shortest word get beaten up with a thesaurus. This is not relevant for the grammar itself, because it handles only the recognition of the various elements of the program. What you choose to do with these elements is managed through the normal code.

Add a new file called Speak.g4 and insert the following text.

grammar Speak;

/*
 * Parser Rules
 */

chat                : line line EOF ;
line                : name SAYS opinion NEWLINE;
name                : WORD ;
opinion             : TEXT ;

/*
 * Lexer Rules
 */

fragment A          : ('A'|'a') ;
fragment S          : ('S'|'s') ;
fragment Y          : ('Y'|'y') ;

fragment LOWERCASE  : [a-z] ;
fragment UPPERCASE  : [A-Z] ;

SAYS                : S A Y S ;
WORD                : (LOWERCASE | UPPERCASE)+ ;
TEXT                : '"' .*? '"' ;
WHITESPACE          : (' '|'\t')+ -> skip ;
NEWLINE             : ('\r'? '\n' | '\r')+ ;

While you may create separate lexer and parser grammar, for a simple project you will want to use a combined grammar and put the parser before the lexer. It is important to put the more specific tokens first and then the generic ones, like WORD or ID later. That is because ANTLR picks the first defined token that matches the longest input. Basically, if two tokens can match the same text, ANTLR picks the first one. However, generally speaking, if a token that is defined later matches more text, it picks that one.

In this example, if we had inverted SAYS and WORD, SAYS would have been hidden by WORD. Another thing to notice is that you cannot use fragments outside of lexer rules. That is because fragments are not proper lexer rules, they are just syntactic sugar, shortcuts, that help you avoid repetition in defining lexer rules.

Having said that, the lexer part is pretty straightforward:

we identify a SAYS, that could be written uppercase or lowercase;
a WORD, that could be composed of any letter uppercase or lowercase;
a TEXT, that include everything between two double quotes (“) marks;
a NEWLINE.

Any text that is WHITESPACE, space and tab, is simply ignored. While this is clearly a simple case, individual lexer rules will hardly be more complicated than this. The complexity in defining lexer rules in more in the overall design of the set of rules rather than writing difficult individual rules.

The only slightly complicated rule is the the one for TEXT, the beginning and end symbols (i.e., the double quotes) are clear, the stuff in the middle is composed of two parts, the dot (.) and the *? part. The dot means that every single character is allowed, while the *? means that the rule will exit when it finds whatever is on the right. In this case, the double quotes is on the right, so the end result is that everything between double quotes is included in TEXT.

Usually the worst thing that could happen is to have to use semantic predicates. These are essentially statements that evaluates to true or false, and in the case they are false they disable the following rule. For instance, you may want to use a / as the beginning of a comment, but only if it is the first character of a line, otherwise it should be considered an arithmetic operator.

The parser is usually where things gets more complicated, although that is not the case this time. Every document given to a speak grammar must contain a chat, that in turn is equal to two line rules followed by a End Of File marker. The line must contain a name, the SAYS keyword and a opinion. Name and opinion are similar rules, in the sense that they both correspond to individual lexer rules. However, they have different names because they correspond to different concepts, and they could easily change in a real program. For example, you may want the concept of a name for the user to change to correspond to a username that could contain underscores (_).

You can confirm that your setup works by saving the file. If everything is setup correctly you will see magically appear a few generated files. Like in the following image.

Visiting the Tree

ANTLR will automatically create a tree and base visitor (and/or listener). We then can create our own visitor class, that inherits from this base class, and change what we need. Let’s see an example.

Confusingly, ANTLR generates a file with the name SpeakVisitor.cs but containing the interface ISpeakVisitor<Result>. So we can create a custom visitor class with the name SpeakVisitor, but we have to save in a file with the different name. In our example, to simplify understanding, we choose the name BasicSpeakVisitor for the class and BasicSpeakVisitor.cs for the name of the file. You may choose differently.

public class BasicSpeakVisitor : SpeakBaseVisitor<object>
{
    public List<SpeakLine> Lines = new List<SpeakLine>();

    public override object VisitLine(SpeakParser.LineContext context)
    {            
        NameContext name = context.name();
        OpinionContext opinion = context.opinion();

        SpeakLine line = new SpeakLine() { Person = name.GetText(), Text = opinion.GetText().Trim('"') };
        Lines.Add(line);

        return line;
    }
}

The first line shows how to create a class that inherit from the SpeakBaseVisitor class, that is automatically generated by ANTLR. If you need it, you could restrict the type, for instance for a calculator grammar you could use something like int or double.

SpeakLine (not shown) is a custom class that contains two properties: Person and Text. The line 5 shows how to override the function to visit the specific type of node that you want, you just need to use the appropriate type for the context, that contains the information provided by the parser generated by ANTLR.

At line 13 we return the SpeakLine object that we just created, this is unusual and it is useful for the tests that we will create later. Usually you would want to return base.VisitLine(context) so that the visitor could continue its journey across the tree.

This code simply populate a list of SpeakLine that hold the name of the person and the opinion they have spoken. The Lines property will be used by the main program.

Putting It All Together

private static void Main(string[] args)
{
    try
    {
        string input = "";
        StringBuilder text = new StringBuilder();
        Console.WriteLine("Input the chat.");
        
        // to type the EOF character and end the input: use CTRL+D, then press <enter>
        while ((input = Console.ReadLine()) != "u0004")
        {
            text.AppendLine(input);
        }
        
        AntlrInputStream inputStream = new AntlrInputStream(text.ToString());
        SpeakLexer speakLexer = new SpeakLexer(inputStream);
        CommonTokenStream commonTokenStream = new CommonTokenStream(speakLexer);
        SpeakParser speakParser = new SpeakParser(commonTokenStream);

        SpeakParser.ChatContext chatContext = speakParser.chat();
        BasicSpeakVisitor visitor = new BasicSpeakVisitor();        
        visitor.Visit(chatContext);

        foreach(var line in visitor.Lines)
        {
            Console.WriteLine("{0} has said {1}", line.Person, line.Text);
        }
    }
    catch (Exception ex)
    {
        Console.WriteLine("Error: " + ex);                
    }
}

As you can see, there is nothing particularly complicated. The lines 15-18 shows how to create the lexer and then create the tree. The subsequent lines show how to launch the visitor that you have created: you have to get the context for whichever starting rule you use, in our case chat, and the order to visit the tree from that node.

The program itself simply outputs the information contained in the tree. It would be trivial to modify the grammar program to allow infinite lines to be added, both the Visitor and the main Program would not need to be changed.

You can see that the program works as expected running in the usual way, with the following command.

dotnet run --project AntlrCSharp

Unit Testing

Testing is useful in all cases, but it is absolutely crucial when you are creating a grammar, to check that everything is working correctly. If you are creating a grammar for an existing language, you probably want to check many working source files. In any case you want to start with unit testing the single rules.

Luckily, since the creation of the Community edition of Visual Studio, there is a free version of Visual Studio that includes an unit testing framework. All you have to do is to create a new Test Project, add all the necessary nuget packages and add a reference to the project assembly you need to test. We did all of that in the setup section, so we can just write the code.

[TestClass]
public class ParserTest
{
    private SpeakParser Setup(string text)
    {
        AntlrInputStream inputStream = new AntlrInputStream(text);
        SpeakLexer speakLexer = new SpeakLexer(inputStream);
        CommonTokenStream commonTokenStream = new CommonTokenStream(speakLexer);
        SpeakParser speakParser = new SpeakParser(commonTokenStream);

        return speakParser;   
    }

    [TestMethod]
    public void TestChat()
    {
        SpeakParser parser = Setup("john says \"hello\" \n michael says \"world\" \n");

        SpeakParser.ChatContext context = parser.chat();
        BasicSpeakVisitor visitor = new BasicSpeakVisitor();
        visitor.Visit(context);

        Assert.AreEqual(2, visitor.Lines.Count);
    }
    
    [TestMethod]
    public void TestLine()
    {
        SpeakParser parser = Setup("john says \"hello\" \n");

        SpeakParser.LineContext context = parser.line();
        BasicSpeakVisitor visitor = new BasicSpeakVisitor();
        SpeakLine line = (SpeakLine) visitor.VisitLine(context);            
        
        Assert.AreEqual("john", line.Person);
        Assert.AreEqual("hello", line.Text);
    }

    [TestMethod]
    public void TestWrongLine()
    {
        SpeakParser parser = Setup("john sayan \"hello\" \n");

        var context = parser.line();
        
        Assert.IsInstanceOfType(context, typeof(SpeakParser.LineContext));
        Assert.AreEqual("john", context.name().GetText());
        Assert.IsNull(context.SAYS());
        Assert.AreEqual("johnsayan\"hello\"\n", context.GetText());
    }      
}

There is nothing unexpected in these tests. One observation is that we can create a test to check the single line visitor or we can test the matching of the rule itself. You obviously should do both.

You may wonder how the last test works, since we are trying to match a rule that does not match, but we still get the correct type of context as a return value and some correct matching values. This happens because ANTLR is quite robust and there is only checking one rule. There are no alternatives and since it starts the correct way it is considered a match, although a partial one.

You can see that the tests pass with flying colors (actually in green) running the usual command.

dotnet test

Summary

Integrating an ANTLR grammar in a C# project is quite easy with the available Visual Studio Code extension and Nuget package. This makes it the best way to quickly create a parser for your DSL with ANTLR in C#. Finally, you can easily use a better alternative to piles of fragile RegEx(s), but do not forget to implement testing.

Original written in December 2016 – Last revision and update in July 2023

Getting Started With ANTLR in C#

Setup

Creating the Project

Creating the Grammar

Visiting the Tree

Putting It All Together

Unit Testing

Summary

Categories

More on ANTLR

We better Go with ANTLR 4.11

Interview with Kevin Mackey