Building and testing a parser with ANTLR and Kotlin


This post is the part of a series. The goal of the series is to describe how to create a useful language and all the supporting tools.

  1. Building a lexer
  2. Building a parser
  3. Creating an editor with syntax highlighting
  4. Build an editor with autocompletion
  5. Mapping the parse tree to the abstract syntax tree
  6. Model to model transformations
  7. Validation
  8. Generating bytecode

After writing this series of posts I refined my method, expanded it, and clarified into this book titled
How to create pragmatic, lightweight languages

Code

Code is available on GitHub. The code described in this post is associated to the tag 02_parser

A change to the lexer

With respect to the lexer we have seen in the first article we need to do one minor change: we want to keep recognizing the whitespace but we do not want to consider it in our parser rules. So we will instruct the lexer to throw all the whitespace tokens away and not passing them to the parser. To do that we need to change one line in SandyLexer.g4:

Thanks to Joseph for reporting this one!

The parser

The parser is simply defined as an ANTLR grammar. We have previously built a separate lexer. Here we reuse those terminals (NEWLINE, VAR, ID, etc.) to build rules such as statement, assignment, etc.

Here it is our new ANTLR grammar:

  • we reuse the existing lexer (tokenVocab=SandyLexer)
  • we start by defining the rule reppresenting the whole file: sandyFile. It is defined as a list of at list one line
  • each line is composed by a statement terminated either by a newline or the end of the file
  • a statement can be a varDeclaration or an assignment
  • an expression can be defined in many different ways. The order is important because it determines the operator precedence. So the multiplication comes before the sum

To build it we simply run ./gradlew generateGrammarSource. Please refer to the build.gradle file in the repository or take a look at the previous post of the series.

Testing

Ok, we defined our parser, now we need to test it. In general, I think we need to test a parser in three ways:

  • Verify that all the code we need to parse is parsed without errors
  • Ensure that code containing errors is not parsed
  • Verify that the the shape of the resulting AST is the one we expect

In practice the first point is the one on which I tend to insist the most. If you are building a parser for an existing language the best way to test your parser is to try parsing as much code as you can, verifying that all the errors found correspond to actual errors in the original code, and not errors in the parser. Typically I iterate over this step multiple times to complete my grammars.

The second and third points are refinements on which I work once I am sure my grammar can recognize everything.

In this simple case, we will write simple test cases to cover the first and the third point: we will verify that some examples are parsed and we will verify that the AST produced is the one we want.

It is a bit cumbersome to verify that the AST produced is the one you want. There are different ways to do that but in this case I chose to generate a string representation of the AST and verify it is the same as the one expected. It is an indirect way of testing the AST is the one I want but it is much easier for simple cases like this one.

This is how we produce a string representation of the AST:

And these are some test cases:

Simple, isn’t it?

Conclusions

We have seen how to build a simple lexer and a simple parser. Many tutorials stop there. We are instead going to move on and build more tools from our lexer and parser. We laid the foundations, we now have to move to the rest of the infrastructure. Things will start to get interesting.

In the next post we will see how to build an editor with syntax highlighting for our language.

Download the guide with all the 68 resources

68resources

Receive the guide to your inbox to read it on all your devices when you have time

Powered by ConvertKit
9 replies
  1. Joseph Verron
    Joseph Verron says:

    i’ve noticed an error in your tutorial. As long as the whitespace token is not sent to another channel, the unit tests for the parser will fail. best regards. thanks for that series.

  2. Federico Tomassetti
    Federico Tomassetti says:

    Hi Joseph, thank you so much for your comment.

    You are right: I changed the lexer to skip the whitespace token but I did not report that in the article.

    I normally first write the code and then the article and sometimes I miss some changes I have done. I double checked the code on GitHub (tag 02_parser) and all tests pass there.

    I will correct the article. Thank you and please keep sharing your feedback!

Trackbacks & Pingbacks

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply