Building a compiler for your own language: from the parse tree to the Abstract Syntax Tree

From the parse tree to the abstract syntax tree

In this post we are going to see how process and transform the information obtained from the parser. The ANTLR parser recognizes the elements present in the source code and build a parse tree. From the parse tree we will obtain the Abstract Syntax Tree which we will use to perform validation and produce compiled code.

Note that the terminology can vary: many would call the tree obtained from ANTLR an Abstract Syntax Tree. I prefer to mark the difference from this two steps. To me the parse tree is the information as meaningful to the parser, the abstract syntax tree is the information reorganized to better support the next steps.

This post is the part of a series. The goal of the series is to describe how to create a useful language and all the supporting tools.

  1. Building a lexer
  2. Building a parser
  3. Creating an editor with syntax highlighting
  4. Build an editor with autocompletion
  5. Mapping the parse tree to the abstract syntax tree
  6. Model to model transformations
  7. Validation
  8. Generating bytecode

After writing this series of posts I refined my method, expanded it, and clarified into this book titled
How to create pragmatic, lightweight languages


Code is available on GitHub under the tag 05_ast

Spice up our language

In this series of post we have been working on a very simple language for expressions. It is time to make our language slightly more complex introducing:

  • casts for example: 10 as Decimal or (1 * 2.45) as Int
  • print statement for example: print(3 + 11)

To do so we need to revise our lexer and parser grammar. The syntax highlighting and autocompletion which we have built in previous posts will just keep working.

The new lexer grammar:

And the new parser grammar:

The Abstract Syntax Tree metamodel

The Abstract Syntax Tree metamodel is simply the structure of the data we want to use for our Abstract Syntax Tree (AST). In this case we are defining it by defining the classes which we will use for our AST.

The AST metamodel will look reasonably similar to the parse tree metamodel, i.e., the set of classes generated by ANTLR to contain the nodes.

There will be a few key differences:

  • it will have a simpler and nicer API than the classes generated by ANTLR (so the classes composing the parse tree). In next sections we will see how this API could permit to perform transformations on the AST
  • we will remove elements which are meaningful only while parsing but that logically are useless: for example the parenthesis expression or the line node
  • some nodes for which we have separate instances in the parse tree can correspond to a single instance in the AST. This is the case of the type references Int and Decimal which in the AST are defined using singleton objects
  • we can define common interfaces for related node types like BinaryExpression
  • to define how to parse a variable declaration we reuse the assignement rule. In the AST the two concepts are completely separated
  • certain operations have the same node type in the parse tree but are separated in the AST. This is the case of the different types of binary expressions

Let’s see how we can define our AST metamodel using Kotlin.

We start by defining Node. A Node represents every possible node of an AST and it is general. It could be reused for other languages also. All the rest is instead specific of the language (Sandy on our case). In our specific language we need three important interfaces:

  • Statement
  • Expression
  • Type

Each of these interfaces extends Node.

We then declare the two types we use in our language. They are defined as singleton objects. It means that we have just one instance of these classes.

We then have the BinaryExpression interfacewhich extends Expression. For classes implements it, one for each of the basic arithmetic expressions.

Most of the expressions have as children other nodes. A few have instead simple values. They are VarReference (which has a property varName of type String), and Intlit and DecLit (both have a property value of type String).

Finally we have the three classes implementing Statement.

Note that we are using data classes so we can get for free the hashCode, equals and toString methods. Kotlin generates for us also constructors and getters. Try to imagine how much code that would be in Java.

Mapping the parse tree to the abstract syntax tree

Let’s see how we can get the parse tree, produced by ANTLR, and map it into our AST classes.

To implement this we have taken advantage of three very useful features of Kotlin:

  • extension methods: we added the method toAst to several existing classes
  • the when construct, which is a more powerful version of switch
  • smart casts: after we check that an object has a certain class the compiler implicitly cast it to that type so that we can use the specific methods of that class

We could come up with a mechanism to derive automatically this mapping for most of the rules and just customize it where the parse tree and the AST differs. To avoid using too much reflection black magic we are not going to do that for now. If I were using Java I would just go for the reflection road to avoid having to write manually a lot of redundant and boring code. However using Kotlin this code is compact and clear.

Testing the mapping

Of course we need to test this stuff. Let’s see if the AST we get for a certain piece of code is the one we expect.

This would be all nice: we have a clean model of the information present in the code. The metamodel and the mapping code looks very simple and clear. However we would need to add a little detail: the position of the nodes in the source code. This would be needed while showing errors to the user. We want to have the possibility to specify the positions of our AST nodes but we do not want to be forced to do so. In this way depending on the operations we need to do we can ignore or not the positions. Consider the tests we have written so far: wouldn’t be cumbersome and annoying having to specify fake positions for all the nodes? I think so.

This is the new Node definition and a few supporting class:

We need also to add position as an optional parameter to all the classes. It would have the default value null. For example this is how SandyFile looks now:

The parse tree contains the information organized in the most convenient way for the parser. It is typically not the most convenient way for the steps which follow. Think about the variable declaration rule being implemented by reusing the assignment rule: sure, this make the grammar shorter and it makes sense for the parse tree. However from the logical point of view the two elements are separated, and in the AST they are indeed.

Most of the rest of our tools will operate on the AST so it is better to spend some time working on an AST that makes sense.

Download the guide with 68 resources on Creating Programming Languages


Receive the guide to your inbox to read it on all your devices when you have time

Powered by ConvertKit
1 reply

Trackbacks & Pingbacks

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply