Turin Programming Language for the JVM: Building Advanced Lexers with ANTLR

Turin Programming Language for the JVM: building advanced lexers with ANTLR

As I wrote in my last post, I recently started working on a new programming language named Turin. A working compiler for an initial version of the language is available on GitHub. I am currently improving the language and working on a Maven and an IntelliJ plugins. Here and in the next posts I will go over the different components of the compiler and related tools.

Structure of the Compiler

The compiler needs to do several things:

  1. Get the source code and generate an abstract syntax tree (AST)
  2. Translate the AST through different stages to simplify processing. We basically want to move from a representation very close to the syntax to a representation easier to process. For example we could “desugarize” the language, representing several (apparently) different constructs as variants of the same construct. An example? The Java compiler translates string concatenations into calls to StringBuffer.append
  3. Perform semantic checks. For example we want to check if all the expressions are using acceptable types (we do not want to sum characters, right?)
  4. Generate bytecode

The first step requires building two components: a lexer and a parser. The lexer operates on the text and produces a sequences of tokens, while the parser composes tokens into constructs (a type declaration, a statement, an expression, etc.) creating the AST. For writing the lexer and the parser I have used ANTLR.

In the rest of this post we look into the lexer. The parser and the other components of the compiler will be treated in future posts.

Why Using ANTLR?

ANTLR is a very mature tool for writing lexer and parsers. It can generate code for several languages and has decent performance. It is well mantained and I was sure it had all the features I could possible need to handle all the corner cases I could meet. In addition to that, ANTLR 4 makes possible to write simple grammars because it solves left recursive definition for you. So you do not have to write many intermediate node types for specifying precedence rules for your expressions. More on this when we will look into the parser.

ANTLR is used by Xtext (which I have used a lot) and I have ANTLR while building a framework for Model-driven development for the .NET platform (a sort of EMF for .NET). So I know and trust ANTLR and I have no reason for looking into alternatives.

The Current Lexer Grammar

This is the current version of the lexer grammar.

A few choices I have made:

  • there are two different types of ID: VALUE_ID and TYPE_ID. This permits to have less ambiguity in the grammar because values and types can be easily distinguished. In Java instead when (foo) is encountered we do not know if it is an expression (a reference to the value represented by foo between parenthesis) or a cast to the type foo. We need to look at what follows to understand it. In my opinion this is rather stupid because in practice everyone is using capitalized identifiers for types only, but because this is not enforced by the language the compiler cannot take advantage from it
  • newlines are relevant in Turin, so we have tokens for them we basically want to have statements terminated by newlines but we accept optional newlines after commas
  • whitespaces (but newlines) and comments are captured in their own channels, so that we can ignore them in the parser grammar but we can retrieve them when needed. For example we need them for syntax highlighting and in general for the IntelliJ plugin because it requires to define tokens for each single character in the source file, without gaps
  • the most tricky part is parsing string interpolations à la Ruby such as “my name is #{user.name}”. We use modes: when we encounter a string start (“) we switch to lexer mode IN_STRING. While in mode IN_STRING if we encounter the start of an interpolated value (#{) we move to lexer mode IN_INTERPOLATION. While in mode IN_INTERPOLATION we need to accept most of tokens used in expressions (and that sadly means a lot of duplication in our lexer grammar).
  • I had to collapse the relational operators in one single token type, so that the number of states of the generated lexer is not too big. It means that I will have to look into the text of RELOP tokens to figure out which operation need to be executed. Nothing too awful but you have to know how to fix these kinds of issues.

Testing the Lexer

I wrote a bunch of tests specific for the lexer. In particular I tested the most involved part: the one regarding string interpolation.

An example of a few tests:

As you can see I just test the token on a string and verify it produces  the correct list of tokens. Easy and straight to the point.

Conclusions

My experience with ANTLR for this language has not been perfect: there are issues and limitations. Having to collapse several operators in a single token type is not nice. Having to repeat several token definitions for different lexer modes is bad. However ANTLR proved to be a tool usable in practice: it does all that it needs to do and for each problem there is an acceptable solution. The solution is maybe not ideal, maybe not elegant as desired but there is one. So I can use it and move on on more interesting parts of the compiler.

The ANTLR Mega Tutorial as a PDF

Antlr mega tutorial

Get the Mega Tutorial delivered to your email and read it when you want on the device you want

Powered by ConvertKit
2 replies
  1. Jeshan Babooa
    Jeshan Babooa says:

    Good overview of the process; I like it.

    we do not want to sum characters, right?
    Well, not if you’re implementing a language like Java! 😀
    The following is a valid Java statement (I won’t blame you if can’t guess what it evaluates to; feel free to try it!):

    char c = '4' + 'A';

    (see also this SO answer)

    This kind of test code:

    String code = "Void toString() = \"foo\"";
    verify(code, TurinLexer.VOID_KW, TurinLexer.VALUE_ID, TurinLexer.LPAREN, TurinLexer.RPAREN, TurinLexer.ASSIGNMENT, TurinLexer.STRING_START, TurinLexer.STRING_CONTENT, TurinLexer.STRING_STOP);

    Is that the ANTLR way of testing? It’s not that bad but it looks a bit cumbersome!

  2. Federico Tomassetti
    Federico Tomassetti says:

    Hi Jeshan!
    The sum of characters really scared me but using multiplication on characters scared me even more. I definitely do not want that to be legal in my language 🙂

    Well, I do not think there is an official or suggested way to test ANTLR grammars so I come up with this one. There is definitely space for improvements but for now it protects me from regressions

Comments are closed.