The science, tools, strategies, patterns and tools behind the creation and processing of languages

68 Resources To Help You To Create Programming Languages

Here it is a new guide, to collect and organize all the knowledge that you need to create your programming language from scratch.

Creating a programming language is one of the most fascinating challenge you can dream of as a developer.

The problem is that there are a lot of moving parts, a lot of things to do right and it is difficult to find a well detailed map, to show you the way. Sure, you can find a tutorial on writing half a parser there, an half-baked list of advices on language design, an example of a naive interpreter. To find those things you will need to spend hours navigating forums and following links.

We thought it was the case to save you some time by collecting relevant resources, evaluate them and organize them. So you can spend time using good resources, not looking for them.

We organized the resources around the three stages in the creation of a programming language: design, parsing, and execution.

Download the guide with all the 68 resources


Receive the guide to your inbox to read it on all your devices when you have time

Powered by ConvertKit

Section 1:

Designing the Language

When creating a programming language you need to take ideas and transform them in decisions. This is what you do during the design phase.

Before You Start…

Some good resources to beef up your culture on language design.



  • Design Concepts in Programming Languages, if you want to make deliberate choices in the creation of your programming language, this is the book you need. Otherwise, if you don’t already have the necessary theoretical background, you risk doing things the way everybody else does them. It’s also useful to develop a general framework to understand how the different programming languages behave and why.
  • Practical Foundations for Programming Languages, this is, for the most part, a book about studying and classifying programming languages. But by understanding the different options available it can also be used to guide the implementation of your programming language.
  • Programming Language Pragmatics, 4th Edition, this is the most comprehensive book to understand contemporary programming languages. It discusses different aspects, of everything from C# to OCaml, and even the different kinds of programming languages such as functional and logical ones. It also covers the several steps and parts of the implementation, such as an intermediate language, linking, virtual machines, etc.
  • Structure and Interpretation of Computer Programs, Second Edition, an introduction to computer science for people that already have a degree in it. A book widely praised by programmers, including Paul Graham directly on the Amazon Page, that helps you developing a new way to think about programming language. It’s quite abstract and examples are proposed in Scheme. It also covers many different aspect of programming languages including advanced topics like garbage collection.

Type Systems

Long discussions and infinite disputes are fought around type systems. Whatever choices you end up making it make sense to know the different positions.


  • These are two good introductory articles on the subject of type systems. The first discuss the dichotomy Static/Dynamic and the second one dive into Introspection.
  • What To Know Before Debating Type Systems, if you already know the basics of type systems this article is for you. It will permit you to understand them better by going into definitions and details.
  • Type Systems (PDF), a paper on the formalization of Type Systems that also introduces more precise definitions of the different type systems.


  • Types and Programming Languages, a comprehensive book on understanding type systems. It will impact how your ability to design programming languages and compilers. It has a strong theoretical support, but it also explains the practical importance of individual concepts.
  • Functional programming and type systems, an interesting university course on type systems for functional programming. It is used in a well known French university. There are also notes and presentation material available. It is as advanced as you would expect.
  • Type Systems for Programming Language, is a simpler course on type system for (functional) programming languages.

Section 2:


Parsing transform the concrete syntax in a form that is more easily manageable by computers. This usually means transforming text written by humans in a more useful representation of the source code, an Abstract Syntax Tree.

An abstract syntax tree for the Euclidean algorithm

There are usually two components in parsing: a lexical analyzer and the proper parser. Lexers, which are also known as tokenizers or scanners, transform the individual characters in tokens, the atom of meaning. Parsers instead organize the tokens in the proper Abstract Syntax Tree for the program. But since they are usually meant to work together you may use a single tool that does both the tasks.


  • Flex, as a lexer generator and (Berkeley) Yacc or Bison, for the generation of the proper parser, are the venerable choices to generate a complete parser. They are a few decades old and they are still maintained as open source software. They are written in and thought for C/C++. They still works, but they have limitations in features and support for other languages.
  • ANTLR, is both a lexer and a parser generator. It’s also more actively developed as open source software. It is written in Java, but it can generate both the lexer and the parser in languages as varied as C#, C++, Python, Javascript, Go, etc.
  • Your own lexer and parser. If you need the best performance and you can create your own parser. You just need to have the necessary computer science knowledge.


  • Flex and Bison tutorial, a good introduction to the two tools with bonus tips.
  • Lex and Yacc Tutorial, at 40 pages this is the ideal starting point to learn how to put together lex and yacc in a few hours.
  • Video Tutorial on lex/yacc in two parts, in an hour of video you can learn the basics of using lex and yacc.
  • ANTLR Mega Tutorial, the renown and beloved tutorial that explains everything you need to know about ANTLR, with bonus tips and tricks and even resources to know more.


  • lex & yacc, despite being a book written in 1992 it’s still the most recommended book on the subject. Some people say because the lack of competition, others because it is good enough.
  • flex & bison: Text Processing Tools, the best book on the subject written in this millennium.
  • The Definitive ANTLR 4 Reference, written by the main author of the tool this is really the definitive book on ANTLR 4. It explains all of its secrets and it’s also a good introduction about how the whole parsing thing works.
  • Parsing Techniques, 2nd edition, a comprehensive, advanced and costly book to know more than you possibly need about parsing.

Section 3:


To implement your programming language, that is to say to actually making something happens, you can build one of two things: a compiler or an interpreter. You could also build both of them if you want. Here you can find a good overview if you need it: Compiled and Interpreted Languages.

The resources here are dedicated to explaining how compilers and/or interpreters are built, but for practical reasons often they also explain the basics of creating lexers and parsers.


A compiler transforms the original code into something else, usually machine code, but it could also be simply any lower level language, such as C. In the latter case some people prefer to use the term transpiler.


  • LLVM, a collection of modular and reusable compiler and toolchain technologies used to create compilers.
  • CLR, is the virtual machine part of the .NET technologies, that permits to execute different languages transformed in a common intermediate language.
  • JVM, the Java Virtual Machine that powers the Java execution.
  • Babel, a JavaScript compiler. It has a very modern structure, based upon plugins, which means that it can be easily adapted, but it doesn’t do anything by default, really, that is what the documentation says. Its creators present it as a tool to support new featuers of JavaScript in old environment, but it can do much more. For instance, you can use to create JavaScript-based languages, such as LightScript.

Articles & Tutorials

  • Building Domain Specific Languages on the CLR, an article on how to build internal DSL on the CLR. It’s slightly outdated, since it’s from 2008, but it’s still a good presentation on the subject.
  • The digital issue of MSDN Magazine for February 2008 (CHM format), contains an article on how to Create a Language Compiler for the .NET Framework. It’s still a competent overview of the whole process.
  • Create a working compiler with the LLVM framework, Part 1 and Part 2, a two-part series of articles on creating a custom compiler by IBM, from 2012 and thus slightly outdated.
  • A few series of tutorials froms the LLVM Documentation, this is three great linked series of tutorial on how to implement a language, called Kaleidoscope, with LLVM. The only problem is that some parts are not always up-to-date.
  • My First LLVM Compiler, a short and gentle introduction to the topic of building a compiler with LLVM.
  • Creating an LLVM Backend for the Cpu0 Architecture, a whopping 600-pages tutorial to learn how to create a LLVM backend, also available in PDF or ePub. The content is great, but the English is lacking. On the positive side, if you are a student, they feel your pain of transforming theoretical knowledge into practical applications, and the book was made for you.
  • A Nanopass Framework for Compiler Education, a paper that present a framework to teach the creation of a compiler in a simpler way, transforming the traditional monolithic approach in a long series of simple transformations. It’s an interesting read if you already have some theoretical background in computer science.
  • An Incremental Approach to Compiler Construction (PDF), a paper that it’s also a tutorial that develops a basic Scheme compiler with an easier to learn approach.
  • The Super Tiny Compiler!: “This is an ultra-simplified example of all the major pieces of a modern compiler written in easy to read JavaScript” (and using Babel). There is a short introduction, but this is more of a walkthrough of the code than a proper tutorial. This may be a good or a bad thing, depending on your preference.


  • Compilers: Principles, Techniques, and Tools, 2nd Edition, this is the widely known Dragon book (because of the cover) in the 2nd edition (purple dragon). There is a paperback edition, which probably costs less but it has no dragon on it, so you cannot buy that. It is a theoretical book, so don’t expect the techniques to actually include a lot of reusable code.
  • Modern Compiler Implementation in ML, this is known as a the Tiger book and a competitor of the Dragon book. It is a book that teaches the structure and the elements of a compiler in detail. It’s a theoretical book, although it explains the concept with code. There are other versions of the same book written using Java and C, but it’s widely agreed that the ML one is the best.
  • Engineering a Compiler, 2nd edition, it is another compiler book with a theoretical approach, but that it covers a more modern approach and it is more readable. It’s also more dedicated to the optimization of the compiler. So if you need a theoretical foundation and an engineering approach this is the best book to get.


An interpreter directly executes the language without transforming it in another form.

Articles & Tutorials


  • Writing An Interpreter In Go, despite the title it actually shows everything from parsing to creating an interpreter. It’s contemporary book both in the sense that is recent (a few months old), and it is a short one with a learn-by-doing attitude full of code, testing and without 3-rd party libraries. We have interviewed the author, Thorsten Ball.
  • Crafting Interpreters, a work-in-progress and free book that already has good reviews. It is focused on making interpreters that works well, and in fact it will builds two of them. It plan to have just the right amount of theory to be able to fit in at a party of programming language creators.

Section 4:


This are resources that cover a wide range of the process of creating a programming language. They may be comprehensive or just give the general overview.


In this section we include tools that cover the whole spectrum of building a programming language and that are usually used as standalone tools.

  • Xtext, is a framework part of several related technologies to develop programming languages and especially Domain Specific Languages. It allows you to build everything from the parser, to the editor, to validation rules. You can use it to build great IDE support for your language. It simplifies the whole language building process by reusing and linking existing technologies under the hood, such as the ANTLR parser generator.
  • JetBrains MPS, is a projectional language workbench. Projectional means that the Abstract Syntax Tree is saved on disk and a projection is presented to the user. The projection could be text-like, or be a table or diagram or anything else you can imagine. One side effect of this is that you will not need to do any parsing, because it is not necessary. The term Language Workbench indicates that Jetbrains MPS is a whole system of technologies created to help you create your own programming language: everything from the language itself to IDE and supporting tools designed for your language. You can use it to build every kind of language, but the possibility and need to create everything makes it ideal to create Domain Specific Languages that are used for specific purposes, by specific audiences.
  • Racket, is described by its authors as “a general-purpose programming language as well as the world’s first ecosystem for developing and deploying new languages”. It’s a pedagogical tool developed with practical ambitions that has even a manifesto. It is a language made to create other languages that has everything: from libraries to develop GUI applications to an IDE and the tools to develop logic languages. It’s part of the Lisp family of languages, and this tells everything you need to know: it’s all or nothing and always the Lisp-way.




  • How to create pragmatic, lightweight languages, the focus here is on making a language that works in practice. It explains how to generate bytecode, target the LLVM, build an editor for your language. Once you read the book you should know everything you need to make a usable, productive language. Incidentally, we have written this book.
  • How To Create Your Own Freaking Awesome Programming Language, it’s a 100-page PDF and a screencast that teach how to create a programming language using Ruby or the JVM. If you like the quick-and-dirty approach this book will get you started in little time.
  • Writing Compilers and Interpreters: A Software Engineering Approach, 3rd edition, it’s a pragmatic book that still teaches the proper approach to compilers/interpreters. Only that instead of an academic focus, it has an engineering one. This means that it’s full of Java code and there is also UML sprinkled here and there. Both the techniques and the code are slightly outdated, but this is still the best book if you are a software engineer and you need to actually do something that works correctly right now, that is to say in a few months after the proper review process has completed.
  • Language Implementation Patterns, this is a book from the author of ANTLR, which is also a computer science professor. So it’s a book with a mix of theory and practice, that guides you from start to finish, from parsing to compilers and interpreters. As the name implies, it focuses on explaining the known working patterns that are used in building this kind of software, more than directly explaining all the theory followed by a practical application. It’s the book to get if you need something that really works right now. It’s even recommended by Guido van Rossum, the designer of Python.
  • Build Your Own Lisp, it’s a very peculiar book meant to teach you how to use the C language and how to build you own programming language, using a mini-Lisp as the main example. You can read it for free online or buy it. It’s meant you to teach about C, but you have to be already familiar with programming. There is even a picture of Mike Tyson (because… lisp): it’s all so weird, but fascinating.
  • Beautiful Racket: how to make your own programming languages with Racket, it’s a good and continually updated online book on how to use Racket to build a programming language. The book is composed of a series of tutorials and parts of explanation and reference. It’s the kind of book that is technically free, but you should pay for it if you use it.
  • Programming Languages: Application and Interpretation, an interesting book that explains how to create a programming language from scratch using Racket. The author is a teacher, but of the good and understandable kind. In fact, there is also a series of recordings of the companion lectures, that sometimes have questionable audio. There is an updated version of the book and of the recordings, but the new book has a different focus, because it want also to teach about programming language. It also doesn’t uses Racket. If you don’t know any programming at all you may want to read the new version, if you are an experienced one you may prefer the old one.
  • Implementing Domain-Specific Languages with Xtext and Xtend, 2nd edition, is a great book for people that want to learn with examples and using a test-driven approach. It covers all levels of designing a DSL, from the design of the type system, to parsing and building a compiler.
  • Implementing Programming Languages, is an introduction to building compilers and interpreters with the JVM as the main target. There are related materials (presentations, source code, etc.) in a dedicated webpage. It has a good balance of theory and practice, but it’s explicitly meant as a textbook. So don’t expect much reusable code. It’s the typical textbook also in the sense that it can be a great and productive read if you already have the necessary background (or a teacher), otherwise you risk ending up confused.
  • Implementing functional languages: a tutorial, a free book that explains how to create a simple functional programming language from the parsing to the interpreter and compiler. On the other hand: “this book gives a practical approach to understanding implementations of non-strict functional languages using lazy graph reduction”. Also, expect a lot of math.
  • DSL Engineering, a great book that explains the theory and practice of building DSLs using language workbenches, such as MPS and Xtext. This means that other than traditional design aspects, such as parsing and interpreters, it covers things like how to create an IDE or how to test your DSL. It’s especially useful to software engineers, because it also discusses software engineering and business related aspects of DSLs. That is to say it talks about why a company should build a DSL.
  • Lisp in Small Pieces, an interesting book that explain in details how to design and implement a language of the Lisp family. It describes “11 interpreters and 2 compilers” and many advanced implementation details such as the optimization of the compiler. It’s obviously most useful to people interested in creating a Lisp-related language, but it can be an interesting reading for everybody.

Section 5:


Here you have the most complete collection of high-quality resources on creating programming languages. You have just to decide what you are going to read first.

At this point we have two advices for you:

  1. Get started. It does not matter how many amazing resources we will send you, if you do not take the time to practice, trying and learning from your mistake you will never create a programming language
  2. If you are interested in building programming languages you should subscribe to our newsletter. You will receive updates on new articles, more resources, ideas, advices and ultimately become part of a community that share your interests on building languages

You should have all you need to get started. If you have questions, advices or ideas to share feel free to write at We read and answer every email.

Our thanks to Krishna and the people on Hacker News for a few good suggestions.

Language Server Protocol: A Language Server For DOT With Visual Studio Code

A Language Server For DOT With Visual Studio Code

In a previous post we have seen how the Language Server Protocol can be a game changer in language development: we can now build support for one language and integrate it with all the IDEs compatible with this protocol.

In this article we are going to see how easy is to build support for the DOT Language in Visual Studio Code.

Note that Visual Studio Code now runs also on Mac and Linux

To do this we are going to build a system composed of two elements:

  • A server which will provide support for our language (DOT)
  • A very thin client that will be integrated in Visual Studio Code

If you want to support another editor you must create a new client for that editor, but you could reuse the same server. In fact, the server is nothing else than a node app.

The code for this article is in its companion repository. In this tutorial we are not going to show every little details. We are going to focus instead on the interesting parts. We will start slow and explain how to setup the client and then only the parts necessary to understand how everything works. So you have to refer to the repository if you want to see all the code and the configuration details.

Language Server Protocol Recap

If you haven’t read the previous article there is a few things that you may want to know. The protocol is based upon JSON-RPC, this means it is lightweight and simple both to use and to implement. In fact there are already editors that support it, such as Visual Studio Code and Eclipse, and libaries for many languages and formats.

The client informs the server when a document is opened or changed, and the server must mantain it’s own representation of the open documents. The client can also send requests that must be fullfilled by the server, such as requesting hover information or completion suggestions.

DOT Language

The DOT Language permits to describe graphs. A tool named Graphviz can read descriptions in DOT and generate nice pictures from those. Below you can see an example. It does not matter if you are not familiar with DOT. It is a very simple language that is perfect to show you how to use the Language Server Protocol in practice.


From this graph Graphviz will generate this image:



Before starting, we have to install a couple of things:

  1. VS Code extension generator: to generate a skeleton extension
  2. VS Code extension for the DOT language: to register the DOT language

VS Code Extension Generator

The generator can be installed and used the following way.

The generator will guide through the creation of an extension. In our case, we need for the client which is the proper VS Code extension.

VSCode Extension Generator

VS Code Extension For The DOT Language

The Language Support for the DOT language is an extension to register the DOT language with Visual Studio Code.

You can navigate to the extensions page

Once you are there search for dot and install it.

You may wonder why we need an extension for the DOT Language. I mean, aren’t we building it one right now?

We are building a Language Server for the DOT Language, to implement things such as verifying the correcteness of the syntax or suggesting terms for autocompletion. The extension we are going to install provides instead basic stuff like syntax highlighting. It also associate the extension .dot files with the DOT language. These kinds of extensions are quite easy to create: they consist just of configuration files.

Structure Of The Project

The project will be composed by:

  • a Language Protocol client
  • a Language Protocol server
  • a backend in C# and .NET Core that will provide the validation of the code

So under the project there will be a folder client, that will contain the client code, a folder server, that will contain the server code, and a folder csharp, that will contain the C# service. There will also be a data folder, that will contain two files with a list of shapes and colors.

Language Server Protocol architecture

The architecture of our application

Since client and server are actually node projects you have to install the node packages. And for the .NET Core project you have to restore the nuget packages using the well known commands inside the respective folders. More precise information is available in the repository.

You have also to remember to start the C# project first (dotnet run) and then run the extension/client after that.

1. The Client For The Language Server Protocol

Configuration Of The Client

Our client is a Visual Studio Code extension, and a very simple one, given the integrated support in VS Code for the protocol.

We are writing it using TypeScript, a language that compiles to JavaScript. We are using TypeScript for two reasons: 1) because it provides some useful features, such as strong typing, and 2) because it’s the default language for a VS Code extension.

First we are going to setup how TypeScript will be compiled into Javascript, by editing tsconfig.json.

In addition to the usual node_modules, we exclude the folder server, because it will contain the already compiled server. In fact, to simplify the use of the extension, the client will also contain the server. That is to say, we are going to create the server in another project, but we will output the source under the client/server folder, so that we could launch the extension with one command.

To do just that you can use CTLR+Shift+B, while you are in the server project, to automatically build and output the server under the client, as soon as you edit the code.

Setting Up The Extension

In package.json we include the needed packages and the scripts necessary to automatically compile, install the extension in our development environment, and run it.

For the most part it’s the default one, we just add a dependency.

The same file is also used to configure the extension.

The setting activationEvents is used to configure when should the extension be activated, in our case for DOT files. The particular value (onLanguage:dot) is available because we have installed the extension for GraphViz (Dot) files. This is also the only thing that we strictly need from the DOT extensions. Alternatively, we could avoid installing the extension by adding the following to the contributes section. But doing this way we lose syntax highlighting.

The contributes section can also contain custom properties to modify the configuration of the extension. As an example we have a maxNumberOfProblems setting.

This is the time to call npm install so that the extension vscode-languageclient will be installed.

Setting Up The Client

Let’s see the code for the client. Copy it (or type it) in src/extension.ts

This is the entire client: if you remove the comments it is just a bunch of lines long.

We first create and setup the server, for normal and debug sessions.

Then we take care of the client: on line 27 we order the client to call the server only for DOT files.

On line 37 we create the client with all the options and the informations it need.

This is quite easy, given the integrated support for the Language Server Protocol in Visual Studio Code, but if you had to create a client from scratch you would have to support the protocol JSON-RPC and configure the client to call the server whenever is needed: for instance when the document changes or a completion suggestion is asked by the user.

2. The Server For The Language Server Protocol

Creating the Server

The basics of a server are equally easy: you just need to setup the connection and find a way to maintain a model of the documents.

The first thing to do is to create the connection and start listening.

Then we create an instance of TextDocuments, a class provided by Visual Studio Code to manage the documents on the server. In fact, for the server to work, it must maintain a model of the document on which the client is working on. This class listens on the connection and updates the model when the server is notified of a change.

On initialization we inform the client of the capabilities of the server. The Language Server Protocol can work in two different ways: either sending only the portion of the document that have changed or sending the whole document each time. We choose the latter and inform the client to send the complete document every time, on line 13.

We communicate to the client that our server supports autocompletion. In particular, on line 17, we say that it should only ask for suggestions after the character equals ‘=’. This makes sense for the DOT language, but in other cases you could choose to not specify any character or to specify more than one character.

We also support hover information: when the user leaves the mouse pointer over a token for some time we can provide additional information.

Finally we support validation, but we don’t need to tell the client about it. The rationale is that when we are informed of changes on the document we inform the client about any issue. So the client itself doesn’t have to do anything special, apart from notifying the server of any change.

Implement Autocompletion

The suggestions for autocompletion depends on the position of the cursor. For this reason the client specify to the server the document, and the position for each autocompletion request.

Given the simplicity of the DOT language there aren’t many element to consider for autocompletion. In this example we consider the values for colors and shapes. In this article we are going to see how to create suggestions for color, but the code in the repository contains also suggestions for shapes, which are created in the same way.

In the first few lines we set the proper values, and load the list of possible colors.

In particular, on line 2, we get the text of the document. We do that by calling the document manager, using a document URI, that we are given as input. In theory, we could also read the document directly from disk, using the provided document URI, but this way we would had only the version that is saved on disk. We would miss any eventual changes in the current version.

Then, on line 11, we find the position of the equals (=) character. You may wonder why we don’t just  use  position.character - 1: since the completion is triggered by that character don’t we already know the relative position of the symbol? The answer is yes, if we are starting a suggestion for a new item, but this isn’t always true. For instance it’s not true if there is already a value, but we want to change it.

VSCode Complete for existing items

Autocomplete for an existing value

By making sure to find the position of the equals sign, we can always know if we are assigning a value to an option, and which option it is.

Sending The Suggestions For Autocomplete

These suggestions are used when the user typed “color=”.

Now that we know the option name, if the option is color we send the list of colors. We provide them as an array of CompletionItems. On lines 26-28 you can see how they are created: you need a label and a CompletionItemKind. The “kind” value, at the moment, is only used for displaying an icon next to the suggestion.

The last element, data, is a field meant to contain custom data chosen by the developer. We are going to see later what is used for.

We always send all the colors to the client, Visual Studio Code itself will filter the values considering what the user is typing. This may or may not be a good thing, since this makes impossible to use abbreviations or nicknames to trigger a completion suggestion for something else. For example, you can’t type “Bill” to trigger “William Henry Gates III”.

Giving Additional Information For The Selected CompletionItem

You may want to give additional information to the user, to make easier choosing the correct suggestion, but you can’t send too much information at once. The solution is to use another event to give the necessary information once the user has selected one suggestion.

The method onCompletionResolve is the one we need to use for doing just that.

It accepts a CompletionItem and it adds values to it. In our case if the suggestion is a color we give a link to the DOT documentation that contains the whole list of colors and we specify which color scheme is part of. Notice that the specification of DOT also supports color schemes other than the X11 one, but our autocomplete doesn’t.

Validating A Document

Now we are going to see the main feature of our Language Server: the validation of the DOT file.

But first we make sure that the validation is triggered after every change of each document.

We do just that by calling the validation when we receive a notification of a change. Since there is a line of communication between the server and the client we don’t have to answer right away. We first receive notification of the change and once the verification is complete we send back the errors.

The function validateDotDocument takes the changed document as argument and then compute errors. In this particular case we use a C# backend to perform the actual validation. So we just have to proxy the request through the Language Server and format the results for the client.

While this may be overkill for our example, it’s probably the best choice for big projects. If you want to easily use many linters and libraries, you are not going to mantain a specific version of each of these just for the Language Server. By integrating other services you can mantain a lean Language Server.

The validation is the best phase in which to integrate such services, because it’s not time sensitive. You can send back eventual errors at any time within a reasonable timeframe.

On lines 5-6 we receive back both our errors and other values computed by our service (line 4). We take advantage of the validation to compute which are the names of nodes and graphs, that we store in a global variable. We are going to use these names later, to satisfy requests for hover information.

The rest of the marked lines shows the gist of the communication with the client:

  • we gather the diagnostic messages
  • we setup them for easier use by the user
  • we send them to the client.

We can choose the severity, for instance we can also simply communicate informations or warning. We must also choose a range for which the message apply, so that the user can deal with it. Sometimes this doesn’t always make sense or it’s possible, in such cases we choose as a range the rest of the line length, starting from the character that indicate the beginning of the error.

The editor will then take care of communicating the mistake to the user, usually by underlining the text. But nothing forbids the client to do something else, for instance if the client is a log manager, it could simply store the errore in some way.

How Visual Studio Code shows an error

How Visual Studio Code shows an error

We are going to see the actual validation later, now we are going to see the last feature of our server, instead, the Hover Provider.

A Simple Hover Provider

An Hover Provider job is to give additional information on the text that the user is hovering on, such as the type of an object (ex. “class X”), documentation about it (ex. “The method Y is […]) or the signature of a method.

For our language server we choose to show what are the elements that can be used in an edge declaration: node, graph or subgraph. To find that information ourselves we simply use a listener when we validate the DOT document on the service.

To communicate this information to the user, we search between the names that we have saved on the Languager Server. If we find one on the current position that the user is hovering on we tell the client, otherwise we show nothing.

VS Code Hover Information

VS Code Hover Information

3. The C# Backend

Choosing One ANTLR Runtime Library

Since we are creating a cross platform service we want to use the .NET Core Platform.

The new ANTLR 4.7.0 supports .NET Core, but at the moment the nuget package has a configuration problem and still doesn’t. Depending on when you read this the problem might have been solved, but now you have two choices: you compile ANTLR Runtime for C# yourself, or you use the “C# optimized” version.

The problem is that this version it’s still in beta, and the integration to automatically create the C# files from the grammar it’s still in the future. So the generate the C# files you have to download the latest beta nuget package for the ANTLR 4 Code Generator. Then you have to decompress the .nupkg files, which is actually a zip file, and then run the included ANTLR4 program.

You can’t use the default ANTLR4 because it generates valid C# code, but that generated code is not compatible with the “C# optimized” runtime (don’t ask…).

The ANTLR Service

We are going to rapidly skim through the structure of the C# ANTLR Service. It’s not that complicated, but if you don’t understand something you can read our ANTLR Mega Tutorial.

We setup ANTLR with a Listener, to gather the names to use for the hover information, and an ErrorListener, to collect any error in our DOT document. Then we create a simple ASP .NET Core app to communicate with the Language Server in TypeScript.

In Program.cs (not shown) we configure the program to listen on the port 3000, then we setup the main method, to comunicate with the server, in Startup.cs.

We use a RouteBuilder to configure from which path we answer to. Since we only answer to one queston we could have directly answered from the root path, but this way is cleaner and it’s easier to add other services.

You can see that we actually use two ErrorListener(s), one each for the lexer and the parser. This way we can give better error information in the case of parser errors. The rest is the standard ANTLR program that use a Listener:

  • we create the parser
  • we try to parse the main element (ie. graph)
  • we walk the resulting tree with our listener.

The errors are found when we try to parse, on line 22, the names when we use the LanguageListener, on line 24.

The rest of the code simply prepare the JSON output that must be sent to the server of the Language Server Protocol.

Finding The Names

Let’s see where we find the names of the elements that we are providing for the hover information. This is achieved by listening to the firing of the id rule.

In our grammar the id rule is used to parse every name, attributed and value. So we have to distinguish between each case to find the ones we care about and categorize them.

We do just that by looking at the type of the parent of the Id node that has fired the method. On line 18 we subtract 1 to the line because ANTLR count lines as humans do, starting from 1, while Visual Studio Code count as a developer, starting by 0.


In this tutorial we have seen how to create a client of a Language Server for Visual Studio Code, a server with Visual Studio Code and a backend in C#.

We have seen how easily you can add features to your Language Server and one way to integrate it with another service that you already are using. This way you can create your language server as lightweight as you want. You can make it useful from day one and improve it along the way.

Of course we have leveraged a few things ready to be used:

  • an ANTLR grammar for our language
  • a client and a server implementation of the Language Server Protocol

What is amazing is that with one server you can leverage all the clients, even the one you don’t write yourself.

If you need more inspiration you can find additional examples and informations on the Visual Studio Code documentation. There is also the description of the Language Server Protocol itself, if you need to implement everything from scratch.

If you run in any issues following the tutorial or if you want to provide any feedback please feel free to write at

The ANTLR mega tutorial

The ANTLR mega tutorialParsers are powerful tools, and using ANTLR you could write all sort of parsers usable from many different languages.

In this complete tutorial we are going to:

  • explain the basics: what a parser is, what it can be used for
  • see how to setup ANTLR to be used from Javascript, Python, Java and C#
  • discuss how to test your parser
  • present the most advanced and useful features present in ANTLR: you will learn all you need to parse all possible languages
  • show tons of examples

Maybe you have read some tutorial that was too complicated or so partial that seemed to assume that you already know how to use a parser. This is not that kind of tutorial. We just expect you to know how to code and how to use a text editor or an IDE. That’s it.

At the end of this tutorial:

  • you will be able to write a parser to recognize different formats and languages
  • you will be able to create all the rules you need to build a lexer and a parser
  • you will know how to deal with the common problems you will encounter
  • you will understand errors and you will know how to avoid them by testing your grammar.

In other words, we will start from the very beginning and when we reach the end you will have learned all you could possible need to learn about ANTLR.

ANTLR Mega Tutorial Giant List of Content

ANTLR Mega Tutorial Giant List of Content

What is ANTLR?

ANTLR is a parser generator, a tool that helps you to create parsers. A parser takes a piece of text and transform it in an organized structure, such as an Abstract Syntax Tree (AST). You can think of the AST as a story describing the content of the code or also as its logical representation created by putting together the various pieces.

An abstract syntax tree for the Euclidean algorithm

Graphical representation of an AST for the Euclidean algorithm

What you need to do to get an AST:

  1. define a lexer and parser grammar
  2. invoke ANTLR: it will generate a lexer and a parser in your target language (e.g., Java, Python, C#, Javascript)
  3. use the generated lexer and parser: you invoke them passing the code to recognize and they return to you an AST

So you need to start by defining a lexer and parser grammar for the thing that you are analyzing. Usually the “thing” is a language, but it could also be a data format, a diagram, or any kind of structure that is represented with text.

Aren’t regular expressions enough?

If you are the typical programmer you may ask yourself why can’t I use a regular expression? A regular expression is quite useful, such as when you want to find a number in a string of text, but it also has many limitations.

The most obvious is the lack of recursion: you can’t find a (regular) expression inside another one, unless you code it by hand for each level. Something that quickly became unmaintainable. But the larger problem is that it’s not really scalable: if you are going to put together even just a few regular expressions, you are going to create a fragile mess that would be hard to maintain.

It’s not that easy to use regular expressions

Have you ever tried parsing HTML with a regular expression? It’s a terrible idea, for one you risk summoning Cthulhu, but more importantly it doesn’t really work. You don’t believe me? Let’s see, you want to find the elements of a table, so you try a regular expression like this one: <table>(.*?)</table>. Brilliant! You did it! Except somebody adds attributes to their table, such as style or id. It doesn’t matter, you do this <table.*?>(.*?)</table>, but you actually cared about the data inside the table, so you then need to parse tr and td, but they are full of tags.

So you need to eliminate that, too. And somebody dares even to use comments like <!— my comment &gtl—>. Comments can be used everywhere, and that is not easy to treat with your regular expression. Is it?

So you forbid the internet to use comments in HTML: problem solved.

Or alternatively you use ANTLR, whatever seems simpler to you.

ANTLR vs writing your own parser by hand

Okay, you are convinced, you need a parser, but why to use a parser generator like ANTLR instead of building your own?

The main advantage of ANTLR is productivity

If you actually have to work with a parser all the time, because your language, or format, is evolving, you need to be able to keep the pace, something you can’t do if you have to deal with the details of implementing a parser. Since you are not parsing for parsing’s sake, you must have the chance to concentrate on accomplishing your goals. And ANTLR make it much easier to do that, rapidly and cleanly.

As second thing, once you defined your grammars you can ask ANTLR to generate multiple parsers in different languages. For example you can get a parser in C# and one in Javascript to parse the same language in a desktop application and in a web application.

Some people argue that writing a parser by hand you can make it faster and you can produce better error messages. There is some truth in this, but in my experience parsers generated by ANTLR are always fast enough. You can tweak them and improve both performance and error handling by working on your grammar, if you really need to. And you can do that once you are happy with your grammar.

Table of Contents or ok, I am convinced, show me what you got

Two small notes:

  • in the companion repository of this tutorial you are going to find all the code with testing, even where we don’t see it in the article
  • the examples will be in different languages, but the knowledge would be generally applicable to any language


  1. Setup ANTLR
  2. Javascript Setup
  3. Python Setup
  4. Java Setup
  5. C# Setup


  1. Lexers and Parsers
  2. Creating a Grammar
  3. Designing a Data Format
  4. Lexer Rules
  5. Parser Rules
  6. Mistakes and Adjustments


  1. Setting Up the Chat Project in Javascript
  2. Antlr.js
  3. HtmlChatListener.js
  4. Working with a Listener
  5. Solving Ambiguities with Semantic Predicates
  6. Continuing the Chat in Python
  7. The Python Way of Working with a Listener
  8. Testing with Python
  9. Parsing Markup
  10. Lexical Modes
  11. Parser Grammars


  1. The Markup Project in Java
  2. The Main
  3. Transforming Code with ANTLR
  4. Joy and Pain of Transforming Code
  5. Advanced Testing
  6. Dealing with Expressions
  7. Parsing Spreadsheets
  8. The Spreadsheet Project in C#
  9. Excel is Doomed
  10. Testing Everything

Final Remarks

  1. Tips and Tricks
  2. Conclusions


In this section we prepare our development environment to work with ANTLR: the parser generator tool, the supporting tools and the runtimes for each language.

1. Setup ANTLR

ANTLR is actually made up of two main parts: the tool, used to generate the lexer and parser, and the runtime, needed to run them.

The tool will be needed just by you, the language engineer, while the runtime will be included in the final software using your language.

The tool is always the same no matter which language you are targeting: it’s a Java program that you need on your development machine. While the runtime is different for every language and must be available both to the developer and to the user.

The only requirement for the tool is that you have installed at least Java 1.7. To install the Java program you need to download the last version from the official site, which at the moment is:


  1. copy the downloaded tool where you usually put third-party java libraries (ex. /usr/local/lib or C:\Program Files\Java\lib)
  2. add the tool to your CLASSPATH. Add it to your startup script (ex. .bash_profile)
  3. (optional) add also aliases to your startup script to simplify the usage of ANTLR

Executing the instructions on Linux/Mac OS

Executing the instructions on Windows

Typical Workflow

When you use ANTLR you start by writing a grammar, a file with extension .g4 which contains the rules of the language that you are analyzing. You then use the antlr4 program to generate the files that your program will actually use, such as the lexer and the parser.

There are a couple of important options you can specify when running antlr4.

First, you can specify the target language, to generate a parser in Python or JavaScript or any other target different from Java (which is the default one). The other ones are used to generate visitor and listener (don’t worry if you don’t know what these are, we are going to explain it later).

By default only the listener is generated, so to create the visitor you use the -visitor command line option, and -no-listener if you don’t want to generate the listener. There are also the opposite options, -no-visitor and -listener, but they are the default values.

You can optionally test your grammar using a little utility named TestRig (although, as we have seen, it’s usually aliased to grun).

The filename(s) are optional and you can instead analyze the input that you type on the console.

If you want to use the testing tool you need to generate a Java parser, even if your program is written in another language. This can be done just by selecting a different option with antlr4.

Grun is useful when testing manually the first draft of your grammar. As it becomes more stable you may want to relay on automated tests (we will see how to write them).

Grun also has a few useful options: -tokens, to shows the tokens detected,  -gui to generate an image of the AST.

2. Javascript Setup

You can put your grammars in the same folder as your Javascript files. The file containing the grammar must have the same name of the grammar, which must be declared at the top of the file.

In the following example the name is Chat and the file is Chat.g4.

We can create the corresponding Javascript parser simply by specifying the correct option with the ANTLR4 Java program.

Notice that the option is case-sensitive, so pay attention to the uppercase ‘S’. If you make a mistake you will receive a message like the following.

ANTLR can be used both with node.js and in the browser. For the browser you need to use webpack or require.js. If you don’t know how to use either of the two you can look at the official documentation for some help or read this tutorial on antlr in the web. We are going to use node.js, for which you can install the ANTLR runtime simply by using the following standard command.

3. Python Setup

When you have a grammar you put that in the same folder as your Python files. The file must have the same name of the grammar, which must be declared at the top of the file. In the following example the name is Chat and the file is Chat.g4.

We can create the corresponding Python parser simply by specifying the correct option with the ANTLR4 Java program. For Python, you also need to pay attention to the version of Python, 2 or 3.

The runtime is available from PyPi so you just can install it using pip.

Again, you just have to remember to specify the proper python version.

4. Java Setup

To setup our Java project using ANTLR you can do things manually. Or you can be a civilized person and use Gradle or Maven.

Also, you can look in ANTLR plugins for your IDE.

4.1 Java Setup Using Gradle

This is how I typically setup my Gradle project.

I use a Gradle plugin to invoke ANTLR and I also use the IDEA plugin to generate the configuration for IntelliJ IDEA.

I put my grammars under src/main/antlr/ and the gradle configuration make sure they are generated in the directory corresponding to their package. For example, if I want the parser to be in the package me.tomassetti.mylanguage it has to be generated into generated-src/antlr/main/me/tomassetti/mylanguage.

At this point I can simply run:

And I get my lexer & parser generated from my grammar(s).

Then I can also run:

And I have an IDEA Project ready to be opened.

4.2 Java Setup Using Maven

First of all we are going to specify in our POM that we need antlr4-runtime as a dependency. We will also use a Maven plugin to run ANTLR through Maven.

We can also specify if we ANTLR to generate visitors or listeners. To do that we define a couple of corresponding properties.

Now you have to put the *.g4 files of your grammar under src/main/antlr4/me/tomassetti/examples/MarkupParser.

Once you have written your grammars you just run mvn package and all the magic happens: ANTLR is invoked, it generates the lexer and the parser and those are compiled together with the rest of your code.

If you have never used Maven you can look at the official ANTLR documentation for the Java target or also the Maven website to get you started.

There is a clear advantage in using Java for developing ANTLR grammars: there are plugins for several IDEs and it’s the language that the main developer of the tool actually works on. So they are tools, like the org.antlr.v4.gui.TestRig, that can be easily integrated in you workflow and are useful if you want to easily visualize the AST of an input.

5. C# Setup

There is support for .NET Framework and Mono 3.5, but there is no support for .NET core. We are going to use Visual Studio to create our ANTLR project, because there is a nice extension for Visual Studio 2015 created by the same author of the C# target, called ANTLR Language Support. You can install it by going in Tools -> Extensions and Updates. This extension will automatically generate parser, lexer and visitor/listener when you build your project.

Furthermore, the extension will allow you to create a new grammar file, using the well known menu to add a new item. Last, but not least, you can setup the options to generate listener/visitor right in the properties of each grammar file.

Alternatives If You Are Not Using Visual Studio 2015

At the moment the extension only support up to Visual Studio 2015, so if you are using another version of the IDE, or another IDE entirely, you need an alternative.

You can use the usual Java tool to generate everything, even for C#. You can do that just by indicating the right language. In this example the grammar is called “Spreadsheet”.

Notice that the ‘S’ in CSharp is uppercase.

Picking The Right Runtime

In both cases, if you are using the extension or the Java tool, you also need an ANTLR4 runtime for your project, and you can install it with the good ol’ nuget. But you have to remember that not all runtimes are created egual.

The problem is that recently the C# version has been integrated in the usual java tool, but the author of the extension has chosen to maintain a “C# optimized” version, that is incompatible with the usual ANTLR4 tool. This is not strictly a fork, since the same person continues to be a core contributor to the main ANTLR4 tool, but it’s more of a parallel development.

So, if you are using the Visual Studio Extension you need to use the most popular ANTLR4 runtime, authored by sharwell. If you are using the ANTLR4 tool to generate your C# lexer and parser. you need to use the recently created ANTLR4 standard runtime.


In this section we lay the foundation you need to use ANTLR: what lexer and parsers are, the syntax to define them in a grammar and the strategies you can use to create one. We also see the first examples to show how to use what you have learned.  You can come back to this section if you don’t remember how ANTLR works.

6. Lexers and Parsers

Before looking into parsers, we need to first to look into lexers, also known as tokenizers. They are basically the first stepping stone toward a parser, and of course ANTLR allows you to build them too. A lexer takes the individual characters and transforms them in tokens, the atoms that the parser uses to create the logical structure.

Imagine this process applied to a natural language such as English. You are reading the single characters, putting them together until they make a word, and then you combine the different words to form a sentence.

Let’s look at the following example and imagine that we are trying to parse a mathematical operation.

The lexer scans the text and find ‘4’, ‘3’, ‘7’ and then the space ‘ ‘. So it knows that the first characters actually represent a number. Then it finds a ‘+’ symbol, so it knows that it represents an operator, and lastly it finds another number.How the input stream is analyzed

How does it knows that? Because we tell it.

This is not a complete grammar, but we can already see that lexer rules are all uppercase, while parser rules are all lowercase. Technically the rule about case applies only to the first character of their names, but usually they are all uppercase or lowercase for clarity.

Rules are typically written in this order: first the parser rules and then the lexer ones, although logically they are applied in the opposite order. It’s also important to remember that lexer rules are analyzed in the order that they appear, and they can be ambiguous.

The typical example is the identifier: in many programming language it can be any string of letters, but certain combinations, such as “class” or “function” are forbidden because they indicate a class or a function. So the order of the rules solves the ambiguity by using the first match and that’s why the tokens identifying keywords such as class or function are defined first, while the one for the identifier is put last.

The basic syntax of a rule is easy: there is a name, a colon, the definition of the rule and a terminating semicolon

The definition of NUMBER contains a typical range of digits and a ‘+’ symbol to indicate that one or more matches are allowed. These are all very typical indications with which I assume you are familiar with, if not, you can read more about the syntax of regular expressions.

The most interesting part is at the end, the lexer rule that defines the WHITESPACE token. It’s interesting because it shows how to indicate to ANTLR to ignore something. Consider how ignoring whitespace simplify parser rules: if we couldn’t say to ignore WHITESPACE we would have to include it between every single subrule of the parser, to let the user puts spaces where he wants. Like this:

And the same typically applies to comments: they can appear everywhere and we do not want to handle them specifically in every single piece of our grammar so we just ignore them (at least while parsing) .

7. Creating a Grammar

Now that we have seen the basic syntax of a rule, we can take a look at the two different approaches to define a grammar: top-down and bottom-up.

Top-down approach

This approach consist in starting from the general organization of a file written in your language.

What are the main section of a file? What is their order? What is contained in each section?

For example a Java file can be divided in three sections:

  • package declaration
  • imports
  • type definitions

This approach works best when you already know the language or format that you are designing a grammar for. It is probably the strategy preferred by people with a good theoretical background or people who prefer to start with “the big plan”.

When using this approach you start by defining the rule representing the whole file. It will probably include other rules, to represent the main sections. You then define those rules and you move from the most general, abstract rules to the low-level, practical ones.

Bottom-up approach

The bottom-up approach consists in focusing in the small elements first: defining how the tokens are captured, how the basic expressions are defined and so on. Then we move to higher level constructs until we define the rule representing the whole file.

I personally prefer to start from the bottom, the basic items, that are analyzed with the lexer. And then you grow naturally from there to the structure, that is dealt with the parser. This approach permits to focus on a small piece of the grammar, build thests for that, ensure it works as expected and then move on to the next bit.

This approach mimic the way we learn. Furthermore, there is the advantage of starting with real code that is actually quite common among many languages. In fact, most languages have things like identifiers, comments, whitespace, etc. Obviously you might have to tweak something, for example a comment in HTML is functionally the same as a comment in C#, but it has different delimiters.

The disadvantage of a bottom-up approach rests on the fact that the parser is the thing you actually cares about. You weren’t asked to build a lexer, you were asked to build a parser, that could provide a specific functionality. So by starting on the last part, the lexer, you might end up doing some refactoring, if you don’t already know how the rest of the program will work.

8. Designing a Data Format

Designing a grammar for a new language is difficult. You have to create a language simple and intuitive to the user, but also unambiguous to make the grammar manageable. It must be concise, clear, natural and it shouldn’t get in the way of the user.

So we are starting with something limited: a grammar for a simple chat program.

Let’s start with a better description of our objective:

  • there are not going to be paragraphs, and thus we can use newlines as separators between the messages
  • we want to allow emoticons, mentions and links. We are not going to support HTML tags
  • since our chat is going to be for annoying teenagers, we want to allow users an easy way to SHOUT and to format the color of the text.

Finally teenagers could shout, and all in pink. What a time to be alive.

9. Lexer Rules

We start with defining lexer rules for our chat language. Remember that lexer rules actually are at the end of the files.

In this example we use rules fragments: they are reusable building blocks for lexer rules. You define them and then you refer to them in lexer rule. If you define them but do not include them in lexer rules they have simply no effect.

We define a fragment for the letters we want to use in keywords. Why is that? because we want to support case-insensitive keywords. Other than to avoid repetition of the case of characters, they are also used when dealing with floating numbers. To avoid repeating digits, before and after the dot/comma. Such as in the following example.

The TEXT token shows how to capture everything, except for the characters the follows the tilde (‘~’). We are excluding the closing square bracket ‘]’, but since it is a character used to identify the end of a group of characters, we have to escape it by prefixing it with a backslash ‘\’.

The newlines rule is formulated that way because there are actually different ways in which operating systems indicate a newline, some include a carriage return ('\r') others a newline ('\n') character, or a combination of the two.

10. Parser Rules

We continue with parser rules, which are the rules with which our program will interact most directly.

The first interesting part is message, not so much for what it contains, but the structure it represents. We are saying that a message could be anything of the listed rules in any order. This is a simple way to solve the problem of dealing with whitespace without repeating it every time. Since we, as users, find whitespace irrelevant we see something like WORD WORD mention, but the parser actually sees WORD WHITESPACE WORD WHITESPACE mention WHITESPACE.

Another way of dealing with whitespace, when you can’t get rid of it, is more advanced: lexical modes. Basically it allows you to specify two lexer parts: one for the structured part, the other for simple text. This is useful for parsing things like XML or HTML. We are going to show it later.

The command rule it’s obvious, you have just to notice that you cannot have a space between the two options for command and the colon, but you need one WHITESPACE after. The emoticon rule shows another notation to indicate multiple choices, you can use the pipe character ‘|’ without the parenthesis. We support only two emoticons, happy and sad, with or without the middle line.

Something that could be considered a bug, or a poor implementation, is the link rule, as we already said, in fact, TEXT capture everything apart from certain special characters. You may want to only allows WORD and WHITESPACE, inside the parentheses, or to force a correct format for a link, inside the square brackets. On the other hand, this allows the user to make a mistake in writing the link without making the parser complain.

You have to remember that the parser cannot check for semantics

For instance, it cannot know if the WORD indicating the color actually represents a valid color. That is to say, it doesn’t know that it’s wrong to use “dog”, but it’s right to use “red”. This must be checked by the logic of the program, that can access which colors are available. You have to find the right balance of dividing enforcement between the grammar and your own code.

The parser should only check the syntax. So the rule of thumb is that when in doubt you let the parser pass the content up to your program. Then, in your program, you check the semantics and make sure that the rule actually have a proper meaning.

Let’s look at the rule color: it can include a message,  and it itself can be part of message; this ambiguity will be solved by the context in which is used.

11. Mistakes and Adjustments

Before trying our new grammar we have to add a name for it, at the beginning of the file. The name must be the same of the file, which should have the .g4 extension.

You can find how to install everything, for your platform, in the official documentation. After everything is installed, we create the grammar, compile the generate Java code and then we run the testing tool.

Okay, it doesn’t work. Why is it expecting WORD? It’s right there! Let’s try to find out, using the option -tokens to make it shows the tokens it recognizes.

So it only sees the TEXT token. But we put it at the end of the grammar, what happens? The problem is that it always try to match the largest possible token. And all this text is a valid TEXT token. How we solve this problem? There are many ways, the first, of course, is just getting rid of that token. But for now we are going to see the second easiest.

We have changed the problematic token to make it include a preceding parenthesis or square bracket. Note that this isn’t exactly the same thing, because it would allow two series of parenthesis or square brackets. But it is a first step and we are learning here, after all.

Let’s check if it works:

Using the option -gui we can also have a nice, and easier to understand, graphical representation.

ANTLR4 Parse Tree

The dot in mid air represents whitespace.

This works, but it isn’t very smart or nice, or organized. But don’t worry, later we are going to see a better way. One positive aspect of this solution is that it allows to show another trick.

This is an equivalent formulation of the token TEXT: the ‘.’ matches any character, ‘*’ says that the preceding match can be repeated any time, ‘?’ indicate that the previous match is non-greedy. That is to say the previous subrule matches everything except what follows it, allowing to match the closing parenthesis or square bracket.


In this section we see how to use ANTLR in your programs, the libraries and functions you need to use, how to tests your parsers, and the like. We see what is and how to use a listener. We also build up on our knowledge of the basics, by looking at more advanced concepts, such as semantic predicates. While our projects are mainly in Javascript and Python, the concept are generally applicable to every language. You can come back to this section when you need to remember how to get your project organized.

12. Setting Up the Chat Project with Javascript

In the previous sections we have seen how to build a grammar for a chat program , piece by piece. Let’s now copy that grammar we just created in the same folder of our Javascript files.

We can create the corresponding Javascript parser simply by specifying the correct option with the ANTLR4 Java program.

Now you will find some new files in the folder, with names such as ChatLexer.js, ChatParser.js and there are also *.tokens files, none of which contains anything interesting for us, unless you want to understand the inner workings of ANTLR.

The file you want to look at is ChatListener.js,  you are not going to modify anything in it, but it contains methods and functions that we will override with our own listener. We are not going to modify it, because changes would be overwritten every time the grammar is regenerated.

Looking into it you can see several enter/exit functions, a pair for each of our parser rules. These functions will be invoked when a piece of code matching the rule will be encountered. This is the default implementation of the listener that allows you to just override the functions that you need, on your derived listener, and leave the rest to be.

The alternative to creating a Listener is creating a Visitor. The main differences are that you can’t neither control the flow of a listener nor returning anything from its functions, while you can do both of them with a visitor. So if you need to control how the nodes of the AST are entered, or to gather information from several of them, you probably want to use a visitor. This is useful, for example, with code generation, where some information that is needed to create new source code is spread around many parts. Both the listener and the visitor use depth-first search.

A depth-first search means that when a node will be accessed its children will be accessed, and if one the children nodes had its own children they will be accessed before continuing on with the other children of the first node. The following image will make it simpler to understand the concept.

Depth-first search

So in the case of a listener an enter event will be fired at the first encounter with the node and a exit one will be fired after after having exited all of its children. In the following image you can see the example of what functions will be fired when a listener would met a line node (for simplicity only the functions related to line are shown).

ANTLR Listener Example

With a standard visitor the behavior will be analogous except, of course, that only a single visit event will be fired for every single node. In the following image you can see the example of what function will be fired when a visitor would met a line node (for simplicity only the function related to line is shown).

ANTLR Visitor Example

Remember that this is true for the default implementation of a visitor and it’s done by returning the children of each node in every function. If you override a method of the visitor it’s your responsibility to make it continuing the journey or stop it right there.

13. Antlr.js

It is finally time to see how a typical ANTLR program looks.

At the beginning of the main file we import (using require) the necessary libraries and file, antlr4 (the runtime) and our generated parser, plus the listener that we are going to see later.

For simplicity we get the input from a string, while in a real scenario it would come from an editor.

Lines 16-19 shows the foundation of every ANTLR program: you create the stream of chars from the input, you give it to the lexer and it transforms them in tokens, that are then interpreted by the parser.

It’s useful to take a moment to reflect on this: the lexer works on the characters of the input, a copy of the input to be precise, while the parser works on the tokens generated by the parser. The lexer doesn’t work on the input directly, and the parser doesn’t even see the characters.

This is important to remember in case you need to do something advanced like manipulating the input. In this case the input is a string, but, of course, it could be any stream of content.

The line 20 is redundant, since the option already default to true, but that could change in future versions of the runtimes, so you are better off by specifying it.

Then, on line 21, we set the root node of the tree as a chat rule. You want to invoke the parser specifying a rule which typically is the first rule. However you can actually invoke any rule directly, like color.

Once we get the AST from the parser typically we want to process it using a listener or a visitor. In this case we specify a listener. Our particular listener take a parameter: the response object. We want to use it to put some text in the response to send to the user. After setting the listener up, we finally walk the tree with our listener.

14. HtmlChatListener.js

We continue by looking at the listener of our Chat project.

After the requires function calls we make our HtmlChatListener to extend ChatListener. The interesting stuff starts at line 17.

The ctx argument is an instance of a specific class context for the node that we are entering/exiting. So for enterName is NameContext, for exitEmoticon is EmoticonContext, etc. This specific context will have the proper elements for the rule, that would make possible to easily access the respective tokens and subrules. For example, NameContext will contain fields like WORD() and WHITESPACE(); CommandContext will contain fields like WHITESPACE(), SAYS() and SHOUTS().

These functions, enter* and exit*, are called by the walker everytime the corresponding nodes are entered or exited while it’s traversing the AST that represents the program newline. A listener allows you to execute some code, but it’s important to remember that you can’t stop the execution of the walker and the execution of the functions.

On line 18, we start by printing a strong tag because we want the name to be bold, then on exitName we take the text from the token WORD and close the tag. Note that we ignore the WHITESPACE token, nothing says that we have to show everything. In this case we could have done everything either on the enter or exit function.

On the function exitEmoticon we simply transform the emoticon text in an emoji character. We get the text of the whole rule because there are no tokens defined for this parser rule. On enterCommand, instead there could be any of two tokens SAYS or SHOUTS, so we check which one is defined. And then we alter the following text, by transforming in uppercase, if it’s a SHOUT. Note that we close the p tag at the exit of the line rule, because the command, semantically speaking, alter all the text of the message.

All we have to do now is launching node, with nodejs antlr.js, and point our browser at its address, usually at http://localhost:1337/ and we will be greeted with the following image.

ANTLR4 Javascript So all is good, we just have to add all the different listeners to handle the rest of the language. Let’s start with color and message.

15. Working with a Listener

We have seen how to start defining a listener. Now let’s get serious on see how to evolve in a complete, robust listener. Let’s start by adding support for color and checking the results of our hard work.

ANTLR4 Javascript ouput

Except that it doesn’t work. Or maybe it works too much: we are writing some part of message twice (“this will work”): first when we check the specific nodes, children of message, and then at the end.

Luckily with Javascript we can dynamically alter objects, so we can take advantage of this fact to change the *Context object themselves.

Only the modified parts are shown in the snippet above. We add a text field to every node that transforms its text, and then at the exit of every message we print the text if it’s the primary message, the one that is directly child of the line rule. If it’s a message, that is also a child of color, we add the text field to the node we are exiting and let color print it. We check this on line 30, where we look at the parent node to see if it’s an instance of the object LineContext. This is also further evidence of how each ctx argument corresponds to the proper type.

Between lines 23 and 27 we can see another field of every node of the generated tree: children, which obviously it contains the children node. You can observe that if a field text exists we add it to the proper variable, otherwise we use the usual function to get the text of the node.

16. Solving Ambiguities with Semantic Predicates

So far we have seen how to build a parser for a chat language in Javascript. Let’s continue working on this grammar but switch to python. Remember that all code is available in the repository. Before that, we have to solve an annoying problem: the TEXT token. The solution  we have  is terrible, and furthermore, if we tried to get the text of the token we would have to trim the edges, parentheses or square brackets. So what can we do?

We can use a particular feature of ANTLR called semantic predicates. As the name implies they are expressions that produce a boolean value. They selectively enable or disable the following rule and thus permit to solve ambiguities. Another reason that they could be used is to support different version of the same language, for instance a version with a new construct or an old without it.

Technically they are part of the larger group of actions, that allows to embed arbitrary code into the grammar. The downside is that the grammar is no more language independent, since the code in the action must be valid for the target language. For this reason, usually it’s considered a good idea to only use semantic predicates, when they can’t be avoided, and leave most of the code to the visitor/listener.

We restored link to its original formulation, but we added a semantic predicate to the TEXT token, written inside curly brackets and followed by a question mark. We use self._input.LA(-1) to check the character before the current one, if this character is a square bracket or the open parenthesis, we activate the TEXT token. It’s important to repeat that this must be valid code in our target language, it’s going to end up in the generated Lexer or Parser, in our case in

This matters not just for the syntax itself, but also because different targets might have different fields or methods, for instance LA returns an int in python, so we have to convert the char to a int.

Let’s look at the equivalent form in other languages.

If you want to test for the preceding token, you can use the _input.LT(-1,)but you can only do that for parser rules. For example, if you want to enable the mention rule only if preceded by a WHITESPACE token.

17. Continuing the Chat in Python

Before seeing the Python example, we must modify our grammar and put the TEXT token before the WORD one. Otherwise ANTLR might assign the incorrect token, in cases where the characters between parentheses or brackets are all valid for WORD, for instance if it where [this](link).

Using ANTLR in python is not more difficult than with any other platform, you just need to pay attention to the version of Python, 2 or 3.

And that’s it. So when you have run the command, inside the directory of your python project, there will be a newly generated parser and a lexer. You may find interesting to look at and in particular the function TEXT_sempred (sempred stands for semantic predicate).

You can see our predicate right in the code. This also means that you have to check that the correct libraries, for the functions used in the predicate, are available to the lexer.

18. The Python Way of Working with a Listener

The main file of a Python project is very similar to a Javascript one, mutatis mutandis of course. That is to say we have to adapt libraries and functions to the proper version for a different language.

We have also changed the input and output to become files, this avoid the need to launch a server in Python or the problem of using characters that are not supported in the terminal.

Apart from lines 35-36, where we introduce support for links, there is nothing new. Though you might notice that Python syntax is cleaner and, while having dynamic typing, it is not loosely typed as Javascript. The different types of *Context objects are explicitly written out. If only Python tools were as easy to use as the language itself. But of course we cannot just fly over python like this, so we also introduce testing.

19. Testing with Python

While Visual Studio Code have a very nice extension for Python, that also supports unit testing, we are going to use the command line for the sake of compatibility.

That’s how you run the tests, but before that we have to write them. Actually, even before that, we have to write an ErrorListener to manage errors that we could find. While we could simply read the text outputted by the default error listener, there is an advantage in using our own implementation, namely that we can control more easily what happens.

Our class derives from ErrorListener and we simply have to implement syntaxError. Although we also add a property symbol to easily check which symbol might have caused an error.

The setup method is used to ensure that everything is properly set; on lines 19-21 we setup also our ChatErrorListener, but first we remove the default one, otherwise it would still output errors on the standard output. We are listening to errors in the parser, but we could also catch errors generated by the lexer. It depends on what you want to test. You may want to check both.

The two proper test methods checks for a valid and an invalid name. The checks are linked to the property symbol, that we have previously defined, if it’s empty everything is fine, otherwise it contains the symbol that created the error. Notice that on line 28, there is a space at the end of the string, because we have defined the rule name to end with a WHITESPACE token.

20. Parsing Markup

ANTLR can parse many things, including binary data, in that case tokens are made up of non printable characters. But a more common problem is parsing markup languages such as XML or HTML. Markup is also a useful format to adopt for your own creations, because it allows to mix unstructured text content with structured annotations. They fundamentally represent a form of smart document, containing both text and structured data. The technical term that describe them is island languages. This type is not restricted to include only markup, and sometimes it’s a matter of perspective.

For example, you may have to build a parser that ignores preprocessor directives. In that case, you have to find a way to distinguish proper code from directives, which obeys different rules.

In any case, the problem for parsing such languages is that there is a lot of text that we don’t actually have to parse, but we cannot ignore or discard, because the text contain useful information for the user and it is a structural part of the document. The solution is lexical modes, a way to parse structured content inside a larger sea of free text.

21. Lexical Modes

We are going to see how to use lexical modes, by starting with a new grammar.

Looking at the first line you could notice a difference: we are defining a lexer grammar, instead of the usual (combined) grammar. You simply can’t define a lexical mode together with a parser grammar. You can use lexical modes only in a lexer grammar, not in a combined grammar. The rest is not suprising, as you can see, we are defining a sort of BBCode markup, with tags delimited by square brackets.

On lines 3, 7 and 9 you will find basically all that you need to know about lexical modes. You define one or more tokens that can delimit the different modes and activate them.

The default mode is already implicitly defined, if you need to define yours you simply use mode followed by a name. Other than for markup languages, lexical modes are typically used to deal with string interpolation. When a string literal can contain more than simple text, but things like arbitrary expressions.

When we used a combined grammar we could define tokens implicitly: when in a parser rule we used a string like ‘=’ that is what we did. Now that we are using separate lexer and parser grammars we cannot do that. That means that every single token has to be defined explicitly. So we have definitions like SLASH or EQUALS which typically could be just be directly used in a parser rule. The concept is simple: in the lexer grammar we need to define all tokens, because they cannot be defined later in the parser grammar.

22. Parser Grammars

We look at the other side of a lexer grammar, so to speak.

On the first line we define a parser grammar. Since the tokens we need are defined in the lexer grammar, we need to use an option to say to ANTLR where it can find them. This is not necessary in combined grammars, since the tokens are defined in the same file.

There are many other options available, in the documentation.

There is almost nothing else to add, except that we define a content rule so that we can manage more easily the text that we find later in the program.

I just want to say that, as you can see, we don’t need to explicitly use the tokens everytime (es. SLASH), but instead we can use the corresponding text (es. ‘/’).

ANTLR will automatically transform the text in the corresponding token, but this can happen only if they are already defined. In short, it is as if we had written:

But we could not have used the implicit way, if we hadn’t already explicitly defined them in the lexer grammar. Another way to look at this is: when we define a combined grammar ANTLR defines for use all the tokens, that we have not explicitly defined ourselves. When we need to use a separate lexer and a parser grammar, we have to define explicitly every token ourselves. Once we have done that, we can use them in every way we want.

Before moving to actual Java code, let’s see the AST for a sample input.

Sample AST of the Markup parser

You can easily notice that the element rule is sort of transparent: where you would expect to find it there is always going to be a tag or content. So why did we define it? There are two advantages: avoid repetition in our grammar and simplify managing the results of the parsing. We avoid repetition because if we did not have the element rule we should repeat (content|tag) everywhere it is used. What if one day we add a new type of element? In addition to that it simplify the processing of the AST because it makes both the node represent tag and content extend a comment ancestor.


In this section we deepen our understanding of ANTLR. We will look at more complex examples and situations we may have to handle in our parsing adventures. We will learn how to perform more adavanced testing, to catch more bugs and ensure a better quality for our code. We will see what a visitor is and how to use it. Finally, we will see how to deal with expressions and the complexity they bring.

You can come back to this section when you need to deal with complex parsing problems.

23. The Markup Project in Java

You can follow the instructions in Java Setup or just copy the antlr-java folder of the companion repository. Once the file pom.xml is properly configured, this is how you build and execute the application.

As you can see, it isn’t any different from any typical Maven project, although it’s indeed more complicated that a typical Javascript or Python project. Of course, if you use an IDE you don’t need to do anything different from your typical workflow.

24. The Main

We are going to see how to write a typical ANTLR application in Java.

At this point the main java file should not come as a surprise, the only new development is the visitor. Of course, there are the obvious little differences in the names of the ANTLR classes and such. This time we are building a visitor, whose main advantage is the chance to control the flow of the program. While we are still dealing with text, we don’t want to display it, we want to transform it from pseudo-BBCode to pseudo-Markdown.

25. Transforming Code with ANTLR

The first issue to deal with our translation from pseudo-BBCode to pseudo-Markdown is a design decision. Our two languages are different and frankly neither of the two original one is that well designed.

BBCode was created as a safety precaution, to make possible to disallow the use of HTML but giove some of its power to users. Markdown was created to be an easy to read and write format, that could be translated into HTML. So they both mimic HTML, and you can actually use HTML in a Markdown document. Let’s start to look into how messy would be a real conversion.

The first version of our visitor prints all the text and ignore all the tags.

You can see how to control the flow, either by calling visitChildren, or any other visit* function, and deciding what to return. We just need to override the methods that we want to change. Otherwise, the default implementation would just do like visitContent, on line 23, it will visit the children nodes and allows the visitor to continue. Just like for a listener, the argument is the proper context type. If you want to stop the visitor just return null as on line 15.

26. Joy and Pain of Transforming Code

Transforming code, even at a very simple level, comes with some complications. Let’s start easy with some basic visitor methods.

Before looking at the main method, let’s look at the supporting ones. Foremost we have changed visitContent by making it return its text instead of printing it. Second, we have overridden the visitElement so that it prints the text of its child, but only if it’s a top element, and not inside a tag. In both cases, it achieve this by calling the proper visit* method. It knows which one to call because it checks if it actually has a tag or content node.

VisitTag contains more code than every other method, because it can also contain other elements, including other tags that have to be managed themselves, and thus they cannot be simply printed. We save the content of the ID on line 5, of course we don’t need to check that the corresponding end tag matches, because the parser will ensure that, as long as the input is well formed.

The first complication starts with at lines 14-15: as it often happens when transforming a language in a different one, there isn’t a perfect correspondence between the two. While BBCode tries to be a smarter and safer replacement for HTML, Markdown want to accomplish the same objective of HTML, to create a structured document. So BBCode has an underline tag, while Markdown does not.

So we have to make a decision

Do we want to discard the information, or directly print HTML, or something else? We choose something else and instead convert the underline to an italic. That might seem completely arbitrary, and indeed there is an element of choice in this decision. But the conversion forces us to lose some information, and both are used for emphasis, so we choose the closer thing in the new language.

The following case, on lines 18-22, force us to make another choice. We can’t maintain the information about the author of the quote in a structured way, so we choose to print the information in a way that will make sense to a human reader.

On lines 28-34 we do our “magic”: we visit the children and gather their text, then we close with the endDelimiter. Finally we return the text that we have created.

That’s how the visitor works

  1. every top element visit each child
    • if it’s a content node, it directly returns the text
    • if it’s a tag, it setups the correct delimiters and then it checks its children. It repeats step 2 for each children and then it returns the gathered text
  2. it prints the returned text

It’s obviously a simple example, but it show how you can have great freedom in managing the visitor once you have launched it. Together with the patterns that we have seen at the beginning of this section you can see all of the options: to return null to stop the visit, to return children to continue, to return something to perform an action ordered at an higher level of the tree.

27. Advanced Testing

The use of lexical modes permit to handle the parsing of island languages, but it complicates testing.

We are not going to show because w edid not changed it; if you need you can see it on the repository.

You can run the tests by using the following command.

Now we are going to look at the tests code. We are skipping the setup part, because that also is obvious, we just copy the process seen on the main file, but we simply add our error listener to intercept the errors.

The first two methods are exactly as before, we simply check that there are no errors, or that there is the correct one because the input itself is erroneous. On lines 30-32 things start to get interesting: the issue is that by testing the rules one by one we don’t give the chance to the parser to switch automatically to the correct mode. So it remains always on the DEFAULT_MODE, which in our case makes everything looks like TEXT. This obviously makes the correct parsing of an attribute impossible.

The same lines shows also how you can check the current mode that you are in, and the exact type of the tokens that are found by the parser, which we use to confirm that indeed all is wrong in this case.

While we could use a string of text to trigger the correct mode, each time, that would make testing intertwined with several pieces of code, which is a no-no. So the solution is seen on line 39: we trigger the correct mode manually. Once you have done that, you can see that our attribute is recognized correctly.

28. Dealing with Expressions

So far we have written simple parser rules, now we are going to see one of the most challenging parts in analyzing a real (programming) language: expressions. While rules for statements are usually larger they are quite simple to deal with: you just need to write a rule that encapsulate the structure with the all the different optional parts. For instance a for statement can include all other kind of statements, but we can simply include them with something like statement*. An expression, instead, can be combined in many different ways.

An expression usually contains other expressions. For example the typical binary expression is composed by an expression on the left, an operator in the middle and another expression on the right. This can lead to ambiguities. Think, for example, at the expression 5 + 3 * 2, for ANTLR this expression is ambiguous because there are two ways to parse it. It could either parse it as 5 + (3 * 2) or (5 +3) * 2.

Until this moment we have avoided the problem simply because markup constructs surround the object on which they are applied. So there is not ambiguity in choosing which one to apply first: it’s the most external. Imagine if this expression was written as: 

That would make obvious to ANTLR how to parse it.

These types of rules are called left-recursive rules. You might say: just parse whatever comes first. The problem with that is semantic: the addition comes first, but we know that multiplications have a precedence over additions. Traditionally the way to solve this problem was to create a complex cascade of specific expressions like this:

This way ANTLR would have known to search first for a number, then for multiplications and finally for additions. This is cumbersome and also counterintuitive, because the last expression is the first to be actually recognized. Luckily ANTLR4 can create a similar structure automatically, so we can use a much more natural syntax.

In practice ANTLR consider the order in which we defined the alternatives to decide the precedence. By writing the rule in this way we are telling to ANTLR that the multiplication has precedence on the addition.

29. Parsing Spreadsheets

Now we are prepared to create our last application, in C#. We are going to build  the parser of an Excel-like application. In practice, we want to manage the expressions you write in the cells of a spreadsheet.

With all the knowledge you have acquired so far everything should be clear, except for possibly three things:

  1. why the parentheses are there,
  2. what’s the stuff on the right,
  3. that thing on line 6.

The parentheses comes first because its only role is to give the user a way to override the precedence of operator, if it needs to do so. This graphical representation of the AST should make it clear.Parentheses to change the operator precedence

The things on the right are labels, they are used to make ANTLR generate specific functions for the visitor or listener. So there will be a VisitFunctionExp, a VisitPowerExp, etc. This makes possible to avoid the use of a giant visitor for the expression rule.

The expression relative to exponentiation is different because there are two possible ways to act, to group them, when you meet two sequential expressions of the same type. The first one is to execute the one on the left first and then the one on the right, the second one is the inverse: this is called associativity. Usually the one that you want to use is left-associativity,  which is the default option. Nonetheless exponentiation is right-associative, so we have to signal this to ANTLR.

Another way to look at this is: if there are two expressions of the same type, which one has the precedence: the left one or the right one? Again, an image is worth a thousand words.

Associativity of an expression

We have also support for functions, alphanumeric variables that represents cells and real numbers.

30. The Spreadsheet Project in C#

You just need to follow the C# Setup: to install a nuget package for the runtime and an ANTLR4 extension for Visual Studio. The extension will automatically generate everything whenever you build your project: parser, listener and/or visitor.

Update. Notice that there are two small differences between the code for a project using the extension and one using the Java tool. These are noted in the README for the C# project at the repository.

After you have done that, you can also add grammar files just by using the usual menu Add -> New Item. Do exactly that to create a grammar called Spreadsheet.g4 and put in it the grammar we have just created. Now let’s see the main Program.cs.

There is nothing to say, apart from that, of course, you have to pay attention to yet another slight variation in the naming of things: pay attention to the casing. For instance, AntlrInputStream, in the C# program, was ANTLRInputStream in the Java program.

Also you can notice that, this time, we output on the screen the result of our visitor, instead of writing the result on a file.

31. Excel is Doomed

We are going to take a look at our visitor for the Spreadsheet project.

VisitNumeric and VisitIdAtom return the actual numbers that are represented either by the literal number or the variable. In a real scenario DataRepository would contain methods to access the data in the proper cell, but in our example is just a Dictionary with some keys and numbers. The other methods actually work in the same way: they visit/call the containing expression(s). The only difference is what they do with the results.

Some perform an operation on the result, the binary operations combine two results in the proper way and finally VisitParenthesisExp just reports the result higher on the chain. Math is simple, when it’s done by a computer.

32. Testing Everything

Up until now we have only tested the parser rules, that is to say we have tested only if we have created the correct rule to parse our input. Now we are also going to test the visitor functions. This is the ideal chance because our visitor return values that we can check individually. In other occasions, for instance if your visitor prints something to the screen, you may want to rewrite the visitor to write on a stream. Then, at testing time, you can easily capture the output.

We are not going to show SpreadsheetErrorListener.cs because it’s the same as the previous one we have already seen; if you need it you can see it on the repository.

To perform unit testing on Visual Studio you need to create a specific project inside the solution. You can choose different formats, we opt for the xUnit version. To run them there is an aptly named section “TEST” on the menu bar.

The first test function is similar to the ones we have already seen; it checks that the corrects tokens are selected. On line 11 and 13 you may be surprised to see that weird token type, this happens because we didn’t explicitly created one for the ‘^’ symbol so one got automatically created for us. If you need you can see all the tokens by looking at the *.tokens file generated by ANTLR.

On line 25 we visit our test node and get the results, that we check on line 27. It’s all very simple because our visitor is simple, while unit testing should always be easy and made up of small parts it really can’t be easier than this.

The only thing to pay attention to it’s related to the format of the number, it’s not a problem here, but look at line 59, where we test the result of a whole expression. There we need to make sure that the correct format is selected, because different countries use different symbols as the decimal mark.

There are some things that depends on the cultural context

If your computer was already set to the American English Culture this wouldn’t be necessary, but to guarantee the correct testing results for everybody we have to specify it. Keep that in mind if you are testing things that are culture-dependent: such as grouping of digits, temperatures, etc.

On line 44-46 you see than when we check for the wrong function the parser actually works. That’s because indeed “logga” is syntactically valid as a function name, but it’s not semantically correct. The function “logga” doesn’t exists, so our program doesn’t know what to do with it. So when we visit it we get 0 as a result. As you recall this was our choice: since we initialize the result to 0 and we don’t have a default case in VisitFunctionExp. So if there no function the result remains 0. A possib alternative could be to throw an exception.

Final Remarks

In this section we see tips and tricks that never came up in our example, but can be useful in your programs. We suggest more resources you may find useful if you want to know more about ANTLR, both the practice and the theory, or you need to deal with the most complex problems.

33. Tips and Tricks

Let’s see a few tricks that could be useful from time to time. These were never needed in our examples, but they have been quite useful in other scenarios.

Catchall Rule

The first one is the ANY lexer rule. This is simply a rule in the following format.

This is a catchall rule that should be put at the end of your grammar. It matches any character that didn’t find its place during the parsing. So creating this rule can help you during development, when your grammar has still many holes that could cause distracting error messages. It’s even useful during production, when it acts as a canary in the mines. If it shows up in your program you know that something is wrong.


There is also something that we haven’t talked about: channels. Their use case is usually handling comments. You don’t really want to check for comments inside every of your statements or expressions, so you usually throw them way with -> skip. But there are some cases where you may want to preserve them, for instance if you are translating a program in another language. When this happens you use channels. There is already one called HIDDEN that you can use, but you can declare more of them at the top of your lexer grammar.

Rule Element Labels

There is another use of labels other than to distinguish among different cases of the same rule. They can be used to give a specific name, usually but not always of semantic value, to a common rule or parts of a rule. The format is label=rule, to be used inside another rule.

This way left and right would become fields in the ExpressionContext nodes. And instead of using context.expression(0), you could refer to the same entity using context.left.

Problematic Tokens

In many real languages some symbols are reused in different ways, some of which may lead to ambiguities. A common problematic example are the angle brackets, used both for bitshift expression and to delimit parameterized types.

The natural way of defining the bitshift operator token is as a single double angle brackets, ‘>>’. But this might lead to confusing a nested parameterized definition with the bitshift operator, for instance in the second example shown up here. While a simple way of solving the problem would be using semantic predicates, an excessive number of them would slow down the parsing phase. The solution is to avoid defining the bitshift operator token and instead using the angle brackets twice in the parser rule, so that the parser itself can choose the best candidate for every occasion.

34. Conclusions

We have learned a lot today:

  • what are a lexer and a parser
  • how to create lexer and parser rules
  • how to use ANTLR to generate parsers in Java, C#, Python and JavaScript
  • the fundamental kinds of problems you will encounter parsing and how to solve them
  • how to understand errors
  • how to test your parsers

That’s all you need to know to use ANTLR on your own. And I mean literally, you may want to know more, but now you have solid basis to explore on your own.

Where to look if you need more information about ANTLR:

Also the book it’s only place where you can find and answer to question like these:

ANTLR v4 is the result of a minor detour (twenty-five years) I took in graduate
school. I guess I’m going to have to change my motto slightly.

Why program by hand in five days what you can spend twenty-five years of your
life automating?

We worked quite hard to build the largest tutorial on ANTLR: the mega-tutorial! A post over 13.000 words long, or more than 30 pages, to try answering all your questions about ANTLR. Missing something? Contact us and let us now, we are here to help

The complete guide to (external) Domain Specific Languages

The complete guide to Domain Specific Languages
This guide will show you:

  • the what: after a definition we will look into 19 examples of DSLs
  • the why: what are the concrete benefits you can achieve using DSLs
  • the how: we will discuss the different ways to build a DSLs and what are the success factors

After that you will get a list of resources to learn even more: books, websites, papers, screencasts.
This is the most complete resource on Domain Specific Languages out there. I hope you will enjoy it!

Just one thing: here we try to be practical and understandable. If you are looking for formal definitions and theoretical discussions this is not the right place.

DSL Checklist: 7 cases in which you should use a DSL


Receive the checklist by email and get more tips on DSLs

Powered by ConvertKit

What are Domain Specific Languages?

Domain Specific Languages are languages created to support a particular set of tasks, as they are performed in a specific domain.

You could be familiar with the typical programming languages (a.k.a. General Programming Languages or GPLs). They are tools good enough to create all sort of programs, but not really specific to anything. They are like hammers: good enough for many tasks, if you have the patience and ability to adapt them, but in most cases you would be better off using a more specific tool. You can open a beer with an hammer, it is just way more difficult, risky and lead to poorer results that using a specific tool like a bottle opener.

Languages are tools to solve problems and Domain Specific Languages are specific tools, good to solve a limited set of problems.

Ok, but what does it mean in practice? How do DSLs look like? Let’s see plenty of examples.

19 Examples of Domain Specific Languages

Domain Specific Languages can serve all sort of purposes. They can be used in different contexts and by different kinds of users. Some DSLs are intended to be used by programmers, and therefore are more technical, while others are intended to be used by someone who is not a programmer and therefore they use less geeky concepts and syntax.

Domain Specific Languages can be extremely specific and being created only to be used within a company. I have built several of this kind of DSLs myself, but I am not allowed to share them. I can instead list several examples of public DSLs which are used by millions of persons.

1. DOT – A DSL to define graphs

DOT is a language that can describe graphs, either directed or non directed.

From this description images representing these graphs can be generated. To do that you use a program named graphviz, which works with the DOT language. From the previous example you would get this:

An image generated using the dot DSL

An image generated using the dot DSL

The language permits also to define the shape of the nodes, their colors and many other characteristics. But the basics are pretty simple and almost everyone can learn how to use it in a matter of minutes.

2. PlantUML – A DSL to draw UML diagrams

PlantUML Can be used to define UML diagrams of different kinds. For example we can define a sequence diagram.

From this definition we can get a picture.

plantuml dsl

Image generated with the PlantUML DSL

With a similar syntax different kinds of diagrams can be defined like class diagrams or use case diagrams. Using a textual DSL to define UML diagrams have several advantages: it is easier to version, and it can be modified by everyone without special tools. A DSL like this one could be used during a meeting to support a discussion: as the participants argue different diagrams can be quickly defined and the corresponding images generated with a click.

3. Sed – A DSL to define text transformation

On UNIX-like operating systems (Linux, Mac, BSD, etc.) there are a set of command line tools, each one accepting instructions in their own format. This format can be considered a DSL that permits to specify the tasks to be executed. For example sed executes the text transformations indicated using its own DSL.

Do you want to replace the word “Jack” with the word “John”?

Or do you want to delete all the lines of a file from line 10 until the word “stophere” is found?

The technical level necessary to become proficient with this kind of DSLs is elevated. However many advanced computer users could learn the basic to execute common operation on files. Small things that they could need to do every day and that currently are doing manually. For example, you could have e-mail templates containing placeholders like “{FIRST_NAME}” and replace them with the proper test with one of this commands.

4. Gawk – A DSL to print and process text

Like sed, gawk is another UNIX utility accepting commands in its own language. For example you could print all the lines of a given file which are longer than 80 characters:

Or count the lines in a file:

The UNIX philosophy is to use several or these little utilities and combine them to perform the most amazing and complex tasks. For example you could take an input file, transform it with sed and then print selected parts using gawk.

5. Gherkin – A DSL to define functional tests

Gherkin is a DSL for defining functional tests. It has a very flexible syntax that makes it look almost like free text. Basically developers, analysts and clients can sit around a table and define some scenarios. These scenarios will be then executable as tests, to verify whether the application meet the expectations.
Here it is how we could define the expectations for withdrawing from an ATM:

I really like this DSL because the bar for using it is very low. This DSL however requires a developer to define some code using a GPL. How it works in practice is that a developer define specific commands like: “{name} has {amount}$ on his account” and define the code that execute this command in the GPL chosen for the project (Ruby, Java, or others are supported). Once the developers have created these commands, specific to the application of interest, all users can use them while defining their functional tests. It is also possible to start in the other way: first you write your scenarios, as you want, trying to capture the requirements and only later developers map each command to a corresponding function in a GPL.

In other words, this DSL is great for hiding the real code behind a surface that everyone can understand and everyone can contribute to. It is much better to sit at a table and discuss with a bank representative using the example we have displayed than showing him the hundreds of lines of Java which correspond to those commands, right?

6. Website-spec – A DSL for functional web testing

Gherkin is not the only DSL used to define tests. Website-spec can be used to define functional tests specific for web applications.
Here we define how to navigate on a certain website and what we expect to find.

In this case there is no need for a developer to define the translation of commands to a GPL because this language users domain specific commands, like “Click”, which the interpreter knows how to execute. With a minimal training now anyone can describe a sispecific interaction with a website and the expected results. Pretty neat, eh?

7. SQL – databases

You have probably heard of SQL. It is a language used to define how to insert, modify or extract data from a relational database. Let’s get some stats from the STATS table:

For sure you do not expect the average Joe to be able to write complex queries: SQL is not a trivial language and it requires some time to be mastered. However you do not need to be trained as a developer to learn SQL. Indeed many DBAs are not developers. Maybe Joe should not be trusted with writing access to the database, but he could get read access and write simple queries to answer his own questions instead of having to ask someone and wait to get an answer. Suppose he needs to know the maximum temperature in august in Atlanta:

Maybe Joe will never reach the level of a DBA, but he can learn a few basic queries and adapt them to his needs, making him more independent and letting his colleagues focus on their job instead of helping him out.

8. HTML – web layout

I really hope you have heard of this quite successful language to define documents. It is amazing to think that we could have defined HTML pages 20 years ago, when most people had desktop computers attached to monitors with a resolution of 640×480 pixels and now those some pages can be rendered on the browser running on our smartphones. I guess it is a good example of what can be achieved with DSLs.

Note that HTML is really about defining documents: their structure and the information they contain. The same document then it is rendered differently on a desktop computer, a tablet or a smartphone. The same document is consumed differently from people with disabilities. Specific browser for people with impaired sight help them consume a document defined with HTML by reading the content and support navigation to the different sections of the document.

9. CSS – style

The Cascading Style Sheet language defines the style to use to visualize a document. We can use it to define how an HTML document will appear on the screen or how it will appear when printed.

CSS is a not trivial to master but many persons, with basic or no knowledge of programming, can use it to change the appearance of a web page. This DSL has played an important role in democratizing web design.

10. XML – data encoding

Some years ago XML used to seem the solution to all problems in IT. Now the hype is long gone, but XML is here to stay. A solid DSL to represent data, and a quite flexible one.

While it is not the most readable or impressive language everyone is able to modify the data contained in an XML file.

11. UML – visual modeling

Not all DSLs have to be textual! Languages can be also graphical. For example the Unified Modeling Language is a language with well defined rules. It can be used to define diagrams that are typically used to support discussions. Someone also uses them to generate code or even to define an entire application (look for Model Driven Architecture, if you are interested in this kind of stuff).

UML: an example of DSL

UML: an example of DSL

UML is a vast language (someone said bloated?). There are many different kinds of different diagrams comprised after the UML umbrella. All of them share some commonalities.
Not everyone would agree that UML is a DSL. While it is definitely a language, someone would say it is not domain specific, but generic instead. Domain specificities can be added by means of UML profiles. I consider it instead a language specific to modeling. Now, this is one case that demonstrates which there are not hard and easy rule to define what is a DSL and what is not, mostly because domain is a difficult term to define.

12. VHDL – hardware design

VHDL is a DSL used to define circuits. Once upon the time electronic engineers used to design complex systems directly deciding which gates to use and how to wire them together. VHDL changed all of this, providing higher level concepts that those engineers can used to define their systems.

Example taken from Wikipedia.

There are tools able to process these definitions to derive actual circuit layouts, ready to be printed. Verilog is another DSL similar to VHDL.

13. ANTLR – lexer and parser definitions

ANTLR comes with its own DSL to define lexer and parser grammars. Those are instructions for recognizing the structure of a piece of text.
For example this is a small snippet of a lexer grammar:

JavaCC, Lex, Yacc and Bison are similar tools and all come with their slightly different DSL, inspired by the Backus-Naur form.

14. Make – build system

Make is a language to describe how to build something and the dependencies between different steps. For example you can define how to generate an executable and specifying that to do that you will first need 3 object files. Then you can define for each of those object files how to obtain it from a corresponding source file.

In this example we specify that to create the program myExecutable we will need the object files, and once we have them, we will use gcc to link them together.
We can also define some constants at the top of the file, so it is easy to change the Makefile later, if we need it.

15. Latex – document layout

Latex is used a lot, in the academy and in the publishing industry, to produce gorgeous looking papers and books.

Once you have described a document in this format you typically generate a PDF.

It is quite nice because it can handle references to figure or tables, he can automatically numerate those, it let you control the layout of tables in very complex ways. If you are proficient with LaTeX you can get pretty nice results.

16. OCL – model constraints

OCL stands for Object Constraint Language and it can be used to define additional constraints on objects. Typically it is used together with UML.
For example, if you have a class Appointment with two properties start and end you may want to specify that the end of the appointment follow its start:

You can also define preconditions and postconditions for your operations or invariants that apply to classes.
If you are interested in this sort of stuff you may also want to look into QVT, a set of languages to define model transformations.

17. XPath – XML nodes selection

XPath can be used to select nodes into XML documents. For example, suppose you have a document representing a list of restaurants and you want to get the last restaurant:

XPath expressions are used into XSLT to define which elements to transform or they can be used in combination with many libraries for all sort of languages to define which elements to extract from a document. If you wonder what XSLT is, it is a language to define transformations of XML documents.

18. BPEL – Business processes

BPEL is a language to define the collaboration between web services to implement business processes. It used to be more popular when the world was going through its Service Oriented Architecture (SOA) phase.

The goal of this language is to permit to software architects, or even to analysts, to combine different web-services, or other components, to obtain complex systems.

Example from Eclipse BPE

Example from Eclipse BPEL (

There are different implementation of the language, each one with extensions and mostly incompatible one with the other.

19. Actulus Modeling Language – A DSL to calculate life insurance and pensions

This example is taken from the paper “An Actuarial Programming Language for Life Insurance and Pensions” by David Christiansen et al.
This DSL is listed in the Financial Domain-Specific Language Listing where you can find many more similar examples.

So what can we use DSLs for?

After looking at these examples we can derive that DSLs can be used for a variety of goals:

  • define commands to be executed: like sed or gawk
  • describe documents or some of their specific aspects: like html, latex or CSS
  • define rules or processes: like BPMN or Actulus

These are just some typical usages, but DSLs can be used for so many other reasons.

It is also important to notice that DSLs focus on one specific aspect of a system and often it makes sense to combine several of them to describe the different facets of a product. For example HTML and CSS are used together to describe the content and the style or a document, XML and XPath are used to define data and how to traverse that data. OCL is used to define constraints on UML models.

You could think of designing not one DSL but a family or interrelated DSLs.

Ok, at this point you should have some understanding of how a DSL can look like and what it can be used for. Now let’s see why you should use one and then how to build one.

DSLs vs GPLS: 5 Advantages of using Domain Specific Languages

You could be asking yourself the following question:

Why using a specific, limited language instead of a generic, powerful one?

The short answer is that Domain Specific Languages are limited in the things they can do, but because of their specialization they can do much more in their own limited domain.
The 5 Advantages of a DSL over a GPL: Easier to analyze, Safer, More meaningful errors, Easier to port and Easier to learn
Let’s be concrete and see five separate advantages:

  1. We can analyze them much better: while it is impossible in practice to guarantee that a program written in C or in Java will respect certain characteristics (like not ending in an infinite loop) we can perform all sort of analyses when we use DSLs. Precisely because they are limited in what they can do they are easier to analyze.
  2. They are more safe. Less things can possibly go wrong when using a DSL. When is the last time you had a Null Pointer Exception when working with HTML or SQL? Exactly, never. This is very important if we are doing something critical like dealing with the health of someone or his money.
  3. When there are errors those are errors specific to the domain, so that they are easier to understand. They are domain specific: so errors are not about null pointers, they are about things that a domain expert can understand.
  4. It also means that the interpretation is easier, so bringing them to a new platform is easy. The same applies to simulators. The same HTML documents we could open on a PDA in 2000 can now be open on an iPad pro.
  5. We can teach them more easily: they are limited in scope so less time and less training are needed to master them simply because there is less stuff to study.

Why Adopting Domain Specific Languages?

Ok, we have seen what Domain Specific Languages are and how they differ from GPL, now we should understand why we should consider adopting them. What are the real benefits?
Domain Specific Languages are great because:

  1. They let you communicate with domain experts. Do you write medical applications? Doctors do not understand when you talk about arrays or for-loops, but if you use a DSL that is about patients, temperature measures and blood pressure they could understand it better than you do
  2. They let you focus on the important concepts. They hide the implementation or the technical details and expose just the information that really matters.

They are great tools to support reasoning on specific domains and all the other advantages derive from that.  Let’s look at them in details.

Communication with domain experts

DSLs are about communicating with domain experts
In many contexts you need to build software together with domain experts who are not themselves developers.
For example:

  • You could build medical applications and need to communicate with doctors to understand the treatment a companion software should suggest
  • You could build marketing automation software. You would need the marketing people to explain you how to identify clients matching a certain profile, to offer them a particular deal
  • You could build software for the automotive industry. You would need to communicate with the engineers to understand how to control the brakes
  • You could build software for accountants. You need to represent all the specific tax rules to apply in a given context and you would need an accountant to explain them to you

Now, the problem is that these domain experts do not have a background in software development and the way of communicating of developers and those domain experts can be very different, because they speak different languages.

Developers talk about software, while domain experts talk about their domain.

By building a DSL we build a language to communicate between developers and domain experts. This is not too dissimilar to what is described in Domain Driven Design. The difference here is that we want to create a language understood by developers, domain experts and also by the software that will be able to execute the instructions specified in the DSL.

Now, the holy grail would be to create a language, give it to domain experts and have them go away and write their queries or logic alone. In practice, usually DSLs do not achieve that but prove very useful anyway: a typical interaction consists in having a domain expert describes what it wants to a developer, and the developer can immediately write down that description using a DSL. The domain expert could at this point read it and criticize it.

Typically non-developers do not have the analytical skill to formalize a problem, but they can still read it and understand it, if it is written in a DSL using a lingo familiar with the user. Also, these tools could use simulators or run queries on the fly so that the domain expert can look not only at the code itself, but also at the result.

These kinds of interactions in practice can have a very short turnaround: code can be written during a meeting or within days. While typically when using GPL the turnaround is measured at the very least in weeks, if not months or years.

By using DSLs you can very frequently:

  • having domain experts read or write high level tests. Like requirements which are executable
  • when doing co-development with domain experts developers can get feedback at a very fast pace

So the answer to the question:

Can domain experts (not programmers) write DSLs alone?

Is, of course, “it depends”. But definitely DSLs permit to have domain experts involved in the development process. Reducing dramatically feedback cycles and reducing exponentially the risks of disalignments. You know, when you have developers talking a couple of time with the domain experts, than they walk away and come back to show their solution to domain experts. And those experts stare the solution and declare it to be absolutely, completely, irremediably wrong. You can avoid that, by using a DSL.
So it is important to understand the advantages of that, and at the same time be realistic. You see, some decades ago there were enthusiast suggesting that all sort of people could write queries in SQL autonomously:

Forty years later it appears pretty clear that housewives are not going to write SQL.
The point is that many DSLs require to formalize processes in a way that demand significant analytic skills. Those skills are easily found in developers, but are not so common in people with a non scientific background.

Focus and productivity

The fact that the DSLs abstract some technical details, to focus on the knowledge to capture, has important consequences.

On one hand, it make the investments in the code written using DSLs something that mantain value over time. As the technology change you can change the interpreter processing DSL code, but the DSL code can stay the same. An HTML page written 20 years ago can still be opened using devices that no one was able to imagine 20 years ago. The browsers in the meantime have been completely rewritten multiple times. Then the logic can be ported to new technologies.

I want to share a story about a company I have worked with. This company has created its own DSL to define logic for accounting and tax calculations. They started building this DSL 30 years ago and at that time they used to generate console applications. Yes, applications that run in consoles of 80×25 cells. I worked with them re-engineering the compiler and the same code of their DSL is now used to generate reactive web applications. How this happened? Because the DSL captured only the logic, the really valuable part of the programs, an asset extremely important for the company. The technical details were abstracted in the compiler. So we just had to change the DSL compiler to preserve the value of the logic and make it usable in a more modern context.

Domain Logic is what has value while technology changes over time
This teach us that:

Domain logic is what has value and should preserved, while technology change over time

By using a DSL we can decouple domain logic and technlogy and make them evolve separately.

Another advantage of hiding technical details is productivity. Think about the time spent thinking about deallocating memory or choosing implementation of a list would perform best for the case at hand. That time has a poor ROI. With a DSL instead you just focus on the relevant parts of the problem and get it solved.

The typical (wrong) reasons against DSLs

There are a lot of developers out there thinking:

If my language is Turing complete I can do everything with it

Yes and no. You can tackle any problem but you cannot necessarily:

  • write the most concise and clear solution to a problem
  • you cannot write the solution quickly
  • you cannot write a solution that is understandable
  • you cannot provide errors which are understandable
  • you cannot show it and discuss it with domain experts
  • you cannot provide tool support which is meaningful for the problem at hand
  • how reusable is the solution: if you write it in C you cannot later use that solution in other context. With a DSL you can start by building the solution once and then build several code generators or interpreters
  • a DSL can in reality be faster because the generator can be specialized for a certain architecture

But people will have to learn another language

First of all, if a language is tailored for a specific domain, persons that know that domain should be very facilitated to learn the language, because it is about the concepts they are familiar with. It is true that learning has a cost and this cost can be reduced with good tool support: editors that provide auto-completion, proper error messages and quick fixes can reduce the learning time. The possibility to easily obtain feedback, for example by simulating the results of the code just wrote, also helps. There is also the possibility of creating interactive tutorials.

But I would be locked in into the DSL!

I have some shocking news: you are already locked in whatever programming language you are using to express the logic of your systems. If those languages stop evolving you will need to sort your way out. How many companies are trapped by their Cobol, Visual Basic or Scheme codebases? The difference is that, if you are locked in into a language you build and control, you can decide if the language keeps evolving, if the compiler can target a new environment (“let’s generate a web application instead of a console application!”). If you are locked in someone else’s language there is not much you can do about it.

A DSL will not be flexible enough

Designing a DSL requires to define clearly its scope. There will be things left out, things that you could occasionally want to do but that the DSL will not support. Why is that? Because a DSL has to be specific and limited, to support in a great way a set of tasks others have to be left out. To strike the right balance is one of the greatest challenges you will face when designing a DSL, but if you get it right you should be able to cover all reasonable usages with your DSL. It is often possible to design a DSL to be extensible, supporting functionalities written in a GPL, in the case they are really, really needed. But this should be more seen as a way to reassure users, until they realize they would very rarely need it.

Build a DSL take a huge effort

This is just not true, if you use the right approach. In the following section we are going to see different ways to build languages and all the necessary supporting tools with a reasonable effort.
You can find a list of other perceived obstacles to DSLs adoption here: Stumbling Blocks For Domain Specific Languages or you may want to read this paper I coauthored about benefits and problems with adopting modeling in general (it mostly applies to DSLs as well).

How to create Domain Specific Languages

Wonderful, you have seen why Domain Specific Languages are so cool and what benefits they can bring you.
Now there is only one question:

How do we build DSLs?

Let’s see how by looking at:

  • what tools you can use to build DSLs
  • what are the most important success factors
  • which skills do you need

What tools can we use to build Domain Specific Languages?

There are different ways to build a DSL. The goal here is to build a language, with tool support, while keeping the effort reasonable. We are not building the next Java or C# so we are not going to pour tens of man years at building an extra complex compiler or an IDE with tons of features. We are going to build a useful language, with good tool support with an investment that can be substained by a small company.

So I am going to show you the menu and you can pick your own choice. If you need help you can look at the comparison I prepared at the end of this list.

The approaches are divided in three groups, depending on the kind of DSLs you want to build:

You probably know what textual and graphical languages are but you may not have encountered projectional editors before. They are pretty interesting things so you should probably take a look.

Textual languages

These are the most classical languages. Most practicioners will not even conceive other kinds of languages. Admittedly we are all used to work with textual languages. They are easier to support and can be used in all sort of contexts. However to use them productively I think that a specific editor is mandatory. Let’s see how to build textual languages and supporting tools.

A pragmatic do-it-yourself approach

Do it yourself approach
Roll up your own solution: You could reuse a parser generator like ANTLR (or Lex & Yacc if you are an old-style guy) and write the rest of the processing yourself. This is doable but it requires knowing what your doing. If you don’t know where to start you can take a look at my book on building languages.

I am sure you want to read it so I do now want to spoil it too much, but the path is more or less this:

  1. You define the lexer and parser grammar using ANTLR. You do not know ANTLR? No problem, here it is a nice tutorial on ANTLR. In this blog there are many other articles about ANTLR.
  2. You transform the parse tree produced by ANTLR in a format easier to work with. So you get the model of your code.
  3. You resolve reference, build validation, and implement a typesystem as a set of operations on the model of your code. It sounds complex, but if you know what you are doing you can get it done in a few hundreds of lines of code.
  4. Finally you either interpret or compile the model of your code
  5. You build a simple editor for your language. For how to do that look for tutorials on this blog or into the book. I tend to use an hackable editor I built myself. I named it Kanvas

What I like of this approach is that you are in control of what is happening. You can change the whole system, evolve it and it is simple enough that you can really understand it. Sure with this approach you are not going to get a super complex IDE with tens of refactoring operation. Or you are not going to get those things for free at least.


Xtext is a solid solution to build textual languages. In practice you define your grammar in a way similar to what you would do with ANTLR but instead of getting just a parser you get a nice editor. This editor is by default an Eclipse plugin. It means that you will be able to edit the files written in your DSL inside Eclipse. If you know how the Eclipse platform works you can create an RCP application: i.e., a stripped down version of Eclipse basically supporting only your language and removing a bunch of stuff that would not be useful to your users.

So Xtext give you an editor and a parser. This parser produces for you a model of your code using the Eclipse Modeling Framework (EMF). So it basically means that you have to study this technology. I remember the long days reading the EMF book as one of the most boring, mind-numbing experiences I have ever had. I also remember asking questions on the Eclipse forums and not getting any answer. I have open bug reports and I have received answers three years after (I am not joking). So it was disheartening at first but over time the community seems to be improved a lot. Right now the material available on the Xtext website is incomparably better than it used to be and the book from Lorenzo Bettini helped a lot (more on it later).

The editors generated by Xtext can be deeply customized, if you know what you are doing. You can get away with minor changes with a reasonable effort, but if you want to do advanced stuff you need to learn the Eclipse internals and this is not a walk in the park.

Recently Xtext escaped the “Eclipse trap” by adding the possibility of generating editors also for IntelliJ IDEA and… the web! When I first found out this I was extremely excited. Me, as many other developers, switched to IntelliJ some years ago and I was missing a way to easily build editors for IntelliJ IDEA. So this was great news, even if Xtext has been created to work with Eclipse so the support for IntelliJ IDEA is not as mature and battle-tested as the one for Eclipse. I did not try yet the web editor, but from what I understood it generates a server side application which is basically an headless Eclipse and on the client side it generates three different editors based on three technologies (each one with a different level of completeness). The one fully supported is Orion, an Eclipse project. While the other two are the well-known CodeMirror and ACE.

You may want to check out this list of projects implemented with Xtext to get an idea of what is possible to achieve using Xtext.

Textual languages: other tools

If I had to build a textual language in most cases I would go for one of the two approaches defined earlier: either my do-it-yourself approach or using Xtext. That said there are alternatives on which I think it makes sense to keep an eye on.

textX is a Python framework inspired by Xtext. You can define the grammar of your language with a syntax very, very close to the one used by Xtext. textX does not use EMF or generate code but it use instead the metaprogramming power of Python to define classes in memory. While it seems nice and easy to use, textX does not generate editor support like Xtext, so that is a major difference.

If you want to get a better feeling of how textX works take a look at this video.

There are other tools like Spoofax. I did not use it myself so I cannot vouch for it. It is more academic stuff than an industrial-grade language workbench, so I would suggest a bit of caution.
Spoofax can be used inside Eclipse. It is based on a set of DSLs to use to create other DSLs.

If you want to look into Spoofax you may want to look at this free short book from Eelco Visser named Declare Your Language.

Graphical languages

Graphical languages seem approachable and frequently domain experts feel more at ease with them than with textual languages and their geeky syntaxes. Graphical languages require building specific editors to be used and they are less flexible than textual languages. Also, they are less frequently used than textual languages and the tools to build graphical languages tend to be less mature and more clunky. Here I present you a list of a few options. If you want to read a more complete list you can look into this survey on graphical modeling editors.

GMF, the painful solution

There is one way to build graphical editors for your language that over time acquired quite a (not-exactly-positive) reputation. Its name is GMF: Graphical Modeling Framework.
Yes, you can use it to build editors which can be used inside Eclipse. Similarly to Xtext it is based on EMF. Basically you use EMF to define the structure of your data and then GMF permits to specify how the different elements are represented, how their connections are displayed and that sort of stuff. Typically you then edit the details of each element in a separate panel, a sort of form.
You can see an example of a GMF editor in the pictur below.

Image from

Now, the documentation is basically unexisting and to make it work is a challenge which requires a great amount of patience and determination.

This framework has potential and it is powerful and flexible, but working with it is far from being an enjoyable experience.

Sirius, hiding GMF

Eclipse Sirius
There are tools built on top of GMF to make the experience less terrible for the language designer. A simple tool is Eclipse Eugenia, while a more complex one is Eclipse Sirius. Sirius reuses some pieces of GMF (GMF Notation, GMF Runtime) but it moves aways from its code generation approach and instead use model introspection. Let me stress this is what I have read about Sirius but I did not used it myself.

I have used Eclipse Eugenia and it helped jump starting my editor, but it is a limited tool and if you want to customize your editor you are back to GMF.

I have not used Eclipse Sirius myself but it seems to be decently supported by Obeo and being used at Thales, so I would expect it to have at least reasonable maturity and usability.

MetaEdit+, the commercial solution

MetaEdit+ is a language workbench for defining graphical languages. Contrary to all the other tools we discussed it is a commercial tool. Now, generally I prefer to base my languages on open-source solutions because I found them more reliable. I know that in the worst case I can always jump in and mantain the platform myself, if I really need to. With a commercial solution instead we have to consider what happens if the provider goes out of business. You can probably keep using the tool you bought for a while, until it is not compatible with the operating-system you are using. If you are about doing a large investment you could also consider setting up a source code escrow, to get access to the code in the unfortunate circumstance the provider shuts down. That said MetaCase (the company behind MetaEdit+) is a solid company which has been in business for quite a few years.

I have assisted in two occasions to a presentation from Juha-Pekka Tolvanen and I was positively impressed both of the time. They have a mature solution and they use it to build a bunch of interesting DSLs for their clients. I like very much to check their regular tweets on the DSL of the week. Here a few samples:

So if you ever need a graphical language I would advise to consider this solution. The alternative is in my opinion to use a full-blown projectional editor, which permits to create also graphical languages, but not only those. Curious? Keep reading.

Projectional editors

Projectional editors are extremely powerful and exciting but they are unfamiliar to many users. I can give you the theory bit, and throw at you a definition, but if you really want to understand them watch the video below.

A projectional editor is an editor that show a projection of the content stored on file. The user interacts with such projection and the editor translates those interactions to changes to the persisted model. When you use a text editor you see characters, you add or delete characters and characters are actually saved on disk. In a projectional editor you could edit tables, diagrams and even what it looks like text, but those changes would be persisted in some format different from what you see on screen. Maybe in some XML format, maybe in a binary format. The point is that you can work with those files only inside their special editor. If you think about it, this is the case also for all the graphical languages: you see nice pictures, you drag them around, connect lines and in the end the editor save some obscure format, not the nice pictures you see on the screen. The point with projectional editors is that they are much more flexible than your typical graphical language. You can combine different notations and support all sort of representation you need for your case.

Confused? That is expected, watch the video below, watch many more videos and things will appear clear at some time.

You could also take a look at this explanation of projectional editing written by Martin Fowler.

Jetbrains MPS

Jetbrains MPS is a tool that I have been using for some years, working on it daily on most of last 12 months. It is an extremely powerful tool and it is the most mature projectional editor available out there. It is no accident: Jetbrains has invested significantly on developing it over more than a decade.

Do you want to see how it looks like? Watch the video.

I find very useful MPS to build families of interoperable languages with advanced tooling. Imagine several DSLs to describe the logic of your problems, to define tests, to define documentation. Imagine all sort of simulators, debuggers, tools to analyze code coverage. Everything built on one platform.

Now, it means you need to be ready to embrace Jetbrains MPS and to invest a significant amount of time to properly learn it (or hire someone like me). However if you are ready to do the investment it can simply revolutionize your processes.

Intentional Platform

Intentional Platform
This one is the mysterious and intriguing one.

Charles Simonyi is the man who designed Excel, one of the first space tourists and one of the richiest men on earth. In 1995 he wrote the revolutionary paper The Death of Computer Languages, The Birth of Intentional Programming. In this paper he presents a new, more intuitive way of programming, fundamentally based on projectional editors. He starts working at Microsoft on these ideas, but in 2002 he leaves Microsost to cofound Intentional Software. Since then the publicly available information on their tool, the Intentional Platform has been extremely scarce. They have published a few papers and given a few presentations.

I have heard from people that have used it that it has a lot of potential but it is not there yet. For sure I would love to put my hands on it. The closest you can get is to read this somehow old review of the Intentional Platform from Martin Fowler. Maybe one day even us mortals will have the possibility to know more about this legendary tool.

As far as I know they work with selected companies but for now their tool is not publicly available.

Whole Platform

Whole Platform
This is a good Language Workbench too frequently overlooked. While it has been used in production for many years in a large company in Italy, it is lacking a little bit on the documentation and marketing side. Yes, I know it is sad that pure engineering awesomeness is not enough to gain popularity. Anyway if you want to know more about it you can read my post on getting started with the whole platform.

There are a few concepts that I find quite interesting. I am not an expert on the Whole Platform: I have just played with it and talked with his authors.

One aspect that I find really interesting and different with respect to the other Language Workbenches is that the Whole Platform supports quite well working with existing formats, and evolve from existing processes to more advanced Language-Engineering approaches. For example, it is quite easy to define grammars for existing formats in order to parse them.

The following is an example of a grammar to parse JSON but the same approach has been used to parse very complex formats used in the banking domain.

Whole Platform – Example of grammar

Another idea I love about the Whole Platform is the Pattern Language: the possibility to take a model (in this sense a piece of Java code is a model) and define variability points that could be filled with values from another model.

Whole Platform – Pattern Language

Also, the Whole Platform is quite rich and it supports also graphical languages.

Whole Platform - Graphical Language

Whole Platform – Graphical Language

Riccardo Solmi and Enrico Persiani are the minds behind the Whole Platform and you should probably talk to them if you are interested in using this Language Workbench.
The images of the Whole Platform I have used are released under the CC Attribution 2.0 license (

Comparing the different approaches

ApproachTypeWhen to use it
Do-it-yourselfTextualYou need to be in absolute control of the platforms supported and you do not want any vendor lock-in
XtextTextualYou want a textual language with good editors and you want it fast
textXTextualYou love the flexibility of dynamic languages while editor support is not important to you
SpoofaxTextualYou like to work with a sound theoretical approach and you are not afraid of a few bumps in the road
GMFGraphicalYou need extreme flexibility to build your very own graphical editor
Eclipse SiriusGraphicalYou are ready to trade some flexibility to get things done quicker and saving some mind sanity
MetaEdit+GraphicalYou are fine using commercial software for a stragegic component and you want results quickly
Jetbrains MPSProjectionalYou want to build family of languages with powerful and complex tooling like simulators, debuggers, testing support and more
Intentional WorkbenchProjectionalYou have some connection that give you access to the most mysterios and hyped Language Workbench
Whole PlatformProjectionalYou want to support existing formats and transition smoothly to a new approach

One thing I would like to stress is that projectional editing is a superset of the graphical editing. So you can define graphical languages using Jetbrains MPS. Given there is not a clear and great alternative to build only graphical DSLs I would use Jetbrains MPS in that case. Alternatively I would consider build the tooling myself targeting the web. Another interesting option could be looking into something like FXDiagram.

Still not enough?

If you want to keep an eye on the new Language Workbenches that come up I suggest you follow the “Language Workbench Challenge”, a workshop that is typically colocated with the Software Language Engineering conference. I co-authored a paper and attended the last edition and it was awesome.

What do I need to make my DSL succeed?

There are just two things that will seem obvious but are not:

  1. You need your users to use it
  2. You need your users to get benefits from using it

To achieve this you will need to build the right tool support and adopt the right skills. We are going to discuss all of this in this section.

Get users to use it

The first point means that you need to win the support of users. During my PhD I conducted a survey on the reasons why DSLs are not adopted. There are several causes, but one important factor is resistance from users, especially when they are developers.

If you target a DSL to developers, they could resist because they feel they are not in control as when they use a General Purpose Language (GPL). They could also fear that a DSL lowers the bar, being simpler to use than, let’s say, Java. In addition to that, as all innovations, a new DSL is threatening for seasoned developers because it reduces the importance of some of their skills, like the vast experience in dealing with the quirks of the current GPL you are using at your company.

If your DSL is intended to non-developers generally it is easier to win their support. Why? Because you are giving them a superpower: the ability to do something on their own. They could use a DSL to automatize a procedure that previously was done manually. Maybe before the DSL was available the only possibility for them to do something was bothering some developer to write custom code. Now they get a DSL which means more power and independence because of it. Still they could resist adopting it if they perceive it as too difficult or if they feel it does not match their way of thinking.

To me the key as a DSL designer in this case is being humble and listen. Listen to the developers, work on capturing their experience and embedding it in the design of the DSL or the tooling around it. Involve them in the design of the DSL. When talking with your users, technical or not, communicate that the DSL will be a tool for them, designed to support them and derived by their understanding of the domain at hand. When designing DSLs the cowboy approach does not work, you need to succeed as a team or not succeed at all.

Give benefits to users

DSL Benefits

If you get the support of users and people start using it you win only if they get a significant advantage from using the DSL.

We have discussed the importance of a DSL as communication tool, as a medium to support co-design. This is vital.

In addition to this, you can increase significantly the productivity of your users by building first-class tool support.

A few examples:

  • a great editor with syntax highlighting and auto completion: so that learning the language and using it feel like a breeze
  • great error messages: a DSL is an high level language and error can be very significant for users
  • simulators: nothing helps users as the possibility to interact with a simulator and see the results of what they are writing
  • static analysis: in some contexts the possibility to analyze the code and reassure against possible mistake is a big win

These are a few ideas but more can be adopted, depending on the specific case. Specific tools support for specific languages.

Tool support: why we do not care about internal Domain Specific Languages

There is one factor that is frequently overlooked and this is tool support.

Many practicioners think that the only relevant thing is the syntax of your language or what it permits to express, with everything else being a detail.

This is just fundamentally wrong because the tool support can multiply exponentially the productivity when using any language, especially a DSL. This is a crucial aspect to consider, because the language should be designed considering tool support.

Because when building Domain Specific Languages, if you want to get serious, you have to build good tool support. Tool support is an essential key in delivering value.

I recorded this short video on this very subject.

Building a language: tool support from Federico Tomassetti on Vimeo.

Tool support is the reason why internal domain specific languages (i.e., fake DSLs) are irrelevant: they do not have any significant tool support.

When using some host languages you can bend them enough of getting some sort of feeling of having your own little language, but that is it. There are some languages that are flexible enough to give a sort of decent … like ruby. With other languages you get very poor results. I feel pity for the people trying to build “internal DSLs” for languages like Scala or Java. The worst of all is lispers. I understand their philosophy “if you want to solve a problem in LISP, first you create your LISP dialect and then solve it using it”. I understand and I think it is a great technique. Just let’s not pretend this is a real DSL. This is not something you can share and work with closely with domain experts. It can be your trick to be more productive, but that is it.

You see those ridicuosouly long chains of method calls and you hear someone presenting those as Domain Specific Language. That makes me feel a mix of two emotions:

  • pity for him and his users
  • rage for the confusion it creates. Real DSLs are very different and they can bring real benefits. Stop mixing them with this… thing

Just build a real DSL, so an external DSL!

What skills are needed to write a DSL?

Typically you need to be able to have high abstraction skills, the same you need typically for metaprogramming. If writing a library is 3 times harder than writing a program, writing a framework or a DSL is typically 3 times harder than writing a library.

You need to be humble: you may need a developer, but typically you need to create this DSL for other professional that are going to use it for their job. Listen to what theydo, understand how they work. What could seem wrong or naive to you could have reasons you do not yet understand. Acting as the all-mighty-expert is the single best way to create a failed DSL, not useful in practice and therefore not used.

Aside from this, practice and learn. Keeping doing it for a few years should do the trick.


Now that you have seen what DSLs can bring you and you have an idea how to build them you should be happy.

But you are not, you want more, you want to understand better DSLs, you want to learn everything about them.

Well, I do not know about everything, but definitely I can give you some pointers. I can tell you where to find the good stuff.

Let’s see what we can find:


DSL Engineering

I would start suggesting to read the DSL Engineering by Markus Völter. This PDF version of the book is donation-ware. So just read it and donate. Alternatively you can find the printed version on Amazon.

The book start with an introduction part: it is very useful to set your terminology straight. After that it comes the DSL design part, focusing on different aspects separately. If you do not have direct access to an expert to teach you how to design DSLs reading this part of this book is the best alternative I can recommend (together with as much practice as you can, of course).

Then it comes the part about implementation: remember that Markus has a PhD, but he is first of all someone who gets things done so this part is very well written, with examples based on Xtext, Spoofax and MPS.

Part IV is about scenarios in which DSLs are useful. Given this is based on his large experience in this field there are a lot of interesting comments.

I had the occasion to work with Markus. I used to admire him a lot before meet him and I now I admire him even more. He is simply the best one on this field, so if you can learn something from him do it. Read his books, watch his presentations, follow his projects. It will be a good way to invest your time. He is lately working on Jetbrains MPS stuff, so you should follow what is going on with mbeddr and IETS3. Mbeddr is both a set of plugins for MPS and an implementation of the C language in MPS with special domain-specific extensions to support development of embedded software. IETS3 is instead an expression language built in MPS.

Martin Fowler is a very famous thought leader and bestseller author. I really admire his clarity. He is the author of Domain Specific Languages, a book about both internal and external DSLs.

I find the mental models presented in the book quite useful and elegant. In practice however I find internal DSLs irrelevant, so I am interested in only some portions of this book.

There are 15 chapters dedicated specifically to external domain specific languages. While those chapters are organized around implementation techniques there are comments and remarks from which you can learn some design principles.

I think the sections on the alternative computational models and code generation are very valuable. You will have an hard time finding an exploration to these topics at this level of detail anywhere else.

The book is 7 years old and the techniques may have evolved since the book was written, but the vast majority of the considerations presented in the book are still valid. And of course they are thoughtful and well explained, as you would expect from Martin Fowler.

If you are interested in textual languages and in particular on ANTLR you should definitely look into the Language Implementation Patterns from Terence Parr. I like the author, I like the publisher (the Pragmatic Bookshelf) and unsurprisingly I love the book.

The book starts discussing different parsing algorithms. If you like to learn how stuff works you should take a look at these chapters. Then there are chapters about working with the Abstract Syntax Tree, extracting information, transforming it. This is the kind of stuff you need to learn if you want to become a Language Engineer. Also, there chapters on resolving references, building symbol tables or implementing a typesystem. These are the foundations to learn how to process the information expressed in your DSL.

Finally Terence explains you how to use the information you have processed by building an interpreter or a code generator. At this point you end your journey, having seen how to build a useful language from start to finish. This book will give you solid basis to learn how to implement DSLs. The only thing missing is a discussion on how to design DSLs, but this is not the goal of this book.


The MPS Language Workbench - Volume I  The MPS Language Workbench - Volume II
If you are looking into MPS there are not many resources around. It could make sense to buy the two books from Fabien Campagne on the MPS Language Workbench. They explain in details all the many features of MPS (admittedly some are a bit obscure). If I would have to find one thing missing is more advices on language design. These books are very good references to learn how MPS works, but there is not much guidance on how to combine these features to get your results. One reason for that is that MPS is an extremely powerful tool, which can be used in very different ways so it is not easy to give general directions.

Volume I explains separately the different aspects of a language: how to define the structure (the metamodel), how to define the editors, the behavior, the constraints, the typesystem rules and so on. Most of the chapters are in reference-manual style (e.g. the chapter The Structure AspectStructure In Practice). Everything you need to learn to get started and build real languages with MPS is explained in Volume I.

Volume II is mostly about the advanced stuff that you can safely ignore at the beginning. I suggest looking into this book only when you feel comfortable with all the topics explained in Volume I. If you have never used MPS before it will take some time. Volume II explaine you how to use the build framework to define complex building configurations, it gives you an overview of all the different kinds of testing you may want to use for your languages. It also show you how to define custom aspects for your language or custom persistence.

I have bought the Google Play version but they are available also in print form.
Implementing Domain Specific Languages with Xtext and Xtend - Second-Edition

On Xtext there is a quite practical and enjoyable book from Lorenzo Bettini: Implementing Domain-Specific Languages with Xtext and Xtend. I wrote a review on the second edition of this book: if you want to read my long-form opinion of that book you can visit the link.

If you want to learn how to write textual languages with good tool support you could start following a couple of tutorials on Xtext and then jump to this book. This book will explain you everything you need to know to use Xtext to build rather complex editors for your language. The only caveat is that Xtext is part of a complex ecosytem so if you really want to become an expert of Xtext you need to learn EMF and Xtend. The book does a good job in teaching you what you need to know to get started on these subjects but you may have to complete your education with other resources too, when you want to progress.

What I like about this book is that is not a reference manual, but it contains indications and opinions on topics like Scoping or building a typesystem rules (the author has some significative experience on this specific topic). Also, the author is interested in best practices so you will read his take on testing and continuos integration. The kind of stuff you should not ignore if you are serious about language engineering.

If you are interested in DSLs in general you can take a look at DSLs in ActionThe book is interesting but I have two comments:

  1. It focus way too much on internal DSLs, which are, as we all know, not the real thing
  2. They have misspelled my name. -1 point for that

Specifically on external DSLs there is not much: the author briefly discuss Xtext and then spend a chapter on using Scala parser combinators to build external DSLs. That would not have been my first choice. So if you are interested in learning how to implement an external DSL do not pick this book. However if you want a gentle introduction to the topic of DSLs, if you are a degenerate who prefers internal DSLs instead of external DSLs, or if you want to read every available resource on DSLs this book would be a good choice.

Domain Driven Design

Domain Driven Design is a relevant and important book to read. Because you need skills to understand the domain, in order to be able to represent it in your language, and to design your language so that it can capture it. Now, I should probably just praise this book and stress how much I have enjoyed it. Unfortunately I tend to err on the honesty side so I warn you: this is one of the most boring books I have ever read. It is important, it is useful, it is great but it just so plain and long. It stayed on my night stand for months.

What you should get how of this book is the importance of capturing the Domain in all of your software artifacts. The book stress the importance of building a common language to be shared among the stakeholders. This is completely and absolutely relevant if you want to build Domain Specific Languages. What is not part of this book is how to map this domain model to a language. For that part you should refer to the other books, specific to DSLs design. This book is a good complement to any of them.

Websites and papers


I design and implement Domain Specific Languages for a living but instead of tell you how great I am, I will instead list other companies that you could work with.

The first, obvious name is Itemis. They are a German company with small offices in France and Switzerland. They employ some of the best in the field: I have met and worked with Markus Völter and Bernd Kolb and I am seriously impressed by the level at which they work. They have worked with so many companies and have done so many projects that the list will be scarely too long. I would just say that in the later years they have done amazing work using Jetbrains MPS. They have created the mbeddr project and from it they have derived a set of utilities, named the mbeddr platform, which contributed enormously to the growth of MPS. They have contributed early and significantly to the Language Workbench community, so if you need some help on a DSL you should seriously consider working with them.

TypeFox is another German company.  I have interviewed one of the founders some time ago. Several Xtext core committers, including the project lead, are involved in the company so as you can imagine they have some serious competencies on Xtext and the EMF platform in general. If you need Xtext training or consulting I would consider them the top-choice. They are also a Solution Member of Eclipse and I would expect them to be able to build complex Eclipse-based solutions. If you want to hear more you can read this interview to Jan Köhnlein. Jan is one of the founders of the company and we talked a few months after the company was created.

Jetbrains is obviously the company behind Jetbrains MPS. Recently they started offering training on MPS. I asked about this during my interview at Vaclav Pech. They offer training, either basic or advanced, on their premises in Prague or on-site.

At the moment I do not think they offer MPS consulting (for that you can talk to me or to Itemis).


There are many reasons why you should really consider Domain Specific Languages. I have seen companies benefit enormously from DSLs. Most of the people I have worked with used DSLs as a key differentiator that helped them increase productivity by 10-20 times, reduce time-to-market and feedback cycles, increase the longevity of their business logic and much more.

Aside from the practical benefits I find the topic ext