Tutorials and issues on all aspects of creating software to analyse code

Create a simple parser in C# with Sprache

Create a simple parser in C# with Sprache

You can find the code for this article on github

Everybody loves ANTLR, but sometimes it may be overkill. On the other hand, a regular expression just doesn’t cut it or it may be too complicated to maintain. What a developer can do in such cases ? He uses Sprache. As its creators say:

Sprache is a simple, lightweight library for constructing parsers directly in C# code.

It doesn’t compete with “industrial strength” language workbenches – it fits somewhere in between regular expressions and a full-featured toolset like ANTLR.

It is a simple but effective tool, whose main limitation is being character-based. In other words, it works on characters and not on tokens. The advantage is that you can work directly with code and you don’t have to use external tools to generate the parser.

The guessing game

You can see the project website if you want to see specific real uses, let’s just say that its even credited by ReSharper and it was created more than six years ago, so it’s stable and quite good. It’s ideal to manage things like error messages created by other tools that you have to deal with, to read logs, to parse queries like the ones you would uses for a simple search library or to read simple formats like Json. In this article we will create a parser for a simple guessing game, we will use .NET Core and xUnit for the unit tests, so it will work also on Linux and Mac.

The objective of the game is to guess a number, and to do that you can ask if the number is greater than a number, less than a number or between two numbers. When you are ready to guess you simply ask if it’s equal to a certain number.

Setup the project

We will use VSCode, instead of Visual Studio, but in the github project you would find two projects, one for each: this because there are still some compatibility quirks relative to project.json and the different .NET Core tools versions used by Visual Studio or the standalone command line version. To clarify, the project.json generated by the .NET Core standalone command line will work also with Visual Studio, but not viceversa (this might be changed when you will read this). Also, with two projects you can easily see how Visual Studio integrates xUnit tests. The C# code itself is the same.

Create the file global.json in the directory of your project, in our case SpracheGame, then create another SpracheGame folder inside src and a SpracheGame.Tests folder inside test. Inside the nested SpracheGame folder you can create a new .NET core program with the usual:

While you are nside the SpracheGame.Tests folder you can create a xUnit test project with:

You can see the final structure here.

SpracheGame folder structure

Change both project.json, adding sprache as a dependency to the main project:

…and add the main project as a dependency for the xUnit test project.

If you are using Visual Studio you may need to add a runtimes section to both of your project.json:

See the .NET documentation for .NET Core Runtime IDentifier (RID) catalog if you need to know other platform IDs.

Create GameParser

Let’s start by creating a class called GameParser and by recognizing numbers and commands.

On line 3 there is the code to parse a number: we start with Sprache.Parse followed by a digit, of which there must be at least one, then we convert from IEnumerable<char> to string, with Text(), and finally we discard whitespace with Token(). So first we choose the type of character we need, in this case Digit, then we set a quantity modifier and trasform the result in something more manageable. Notice that we return Parser<string> and not an int.

On the lines 5-6 we order to the parser to find a character ‘<‘  followed by  one ‘>’, using  Then(). We return an enum instead of a simple string. We can easily check for the presence of different options with the Or(), but it’s important to remember that, just as for ANTLR, the order matters. We have to put the more specific case first, otherwise it would match the generic one instead of reaching the correct case.

Now we have to combine this two simple parser in one Play, and thanks to the LINQ-like syntax the task is very simple. Most commands require only a number, but there is one that requires two, because we have to check if the number to guess is between two given numbers. It also has a different structure, first there is a number, then the symbol, and finally the second number. This is a more natural syntax for the user than using a ‘<>’ symbol followed by two numbers. As you can see, the code is quite simple, we gather the elements with from .. in .. and then we create a new object with select.

It’s time for Play

The only interesting things in the Play class are on the lines 27-51, the Evaluate function, where the “magic” happens, and I use the term magic extremely loosely. The number to guess is provided to the function, then it’s properly checked with the command and the numbers of the specific play that we are evaluating.

Unit Tests are easy

There are basically no disadvantages in using xUnit for our unit tests: it’s compatible with many platforms, it’s still integrated with the Visual Studio Test Explorer and it also have a special feature: theory. Theory is a special kind of test that allow you to supply multiple inputs with one test. Lines 3-6 shows exactly how you can do it. In our case we are testing that our parser can parse numbers with many digits.

The following test is a typical one, we are checking that the symbol ‘>’ is correctly parsed as a Command.Greater. On Line 27 we are making sure that an Exception is raised if we encounter an incorrect Play. Sprache allows also to use TryParse, instead of Parse, if you don’t want to throw an exception. As you can see the simplicity of tool make very easy to test it.

Let’s put everything together

The main function doesn’t contain anything shocking, on the lines 27-28 we parse the input and execute the proper command, then, on 31, we check whether we guessed the correct number and if so we prepare to exit the cycle. Notice that we provide a way to exit the game even without guessing the number correctly, but we check for ‘q’ before trying to parse, because it would be an illegal command for GameParser.


This blog talks much about Language Engineering, which is a fascinating topic, but it is not always used in the everyday life of the average developer. Sprache, instead, is one tool that any developer could find a use for. When a RegEx wasn’t good enough you probably have simply redesigned your application, making your life more complicated. Now you don’t need to, when you meet the mortal enemy of regular expressions, that is to say nested expression, you can just use Sprache, right in your code.

Building and testing a parser with ANTLR and Kotlin

This post is the part of a series. The goal of the series is to describe how to create a useful language and all the supporting tools.

  1. Building a lexer
  2. Building a parser
  3. Creating an editor with syntax highlighting
  4. Build an editor with autocompletion
  5. Mapping the parse tree to the abstract syntax tree
  6. Model to model transformations
  7. Validation
  8. Generating bytecode

After writing this series of posts I refined my method, expanded it, and clarified into this book titled
How to create pragmatic, lightweight languages


Code is available on GitHub. The code described in this post is associated to the tag 02_parser

A change to the lexer

With respect to the lexer we have seen in the first article we need to do one minor change: we want to keep recognizing the whitespace but we do not want to consider it in our parser rules. So we will instruct the lexer to throw all the whitespace tokens away and not passing them to the parser. To do that we need to change one line in SandyLexer.g4:

Thanks to Joseph for reporting this one!

The parser

The parser is simply defined as an ANTLR grammar. We have previously built a separate lexer. Here we reuse those terminals (NEWLINE, VAR, ID, etc.) to build rules such as statement, assignment, etc.

Here it is our new ANTLR grammar:

  • we reuse the existing lexer (tokenVocab=SandyLexer)
  • we start by defining the rule reppresenting the whole file: sandyFile. It is defined as a list of at list one line
  • each line is composed by a statement terminated either by a newline or the end of the file
  • a statement can be a varDeclaration or an assignment
  • an expression can be defined in many different ways. The order is important because it determines the operator precedence. So the multiplication comes before the sum

To build it we simply run ./gradlew generateGrammarSource. Please refer to the build.gradle file in the repository or take a look at the previous post of the series.


Ok, we defined our parser, now we need to test it. In general, I think we need to test a parser in three ways:

  • Verify that all the code we need to parse is parsed without errors
  • Ensure that code containing errors is not parsed
  • Verify that the the shape of the resulting AST is the one we expect

In practice the first point is the one on which I tend to insist the most. If you are building a parser for an existing language the best way to test your parser is to try parsing as much code as you can, verifying that all the errors found correspond to actual errors in the original code, and not errors in the parser. Typically I iterate over this step multiple times to complete my grammars.

The second and third points are refinements on which I work once I am sure my grammar can recognize everything.

In this simple case, we will write simple test cases to cover the first and the third point: we will verify that some examples are parsed and we will verify that the AST produced is the one we want.

It is a bit cumbersome to verify that the AST produced is the one you want. There are different ways to do that but in this case I chose to generate a string representation of the AST and verify it is the same as the one expected. It is an indirect way of testing the AST is the one I want but it is much easier for simple cases like this one.

This is how we produce a string representation of the AST:

And these are some test cases:

Simple, isn’t it?


We have seen how to build a simple lexer and a simple parser. Many tutorials stop there. We are instead going to move on and build more tools from our lexer and parser. We laid the foundations, we now have to move to the rest of the infrastructure. Things will start to get interesting.

In the next post we will see how to build an editor with syntax highlighting for our language.

Getting started with ANTLR: building a simple expression language

This post is the part of a series. The goal of the series is to describe how to create a useful language and all the supporting tools.

  1. Building a lexer
  2. Building a parser
  3. Creating an editor with syntax highlighting
  4. Build an editor with autocompletion
  5. Mapping the parse tree to the abstract syntax tree
  6. Model to model transformations
  7. Validation
  8. Generating bytecode

After writing this series of posts I refined my method, expanded it, and clarified into this book titled
How to create pragmatic, lightweight languages

In this post we will start working on a very simple expression language. We will build it in our language sandbox and therefore we will call the language Sandy.

I think that tool support is vital for a language: for this reason we will start with an extremely simple language but we will build rich tool support for it. To benefit from a language we need a parser, interpreters and compilers, editors and more. It seems to me that there is a lot of material on building simple parsers but very few material on building the rest of the infrastructure needed to make using a language practical and effective.

I would like to focus on exactly these aspects, making a language small but fully useful. Then you will be able to grow your language organically.

The code is available on GitHub: https://github.com/ftomassetti/LangSandbox. The code presented in this article corresponds to the tag 01_lexer.

The language

The language will permit to define variables and expressions. We will support:

  • integer and decimal literals
  • variable definition and assignment
  • the basic mathematical operations (addition, subtraction, multiplication, division)
  • the usage of parenthesis

Examples of a valid file:

The tools we will use

We will use:

  • ANTLR to generate the lexer and the parser
  • use Gradle as our build system
  • write the code in Kotlin. It will be very basic Kotlin, given I just started learning it.

Setup the project

Our build.gradle file will look like this

We can run:

  • ./gradlew idea to generate the IDEA project files
  • ./gradlew generateGrammarSource to generate the ANTLR lexer and parser

Implementing the lexer

We will build the lexer and the parser in two separate files. This is the lexer:

Now we can simply run ./gradlew generateGrammarSource and the lexer will be generated for us from the previous definition.

Testing the lexer

Testing is always important but while building languages it is absolutely critical: if the tools supporting your language are not correct this could affect all possible programs you will build for them. So let’s start testing the lexer: we will just verify that the sequence of tokens the lexer produces is the one we aspect.

Conclusions and next steps

We started with the first small step: we setup the project and built the lexer.

There is a long way in front of us before making the language usable in practice but we started. We will next work on the parser with the same approach: building something simple that we can test and compile through the command line.

On the need of a generic library around ANTLR: using reflection to build a metamodel

I am a Language Engineer: I use several tools to define and process languages.

Among other tools I use ANTLR: it is simple, it is flexible, I can build things around it.

However I find myself rebuilding similar tools around ANTLR for different projects. I see two problems with that:

  • ANTLR is a very good building block but with ANTLR alone not much can be done: the value lies in the processing we can do on the AST and I do not see an ecosystem of libraries around ANTLR
  • ANTLR does not produce a metamodel of the grammar: without it becomes very difficult to build generic tools around ANTLR

Let me explain that:

  • For people with experience with EMF: we basically need an Ecore-equivalent for each grammar.
  • For the others: read next paragraph

Why we need a metamodel

Suppose I want to build a generic library to produce an XML file or a JSON document from an AST produced by ANTLR. How could I do that?

Well, given a ParseRuleContext I can take the rule index and find the name. I have generated the parser for the Python grammar to have some examples, so let’s see how to do that with an actual class:

Let’s look at the class Single_inputContext:

I should obtain something like this:

Good. It is very easy for me to look at the class and recognize these elements, however how can I do that automatically?

Reflection, obviously, you will think.

Yes. That would work. However what if when we have multiple elements? Take this class:

To define metamodels I would not try to come up anything fancy. I would use the classical schema which is at the base of EMF and it is similar to what it is available in MPS.

I would add a sort of container named Package or Metamodel. The Package would list several Entities. We could also mark one of those entity as the root Entity.

Each Entity would have:

  • a name
  • an optional parent Entity (from which it inherits properties and relations)
  • a list of properties
  • a list of relations

Each Property would have:

  • a name
  • a type chosen among the primitive type. In practice I expect to use just String and Integers. Possibly enums in the future
  • a multiplicity (1 or many)

Each Relation would have:

  • a name
  • the kind: containment or reference. Now, the AST knows only about containments, however later we could implement symbol resolution and model transformations and at that stage we will need references
  • a target type: another Entity
  • a multiplicity (1 or many)

Next steps

I would start building a metamodel and later building generic tools taking advantage of the metamodel.

There are other things that typically need:

  • transformations: the AST which I generally get from ANTLR is determined by how I am force to express the grammar to obtain something parsable. Sometimes I have also to do some refactoring to improve performance. I want to transform the AST after parsing to obtain closer to the logical structure of the language.
  • unmarshalling: from the AST I want to produce the test back
  • symbol resolution: this could be absolutely not trivial, as I have found out building a symbol solver for Java

Yes, I know that some of you are thinking: just use Xtext. While I like EMF (Xtext is built on top of it), it has a steep learning curve and I have seen many people confused by it. I also do not like how OSGi plays with the non-OSGi world. Finally Xtext is coming with a lot of dependencies.

Do not get my wrong: I think Xtext is an amazing solution in a lot of contexts. However there are clients who prefer a leaner approach. For the cases in which it makes sense we need an alternative. I think it can be built on top of ANTLR, but there is work to do.

By the way years ago I built something similar for .NET and I called it NetModelingFramework.

ANTLR and Jetbrains MPS: Parsing files and display the AST using the tree notation

Itemis did it again: they just released a new very cool plugin for Jetbrains MPS. This one permits to define new tree editors.

They look like this:


In this post we are going to see:

  • how to use ANTLR parsers inside MPS
  • how to represent the parsed AST using the tree notation

In particular we are going to use the ANTLR grammar which parses… ANTLR grammars. How meta is that? The very same approach could be used for every ANTLR grammar, of course.

Also always code is is available on GitHub.


First of all you need to install Jetbrains MPS. Grab your free copy here.

To use the tree notations you should install the mbeddr platform. Just go here, download a zip and unzip it among the plugins of your MPS installation.

All set, time to do some programming.

Packaging ANTLR to be used inside MPS

In a previous post we discussed how to use an existing ANTLR grammar in Java projects using Gradle. We will apply that technique also here.

We start by download the grammar from here: https://github.com/antlr/grammars-v4/tree/master/antlr4

We just do some minor changes by including directly LexBasic into ANTLRv4Lexer. Note that we need also the LexerAdaptor.

For simplifying the usage we create a Facade:

Now we need a build file:

You may want to run:

  • gradle idea to create a Jetbrains IDEA project
  • gradle fatJar to create a Jar which will contain our compiled code and all the dependencies

Good. Now to use this parser into MPS we start by creating a project. In the wizard we select also the runtime and sandbox options. Once we have done that we should copy our fat jar under the models directory of the runtime solution. In my case I run from the directory of the Java project this command:

Now we need to make MPS aware of that Jar. Lets’s select the sandbox solution and first add the jar to the models:


Then we add it also to the libraries:


Now the content of the JAR should appear among the stubs of the runtime solution.



Creating MPS nodes from AST nodes

Now we are going to build a new concept named AntlrImporter. We will use it to select and import ANTLR grammars into MPS:


The Concept structure will be pretty simple:


We need also concepts for the AST nodes we are going to import. First of all, we will define the abstract concept AstNode. Then we will define two subconcepts for the terminal and non-terminal AST nodes.


Now let’s take a look at the editor for the AntlrImporter.


The first swing component is a button which opens a file chooser. In this way, we can easily select a file and set the property path. Or we can edit it manually if we prefer.



Once we have selected a File we can import it by clicking on the second button


The import logic is in importModel, a method in the behavior of AntlrImporter.


Good. That is it. With that we can parse any ANTLR grammar and get it into MPS. Now we have just to use a nice representation. We are going for the tree notation.

Using the tree notation

The tree notation is surprising easily to use.

Let’s start by adding com.mbeddr.mpsutil.treenotation.styles.editor to the dependencies of the editor aspect of our language.


We will need also the com.mbeddr.mpsutil.treenotation to be among the used languages.


The editor for NonTerminalNode consists of a single tree cell. The top part of the tree cell represents this node. We will use the ruleName to represent it. In the bottom part instead we should pick the relation contains the children to be displayed in the tree


We can put the cursor on the tree drawing between the top and the bottom part (the “/|\” symbol) and open the inspector. There we can use style attributes to customize the appearance of the tree


We just decide to show the tree from left-to-right instead that top down. Then we decide to add more spaces between the parent and the children when there are too many children. In this way the lines to not overlap too much.

This is how it looks without the property


This is how it looks with the property set


There are other properties that can be used to control the color and the thickness of the lines, for example. Or you could add shapes at the extremes of the lines. For now we do not need these features, but it is nice to know they are there.

The editor for TerminalNode is very simple



Over the years MPS became more stable and easier to use. It has reached the point at which you can be very productive using it. Projectional editing is an idea that has been around for a while and there are other implementations available like the Whole Platform. However MPS has reached a very high level of maturity.

What I think we still miss are:

  • processes and best practices: how should we manage dependencies with other MPS projects? How should we integrate with Java libraries?
  • examples: there are surprisingly few applications which are publicly available. After all, many users develop DSLs for their specific usages and do not intend to share them. However, this means we have few opportunities to learn from each other
  • extensions: the Mbeddr team is doing an amazing job providing a lot of goodies as part of the Mbeddr platform. However, they seem the only ones producing reusable components and sharing them

I think this is now time to understand together what we can achieve with projectional editing. In my opinion these are going to be very interesting times.

If I have to express one wish is that I would like to hear more about how others are using MPS. If you are out there, please knock. And leave a comment 🙂

Turin Programming Language for the JVM: building advanced lexers with ANTLR

As I wrote in my last post, I recently started working on a new programming language named Turin. A working compiler for an initial version of the languag is available on GitHub. I am currently improving the language and working on a Maven and an IntelliJ plugins. Here and in the next posts I will go over the different components of the compiler and related tools.

Structure of the compiler

The compiler needs to do several things:

  1. Get the source code and generate an abstract syntax tree (AST)
  2. Translate the AST through different stages to simplify processing. We basically want to move from a representation very close to the syntax to a representation easier to process. For example we could “desugarize” the language, representing several (apparently) different constructs as variants of the same construct. An example? The Java compiler translates string concatenations into calls to StringBuffer.append
  3. Perform semantic checks. For example we want to check if all the expressions are using acceptable types (we do not want to sum characters, right?)
  4. Generate bytecode

The first step requires building two components: a lexer and a parser. The lexer operates on the text and produces a sequences of tokens, while the parser composes tokens into constructs (a type declaration, a statement, an expression, etc.) creating the AST. For writing the lexer and the parser I have used ANTLR.

In the rest of this post we look into the lexer. The parser and the other components of the compiler will be treated in future posts.

Why using ANTLR?

ANTLR is a very mature tool for writing lexer and parsers. It can generate code for several languages and has decent performance. It is well mantained and I was sure it had all the features I could possible need to handle all the corner cases I could meet. In addition to that, ANTLR 4 makes possible to write simple grammars because it solves left recursive definition for you. So you do not have to write many intermediate node types for specifying precedence rules for your expressions. More on this when we will look into the parser.

ANTLR is used by Xtext (which I have used a lot) and I have ANTLR while building a framework for Model-driven development for the .NET platform (a sort of EMF for .NET). So I know and trust ANTLR and I have no reason for looking into alternatives.

The current lexer grammar

This is the current version of the lexer grammar.

A few choices I have done:

  • there are two different types of ID: VALUE_ID and TYPE_ID. This permits to have less ambiguity in the grammar because values and types can be easily distinguished. In Java instead when (foo) is encountered we do not know if it is an expression (a reference to the value represented by foo between parenthesis) or a cast to the type foo. We need to look at what follows to understand it. In my opinion this is rather stupid because in practice everyone is using capitalized identifiers for types only, but because this is not enforced by the language the compiler cannot take advantage from it
  • newlines are relevant in Turin, so we have tokens for them we basically want to have statements terminated by newlines but we accept optional newlines after commas
  • whitespaces (but newlines) and comments are captured in their own channels, so that we can ignore them in the parser grammar but we can retrieve them when needed. For example we need them for syntax highlighting and in general for the IntelliJ plugin because it requires to define tokens for each single character in the source file, without gaps
  • the most tricky part is parsing string interpolations à la Ruby such as “my name is #{user.name}”. We use modes: when we encounter a string start (“) we switch to lexer mode IN_STRING. While in mode IN_STRING if we encounter the start of an interpolated value (#{) we move to lexer mode IN_INTERPOLATION. While in mode IN_INTERPOLATION we need to accept most of tokens used in expressions (and that sadly means a lot of duplication in our lexer grammar).
  • I had to collapse the relational operators in one single token type, so that the number of states of the generated lexer is not too big. It means that I will have to look into the text of RELOP tokens to figure out which operation need to be executed. Nothing too awful but you have to know how to fix these kinds of issues.

Testing the lexer

I wrote a bunch of tests specific for the lexer. In particular I tested the most involved part: the one regarding string interpolation.

An example of a few tests:

As you can see I just test the token on a string and verify it produces  the correct list of tokens. Easy and straight to the point.


My experience with ANTLR for this language has not been perfect: there are issues and limitations. Having to collapse several operators in a single token type is not nice. Having to repeat several token definitions for different lexer modes is bad. However ANTLR proved to be a tool usable in practice: it does all that it needs to do and for each problem there is an acceptable solution. The solution is maybe not ideal, maybe not elegant as desired but there is one. So I can use it and move on on more interesting parts of the compiler.

Develop DSLs for Eclipse and IntelliJ using Xtext

In this post we are going to see how to develop a simple language. We will aim to get:

  • a parser for the language
  • an editor for IntelliJ. The editor should have syntax highlighting, validation and auto-completion

We would also get for free an editor for Eclipse and web editor, but please contain your excitement, we are not going to look into that in this post.

In the last year I have focused on learning new stuff (mostly web and ops stuff) but one of the things I still like the most is to develop DSLs (Domain Specific Languages). The first related technology I played with was Xtext: Xtext is a fantastic tool that let you define the grammar of your language and generate amazing editors for such language. Until now it has been developed only for the Eclipse platform: it means that new languages could be developed using Eclipse and the resulting editors could then be installed in Eclipse.

Lately I have been using far less Eclipse and so I my interest in Xtext faded until now, when finally the new release of Xtext (still in beta) is targeting IntelliJ. So while we will develop our language using Eclipse, we will then generate plugins to use our language both in IntelliJ.

The techniques we are going to see can be used to develop any sort of language, but we are going to apply them to a specific case: AST transformations. This post is intended for Xtext newbies and I am not going in many details for now, I am just sharing my first impression of the IntelliJ target. Consider that this functionality is currently a beta, so we could expect some rough edges.

The problem we are trying to solve: adapt ANTLR parsers to get awesome ASTs

I like playing with parsers and ANTLR is a great parser generator. There are beatiful grammars out there for full blown languages like Java. Now, the problem is that the grammars of languages like Java are quite complex and the generated parsers produce ASTs that are not easy to use. The main problem is due to how precedence rules are handled. Consider the grammar for Java 8 produced by Terence Parr and Sam Harwell. Let’s look at how some expressions are defined:

This is just a fragment of the large portion of code used to define expressions. Now consider you have a simple preIncrementExpression (something like: ++a). In the AST we will have node of type preIncrementExpression that will be contained in an unaryExpression. The unaryExpression will be contained in a multiplicativeExpression, which will be contained in an additiveExpression and so on and so forth. This organization is necessary to handle operator precedence between the different kind of operations, so that 1 + 2 * 3  is parsed as a sum of 1 and 2 * 3 instead of a multiplication of 1 + 2  and 3. The problem is that from the logical point of view multiplications and additions are expressions at the same level: it does not make sense to have Matryoshka AST nodes.

Consider this code:

The AST produced by this grammar is:

While we would like something like:

Ideally we want to specify grammars that produce the Matryoshka-style of ASTs but using a more flat ASTs when doing analysis on the code, so we are going to build adapters from the ASTs as produced by Antlr and the “logical” ASTs.

How do we plan to do that? We will start by developing a language defining the shape of nodes as we want them to appear in the logical ASTs and we will also define how to map the Antlr nodes (the Matryoshka-style nodes) into these logical nodes.

This is just the problem we are trying to solve: Xtext can be used to develop any sort of language, is just that being a parser maniac I like to use DSLs to solve parser related problems. Which is very meta.

Getting started: installing Eclipse Luna DSL and create the project

We are going to download a version of Eclipse containing the beta of Xtext 2.9.

In your brand new Eclipse you can create a new type of projects: Xtext Projects.

Screenshot from 2015-06-01 09:44:03

We just have to define the name of the project and pick an extension to be associated with our new language

Screenshot from 2015-06-01 09:45:14

And then we select the platforms that we are interested into (yes, there is also the web platform… we will look into that in the future)

Screenshot from 2015-06-01 09:47:27

The project created contains a sample grammar. We could use it as is, we would have just to generate a few files running the MWE2 file.


After running this command we could just use our new plugin in IntelliJ or in Eclipse. But we are going instead to first change the grammar, to transform the given example in our glorious DSL.

An example of our DSL

Our language will look like this in IntelliJ IDEA (cool, eh?).

Screenshot from 2015-06-02 19:42:14

Of course this is just a start but we are start defining some basic node types for a Java parser:

  • an enum representing the possible modifiers (warning: this is not a complete list)
  • the CompilationUnit which contains an optional PackageDeclaration and possibly many TypeDeclarations
  • TypeDeclaration is an abstract node and there are three concrete types extending it: EnumDeclaration, ClassDeclaration and InterfaceDeclaration (we are missing the annotation declaration)

We will need to add tens of expressions and statements but you should get an idea of the language we are trying to build.

Note also that we have a reference to an Antlr grammar (in the first line) but we are not yet specifying how our defined node types maps to the Antlr node types.

Now the question is: how do we build it?

Define the grammar

We can define the grammar of our language with a simple EBNF notation (with a few extensions). Look for a file with the xtext extension in your project and change it like this:

The first rule we define corresponds to the root of the AST (Model in our case). Our Model starts with a reference to an Antlr file and a list of Declarations. The idea is to specify declarations of our “logical” node types and how the “antlr” node types should be mapped to them. So we will define transformations that will have references to element defined… in the antlr grammar that will we specify in the AntlrGrammarRef rule.

We could define either Enum or NodeType. The NodeType has a name, can be abstract and can extends another NodeType. Note that the supertype is a reference to a NodeType. It means that the resulting editor will automatically be able to gives us auto-completion (listing all the NodeTypes defined in the file) and validation, verifying we are referring to an existing NodeType.

In our NodeTypes we can defined as many fields as we want (NodeTypeField). Each field starts with a name, followed by an operator:

  • *= means we can have 0..n values in this field
  • ?= means that the field is optional (0..1) value
  • means that exactly one value is always present

The NodeTypeField have also a value type which can be an enum defined inline (UnnamedEnumDeclaration), a relation (it means this node contains other nodes) or an attribute (it means this node has some basic attributes like a string or a boolean).

Pretty simple, eh?

So we basically re-run the MWE2 files and we are ready to go.

See the plugin in action

To see our plugin installed in IntelliJ IDEA we have just to run gradle runIdea from the directory containing the idea plugin (me.tomassetti.asttransf.idea in our case). Just note that you need a recent version of gradle and you need to define JAVA_HOME. This command will download IntelliJ IDEA, install the plugin we developed and start it. In the opened IDE you can create a new project and define a new file. Just use the extension we specified when we created the project (.anttr in our case) and IDEA should use our newly defined editor.

Currently validation is working but the editor seems to react quite slowly. Auto-completion is instead broken for me. Consider that this is just a beta, so I expect these issues to disappear before Xtext 2.9 is released.

Next steps

We are just getting started but it is amazing how we can have a DSL with its editor for IDEA working in a matter of minutes.

I plan to work in a few different direction:

  • We need to see how to package and distribute the plugin: we can try it using gradle runIdea but we want to just produce a binary for people to install it without having to process the sources of the editor
  • Use arbitrary dependencies from Maven: this is going to be rather complicate because Maven and the Eclipse plugin (OSGi bundles) define their dependencies in their own way, so jars have to be typically be packaged into bundles to being used in Eclipse plugins. However there are alternatives like Tycho and the p2-maven-plugin. Spoiler: I do not expect this one too be fast and easy…
  • We are not yet able to refer to elements defined in the Antlr grammar. Now, it means that we should be able to parse the Antlr grammar and create programmatically EMF models, so that we can refer it in our DSL. It require to know EMF (and it gets some time…). I am going to play with that in the future and this will probably require a loooong tutorial.


While I do not like Eclipse anymore (now I am used to IDEA and it seems to me so much better: faster and lighter) the Eclipse Modeling Framework keeps being a very interesting piece of software and be able to use it with IDEA is great.

It was a while that I was not playing with EMF and Xtext and I have to say that I have seen some improvements. I had the feeling that Eclipse was not very command-line friendly and it was in general difficult to integrate it with CI systems. I am seeing an effort being done for fixing these problems (see Tycho or the gradle job we have used to start IDEA with the editor we developed) and it seems very positive to me.

Mixing technologies, combining the best aspects of different worlds in a pragmatic way is my philosophy, so I hope to find the time to play more with this stuff.