Tutorials and issues on all aspects of creating software to analyse code

Parsing in Java: Tools and Libraries

Parsing in Java: Tools and Libraries

If you need to parse a language, or document, from Java there are fundamentally three ways to solve the problem:

  • use an existing library supporting that specific language: for example a library to parse XML
  • building your own custom parser by hand
  • a tool or library to generate a parser: for example ANTLR, that you can use to build parsers for any language

Use an existing library

The first option is the best for well known and supported languages, like XML or HTML. A good library usually include also API to programmatically build and modify documents in that language. This is typically more of what you get from a basic parser. The problem is that such libraries are not so common and they support only the most common languages. In other cases you are out of luck.

Building your own custom parser by hand

You could need to go for the second option if you have particular needs. Both in the sense that the language you need to parse cannot be parsed with traditional parser generators, or you have specific requirements that you cannot satisfy using a typical parser generator. For instance, because you need the best possible performance or a deep integration between different components.

A tool or library to generate a parser

In all other cases the third option should be the default one, because is the one that is most flexible and has the shorter development time. That is why on this article we concentrate on the tools and libraries that correspond to this option.

Note: text in blockquote describing a program comes from the respective documentation

Parsing in Java


Get the guide to parsing in Java delivered to your email and read it when you want on the device you want

Powered by ConvertKit

Tools To Create Parsers

We are going to see:

  • tools that can generate parsers usable from Java (and possibly from other languages)
  • Java libraries to build parsers

Tools that can be used to generate the code for a parser are called parser generators or compiler compiler. Libraries that create parsers are known as parser combinators.

Parser generators (or parser combinators) are not trivial: you need some time to learn how to use them and not all types of parser generators are suitable for all kinds of languages. That is why we have prepared a list of the best known of them, with a short introduction for each of them. We are also concentrating on one target language: Java. This also means that (usually) the parser itself will be written in Java.

To list all possible tools and libraries parser for all languages would be kind of interesting, but not that useful. That is because there will be simple too many options and we would all get lost in them. By concentrating on one programming language we can provide an apples-to-apples comparison and help you choose one option for your project.

Useful Things To Know About Parsers

To make sure that these list is accessible to all programmers we have prepared a short explanation for terms and concepts that you may encounter searching for a parser. We are not trying to give you formal explanations, but practical ones.

Structure Of A Parser

A parser is usually composed of two parts: a lexer, also known as scanner or tokenizer, and the proper parser. Not all parsers adopt this two-steps schema: some parsers do not depend on a lexer. They are called scannerless parsers.

A lexer and a parser work in sequence: the lexer scans the input and produces the matching tokens, the parser scans the tokens and produces the parsing result.

Let’s look at the following example and imagine that we are trying to parse a mathematical operation.

The lexer scans the text and find ‘4’, ‘3’, ‘7’ and then the space ‘ ‘. The job of the lexer is to recognize that the first characters constitute one token of type NUM. Then the lexer finds a ‘+’ symbol, which corresponds to a second token of type PLUS, and lastly it finds another token of type NUM.The input stream is transformed in Tokens and then in an AST by the parser

The parser will typically combine the tokens produced by the lexer and group them.

The definitions used by lexers or parser are called rules or productions. A lexer rule will specify that a sequence of digits correspond to a token of type NUM, while a parser rule will specify that a sequence of tokens of type NUM, PLUS, NUM corresponds to an expression.

Scannerless parsers are different because they process directly the original text, instead of processing a list of tokens produced by a lexer.

It is now typical to find suites that can generate both a lexer and parser. In the past it was instead more common to combine two different tools: one to produce the lexer and one to produce the parser. This was for example the case of the venerable lex & yacc couple: lex produced the lexer, while yacc produced the parser.

Parse Tree And Abstract Syntax Tree

There are two terms that are related and sometimes they are used interchangeably: parse tree and Abstract SyntaxTree (AST).

Conceptually they are very similar:

  • they are both trees: there is a root representing the whole piece of code parsed. Then there are smaller subtrees representing portions of code that become smaller until single tokens appear in the tree
  • the difference is the level of abstraction: the parse tree contains all the tokens which appeared in the program and possibly a set of intermediate rules. The AST instead is a polished version of the parse tree where the information that could be derived or is not important to understand the piece of code is removed

In the AST some information is lost, for instance comments and grouping symbols (parentheses) are not represented. Things like comments are superfluous for a program and grouping symbols are implicitly defined by the structure of the tree.

A parse tree is a representation of the code closer to the concrete syntax. It shows many details of the implementation of the parser. For instance, usually a rules correspond to the type of a node. They are usually transformed in AST by the user, with some help from the parser generator.

A graphical representation of an AST looks like this.

An abstract syntax tree for the Euclidean algorithm


Sometimes you may want to start producing a parse tree and then derive from it an AST. This can make sense because the parse tree is easier to produce for the parser (it is a direct representation of the parsing process) but the AST is simpler and easier to process by the following steps. By following steps we mean all the operations that you may want to perform on the tree: code validation, interpretation, compilation, etc..


A grammar is a formal description of a language that can be used to recognize its structure.

In simple terms is a list of rules that define how each construct can be composed. For example, a rule for an if statement could specify that it must starts with the “if” keyword, followed by a left parenthesis, an expression, a right parenthesis and a statement.

A rule could reference other rules or token types. In the example of the if statement, the keyword “if”, the left and the right parenthesis were token types, while expression and statement were references to other rules.

The most used format to describe grammars is the Backus-Naur Form (BNF), which also has many variants, including the Extended Backus-Naur Form. The Extented variant has the advantage of including a simple way to denote repetitions. A typical rule in a Backus-Naur grammar looks like this:

The <simbol> is usually nonterminal, which means that it can be replaced by the group of elements on the right, __expression__. The element __expression__ could contains other nonterminal symbols or terminal ones. Terminal symbols are simply the ones that do not appear as a <symbol> anywhere in the grammar. A typical example of a terminal symbol is a string of characters, like “class”.

Left-recursive Rules

In the context of parsers an important feature is the support for  left-recursive rules. This means that a rule could start with a reference to itself. This reference could be also indirect.

Consider for example arithmetic operations.  An addition could be described as two expression(s) separated by the plus (+) symbol, but an expression could also contain other additions.

This description also match multiple additions like 5 + 4 +  3. That is because it can be interpreted as expression (5) (‘+’) expression(4+3). And then 4 + 3 itself can be divided in its two components.

The problem is that this kind of rules may not be used with some parser generators. The alternative is a long chain of expressions that takes care also of the precedence of operators.

Some parser generators support direct left-recursive rules, but not indirect one.

Types Of Languages And Grammars

We care mostly about two types of languages that can be parsed with a parser generator: regular languages and context-free languages. We could give you the formal definition according to the Chomsky hierarchy of languages, but it would not be that useful. Let’s look at some practical aspects instead.

A regular language can be defined by a series of regular expressions, while a context-free one need something more. A simple rule of thumb is that if a grammar of a language has recursive elements it is not a regular language. For instance, as we said elsewhere, HTML is not a regular language. In fact, most programming languages are context-free languages.

Usually to a kind of language correspond the same kind of grammar. That is to say there are regular grammars and context-free grammars that corresponds respectively to regular and context-free languages. But to complicate matters, there is a relatively new (created in 2004) kind of grammar, called Parsing Expression Grammar (PEG). These grammars are as powerful as Context-free grammars, but according to their authors they describe more naturally programming languages.

The Differences Between PEG and CFG

The main difference between PEG and CFG is that the ordering of choices is meaningful in PEG, but not in CFG. If there are many possible valid ways to parse an input, a CFG will be ambiguous and thus wrong. Instead with PEG the first applicable choice will be chosen, and this automatically solve some ambiguities.

Another differnce is that PEG use scannerless parsers: they do not need a separate lexer, or lexical analysis phase.

Traditionally both PEG and some CFG have been unable to deal with left-recursive rules, but some tools have found workarounds for this. Either by modifying the basic parsing algorithm, or by having the tool automatically rewrite a left-recursive rule in a non recursive way. Either of these ways has downsides: either by making the generated parser less intelligible or by worsen its performance. However, in practical terms, the advantages of easier and quicker development outweigh the drawbacks.

Parser Generators

The  basic workflow of a parser generator tool is quite simple: you write a grammar that defines the language, or document, and you run the tool to generate a parser usable from your Java code.

The parser might produce the AST, that you may have to traverse yourself or you can traverse with additional ready-to-use classes, such Listeners or Visitors. Some tools instead offer the chance to embed code inside the grammar to be executed every time the specific rule is matched.

Usually you need a runtime library and/or program to use the generated parser.

Regular (Lexer)

Tools that analyze regular languages are typically lexers.

We are not going to talk about it, because it is very basic, but Java includes a library to parse data with numbers and simple patterns: java.util.Scanner. It could be defined as a smart library to read streams of data. It might be worth to check it out if you are in need of quickly parse some data.


JFlex is a lexical analyzer (lexer) generator based upon deterministic finite automata (DFA). A JFlex lexer matches the input according to the defined grammar (called spec) and executes the corresponding action (embedded in the grammar).

It can be used as a standalone tool, but being a lexer generator is designed to work with parser generators: typically it is used with CUP or BYacc/J. It can also work with ANTLR.

The typical grammar (spec) is divided three parts, separated by ‘%%’:

  1. usercode, that will be included in the generated class,
  2. options/macros,
  3. and finally the lexer rules.

Context Free

Let’s see the tools that generate Context Free parsers.


ANTLR is probably the most used parser generator for Java. ANTLR is based on an new LL algorithm developed by the author and described in this paper: Adaptive LL(*) Parsing:  The  Power  of  Dynamic  Analysis (PDF).

It can output parsers in many languages. But the real added value of a vast community it is the large amount of grammars available. The version 4 supports direct left-recursive rules.

It provides two ways to walk the AST, instead of embedding actions in the grammar: visitors and listeners. The first one is suited when you have to manipulate or interact with the elements of the tree, while the second is useful when you just have to do something when a rule is matched.

The typical grammar is divided in two parts: lexer rules and parser rules. The division is implicit, since all the rules starting with an uppercase letter are lexer rules, while the ones starting with a lowercase letter are parser rules. Alternatively lexer and parser grammars can be defined in separate files.

If you are interested in ANTLR you can look into this giant ANTLR tutorial we have written.


APG is a recursive-descent parser using a variation of Augmented BNF, that they call Superset Augmented BNF. ABNF is a particular variant of BNF designed to better support bidirectional communications protocol. APG also support additional operators, like syntactic predicates and custom user defined matching functions.

It can generate parsers in C/C++, Java e JavaScript. Support for the last language seems superior and more up to date: it has a few more features and seems more updated. In fact the documentation says it is designed to have the look and feel of JavaScript RegExp.

Because it is based on ABNF, it is especially well suited to parsing the languages of many Internet technical specifications and, in fact, is the parser of choice for a number of large Telecom companies.

An APG grammar is very clean and easy to understand.


BYACC is Yacc that generates Java code. That is the whole idea and it defines its advantages and disadvantages. It is well known, it allows easier conversion of a Yacc and C program to a Java program. Although you obviously still need to convert all the C code embedded in semantic actions into Java code. Another advantage it is that you do not need a separate runtime, the generated parser it is all you need.

On the other hand it is old and the parsing world has made many improvements. If you are an experienced Yacc developer with a code base to upgrade it is a good choice, otherwise there are many more modern alternatives you should consider.

The typical grammar is divided in three sections, separated by ‘%%’: DECLARATIONS, ACTIONS and CODE. The second one contains the grammar rules and the third one the custom user code.


Coco/R is a compiler generator that takes an attributed grammar and generates a scanner and a recursive descent parser. Attributed grammar means that the rules, that are written in an EBNF variant, can be annotated in several ways to change the methods of the generated parser.

The scanner includes support for dealing with things like compiler directives, called pragmas. They can be ignored by the parser and handled by custom code. The scanner can also be suppressed and substituted with one built by hand.

Technically all the grammars must be LL(1), that is to say the parser must be able to choose the correct rule only looking one symbol ahead. But Coco/R provides several methods to bypass this limitation, including semantic checks, which are basically custom functions that must return a boolean value. The manual also provides some suggestions for refactoring your code to respect this limitation.

A Coco/R grammar looks like this.

Coco/R has a good documentation, with several examples grammars. It supports several languages including Java, C# and C++.


CookCC is a LALR (1) parser generator written in Java. Grammars can be specified in three different ways:

  • in Yacc format: it can reads grammar defined for Yacc
  • in its own XML format
  • in Java code, by using specific annotations

A unique feature is that it can also output a Yacc grammar. This can be useful if you need to interact with a tool that support a Yacc grammar. Like some old C program with which you must maintain compatibility.

It requires Java 7 to generate the parser, but it can run on earlier versions.

A typical parser defined with annotations will look like this.

For the standard of parser generators, using Java annotations it is a peculiar choice. Compared to an alternative like ANTLR there is certainly a less clear division between the grammar and the actions. This could make the parser harder to maintain for complex languages. Also porting to another language could require a complete rewrite.

On the other hand this approach permit to mix grammar rules with the actions to perform when you match them. Furthermore it has the advantage of being integrated in the IDE of your choice, since it is just Java code.


CUP is the acronym of Construction of Useful Parsers and it is LALR parser generator for Java. It just generates the proper parser part, but it is well suited to work with JFlex. Although obviously you can also build a lexer by hand to work with CUP. The grammar has a syntax similar to Yacc and it allows to embed code for each rule.

It can automatically generate a parse tree, but not an AST.

It also has an Eclipse plugin to aid you in the creation of a grammar, so effectively it has its own IDE.

The typical grammar is similar to YACC.


Grammatica is a C# and Java parser generator (compiler compiler). It reads a grammar file (in an EBNF format) and creates well- commented and readable C# or Java source code for the parser. It supports LL(k) grammars, automatic error recovery, readable error messages and a clean separation between the grammar and the source code.

The description on the Grammatica website is itself a good representation of Grammatica: simple to use, well-documented, with a good amount of features. You can build a listener by subclassing the generated classes, but not a visitor. There is a good reference, but not many examples.

A typical grammar of Grammatica is divided in three sections: header, tokens and productions. It is also clean, almost as much as an ANTLR one. It is also based on a similar Extended BNF, although the format is slightly different.


Jacc is similar to BYACC/J, except that is written in Java and thus it can run wherever your program can run. As a rule of thumb it is developed as a more modern version of Yacc. The author describes small improvements in areas like error messages, modularity and debug support.

If you know Yacc and you do not have any code base to upgrade, it might be a great choice.


JavaCC is the other widely used parser generator for Java. The grammar file contain actions and all the custom code needed by your parser.

Compared to ANTLR the grammar file is much less clean and include a lot of Java source code.

Thanks to its long history it is used in important projects, like JavaParser. This has left some quirks in the documentation and usage. For instance, technically JavaCC itself does not build an AST, but it comes with a tool that does it, JTree, so for practical purposes it does.

There is a grammar repository, but it does not have many grammars in it. It requires Java 5 or later.


ModelCC is a model-based parser generator that decouples language specification from language processing [..]. ModelCC receives a conceptual model as input, along with constraints that annotate it.

In practical terms you define a model of your language, that works as a grammar, in Java, using annotations. Then you feed to ModelCC the model you have created to obtain a parser.

With ModelCC you define your language in a way that is independent from the parsing algorithm used. Instead, it should the best conceptual representation of the language. Although, under the hood, it uses a traditional parsing algorithm. So the grammar per se use a form that is independent from any parsing algorithm, but ModelCC does not use magic and produces a normal parser.

There is a clear description of the intentions of the authors of the tools, but a limited documentation. Nonetheless there are examples available, including the following model for a calculator partially shown here.


SableCC is a parser generator created for a thesis and with the aim to be easy to use and to offer a clean separation between grammar and Java code. Version 3 should also offer an included a ready-to-use way to walk the AST using a visitor. But that is all in theory because there is virtually no documentation and we have no idea how to use any of these things.

Also, a version 4 was started in 2015 and apparently lies abandoned.


Urchin(CC) is a parser generator that allows you to define a grammar, called Urchin parser definition. Then you generate a Java parser from it. Urchin also generate a visitor from the UPD.

There is an exhaustive tutorial that is also used to explain how Urchin works and its limitations, but the manual is limited.

A UPD is divided in three sections: terminals, token and rules.


After the CFG parsers is time to see the PEG parsers available in Java.


Canopy is a parser compiler targeting Java, JavaScript, Python and Ruby. It takes a file describing a parsing expression grammar and compiles it into a parser module in the target language. The generated parsers have no runtime dependency on Canopy itself.

It also provides easy access to the parse tree nodes.

A Canopy grammar has the neat feature of using actions annotation to use custom code in the parser. In practical terms. you just write the name of a function next to a rule and then you implement the function in your source code.

The Java file containing the action code.


Laja is a two-phase scannerless, top-down, backtracking parser generator with support for runtime grammar rules.

Laja is a code generator and a parser generator and it is mainly designed to create external DSLs. This means that it have some peculiar features. With Laja you must specify not just the structure of the data, but also how the data should be mapped into Java structures. This structures are usually objects in a hierarchy or flat organization. In short, it makes very easy to parse data files, but it is less suitable for a generic programming language.

Laja options, like output directory or input file, are set in a configuration file.

A Laja grammar is divided in a rules section and the data mapping section. It looks like this.


Mouse is a tool to transcribe PEG into an executable parser written in Java.

It does not use packrat and thus it uses less memory than the typical PEG parser (the manual explicitly compares Mouse to Rats!).

It does not have a grammar repository, but there are grammars for Java 6-8 and C.

A Mouse grammar is quite clean. To include custom code, a feature called semantic predicates, you do something similar to what you do in Canopy. You include a name in the grammar and then later, in a Java file, you actually write the custom code.


Rats! is a parser generator part of xtc (eXTensible Compiler). It is based on PEG, but it uses “additional expressions and operators necessary for generating actual parsers”. It supports left-recursive productions. It can automatically generate an AST.

It requires Java 6 or later.

The grammar can be quite clean, but you can embed custom code after each production.

Parser Combinators

They allow you to create a parser simply with Java code, by combining different pattern matching functions, that are equivalent to grammar rules. They are generally considered suited for simpler parsing needs. Given they are just Java libraries you can easily introduce them into your project: you do not need any specific generation step and you can write all of your code in your favorite Java editor. Their main advantage is the possibility of being integrated in your traditional workflow and IDE.

In practice this means that they are very useful for all the little parsing problems you find. If the typical developer encounter a problem, that is too complex for a simple regular expression, these libraries are usually the solution. In short, if you need to build a parser, but you don’t actually want to, a parser combinator may be your best option.


Jparsec is the port of the parsec library of Haskell.

Parser combinators are usually used in one phase, that is to say they are without lexer. This is simply because it can quickly become too complex to manage all the combinators chains directly in the code. Having said that, jparsec has a special class to support lexical analysis.

It does not support left-recursive rules, but it provides a special class for the most common use case: managing the precedence of operators.

A typical parser written with jparsec is similar to this one.


Parboiled provides a recursive descent PEG parser implementation that operates on PEG rules you specify.

The objective of parboiled is to provide an easy to use and understand way to create small DSLs in Java. It put itself in the space between a simple bunch of regular expressions and an industrial-strength parser generator like ANTLR. A parboiled grammar can include actions with custom code, included directly into the grammar code or through an interface.

It does not build an AST for you, but it provides a parse tree and some classes to make it easier to build it. That is because its authors maintain that the AST is heavily dependent on your exact project needs, so they prefer to offer an “open and flexible approach”. It sound quite appropriate to the project objective and one of our readers find the approach better than a straight AST.

The documentation is very good, it explain features, shows example, compare the ideas behind parboiled with the other options. There are some example grammars in the repository, including one for Java.

It is used by several projects, including important ones like neo4j.


PetitParser combines ideas from scannerless parsing, parser combinators, parsing expression grammars and packrat parsers to model grammars and parsers as objects that can be reconfigured dynamically.

PetitParser is a cross between a parser combinator and a traditional parser generator. All the information is written in the source code, but the source code is divided in two files. In one file you define the grammar, while in the other one you define the actions corresponding to the various elements. The idea is that it should allow you to dynamically redefine grammars. While it is smartly engineered, it is debatable if it is also smartly designed. You can see that the example JSON grammar it is more lengthy than one expects it to be.

An excerpt from the example grammar file for JSON.

An excerpt from the example parser definiton file (that defines the actions for the rules) for JSON .

There is a version written in Java, but there are also versions in Smalltalk, Dart, PHP, and TypeScript.

The documentation is lacking, but there are example grammars available.

Java Libraries That Parse Java: JavaParser

There is one special case that requires some more comments: the case in which you want to parse Java code in Java. In this case we have to suggest to use a library named JavaParser. Incidentally we heavily contribute to JavaParser, but this is not the only reason why we suggest it. The fact is that JavaParser is a project with tens of contributors and thousands of users, so it is pretty robust.

A quick list of features:

  • it supports all versions of Java from 1 to 9
  • it supports lexical preservation and pretty printing: it means you can parse Java code, modify it and printing it back either with the original formatting or pretty printed
  • it can be used with JavaSymbolSolver, which gives you symbol resolution. I.e., it understands which methods are invoked, to which declarations references are linked to, it calculates the type of expressions, etc.

Convinced? You want still to write your own Java parser for Java?


Parsing in Java is a broad topic and the world of parsers is a bit different from the usual world of programmers. You will find the best tools coming directly from academia, which is typically not the case with software. Some tools and libraries have been started for a thesis or a research project. The upside is that tools tend to be easily and freely available. The downside is that some authors prefer to have a good explanation of the theory behind what their tools do, rather than a good documentation on how to use them. Also, some tools end up being abandoned as the original authors finish their master or their PhD.

We tend to use parser generators quite a lot: ANTLR is our favorite one and we use JavaCC extensively in our work on JavaParser.  We do not use parser combinators very much. It is not because they are bad, they have their uses and in fact we wrote an article about one in C#. But for the problems we deal with, they typically lead to less mantainable code. However they could be easier to start with so you may want to consider those. Especially if until now you have hacked something terrible using regular expressions and an half baked parser written by hand.

We cannot really say to you definitely what software you should use. What it is best for a user might not be the best for somebody else. And we all known that the most technically correct solution might not be ideal in real life with all its constraints. But we have searched and tried many similar tools in our work and something like this article would have helped us save some time. So we wanted to share what we have learned on the best options for parsing in Java.

Create a simple parser in C# with Sprache

Create a simple parser in C# with Sprache

You can find the code for this article on github

Everybody loves ANTLR, but sometimes it may be overkill. On the other hand, a regular expression just doesn’t cut it or it may be too complicated to maintain. What a developer can do in such cases ? He uses Sprache. As its creators say:

Sprache is a simple, lightweight library for constructing parsers directly in C# code.

It doesn’t compete with “industrial strength” language workbenches – it fits somewhere in between regular expressions and a full-featured toolset like ANTLR.

It is a simple but effective tool, whose main limitation is being character-based. In other words, it works on characters and not on tokens. The advantage is that you can work directly with code and you don’t have to use external tools to generate the parser.

The guessing game

You can see the project website if you want to see specific real uses, let’s just say that its even credited by ReSharper and it was created more than six years ago, so it’s stable and quite good. It’s ideal to manage things like error messages created by other tools that you have to deal with, to read logs, to parse queries like the ones you would uses for a simple search library or to read simple formats like Json. In this article we will create a parser for a simple guessing game, we will use .NET Core and xUnit for the unit tests, so it will work also on Linux and Mac.

The objective of the game is to guess a number, and to do that you can ask if the number is greater than a number, less than a number or between two numbers. When you are ready to guess you simply ask if it’s equal to a certain number.

Setup the project

We will use VSCode, instead of Visual Studio, but in the github project you would find two projects, one for each: this because there are still some compatibility quirks relative to project.json and the different .NET Core tools versions used by Visual Studio or the standalone command line version. To clarify, the project.json generated by the .NET Core standalone command line will work also with Visual Studio, but not viceversa (this might be changed when you will read this). Also, with two projects you can easily see how Visual Studio integrates xUnit tests. The C# code itself is the same.

Create the file global.json in the directory of your project, in our case SpracheGame, then create another SpracheGame folder inside src and a SpracheGame.Tests folder inside test. Inside the nested SpracheGame folder you can create a new .NET core program with the usual:

While you are nside the SpracheGame.Tests folder you can create a xUnit test project with:

You can see the final structure here.

SpracheGame folder structure

Change both project.json, adding sprache as a dependency to the main project:

…and add the main project as a dependency for the xUnit test project.

If you are using Visual Studio you may need to add a runtimes section to both of your project.json:

See the .NET documentation for .NET Core Runtime IDentifier (RID) catalog if you need to know other platform IDs.

Create GameParser

Let’s start by creating a class called GameParser and by recognizing numbers and commands.

On line 3 there is the code to parse a number: we start with Sprache.Parse followed by a digit, of which there must be at least one, then we convert from IEnumerable<char> to string, with Text(), and finally we discard whitespace with Token(). So first we choose the type of character we need, in this case Digit, then we set a quantity modifier and trasform the result in something more manageable. Notice that we return Parser<string> and not an int.

On the lines 5-6 we order to the parser to find a character ‘<‘  followed by  one ‘>’, using  Then(). We return an enum instead of a simple string. We can easily check for the presence of different options with the Or(), but it’s important to remember that, just as for ANTLR, the order matters. We have to put the more specific case first, otherwise it would match the generic one instead of reaching the correct case.

Now we have to combine this two simple parser in one Play, and thanks to the LINQ-like syntax the task is very simple. Most commands require only a number, but there is one that requires two, because we have to check if the number to guess is between two given numbers. It also has a different structure, first there is a number, then the symbol, and finally the second number. This is a more natural syntax for the user than using a ‘<>’ symbol followed by two numbers. As you can see, the code is quite simple, we gather the elements with from .. in .. and then we create a new object with select.

It’s time for Play

The only interesting things in the Play class are on the lines 27-51, the Evaluate function, where the “magic” happens, and I use the term magic extremely loosely. The number to guess is provided to the function, then it’s properly checked with the command and the numbers of the specific play that we are evaluating.

Unit Tests are easy

There are basically no disadvantages in using xUnit for our unit tests: it’s compatible with many platforms, it’s still integrated with the Visual Studio Test Explorer and it also have a special feature: theory. Theory is a special kind of test that allow you to supply multiple inputs with one test. Lines 3-6 shows exactly how you can do it. In our case we are testing that our parser can parse numbers with many digits.

The following test is a typical one, we are checking that the symbol ‘>’ is correctly parsed as a Command.Greater. On Line 27 we are making sure that an Exception is raised if we encounter an incorrect Play. Sprache allows also to use TryParse, instead of Parse, if you don’t want to throw an exception. As you can see the simplicity of tool make very easy to test it.

Let’s put everything together

The main function doesn’t contain anything shocking, on the lines 27-28 we parse the input and execute the proper command, then, on 31, we check whether we guessed the correct number and if so we prepare to exit the cycle. Notice that we provide a way to exit the game even without guessing the number correctly, but we check for ‘q’ before trying to parse, because it would be an illegal command for GameParser.


This blog talks much about Language Engineering, which is a fascinating topic, but it is not always used in the everyday life of the average developer. Sprache, instead, is one tool that any developer could find a use for. When a RegEx wasn’t good enough you probably have simply redesigned your application, making your life more complicated. Now you don’t need to, when you meet the mortal enemy of regular expressions, that is to say nested expression, you can just use Sprache, right in your code.

Building and testing a parser with ANTLR and Kotlin

This post is the part of a series. The goal of the series is to describe how to create a useful language and all the supporting tools.

  1. Building a lexer
  2. Building a parser
  3. Creating an editor with syntax highlighting
  4. Build an editor with autocompletion
  5. Mapping the parse tree to the abstract syntax tree
  6. Model to model transformations
  7. Validation
  8. Generating bytecode

After writing this series of posts I refined my method, expanded it, and clarified into this book titled
How to create pragmatic, lightweight languages


Code is available on GitHub. The code described in this post is associated to the tag 02_parser

A change to the lexer

With respect to the lexer we have seen in the first article we need to do one minor change: we want to keep recognizing the whitespace but we do not want to consider it in our parser rules. So we will instruct the lexer to throw all the whitespace tokens away and not passing them to the parser. To do that we need to change one line in SandyLexer.g4:

Thanks to Joseph for reporting this one!

The parser

The parser is simply defined as an ANTLR grammar. We have previously built a separate lexer. Here we reuse those terminals (NEWLINE, VAR, ID, etc.) to build rules such as statement, assignment, etc.

Here it is our new ANTLR grammar:

  • we reuse the existing lexer (tokenVocab=SandyLexer)
  • we start by defining the rule reppresenting the whole file: sandyFile. It is defined as a list of at list one line
  • each line is composed by a statement terminated either by a newline or the end of the file
  • a statement can be a varDeclaration or an assignment
  • an expression can be defined in many different ways. The order is important because it determines the operator precedence. So the multiplication comes before the sum

To build it we simply run ./gradlew generateGrammarSource. Please refer to the build.gradle file in the repository or take a look at the previous post of the series.


Ok, we defined our parser, now we need to test it. In general, I think we need to test a parser in three ways:

  • Verify that all the code we need to parse is parsed without errors
  • Ensure that code containing errors is not parsed
  • Verify that the the shape of the resulting AST is the one we expect

In practice the first point is the one on which I tend to insist the most. If you are building a parser for an existing language the best way to test your parser is to try parsing as much code as you can, verifying that all the errors found correspond to actual errors in the original code, and not errors in the parser. Typically I iterate over this step multiple times to complete my grammars.

The second and third points are refinements on which I work once I am sure my grammar can recognize everything.

In this simple case, we will write simple test cases to cover the first and the third point: we will verify that some examples are parsed and we will verify that the AST produced is the one we want.

It is a bit cumbersome to verify that the AST produced is the one you want. There are different ways to do that but in this case I chose to generate a string representation of the AST and verify it is the same as the one expected. It is an indirect way of testing the AST is the one I want but it is much easier for simple cases like this one.

This is how we produce a string representation of the AST:

And these are some test cases:

Simple, isn’t it?


We have seen how to build a simple lexer and a simple parser. Many tutorials stop there. We are instead going to move on and build more tools from our lexer and parser. We laid the foundations, we now have to move to the rest of the infrastructure. Things will start to get interesting.

In the next post we will see how to build an editor with syntax highlighting for our language.

Getting started with ANTLR: building a simple expression language

This post is the part of a series. The goal of the series is to describe how to create a useful language and all the supporting tools.

  1. Building a lexer
  2. Building a parser
  3. Creating an editor with syntax highlighting
  4. Build an editor with autocompletion
  5. Mapping the parse tree to the abstract syntax tree
  6. Model to model transformations
  7. Validation
  8. Generating bytecode

After writing this series of posts I refined my method, expanded it, and clarified into this book titled
How to create pragmatic, lightweight languages

In this post we will start working on a very simple expression language. We will build it in our language sandbox and therefore we will call the language Sandy.

I think that tool support is vital for a language: for this reason we will start with an extremely simple language but we will build rich tool support for it. To benefit from a language we need a parser, interpreters and compilers, editors and more. It seems to me that there is a lot of material on building simple parsers but very few material on building the rest of the infrastructure needed to make using a language practical and effective.

I would like to focus on exactly these aspects, making a language small but fully useful. Then you will be able to grow your language organically.

The code is available on GitHub: https://github.com/ftomassetti/LangSandbox. The code presented in this article corresponds to the tag 01_lexer.

The language

The language will permit to define variables and expressions. We will support:

  • integer and decimal literals
  • variable definition and assignment
  • the basic mathematical operations (addition, subtraction, multiplication, division)
  • the usage of parenthesis

Examples of a valid file:

The tools we will use

We will use:

  • ANTLR to generate the lexer and the parser
  • use Gradle as our build system
  • write the code in Kotlin. It will be very basic Kotlin, given I just started learning it.

Setup the project

Our build.gradle file will look like this

We can run:

  • ./gradlew idea to generate the IDEA project files
  • ./gradlew generateGrammarSource to generate the ANTLR lexer and parser

Implementing the lexer

We will build the lexer and the parser in two separate files. This is the lexer:

Now we can simply run ./gradlew generateGrammarSource and the lexer will be generated for us from the previous definition.

Testing the lexer

Testing is always important but while building languages it is absolutely critical: if the tools supporting your language are not correct this could affect all possible programs you will build for them. So let’s start testing the lexer: we will just verify that the sequence of tokens the lexer produces is the one we aspect.

Conclusions and next steps

We started with the first small step: we setup the project and built the lexer.

There is a long way in front of us before making the language usable in practice but we started. We will next work on the parser with the same approach: building something simple that we can test and compile through the command line.

On the need of a generic library around ANTLR: using reflection to build a metamodel

I am a Language Engineer: I use several tools to define and process languages.

Among other tools I use ANTLR: it is simple, it is flexible, I can build things around it.

However I find myself rebuilding similar tools around ANTLR for different projects. I see two problems with that:

  • ANTLR is a very good building block but with ANTLR alone not much can be done: the value lies in the processing we can do on the AST and I do not see an ecosystem of libraries around ANTLR
  • ANTLR does not produce a metamodel of the grammar: without it becomes very difficult to build generic tools around ANTLR

Let me explain that:

  • For people with experience with EMF: we basically need an Ecore-equivalent for each grammar.
  • For the others: read next paragraph

Why we need a metamodel

Suppose I want to build a generic library to produce an XML file or a JSON document from an AST produced by ANTLR. How could I do that?

Well, given a ParseRuleContext I can take the rule index and find the name. I have generated the parser for the Python grammar to have some examples, so let’s see how to do that with an actual class:

Let’s look at the class Single_inputContext:

I should obtain something like this:

Good. It is very easy for me to look at the class and recognize these elements, however how can I do that automatically?

Reflection, obviously, you will think.

Yes. That would work. However what if when we have multiple elements? Take this class:

To define metamodels I would not try to come up anything fancy. I would use the classical schema which is at the base of EMF and it is similar to what it is available in MPS.

I would add a sort of container named Package or Metamodel. The Package would list several Entities. We could also mark one of those entity as the root Entity.

Each Entity would have:

  • a name
  • an optional parent Entity (from which it inherits properties and relations)
  • a list of properties
  • a list of relations

Each Property would have:

  • a name
  • a type chosen among the primitive type. In practice I expect to use just String and Integers. Possibly enums in the future
  • a multiplicity (1 or many)

Each Relation would have:

  • a name
  • the kind: containment or reference. Now, the AST knows only about containments, however later we could implement symbol resolution and model transformations and at that stage we will need references
  • a target type: another Entity
  • a multiplicity (1 or many)

Next steps

I would start building a metamodel and later building generic tools taking advantage of the metamodel.

There are other things that typically need:

  • transformations: the AST which I generally get from ANTLR is determined by how I am force to express the grammar to obtain something parsable. Sometimes I have also to do some refactoring to improve performance. I want to transform the AST after parsing to obtain closer to the logical structure of the language.
  • unmarshalling: from the AST I want to produce the test back
  • symbol resolution: this could be absolutely not trivial, as I have found out building a symbol solver for Java

Yes, I know that some of you are thinking: just use Xtext. While I like EMF (Xtext is built on top of it), it has a steep learning curve and I have seen many people confused by it. I also do not like how OSGi plays with the non-OSGi world. Finally Xtext is coming with a lot of dependencies.

Do not get my wrong: I think Xtext is an amazing solution in a lot of contexts. However there are clients who prefer a leaner approach. For the cases in which it makes sense we need an alternative. I think it can be built on top of ANTLR, but there is work to do.

By the way years ago I built something similar for .NET and I called it NetModelingFramework.

ANTLR and Jetbrains MPS: Parsing files and display the AST using the tree notation

Itemis did it again: they just released a new very cool plugin for Jetbrains MPS. This one permits to define new tree editors.

They look like this:


In this post we are going to see:

  • how to use ANTLR parsers inside MPS
  • how to represent the parsed AST using the tree notation

In particular we are going to use the ANTLR grammar which parses… ANTLR grammars. How meta is that? The very same approach could be used for every ANTLR grammar, of course.

Also always code is is available on GitHub.


First of all you need to install Jetbrains MPS. Grab your free copy here.

To use the tree notations you should install the mbeddr platform. Just go here, download a zip and unzip it among the plugins of your MPS installation.

All set, time to do some programming.

Packaging ANTLR to be used inside MPS

In a previous post we discussed how to use an existing ANTLR grammar in Java projects using Gradle. We will apply that technique also here.

We start by download the grammar from here: https://github.com/antlr/grammars-v4/tree/master/antlr4

We just do some minor changes by including directly LexBasic into ANTLRv4Lexer. Note that we need also the LexerAdaptor.

For simplifying the usage we create a Facade:

Now we need a build file:

You may want to run:

  • gradle idea to create a Jetbrains IDEA project
  • gradle fatJar to create a Jar which will contain our compiled code and all the dependencies

Good. Now to use this parser into MPS we start by creating a project. In the wizard we select also the runtime and sandbox options. Once we have done that we should copy our fat jar under the models directory of the runtime solution. In my case I run from the directory of the Java project this command:

Now we need to make MPS aware of that Jar. Lets’s select the sandbox solution and first add the jar to the models:


Then we add it also to the libraries:


Now the content of the JAR should appear among the stubs of the runtime solution.



Creating MPS nodes from AST nodes

Now we are going to build a new concept named AntlrImporter. We will use it to select and import ANTLR grammars into MPS:


The Concept structure will be pretty simple:


We need also concepts for the AST nodes we are going to import. First of all, we will define the abstract concept AstNode. Then we will define two subconcepts for the terminal and non-terminal AST nodes.


Now let’s take a look at the editor for the AntlrImporter.


The first swing component is a button which opens a file chooser. In this way, we can easily select a file and set the property path. Or we can edit it manually if we prefer.



Once we have selected a File we can import it by clicking on the second button


The import logic is in importModel, a method in the behavior of AntlrImporter.


Good. That is it. With that we can parse any ANTLR grammar and get it into MPS. Now we have just to use a nice representation. We are going for the tree notation.

Using the tree notation

The tree notation is surprising easily to use.

Let’s start by adding com.mbeddr.mpsutil.treenotation.styles.editor to the dependencies of the editor aspect of our language.


We will need also the com.mbeddr.mpsutil.treenotation to be among the used languages.


The editor for NonTerminalNode consists of a single tree cell. The top part of the tree cell represents this node. We will use the ruleName to represent it. In the bottom part instead we should pick the relation contains the children to be displayed in the tree


We can put the cursor on the tree drawing between the top and the bottom part (the “/|\” symbol) and open the inspector. There we can use style attributes to customize the appearance of the tree


We just decide to show the tree from left-to-right instead that top down. Then we decide to add more spaces between the parent and the children when there are too many children. In this way the lines to not overlap too much.

This is how it looks without the property


This is how it looks with the property set


There are other properties that can be used to control the color and the thickness of the lines, for example. Or you could add shapes at the extremes of the lines. For now we do not need these features, but it is nice to know they are there.

The editor for TerminalNode is very simple



Over the years MPS became more stable and easier to use. It has reached the point at which you can be very productive using it. Projectional editing is an idea that has been around for a while and there are other implementations available like the Whole Platform. However MPS has reached a very high level of maturity.

What I think we still miss are:

  • processes and best practices: how should we manage dependencies with other MPS projects? How should we integrate with Java libraries?
  • examples: there are surprisingly few applications which are publicly available. After all, many users develop DSLs for their specific usages and do not intend to share them. However, this means we have few opportunities to learn from each other
  • extensions: the Mbeddr team is doing an amazing job providing a lot of goodies as part of the Mbeddr platform. However, they seem the only ones producing reusable components and sharing them

I think this is now time to understand together what we can achieve with projectional editing. In my opinion these are going to be very interesting times.

If I have to express one wish is that I would like to hear more about how others are using MPS. If you are out there, please knock. And leave a comment 🙂

Turin Programming Language for the JVM: building advanced lexers with ANTLR

As I wrote in my last post, I recently started working on a new programming language named Turin. A working compiler for an initial version of the languag is available on GitHub. I am currently improving the language and working on a Maven and an IntelliJ plugins. Here and in the next posts I will go over the different components of the compiler and related tools.

Structure of the compiler

The compiler needs to do several things:

  1. Get the source code and generate an abstract syntax tree (AST)
  2. Translate the AST through different stages to simplify processing. We basically want to move from a representation very close to the syntax to a representation easier to process. For example we could “desugarize” the language, representing several (apparently) different constructs as variants of the same construct. An example? The Java compiler translates string concatenations into calls to StringBuffer.append
  3. Perform semantic checks. For example we want to check if all the expressions are using acceptable types (we do not want to sum characters, right?)
  4. Generate bytecode

The first step requires building two components: a lexer and a parser. The lexer operates on the text and produces a sequences of tokens, while the parser composes tokens into constructs (a type declaration, a statement, an expression, etc.) creating the AST. For writing the lexer and the parser I have used ANTLR.

In the rest of this post we look into the lexer. The parser and the other components of the compiler will be treated in future posts.

Why using ANTLR?

ANTLR is a very mature tool for writing lexer and parsers. It can generate code for several languages and has decent performance. It is well mantained and I was sure it had all the features I could possible need to handle all the corner cases I could meet. In addition to that, ANTLR 4 makes possible to write simple grammars because it solves left recursive definition for you. So you do not have to write many intermediate node types for specifying precedence rules for your expressions. More on this when we will look into the parser.

ANTLR is used by Xtext (which I have used a lot) and I have ANTLR while building a framework for Model-driven development for the .NET platform (a sort of EMF for .NET). So I know and trust ANTLR and I have no reason for looking into alternatives.

The current lexer grammar

This is the current version of the lexer grammar.

A few choices I have done:

  • there are two different types of ID: VALUE_ID and TYPE_ID. This permits to have less ambiguity in the grammar because values and types can be easily distinguished. In Java instead when (foo) is encountered we do not know if it is an expression (a reference to the value represented by foo between parenthesis) or a cast to the type foo. We need to look at what follows to understand it. In my opinion this is rather stupid because in practice everyone is using capitalized identifiers for types only, but because this is not enforced by the language the compiler cannot take advantage from it
  • newlines are relevant in Turin, so we have tokens for them we basically want to have statements terminated by newlines but we accept optional newlines after commas
  • whitespaces (but newlines) and comments are captured in their own channels, so that we can ignore them in the parser grammar but we can retrieve them when needed. For example we need them for syntax highlighting and in general for the IntelliJ plugin because it requires to define tokens for each single character in the source file, without gaps
  • the most tricky part is parsing string interpolations à la Ruby such as “my name is #{user.name}”. We use modes: when we encounter a string start (“) we switch to lexer mode IN_STRING. While in mode IN_STRING if we encounter the start of an interpolated value (#{) we move to lexer mode IN_INTERPOLATION. While in mode IN_INTERPOLATION we need to accept most of tokens used in expressions (and that sadly means a lot of duplication in our lexer grammar).
  • I had to collapse the relational operators in one single token type, so that the number of states of the generated lexer is not too big. It means that I will have to look into the text of RELOP tokens to figure out which operation need to be executed. Nothing too awful but you have to know how to fix these kinds of issues.

Testing the lexer

I wrote a bunch of tests specific for the lexer. In particular I tested the most involved part: the one regarding string interpolation.

An example of a few tests:

As you can see I just test the token on a string and verify it produces  the correct list of tokens. Easy and straight to the point.


My experience with ANTLR for this language has not been perfect: there are issues and limitations. Having to collapse several operators in a single token type is not nice. Having to repeat several token definitions for different lexer modes is bad. However ANTLR proved to be a tool usable in practice: it does all that it needs to do and for each problem there is an acceptable solution. The solution is maybe not ideal, maybe not elegant as desired but there is one. So I can use it and move on on more interesting parts of the compiler.

Develop DSLs for Eclipse and IntelliJ using Xtext

In this post we are going to see how to develop a simple language. We will aim to get:

  • a parser for the language
  • an editor for IntelliJ. The editor should have syntax highlighting, validation and auto-completion

We would also get for free an editor for Eclipse and web editor, but please contain your excitement, we are not going to look into that in this post.

In the last year I have focused on learning new stuff (mostly web and ops stuff) but one of the things I still like the most is to develop DSLs (Domain Specific Languages). The first related technology I played with was Xtext: Xtext is a fantastic tool that let you define the grammar of your language and generate amazing editors for such language. Until now it has been developed only for the Eclipse platform: it means that new languages could be developed using Eclipse and the resulting editors could then be installed in Eclipse.

Lately I have been using far less Eclipse and so I my interest in Xtext faded until now, when finally the new release of Xtext (still in beta) is targeting IntelliJ. So while we will develop our language using Eclipse, we will then generate plugins to use our language both in IntelliJ.

The techniques we are going to see can be used to develop any sort of language, but we are going to apply them to a specific case: AST transformations. This post is intended for Xtext newbies and I am not going in many details for now, I am just sharing my first impression of the IntelliJ target. Consider that this functionality is currently a beta, so we could expect some rough edges.

The problem we are trying to solve: adapt ANTLR parsers to get awesome ASTs

I like playing with parsers and ANTLR is a great parser generator. There are beatiful grammars out there for full blown languages like Java. Now, the problem is that the grammars of languages like Java are quite complex and the generated parsers produce ASTs that are not easy to use. The main problem is due to how precedence rules are handled. Consider the grammar for Java 8 produced by Terence Parr and Sam Harwell. Let’s look at how some expressions are defined:

This is just a fragment of the large portion of code used to define expressions. Now consider you have a simple preIncrementExpression (something like: ++a). In the AST we will have node of type preIncrementExpression that will be contained in an unaryExpression. The unaryExpression will be contained in a multiplicativeExpression, which will be contained in an additiveExpression and so on and so forth. This organization is necessary to handle operator precedence between the different kind of operations, so that 1 + 2 * 3  is parsed as a sum of 1 and 2 * 3 instead of a multiplication of 1 + 2  and 3. The problem is that from the logical point of view multiplications and additions are expressions at the same level: it does not make sense to have Matryoshka AST nodes.

Consider this code:

The AST produced by this grammar is:

While we would like something like:

Ideally we want to specify grammars that produce the Matryoshka-style of ASTs but using a more flat ASTs when doing analysis on the code, so we are going to build adapters from the ASTs as produced by Antlr and the “logical” ASTs.

How do we plan to do that? We will start by developing a language defining the shape of nodes as we want them to appear in the logical ASTs and we will also define how to map the Antlr nodes (the Matryoshka-style nodes) into these logical nodes.

This is just the problem we are trying to solve: Xtext can be used to develop any sort of language, is just that being a parser maniac I like to use DSLs to solve parser related problems. Which is very meta.

Getting started: installing Eclipse Luna DSL and create the project

We are going to download a version of Eclipse containing the beta of Xtext 2.9.

In your brand new Eclipse you can create a new type of projects: Xtext Projects.

Screenshot from 2015-06-01 09:44:03

We just have to define the name of the project and pick an extension to be associated with our new language

Screenshot from 2015-06-01 09:45:14

And then we select the platforms that we are interested into (yes, there is also the web platform… we will look into that in the future)

Screenshot from 2015-06-01 09:47:27

The project created contains a sample grammar. We could use it as is, we would have just to generate a few files running the MWE2 file.


After running this command we could just use our new plugin in IntelliJ or in Eclipse. But we are going instead to first change the grammar, to transform the given example in our glorious DSL.

An example of our DSL

Our language will look like this in IntelliJ IDEA (cool, eh?).

Screenshot from 2015-06-02 19:42:14

Of course this is just a start but we are start defining some basic node types for a Java parser:

  • an enum representing the possible modifiers (warning: this is not a complete list)
  • the CompilationUnit which contains an optional PackageDeclaration and possibly many TypeDeclarations
  • TypeDeclaration is an abstract node and there are three concrete types extending it: EnumDeclaration, ClassDeclaration and InterfaceDeclaration (we are missing the annotation declaration)

We will need to add tens of expressions and statements but you should get an idea of the language we are trying to build.

Note also that we have a reference to an Antlr grammar (in the first line) but we are not yet specifying how our defined node types maps to the Antlr node types.

Now the question is: how do we build it?

Define the grammar

We can define the grammar of our language with a simple EBNF notation (with a few extensions). Look for a file with the xtext extension in your project and change it like this:

The first rule we define corresponds to the root of the AST (Model in our case). Our Model starts with a reference to an Antlr file and a list of Declarations. The idea is to specify declarations of our “logical” node types and how the “antlr” node types should be mapped to them. So we will define transformations that will have references to element defined… in the antlr grammar that will we specify in the AntlrGrammarRef rule.

We could define either Enum or NodeType. The NodeType has a name, can be abstract and can extends another NodeType. Note that the supertype is a reference to a NodeType. It means that the resulting editor will automatically be able to gives us auto-completion (listing all the NodeTypes defined in the file) and validation, verifying we are referring to an existing NodeType.

In our NodeTypes we can defined as many fields as we want (NodeTypeField). Each field starts with a name, followed by an operator:

  • *= means we can have 0..n values in this field
  • ?= means that the field is optional (0..1) value
  • means that exactly one value is always present

The NodeTypeField have also a value type which can be an enum defined inline (UnnamedEnumDeclaration), a relation (it means this node contains other nodes) or an attribute (it means this node has some basic attributes like a string or a boolean).

Pretty simple, eh?

So we basically re-run the MWE2 files and we are ready to go.

See the plugin in action

To see our plugin installed in IntelliJ IDEA we have just to run gradle runIdea from the directory containing the idea plugin (me.tomassetti.asttransf.idea in our case). Just note that you need a recent version of gradle and you need to define JAVA_HOME. This command will download IntelliJ IDEA, install the plugin we developed and start it. In the opened IDE you can create a new project and define a new file. Just use the extension we specified when we created the project (.anttr in our case) and IDEA should use our newly defined editor.

Currently validation is working but the editor seems to react quite slowly. Auto-completion is instead broken for me. Consider that this is just a beta, so I expect these issues to disappear before Xtext 2.9 is released.

Next steps

We are just getting started but it is amazing how we can have a DSL with its editor for IDEA working in a matter of minutes.

I plan to work in a few different direction:

  • We need to see how to package and distribute the plugin: we can try it using gradle runIdea but we want to just produce a binary for people to install it without having to process the sources of the editor
  • Use arbitrary dependencies from Maven: this is going to be rather complicate because Maven and the Eclipse plugin (OSGi bundles) define their dependencies in their own way, so jars have to be typically be packaged into bundles to being used in Eclipse plugins. However there are alternatives like Tycho and the p2-maven-plugin. Spoiler: I do not expect this one too be fast and easy…
  • We are not yet able to refer to elements defined in the Antlr grammar. Now, it means that we should be able to parse the Antlr grammar and create programmatically EMF models, so that we can refer it in our DSL. It require to know EMF (and it gets some time…). I am going to play with that in the future and this will probably require a loooong tutorial.


While I do not like Eclipse anymore (now I am used to IDEA and it seems to me so much better: faster and lighter) the Eclipse Modeling Framework keeps being a very interesting piece of software and be able to use it with IDEA is great.

It was a while that I was not playing with EMF and Xtext and I have to say that I have seen some improvements. I had the feeling that Eclipse was not very command-line friendly and it was in general difficult to integrate it with CI systems. I am seeing an effort being done for fixing these problems (see Tycho or the gradle job we have used to start IDEA with the editor we developed) and it seems very positive to me.

Mixing technologies, combining the best aspects of different worlds in a pragmatic way is my philosophy, so I hope to find the time to play more with this stuff.