Posts

Kanvas: generating a simple IDE from your ANTLR grammar

What is an editor?

An editor is, for me, the main tool I use for work. As a Language Engineer I create new languages, I use existing ones and I need different tools to work with them. I would like to be able to hack all of them together, in a customized IDE I can grow for me. This is why I am working on Kanvas, the hackable editor. Which is on GitHub, of course.

In many cases I need a simple text editor for my DSLs and I tend to build them using ANTLR. I will need other stuff, like tabular or graphical projections, simulators and more but I need to start somewhere, right? Also, I think right now there is not an easy way to get a standalone editor for a DSL, with minimal dependencies and a simple structure. There is not a light option on the menu. Time to add one.

Getting an editor from your grammar quickly

Once you define the grammar of your language there is a lot of information you can extract from it. I think you should be able to get as much value as possible from it for free, with the possibility to customize it further, if needed. This is similar to the idea behind Xtext (minus the 400 pages you need to read to understand EMF).

How quickly can you get an editor for your ANTLR grammar? You create a new project for your editor, add Kanvas as a dependency and register which languages do you intend to support:

and add these lines to support your language:

This quickly. Less then 10 lines of code. We just need to specify the Lexer and Parser classes (SMLexer and SMParser in this example).

If you are wondering what language is that, that is Kotlin: a concise static language for the JVM, easily interoperable with Java.

Let’s improve it a little bit: syntax highlighting

So I have a simple language, I get an editor basically for free and I start using it. As first thing I want to define the style for the different kind of tokens. We are doing something simple, just setting the colors:

We are not setting certain tokens to be bold or in italic because we want to keep things simple. By the way, if you are interested in how syntax highlighting works in Kanvas, I described it in this post.

And then comes autocompletion

Now, we get some limited autocompletion for free. We basically get autocompletion depending on the structure of the language, so our algorithm can tell us which keywords can be inserted in the current position or that in a certain position an identifier can be accepted. What the algorithm cannot not determine for free is which identifiers should suggest. Let’s implement a very simple logic: when we can insert an identifier we look at the preceeding tokens and use them to determine which suggestion to make. For example, when defining an input we could suggest “anInput” while when defining a variable we could suggest “aVar”:

Here is the code. Is this enough? I do not know, but what I know is that this is a system small enough to be understandable and simple enough to be easily extended and customized. So I plan to use it for this small language, and improve the autocompletion as needed, specifically for this language. Organically and iteratively grow tool support is the name of the game.

Design goals: something similar to Sublime Text but open-source

We all love Sublime Text. I would like to have something inspired to it, but open-source. Why open-source? So that I can customize it as much as I want.

This is how it looks like for now:

Yes, it is not yet as beautiful as Sublime Text. But this means I have space for improvement.

To Language Workbench or to not Language Workbench?

I work routinely with Language Workbenches such as Jetbrains MPS and Xtext. They are great because they permit to obtain very good tool support very quickly. In many situations they are your best option. However, as every engineering choice, there are different aspects to consider. Jetbrains MPS and Xtext are very large and complex pieces of software, the kind of stuff that weight hundreds of MBs. To learn the internals of these platforms require a lot of work and a large effort. You can have a huge benefit by simply using those platforms. However they are not the best solution in all situations because in some situations you need to integrate your language with existing systems and thus you have to bend those Language Workbenches in ways they are not designed to. Maybe you want to embed your editor or tools in your existing platform, maybe you want to have a simple editor to use on a tablet, maybe you want tools to use from the command line. Maybe you want to hack a system together to fit your particular needs in some peculiar way. In those cases using a Language Workbench is not the right choice. You need something simple, something hackable. This is the approach I am experimenting it. To do that I am working on a few open-source projects and writing a book.

Conclusions

Will this fly? I do not know. I am having fun spending the few time I find on this project. And I feel it can be a good approach to get simple standalone editors for DSLs built with ANTLR. I would also like to use it as my sort of Kotlin-powered vim, a vim for the new millenium. With super-projectional-powers. Let’s see how this grows.

And yes, I know that Atom describes itself as the hackable editor. But it is not hackable enough from my point of view.

Building a compiler for your own language: from the parse tree to the Abstract Syntax Tree

From the parse tree to the abstract syntax tree

In this post we are going to see how process and transform the information obtained from the parser. The ANTLR parser recognizes the elements present in the source code and build a parse tree. From the parse tree we will obtain the Abstract Syntax Tree which we will use to perform validation and produce compiled code.

Note that the terminology can vary: many would call the tree obtained from ANTLR an Abstract Syntax Tree. I prefer to mark the difference from this two steps. To me the parse tree is the information as meaningful to the parser, the abstract syntax tree is the information reorganized to better support the next steps.

This post is the part of a series. The goal of the series is to describe how to create a useful language and all the supporting tools.

  1. Building a lexer
  2. Building a parser
  3. Creating an editor with syntax highlighting
  4. Build an editor with autocompletion
  5. Mapping the parse tree to the abstract syntax tree
  6. Model to model transformations
  7. Validation
  8. Generating bytecode

After writing this series of posts I refined my method, expanded it, and clarified into this book titled
How to create pragmatic, lightweight languages

 

Code

Code is available on GitHub under the tag 05_ast

Spice up our language

In this series of post we have been working on a very simple language for expressions. It is time to make our language slightly more complex introducing:

  • casts for example: 10 as Decimal or (1 * 2.45) as Int
  • print statement for example: print(3 + 11)

To do so we need to revise our lexer and parser grammar. The syntax highlighting and autocompletion which we have built in previous posts will just keep working.

The new lexer grammar:

And the new parser grammar:

The Abstract Syntax Tree metamodel

The Abstract Syntax Tree metamodel is simply the structure of the data we want to use for our Abstract Syntax Tree (AST). In this case we are defining it by defining the classes which we will use for our AST.

The AST metamodel will look reasonably similar to the parse tree metamodel, i.e., the set of classes generated by ANTLR to contain the nodes.

There will be a few key differences:

  • it will have a simpler and nicer API than the classes generated by ANTLR (so the classes composing the parse tree). In next sections we will see how this API could permit to perform transformations on the AST
  • we will remove elements which are meaningful only while parsing but that logically are useless: for example the parenthesis expression or the line node
  • some nodes for which we have separate instances in the parse tree can correspond to a single instance in the AST. This is the case of the type references Int and Decimal which in the AST are defined using singleton objects
  • we can define common interfaces for related node types like BinaryExpression
  • to define how to parse a variable declaration we reuse the assignement rule. In the AST the two concepts are completely separated
  • certain operations have the same node type in the parse tree but are separated in the AST. This is the case of the different types of binary expressions

Let’s see how we can define our AST metamodel using Kotlin.

We start by defining Node. A Node represents every possible node of an AST and it is general. It could be reused for other languages also. All the rest is instead specific of the language (Sandy on our case). In our specific language we need three important interfaces:

  • Statement
  • Expression
  • Type

Each of these interfaces extends Node.

We then declare the two types we use in our language. They are defined as singleton objects. It means that we have just one instance of these classes.

We then have the BinaryExpression interfacewhich extends Expression. For classes implements it, one for each of the basic arithmetic expressions.

Most of the expressions have as children other nodes. A few have instead simple values. They are VarReference (which has a property varName of type String), and Intlit and DecLit (both have a property value of type String).

Finally we have the three classes implementing Statement.

Note that we are using data classes so we can get for free the hashCode, equals and toString methods. Kotlin generates for us also constructors and getters. Try to imagine how much code that would be in Java.

Mapping the parse tree to the abstract syntax tree

Let’s see how we can get the parse tree, produced by ANTLR, and map it into our AST classes.

To implement this we have taken advantage of three very useful features of Kotlin:

  • extension methods: we added the method toAst to several existing classes
  • the when construct, which is a more powerful version of switch
  • smart casts: after we check that an object has a certain class the compiler implicitly cast it to that type so that we can use the specific methods of that class

We could come up with a mechanism to derive automatically this mapping for most of the rules and just customize it where the parse tree and the AST differs. To avoid using too much reflection black magic we are not going to do that for now. If I were using Java I would just go for the reflection road to avoid having to write manually a lot of redundant and boring code. However using Kotlin this code is compact and clear.

Testing the mapping

Of course we need to test this stuff. Let’s see if the AST we get for a certain piece of code is the one we expect.

This would be all nice: we have a clean model of the information present in the code. The metamodel and the mapping code looks very simple and clear. However we would need to add a little detail: the position of the nodes in the source code. This would be needed while showing errors to the user. We want to have the possibility to specify the positions of our AST nodes but we do not want to be forced to do so. In this way depending on the operations we need to do we can ignore or not the positions. Consider the tests we have written so far: wouldn’t be cumbersome and annoying having to specify fake positions for all the nodes? I think so.

This is the new Node definition and a few supporting class:

We need also to add position as an optional parameter to all the classes. It would have the default value null. For example this is how SandyFile looks now:

The parse tree contains the information organized in the most convenient way for the parser. It is typically not the most convenient way for the steps which follow. Think about the variable declaration rule being implemented by reusing the assignment rule: sure, this make the grammar shorter and it makes sense for the parse tree. However from the logical point of view the two elements are separated, and in the AST they are indeed.

Most of the rest of our tools will operate on the AST so it is better to spend some time working on an AST that makes sense.

How to create an editor with syntax highlighting for your language using ANTLR and Kotlin

What we are going to build

In this post we are going to see how to build a standalone editor with syntax highlighting for our language. The syntax highlighting feature will be based on the ANTLR lexer we have built in the first post. The code will be in Kotlin, however it should be easily convertible to Java. The editor will be named Kanvas.

Implementing syntax highlighting in the editor for your DSL

This post is the part of a series. The goal of the series is to describe how to create a useful language and all the supporting tools.

  1. Building a lexer
  2. Building a parser
  3. Creating an editor with syntax highlighting
  4. Build an editor with autocompletion
  5. Mapping the parse tree to the abstract syntax tree
  6. Model to model transformations
  7. Validation
  8. Generating bytecode

After writing this series of posts I refined my method, expanded it, and clarified into this book titled
How to create pragmatic, lightweight languages

 

Code

Code is available on GitHub. The code described in this post is associated to the tag syntax_highlighting

Setup

We are going to use Gradle as our build system.

This should be pretty straightforward. A few comments:

  • our editor will be based on the RSyntaxTextArea component so we added the corresponding dependency
  • we are specifying the version of Kotlin we are using. You do not need to install Kotlin on your system: Gradle will download the compiler and use it based on this configuration
  • we are using ANTLR to generate the lexer from our lexer grammar
  • we are using the idea plugin: by running ./gradlew idea we can generate an IDEA project. Note that we add the generated code to the list of source directories
  • when we run ./gradlew clean we want also to delete the generated code

Changes to the ANTLR lexer

When using a lexer to feed a parser we can safely ignore some tokens. For example, we could want to ignore spaces and comments. When using a lexer for syntax highlighting instead we want to capture all possible tokens. So we need to slightly adapt the lexer grammar we defined in the first post by:

  • defining a second channel where we will send the tokens which are useless for the parser but needed for syntax highlighting
  • we will add a new real named UNMATCHED to capture all characters which do not belong to correct tokens. This is necessary both because invalid tokens need to be displayed in the editor and because while typing the user would temporarily introduce invalid tokens all the time

The parser will consider only tokens in the default channel, while we will use the default channel and the EXTRA channel in our syntax highlighter.

This is the resulting grammar.

Our editor

Our editor will support syntax highlighting for multiple languages. For now, I have implemented support for Python and for Sandy, the simple language we are working on in this series of post. For implementing syntax highlighting for Python I started from an ANTLR grammar for Python. I started by removing the parser rules, leaving only the lexer rules. Then U did very minimal changes to get the final lexer present in the repository.

To create the GUI we will need some pretty boring Swing code. We basically need to create a frame with a menu. The frame will contained a tabbed panel. We will have one tab per file. In each tab we will have just one editor, implemented using an RSyntaxTextArea.

Note that in the main function we register support for our two languages. In the future we could envision a pluggable system to support more languages.

The specific support for Sandy is defined in the object sandyLanguageSupport (for Java developers: an object is just a singleton instance of a class). The support needs a SyntaxScheme and an AntlrLexerFactory.

The SyntaxScheme just return the style for each given token type. Quite easy, eh?

Now, let’s take a look at the AntlrLexerFactory. This factory just instantiate an ANTLR Lexer for a certain language. It can be used together with an AntlrTokenMaker. The AntlrTokenMaker is the adapter between the ANTLR Lexer and the TokenMaker used by the RSyntaxTextArea to process the text. Basically a TokenMakerBase is invoked for each line of the file separately as the line is changed. So we ask our ANTLR Lexer to process just the line and we get the resulting tokens and instantiate the TokenImpl instances which are expected by the RSyntaxTextArea.

This would be a problem for tokens spanning multiple lines. There are workarounds for that but for now let’s keep the things simple and do not consider that, given that Sandy has not tokens spanning multiple lines.

Conclusions

Of course this editor is not feature complete. However I think it is interesting to see how an editor with syntax highlighting can be built with a couple of hundreds of lines of quite simple Kotlin code.

In future posts we will see how to enrich this editor building things like auto-completion.