Quickly create DSLs with Langium

Langium is a language engineering tool designed to help create DSLs and low code platforms: you can quickly create DSLs with Langium. Langium is lightweight, based on Visual Studio Code and allows you to create a language and an editor in one step.

In this article we are going to take a look at this new tool and create an example language with Langium. This article will be a quick review and tutorial of the tool from the perspective of people that already have experience with parsing. It is not a good introduction to parsing from scratch.

You can find the companion repository for this article on GitHub.

How Langium and Xtext are Similar

Langium is yet another interesting creation of TypeFox, a consulting and research company that created Eclipse Theia, Gitpod and Xtext. This new tool is similar to the last one mentioned: they are both tools designed to help creating DSLs. They are also both built upon opensource libraries and tools: Xtext is built upon Eclipse and ANTLR, while Langium is built upon Visual Code and Chevrotain.

An interesting side note is that Chevrotain itself is not a parser generator, it is described as a parsing building toolkit. We reviewed it in our article about parsing in JavaScript. This means that Langium developers created the generator step themselves. Basically, it is all just TypeScript code. Therefore you could potentially alter the process to suit your needs.

Xtext and Langium also have a similar strategy: everything depends upon the grammar declaration of the language you are creating. For example, you can use the grammar to define some validation rules, like allowing a property dogs_name to have only values of dogs that have been already defined in the current file.

How Langium is Different

The differences between the two are due to the new development environment and new objectives.

Xtext is a tool built to design both traditional programming languages and DSLs. Instead Langium helps you create DSLs or low-code languages. In other words, Xtext can aid you in creating ecosystems of languages of different complexity, while, at the moment, Langium aids you in creating quicker individual, simple languages. Xtext is part of the EMF galaxy of tools, so it is compatible with other tools that use this technology. This is a powerful and widespread technology that powers other language engineering tools, but it is quite complex to work with.

Visual Studio Code has become the new standard for development environment. In fact, it is also the basis for Theia. There are certainly a myriad of other good options, however it is good enough for most use cases.

So it is the safe choice for development tools. Since Visual Studio code uses TypeScript, the whole Langium project also uses TypeScript. This is to avoid having a codebase in multiple languages or the need to integrate different tools or runtimes. This is also why Langium does not use a parser generator, but it just creates code.

Is Langium the New Xtext?

An interesting result of these differences is that you can more easily integrate different languages created with Xtext or other technologies. That is because Xtext follows a classical structure of language execution (i.e., parsing creates an AST, which is based on the Eclipse Modelling Framework, etc.). So, if you are a language engineer, you know where to put your hands on to integrate different languages or tools that use EMF.

On the other hand, with Langium this is harder to do, because it all relies on TypeScript. For instance, the AST is based upon TypeScript interfaces. So, everything is just TypeScript code, that is specific to Langium.

This is not a flaw per se, because this makes it easier and quicker to create a DSL with Langium. And this serves better the objective of Langium. So, Langium does not replace Xtext, but it serves a different purpose.

Or to put it in another way, the competition of Xtext is a series of custom designed tools that cover parts or the whole process of creating language tools that are either more efficient or better integrated with the rest of the project. For instance, instead of using Xtext in your workflow you could build just a parser used in a parsing service or opt to create a custom transpiler from scratch.

What Langium Can Replace

The alternatives to Langium are custom environments/applications (e.g., a custom Desktop app) used by non-developers and developers alike. Some people might scoff at the idea, because Langium is based on VS Code which is an IDE, that is to say a tool for developers.

However, VS Code is not complicated to use. In fact, it does not look more complex than a standard text editor. So, it can be used by non-developers. On the other hand Xtext looks very complicated to the untrained eye. To be fair, Xtext allows you to create custom distributions of the IDE with just the elements of the UI that you need, but that requires some work.

Now that you understand the idea behind Langium, let’s see it at work.

Getting Started With Langium

The first pleasant surprise is how easy it is to start working with Langium. You just need to install the corresponding yeoman generator:

npm install -g yo generator-langium

And then you can launch it:

yo langium

This will start the generator that will create the skeleton of your language just by answering a few questions. This process will also create a langium-quickstart.md file with the information about the structure of a langium project.

Unfortunately that is also the extent of the documentation that is available for Langium. This project is still in the early stages of development, so this is what happens at the beginning. At this point Langium is just a few months old.

The generator creates a readme file that explains the structure of a Langium project. The generated project also comes with a few example files, like one for a language definition and validation. There are also language examples in the github repository of the project. Looking at the arithmetic example, you can see that Langium does not support easily handling expressions, like ANTLR. You would have to create a cascade of ever-more specific expressions rather than one rule expression.

The issue of limited documentation is somewhat mitigated by the fact that Langium is built on opensource and readily available components. So, you can get started by reading other documentation. For instance, the npm package documentation says:

The grammar declaration language of Langium is very similar to Xtext. Please follow the Xtext documentation to learn how to use this language.

So, if you need a reference for the grammar language, you can look for it by searching for the Xtext grammar language. Notice, though, that while the rules are largely the same, there is not a perfect correspondence. For instance, the peculiar until token of Xtext is not supported in Langium.

The Langium Workflow

Everybody that has already created a language server for Visual Studio Code will also recognize the same basic components. For instance, the language-configuration.json file is the standard VS Code file that contains definitions for syntax highlighting. It contains the definitions used to enable syntax highlighting for elements like comments or brackets. To know more about such elements, you could follow our tutorials on creating language servers for Visual Studio Code, such as Integrating Code Completion in Visual Studio Code – With the Language Server Protocol.

Basically, to take advantage of Langium at this stage you should already have some experience with language engineering. Otherwise you must be ready to dig some information from other sources or resign to trial and error. The alternative is to keep reading this article.

Another thing that you may want to do is installing the Langium VS Code extension available for VS Code. This extension adds language support for langium files themselves, such as syntax highlighting, autocompletion, etc.

Technically, with Langium you are just developing a VS Code extension, so the workflow should be familiar to any VS Code extension developer.

You first run this command to observe and automatically compile your TypeScript code:

npm run watch

You use this command specifically to run Langium:

npm run langium:generate

This will make Langium do its magic and generate the code from your grammar definition and code.

Then you use F5 to launch the extension.

The Structure of a Langium Project

The langium-quickstart.md file explains the structure of a Langium project.

You can safely ignore the files package.json, extension.ts and main.ts for most of your Langium projects, since they essentially contain the code to integrate Langium into VS Code.

You might want to take a look at the file language-configuration.json. This is the language configuration that is used to enable syntax highlighting in VS Code for elements such as blocks of code or comments.

The file <language-name>-module.ts is used to set up the Langium project. You can add modules or language services here that will perform some operations on a language file. For instance, a module to serialize or validate a file.

By default, the Langium generator creates a validator module that checks that a file is valid according to the rules of your language. For example, your language might require variable names to start with a capital letter. This default module is in the file <language-name>-validator.ts. The validation is done with standard TypeScript code.

The main file you will work with is the grammar file: <language-name>.langium. This contains the definition of your language. For the most part this defines the parser. However, you can also use it to define the type of the corresponding node in the AST. For instance, the generated example file contains this rule:

terminal INT returns number: /[0-9]+/;

This makes sure that the type of an INT element is a number, rather than the default string.

Creating Our Language: Lexer

In our example, we generated a Langium project with the name LangiumGame. If we open the file langium-game.langium we can see the rules for the default example language. This is a language for creating greetings to people. This is not very useful, so we are doing something different for our example. We are going to create a language that defines simple games, which is clearly much more useful and productive. Okay, it is equally useless, but at least it is uncommon.

Our Langium file starts with the name and the tokens (terminals) that will not appear in the AST. These tokens are indicated using the command hidden.

grammar LangiumGame
hidden(WS, COMMENT)

We choose to have just one type of comment. We ignore that and whitespace.

terminal COMMENT: /§[^\n\r]*[\n\r]*/;

Our comment starts with the character § (section sign) and ends with a newline. That character might not be on your keyboard, so now you know how I feel when I see the ~ (tilde) character.

In Langium, lexer rules definitions (i.e., terminals or tokens) are delimited by the slash character. Aside from that, the definition of rules is intuitive and depends on the typical regular expression format.

The terminal COMMENT, and all the terminals, are put at the end of the grammar file, after all the parser rules. So, in the same file you have the grammar name and list of hidden tokens at the beginning, then the parser rules and finally the lexer rules.

terminal NEWLINE: /[\n\r]+/;
terminal ID: /[_a-zA-Z][\w_]*/;
terminal WS: /\s+/;
terminal INT returns number: /[0-9]+/;
terminal TEXT: /"[^"]*"|'[^']*'/;

Creating Our Language: Parser

The parser rules are very easy to understand and to define. That is because, just like for terminals, they follow the typical rules of the EBNF format. However, there are a few things worth mentioning.

Game:
    'Game' name=ID description=TEXT NEWLINE (rules+=Riddle)+ (suggestions+=Suggestion)* ;

Riddle:    
    'Riddle' name=ID question=TEXT 'Answer' answer=TEXT NEWLINE ;    

Suggestion:
    'Suggestion' TEXT 'for' ID 'at' 'time' minutes=INT ':' seconds=INT NEWLINE ;

The rule game captures a game definition with a name, a description and a series of rules and suggestions. The first peculiarity is that you need to give a label to everything that you want to easily access later in the AST.

For example, the ID of the game will be accessible with the property name in the AST. The terminal NEWLINE or the string ‘Game’ will not be available in the AST. You could get the whole text matched by the rule, using the property $cstNode?.text, but there does not seem to be a simple way to access terminals directly.

This happens because these labels are used in the TypeScript interface that corresponds to a node for that rule. For instance, this is the interface for the rule Game.

export interface Game extends AstNode {
    description: string
    name: string
    rules: Array<Riddle>
    suggestions: Array<Suggestion>
}

The second important thing is that the first rule that you define is the main rule. This rule must capture the whole content of the file. In our example, the Langium-generated parser will try to parse the file with the rule Game.

Our little game is essentially a series of questions with one correct answer. The format also supports delivering suggestions after a specific time to a player that is stuck.

The format works well with trivia games, but would also support more complex logical games, given that it allows for suggestions. It would not work that well for escape rooms, and the like, given that it does not support defining a setting or discovering objects.

A Small Bug

In theory, this simple grammar would work fine. However, it does not. The issue is a bug triggered when your grammar does not uses cross-references. This bug has already been solved in the development version of Langium, but this has not been published yet.

In fact, if you generated the grammar at this point you would see that the generated/ast.ts file would contain this line:

export type LangiumGameAstReference = ;

This is not valid TypeScript, which generates an error at compile time. You could fix this bug by manually changing this line in:

export type LangiumGameAstReference = never;

Or by adding a dummy rule with a cross-reference like the following one.

Description:
    'Description' desc=[Riddle];

Now, you might ask, what is a cross-reference? This is a neat feature of Langium (and Xtext) grammars. Graphically, it is indicated by using two enclosing square brackets. In the previous example it is used in desc=[Riddle].

This is a constraint which allows for the property only already defined values of the specified type. To better understand what this means, let’s change the rule Suggestion in our grammar:

Suggestion:
    'Suggestion' text=TEXT 'for' riddle=[Riddle] 'at' 'time' minutes=INT ':' seconds=INT NEWLINE ;

Now, the property riddle in a Suggestion can only reference a previously defined Riddle rule.

If you now try to reference a non-existent Riddle, you get an error.

For cross-reference to work, the referenced rule should have a property name. If you recall, this is how our rule Riddle looked like.

Riddle:
    'Riddle' name=ID question=TEXT 'Answer' answer=TEXT NEWLINE ;

This is a feature that technically belongs to the validation phase, rather than the parsing phase. However defining it directly in the grammar makes it easy to use and to understand. This is what we meant when we said that Langium is heavily dependent on the grammar declaration.

Cross-references are a cool example of how well it integrates with Visual Studio code. If you define a cross-reference you automatically get autocompletion for the specific item. In our example, you will get suggested Riddle names when writing a Suggestion.

Validating Our Games

Speaking of the validation phase, Langium comes with a default validation module. In our example the validation is contained in a file called langium-game-validator.ts.

First, you need to associate the validation methods to the specific type.

/**
 * Registry for validation checks.
 */
export class LangiumGameValidationRegistry extends ValidationRegistry {
    constructor(services: LangiumGameServices) {
        super(services);
        const validator = services.validation.LangiumGameValidator;
        const checks: LangiumGameChecks = {
            Game: validator.checkDescriptionIsLongEnough
        };
        this.register(checks, validator);
    }
}

In this case, we make sure that all Game objects are checked by the method checkDescriptionIsLongEnough.

Then we define the aforementioned method.

/**
 * Implementation of custom validations.
 */
export class LangiumGameValidator {

    checkDescriptionIsLongEnough(game: Game, accept: ValidationAcceptor): void {
        if (game.description.length < 50) {                        
            accept('warning', 'The description of the game should be longer.', { node: game, property: 'description' }); 
        }
    }
}

We want all game descriptions to be at least 50 characters long. Since this is just a suggestion, rather than a requirement, you get only a warning in case of failure. You can also see that we can explicitly indicate to which property the check refers to.

Summary

We have just seen a simple introduction to Langium. It is a new language engineering tool for the VS Code world we live in. The fact that it is built on top of VS Code makes it a great choice for designing and delivering simple DSLs or formats.

Despite having still a few rough edges and bugs, it is already a tool to keep an eye on. It is amazing how useful it is after reaching just version 0.1.

It is particularly useful if you are already familiar with Xtext or VS Code extension development, but it can be a good choice even for people that just have some experience in parsing.

If you have an experience in none of these things, it is probably too early to use it. On the long term it could be a great choice for beginners, since it does not need to include a parser generator. It just relies on TypeScript code. This simplifies the workflow and distribution of the parser.

We are not there yet though, given that it lacks independent documentation. It also does not have a beginner-friendly feature like support for easily handling expressions with one unified rule like ANTLR. This is not a big problem for formats or declarative DSLs, but it makes it a bit more complicated starting out for people unfamiliar with parsing patterns.

As usual, you can find the companion repository for this article on GitHub.