This is a comprehensive guide to Software Language Engineering. Software Language Engineering is the discipline focused on the science, strategies, patterns, and tools behind the creation and processing of languages. It can teach you how to create better languages and developer tools. It uses science but also insights about people coding, all to create better tools for better work.

If you want to learn more about it, you are in the right place: this article will not teach you everything, but it can be the beginning of an interesting journey.

This is going to be an introductory article on the subject. We are going to explain what language engineering is, present the main topics, and list a few representative articles on common issues and discussions related to the topic.

How to Begin Working on Language Engineering

Language Engineering is a fascinating field because every programmer is affected by the language they use. It is natural to get curious about the tools you use every day and see if you can improve something. In fact, this is how many of us started working on such issues. For example, you might start developing an extension for an editor or adding configuration to support syntax highlighting, and then you move on from there.

It is easy to start working in the field, you do not need to begin working on a thesis on Language Engineering for your Ph.D, like Polyglot Software Development and the benefits and drawbacks of using multiple languages for a specific project. I mean, you can, but it is not required.

My point is that the field is vast and you can find something interesting to learn at any level of expertise or formal training. This is a field in which academic experts do create actual products used for work, like ANTLR, which was initially created by a computer science professor. Your formal training can be beneficial if you have it; however, the lack of it does not pose a barrier.

Examples of Language Engineering Issues

The field itself spans from very technical topics to strategic issues.

An example of technical arguments could be developing patterns for transpiling goto in languages without such a feature. You might say that sounds complicated: I do not even know what a transpiler is! Well, it is not that complicated, once you understand the basic ideas. 

Just like an interpreter or a compiler, a transpiler transforms the code to make it executable. The difference is that it does not transform code into a machine-readable format. It transforms the code into another language. So, a transpiler might transform Typescript in JavaScript. That is why it is also known as a source-to-source compiler.. That is why it is also known as a source-to-source compiler.

Now you can see that developing patterns for transpiling goto in languages without such a feature, means understanding how you can transform some piece of code that uses goto statements into one without it. Or more plainly, how to get rid of gotos. We have just started and you have already learnt something!

An example of a more high-level issue is understanding why the developers of C#, and the LINQ technology, decided to use a syntax that starts with the source instead of the selected elements.

LINQ, or Language Integrated Query, is a .NET feature developed to provide query capabilities to languages such as C#. You can use it to get data from various sources like databases or lists. It is inspired by SQL, but if you have ever used it, you will have noticed that the LINQ queries start with the source (i.e., FROM) rather than the elements selected (i.e., SELECT). This is the opposite of what happens in SQL and feels somewhat unnatural.

The reason they picked this syntax is because this way autocomplete can be supported. In fact, if you start with the source, you can then offer suggestions when selecting elements (WHERE clause). So, they made this choice because IDEs are now a crucial tool for developers and a language can benefit when it is designed with tools in mind.

Topics

As with any discipline as large as Language Engineering, there are many ways to organize it and topics to include. Here we provide our own view, but we are very welcoming of suggestions and improvements: feel free to write to us about any that you might have.

We are going to present the main topics and list a few representative articles on common issues and discussions related to the topic. The list of articles will be by no means exhaustive, just a place to start your exploration of the common ideas you will encounter.

Tools

A brief list of the main tools we use in Language Engineering. A little note beforehand: we found that people who are learning a new subject want a clear answer to the question: what to use? They lack the knowledge to discriminate between many options. Instead, people who are trying to deepen their knowledge want to know all that is available.

Since this is a general article, we believe both kinds of people will read the article. So, we are providing a first option that is good enough for everybody and then alternatives that are best for specific cases or personal tastes.

ANTLR

ANTLR is a parser generator and the main tool we use in our everyday job. A parser generator increases our productivity and allows us and our clients to build and maintain a parser. ANTLR supports many different languages, from Java to JavaScript and it is therefore our primary choice.

ANTLR is widely used, so there are many libraries and tools built for and upon it. Our favorite tool is the VS Code extension for ANTLR4 grammars. VS Code is also widely popular and supports many languages, so this leads to a reliable and productive setup. We also like the official ANTLR plugin for IntelliJ, but of course, this works only for Java and Kotlin projects given the focus of the IntelliJ IDE. You can find a list of plugins for several editors on the ANTLR website.

We are so confident about using ANTLR that we also built a Kotlin target for ANTLR

We created and keep supporting a set of Starlasu libraries compatible with ANTLR. They are designed to create AST and are ideal for transforming an ANTLR parse tree in AST tailored for your application.

Alternative Parsing Tools

There are many parsing tools available out there, from parsing generators, like ANTLR, to parsing libraries, like Chevrotain. We know it well because we research extensively on the subject, gathering lists of Parsing tools and libraries for:

There is really an embarrassment of riches of parsing support available in every language. 

So, if you are interested mainly in one language or one type of project you might want something else that best fits your needs. We mainly work on ANTLR because it has great flexibility and productivity. 

JetBrains MPS

If you need to create a DSL that will be used on a desktop PC, JetBrains MPS is the first tool you should look at. It is a Language Workbench, a tool designed to create languages. JetBrains MPS is the most popular language workbench available, for a few good reasons:

  • It is supported by JetBrains, the well-respected developers of programming tools
  • It permits to prototype a DSL very quickly
  • It supports multiple notations: textual, tabular, graphical, and more.
  • It has all the necessary features to build advanced languages and editors: type system, constraints, etc., in this way, we can build editors that support users effectively
  • It is a rich environment that permits to build great integrated tooling: interpreters, simulators, debuggers, documentation generators, etc. In our experience, the tools built around DSLs make a difference to increase productivity
  • It allows you to evolve languages without breaking existing code. This is very important because DSLs are always designed and evolved in an incremental way

The main drawback of JetBrains MPS is that it is a desktop application designed for developers. It has a UI that looks complex and daunting to non-developers and it is not easily integrated into command-line workflows or the web. We are actually working with the JetBrains MPS community to support a web use of MPS, but the work is still in progress.

The other significant drawback is that it is a standalone system. Your users will need to run JetBrains MPS in some way or another to use any language you are going to define. It is opensource, but is quite complex and designed to work as-is. It is not easy to integrate with other code.

Alternative Language WorkBenches

An alternative Language Workbench is Spoofax, you can read it more in our Spoofax tutorial. It is widely used and adopted in industrial applications. Its main drawback might be that it is based upon Eclipse, which is a platform with a lot of admirers, but it is less polished than an IDE from JetBrains or Visual Studio.

Xtext is also a popular Language Workbench based on Eclipse and EMF, the worst way to describe it is a set of plugins for Eclipse that transform it into an IDE to create languages. The best way is from their website:

Xtext is a framework for development of programming languages and domain-specific languages. With Xtext you define your language using a powerful grammar language. As a result, you get a full infrastructure, including parser, linker, typechecker, compiler as well as editing support for Eclipse, any editor that supports the Language Server Protocol and your favorite web browser 

Essentially Xtext is a complete tool for creating programming languages, based on open-source software. The neat advantage of Xtext is that it is an open system, you can create a language with Xtext and then integrate it with the rest of your codebase, as you wish.

A tool similar in spirit to Xtext is Langium. They are both built upon open-source libraries and tools: Xtext is built upon Eclipse and ANTLR, while Langium is built upon Visual Code and Chevrotain. The difference is that Langium aids you in creating quicker individual, simple languages and that is all in TypeScript to be easily integrated with Visual Code. This makes it a great choice for web or cross-platform projects.

A different option is MetaEdit+, it is a commercial language workbench, so it is designed for companies. The interesting thing is that it was created to provide the tooling to customers who already know the domain to formalize it into a language. So, it is a tool to make it easier for domain experts to perform the work of a language engineer. In particular, it caters to companies working on well-defined products. 

It is a nice example of what language engineering can do for people that are not language engineers. We have interviewed Juha-Pekka Tolvanen, one of the people behind MetaEdit+, so you may watch that interview to learn more about MetaEdit+.

If you are interested in starting slow with language workbenches, you might want to start with textX. It is essentially a suite of Python modules that come together to create a simple but functional language workbench to create languages from Python. It can only be used from Python, but it is easy to start. You can read our tutorial: Quick Domain-Specific Languages in Python with textX.

Summary

We have seen how to get into the fascinating field of Language Engineering: how to start, a map of the main topics, and the tools you can use. We hoped to have succeeded in transforming an abstract topic into a series of understandable topics that can help you navigate the field.