A Comprehensive Guide to Software Language Engineering

This is a comprehensive guide to Software Language Engineering. Software Language Engineering is the discipline focused on the science, strategies, patterns, and tools behind the creation and processing of languages. It can teach you how to create better languages and developer tools. It uses science but also insights about people coding, all to create better tools for better work.

If you want to learn more about it, you are in the right place: this article will not teach you everything, but it can be the beginning of an interesting journey.

This is going to be an introductory article on the subject. We are going to explain what language engineering is, present the main topics, and list a few representative articles on common issues and discussions related to the topic.

How to Begin Working on Language Engineering

Language Engineering is a fascinating field because every programmer is affected by the language they use. It is natural to get curious about the tools you use every day and see if you can improve something. In fact, this is how many of us started working on such issues. For example, you might start developing an extension for an editor or adding configuration to support syntax highlighting, and then you move on from there.

It is easy to start working in the field, you do not need to begin working on a thesis on Language Engineering for your Ph.D, like Polyglot Software Development and the benefits and drawbacks of using multiple languages for a specific project. I mean, you can, but it is not required.

My point is that the field is vast and you can find something interesting to learn at any level of expertise or formal training. This is a field in which academic experts do create actual products used for work, like ANTLR, which was initially created by a computer science professor. Your formal training can be beneficial if you have it; however, the lack of it does not pose a barrier.

Examples of Language Engineering Issues

The field itself spans from very technical topics to strategic issues.

An example of technical arguments could be developing patterns for transpiling goto in languages without such a feature. You might say that sounds complicated: I do not even know what a transpiler is! Well, it is not that complicated, once you understand the basic ideas.

Just like an interpreter or a compiler, a transpiler transforms the code to make it executable. The difference is that it does not transform code into a machine-readable format. It transforms the code into another language. So, a transpiler might transform Typescript in JavaScript. That is why it is also known as a source-to-source compiler.. That is why it is also known as a source-to-source compiler.

Now you can see that developing patterns for transpiling goto in languages without such a feature, means understanding how you can transform some piece of code that uses goto statements into one without it. Or more plainly, how to get rid of gotos. We have just started and you have already learnt something!

An example of a more high-level issue is understanding why the developers of C#, and the LINQ technology, decided to use a syntax that starts with the source instead of the selected elements.

LINQ, or Language Integrated Query, is a .NET feature developed to provide query capabilities to languages such as C#. You can use it to get data from various sources like databases or lists. It is inspired by SQL, but if you have ever used it, you will have noticed that the LINQ queries start with the source (i.e., FROM) rather than the elements selected (i.e., SELECT). This is the opposite of what happens in SQL and feels somewhat unnatural.

The reason they picked this syntax is because this way autocomplete can be supported. In fact, if you start with the source, you can then offer suggestions when selecting elements (WHERE clause). So, they made this choice because IDEs are now a crucial tool for developers and a language can benefit when it is designed with tools in mind.

Topics

As with any discipline as large as Language Engineering, there are many ways to organize it and topics to include. Here we provide our own view, but we are very welcoming of suggestions and improvements: feel free to write to us about any that you might have.

We are going to present the main topics and list a few representative articles on common issues and discussions related to the topic. The list of articles will be by no means exhaustive, just a place to start your exploration of the common ideas you will encounter.

Language design. How to design languages to be effective. This topic includes everything from strategies for creating languages that are easy to learn to discussions about tools used in designing languages. A few notable articles:
- How would I go about creating a programming language?, a handy overview to guide you in creating a programming language
- 68 Resources To Help You To Create Programming Languages, resources presenting everything from designing to actually building your language
- Teaching Programming With Hedy, a presentation of Hedy, a new gradual language and approach to make programming languages easier to learn
- Racket a Language for Creating New Languages, a presentation of Racket, a language and system to create new languages
- Building a language: tool support, a video about the issue of tool support for languages
Model driven development. Creating software working on domain models
- Interview with Matteo Mortari on process automation, interview with a Software Engineer at Red Hat, working on Drools, the rule engine, and DMN, Decision Model Notation
- Telosys: a Code Generation Tool by Laurent Guerin, interview with the author of Telosys, a code generation tool
Domain specific languages. Domain Specific Languages (DSL) are programming languages tailored for a specific purpose and audience. They require thoughtful design and specific tools to be used at their best
- The complete guide to (external) Domain Specific Languages, a great overview of DSLs, what they are, and what they are good for
- Designing a DSL for accounting: use a DSL to describe taxes, pension contributions, and general financial calculations, a tutorial on designing a DSL, in this example one for accounting
- Are You Abusing Excel? You Need Something Different, entire companies rely on Excel, a fantastic software when it is used correctly, but that sometimes should be replaced with a DSL
- When you need low-code or no-code and when you need DSLs, a pragmatic look at how to choose between low-code and DSLs
- Experiences of Practical DSLs usages: a talk with Glen Braun, an interview about a real experience in adopting a DSL
Application modernization. Even the best-designed software gets old and becomes a liability. You need to understand how to modernize the code, keep the value the old software provides, and get rid of the old approach.
- Interview with Graham Cunningham on legacy modernization, an interview with an expert on legacy modernization, is a great way to start understanding the topic
- Why you should not use (f)lex, yacc and bison, a discussion about why you should not use famous but outdated parsing tools; complete with their history
- Comparing the cost of migrating to rewriting, an article about some common approaches to legacy modernization
Parsing. Parsing is about extracting information from some text written in a meaningful format.
- The ANTLR Mega Tutorial, a comprehensive tutorial about ANTLR, the most used parsing generator
- EBNF: How to Describe the Grammar of a Language, parsers are defined using a grammar, this article explains EBNF, the most used format to describe a grammar
- A Guide to Parsing: Algorithms and Terminology, a comprehensive guide about the basic theory of parsing, from the terminology to an overview of the common algorithms
- Parsing HTML: A Guide to Select the Right Library, an overview of the most commonly used ways to parse HTML
- Parsing SQL, a list of the libraries, tools, and approaches to parsing SQL
- Building advanced parsers using Kolasu, our approach to building professional parsers
- Challenges in Parsing Legacy Languages: The Case of SAS Macros, a discussion of a real-life issue in parsing legacy languages and how to deal with such problems
Compilers, Interpreters and Transpilers. Once you have parsed some code, you have essentially a series of instructions. You then need to execute them in some way, this is when you need to build a compiler, interpreter, or transpiler
- A tutorial on how to write a compiler using LLVM, LLVM is a technology that greatly simplifies creating professional compilers, this is a tutorial to get you started on using it
- Language2Language Transformers: machine learning to build transpilers, a novel way to build a transpiler using machine learning and lot of examples
- How to write a transpiler, an introduction to transpiler and a tutorial on how to build one
Code processing. Extracting data from code, analyzing it and programmatically transforming it. In other words, we talk about static analysis, automated refactoring, and code generation.
- How and Why to Analyze, Generate and Transform Java Code Using Spoon, a tutorial and overview of Spool, a tool to analyze, generate, and transform Java code.
- Convert PL/SQL code to Java, some code processing is so much requested that there ready-to-use tool for the job. This is a discussion of two tools to convert PL/SQL code to Java
- Generate diagrams from C# source code using Roslyn, if you can understand code you can transform it into a lot of things, in this article we look at the example of generating diagrams from C# code
- Getting started with JavaParser: analyzing Java Code programmatically, a tutorial on JavaParser, a parser for Java, and how to perform automatic operations on the code
Editors. Programming languages are professional tools that require specific tools to be most productive. There are many aspects to understanding how to develop an editor, from design to pragmatic considerations.
- Code Completion with ANTLR4-c3, code completion, or autocomplete is a great productivity boost for developers. In this article, we see how to implement it based on an ANTLR parser
- Go To Definition in the Language Server Protocol, the Language Server Protocol is a protocol that revolutionized the way editors work, by providing a standard way to make language tools communicate with editors. In this article, we discuss how to implement the Go To definition for a language
- Writing a browser based editor using Monaco and ANTLR, Monaco is the web editor component of Visual Studio Code, the most widely used code editor. In this article, we discuss how to use it to create your own custom web editor
Language Workbenches. Language workbenches are IDEs designed specifically to build languages. They support you in all steps, from language definition to creating supporting tools like compilers and editors.
- A tutorial on Spoofax, a Language Workbench, a tutorial on a platform environment that permits quickly creating DSLs
- MPSServer: enabling integration with MPS, MPSServer is a tool to make MPS accessible remotely. It can be used to build web editors but also create diagrams, or integrate a build mechanism
- Saving JetBrains MPS models in a database using Modelix, JetBrains MPS is an integrated environment, which makes it hard to integrate with external software. In this article, we discuss how to deal with the specific issue of saving MPS models in a database, which would allow sharing data between users
Community. As with any other professional field, participating in the community is the best way to keep you updated and learn more.
- Strumenta Community. The best and probably only community about language engineering. So, you really have no choice but to join us. We welcome everybody from novices to experts.
- LangDev. An informal and annual meeting of language engineering enthusiasts from both industry and academia. We come together to discuss the state-of-the-art and state-of-the-practice of language engineering.
- SplashCon. The ACM SIGPLAN conference on Systems, Programming, Languages, and Applications: Software for Humanity embraces all aspects of software construction and delivery, to make it the premier conference on the applications of programming languages – at the intersection of programming languages and software engineering.
- SIGPLAN, The ACM Special Interest Group on Programming Languages, organizes a lot of conferences in addition to SplashCon.
- Interviews. We have a nice list of interviews with people working in the field that you might be interested in. They have both videos and transcripts, so you can watch or read them.

Tools

A brief list of the main tools we use in Language Engineering. A little note beforehand: we found that people who are learning a new subject want a clear answer to the question: what to use? They lack the knowledge to discriminate between many options. Instead, people who are trying to deepen their knowledge want to know all that is available.

Since this is a general article, we believe both kinds of people will read the article. So, we are providing a first option that is good enough for everybody and then alternatives that are best for specific cases or personal tastes.

ANTLR

ANTLR is a parser generator and the main tool we use in our everyday job. A parser generator increases our productivity and allows us and our clients to build and maintain a parser. ANTLR supports many different languages, from Java to JavaScript and it is therefore our primary choice.

ANTLR is widely used, so there are many libraries and tools built for and upon it. Our favorite tool is the VS Code extension for ANTLR4 grammars. VS Code is also widely popular and supports many languages, so this leads to a reliable and productive setup. We also like the official ANTLR plugin for IntelliJ, but of course, this works only for Java and Kotlin projects given the focus of the IntelliJ IDE. You can find a list of plugins for several editors on the ANTLR website.

We are so confident about using ANTLR that we also built a Kotlin target for ANTLR.

We created and keep supporting a set of Starlasu libraries compatible with ANTLR. They are designed to create AST and are ideal for transforming an ANTLR parse tree in AST tailored for your application.

Alternative Parsing Tools

There are many parsing tools available out there, from parsing generators, like ANTLR, to parsing libraries, like Chevrotain. We know it well because we research extensively on the subject, gathering lists of Parsing tools and libraries for:

There is really an embarrassment of riches of parsing support available in every language.

So, if you are interested mainly in one language or one type of project you might want something else that best fits your needs. We mainly work on ANTLR because it has great flexibility and productivity.

JetBrains MPS

If you need to create a DSL that will be used on a desktop PC, JetBrains MPS is the first tool you should look at. It is a Language Workbench, a tool designed to create languages. JetBrains MPS is the most popular language workbench available, for a few good reasons:

It is supported by JetBrains, the well-respected developers of programming tools
It permits to prototype a DSL very quickly
It supports multiple notations: textual, tabular, graphical, and more.
It has all the necessary features to build advanced languages and editors: type system, constraints, etc., in this way, we can build editors that support users effectively
It is a rich environment that permits to build great integrated tooling: interpreters, simulators, debuggers, documentation generators, etc. In our experience, the tools built around DSLs make a difference to increase productivity
It allows you to evolve languages without breaking existing code. This is very important because DSLs are always designed and evolved in an incremental way

The main drawback of JetBrains MPS is that it is a desktop application designed for developers. It has a UI that looks complex and daunting to non-developers and it is not easily integrated into command-line workflows or the web. We are actually working with the JetBrains MPS community to support a web use of MPS, but the work is still in progress.

The other significant drawback is that it is a standalone system. Your users will need to run JetBrains MPS in some way or another to use any language you are going to define. It is opensource, but is quite complex and designed to work as-is. It is not easy to integrate with other code.

Alternative Language WorkBenches

An alternative Language Workbench is Spoofax, you can read it more in our Spoofax tutorial. It is widely used and adopted in industrial applications. Its main drawback might be that it is based upon Eclipse, which is a platform with a lot of admirers, but it is less polished than an IDE from JetBrains or Visual Studio.

Xtext is also a popular Language Workbench based on Eclipse and EMF, the worst way to describe it is a set of plugins for Eclipse that transform it into an IDE to create languages. The best way is from their website:

Xtext is a framework for development of programming languages and domain-specific languages. With Xtext you define your language using a powerful grammar language. As a result, you get a full infrastructure, including parser, linker, typechecker, compiler as well as editing support for Eclipse, any editor that supports the Language Server Protocol and your favorite web browser

Essentially Xtext is a complete tool for creating programming languages, based on open-source software. The neat advantage of Xtext is that it is an open system, you can create a language with Xtext and then integrate it with the rest of your codebase, as you wish.

A tool similar in spirit to Xtext is Langium. They are both built upon open-source libraries and tools: Xtext is built upon Eclipse and ANTLR, while Langium is built upon Visual Code and Chevrotain. The difference is that Langium aids you in creating quicker individual, simple languages and that is all in TypeScript to be easily integrated with Visual Code. This makes it a great choice for web or cross-platform projects.

A different option is MetaEdit+, it is a commercial language workbench, so it is designed for companies. The interesting thing is that it was created to provide the tooling to customers who already know the domain to formalize it into a language. So, it is a tool to make it easier for domain experts to perform the work of a language engineer. In particular, it caters to companies working on well-defined products.

It is a nice example of what language engineering can do for people that are not language engineers. We have interviewed Juha-Pekka Tolvanen, one of the people behind MetaEdit+, so you may watch that interview to learn more about MetaEdit+.

If you are interested in starting slow with language workbenches, you might want to start with textX. It is essentially a suite of Python modules that come together to create a simple but functional language workbench to create languages from Python. It can only be used from Python, but it is easy to start. You can read our tutorial: Quick Domain-Specific Languages in Python with textX.

Summary

We have seen how to get into the fascinating field of Language Engineering: how to start, a map of the main topics, and the tools you can use. We hoped to have succeeded in transforming an abstract topic into a series of understandable topics that can help you navigate the field.