A Guide to Code Generation

Code Generation

There are many uses for Code Generation:

  • We can generate repetitive code from schemas or source of information we have. For example, we can generate Data Access Objects from database schema files
  • We can generate code from wizards
  • We can generate skeletons of applications from simple models. We wrote about an example here: Code Generation with Roslyn: a Skeleton Class from UML
  • We can generate entire applications from high level DSLs
  • We can generate code from information we obtained processing existing documents
  • We can generate code from information we obtained by reverse-engineering code written using other languages or frameworks
  • Some IDEs have the functionality to generate boilerplate code, like the equals or hashCode methods in Java

So Code Generation can be used for small portions of code or entire applications. It can be used with widely different programming languages and using different techniques.

In this article we explore all the possibilities provided by code generation to give you a complete overview on the subject.

Why to Use Code Generation

The reason to use code generation are fundamentally four: productivity, simplification, portability, and consistency.

Productivity

With code generation you write the generator once and it can be reused as many times as you need. Providing the specific inputs to the generator and invoke it is significantly faster than writing the code manually, therefore code generation permits to save time.

Simplification

With code generation you generate your code from some  abstract description. It means that your source of truth becomes that description, not the code. That description is typically easier to analyze and check compared with the whole generated code.

Portability

Once you have a process to generate code for a certain language or framework you can simply change the generator and target a different language or framework. You can also target multiple platforms at once. For example, with a parser generator you can get a parser in C#, Java and C++. Another example: you might write a UML diagram and use code generation to create both a skeleton class in C# and the SQL code to create a database for MySQL. So the same abstract description can be used to generate different kinds of artifacts.

Consistency

With code generation you get always the code you expect. The generated code is designed according to the same principles, the naming rule match, etc. The code always works the way you expect, of course except in the case of bugs in the generator. The quality of the code is consistent. With code written manually instead you can have different developers use different styles and occasionally introduce errors even in the most repetitive code.

Why Not to Use Code Generation

As all tools code generation is not perfect, it has mainly two issues: maintenance and complexity.

Maintenance

When you use a code generator tool your code becomes dependent on it. A code generator tool must be maintained. If you created it you have to keep updating it, if you are just using an existing one you have to hope that somebody keep maintaining it or you have to take over yourself. So the advantages of code generation are not free. This is especially risky if you do not have or cannot find the right competencies to work on the generator.

Complexity

Code generated automatically tend to be more complex than code written by hand. Sometimes it has to do with glue code, needed to link different parts together, or the fact that the generator supports more use cases than the one you need. In this second case the generated code can do more than what you want, but this is not necessarily an advantage. Generated code is also surely less optimized than the one you can write by hand. Sometimes the difference is small and not significant, but if your application need to squeeze every bit of performancem code generation might not be optimal for you.

How Can We Use Code Generation?

Depending on the context code generation can just be a useful productivity boost or a vital component of your development process. An example of an useful use is in many modern IDEs: they allows to create a skeleton class to implement interfaces or similar things, with the click of a button. You could definitely write that code yourself, you would just waste some time performing a manial task.

There are many possible ways to design a code generation pipeline. Basically we need to define two elements:

  1. Input: where the information to be used in code generation comes from
  2. Output: how the generated code will be obtained

Optionally you may have transformation steps in between the input and the output. They could be useful to simplify the output layer and to make the input and the output more independent.

Possible inputs:

  • A DSL: for example, we can use ANTLR to describe the grammar of a language. From that we can generate a parser
  • code in other formats: database schemas. From a database schema we can generate DAOs
  • wizards: they permit to ask information to the user
  • reverse engineering: the information can be obtained by processing complex code artifacts
  • data sources like a DB, a CSV file or a spreadsheet

Possible outputs:

  • template engine; most web programmers knows template engines, that are used to fill the HTML UI with data
  • code building APIs: e.g., Javaparser can be used to create Java files programmatically

Let’s now examine some pipelines:

  • parser generation; readers of this website will be surely familiar with ANTLR and other such tools to automatically generate parsers from a formal grammar. In this case the input is a DSL and the output is produced using a template engine
  • model driven design; plugins for IDEs, or standalone IDEs, that allows to describe a model of an application, sometimes with a graphical interface, from which to generate an entire application or just its skeleton
  • database-related code; this use can be considered the child of model driven design and templates engine. Usually the programmer defines a database schema from which entire CRUD apps or just the code to handle the database can be generated. There are also tools that perform the reverse process: from an existing database they create a database schema or code to handle it
  • ad hoc applications; this category includes everything: from tools designed to handle one thing to ad-hoc systems, used in an enterprise setting, that can generate entire applications from a formal custom description. These applications are generally part of a specific workflow. For example, the customer uses a graphical interface to describe an app and one ad-hoc system generates the database schema to support this app, another one generates a CRUD interface, etc.
  • IDE generated code: many statically typed languages requires a lot of boilerplate code to be written and an IDE can typically generate some of it: classes with stubs for the methods to implement, standard equals, hashCode and toString methods, getters and setters for all existing properties

Code Generation Tools

After this short introduction to code generation we can see a few code generation tools.

Template Engine

The template engine group is probably the most famous and used. A template engine is basically a mini-compiler that can understand a simple template language. A template file contains special notations that can be interpreted by the template engine. The simplest thing it can do it replacing this special notation with the proper data, given at runtime. Most template engine also supports simple flow control commands (e.g., for-loops, if-else-statements) that allows to describe simple structures.

There are many examples, let’s see two of them that represent how most of them behaves.

Jinja2

Jinja2 is a template engine written in pure Python. It provides a Django inspired non-XML syntax but supports inline expressions and an optional sandboxed environment.

Jinja2 is a widely-use template engine for Python. It can do what all template engines do: create unique documents based on the provided data.  It supports modular templates, control flow, variables, etc. However, it has also powerful security measures: an HTML escaping system to and a sandboxing environment that can control access to risky attributes.

That how a Jinja2 template looks like.

Jinja2 has special support for producing HTML pages, and that is what is most used for. However, it could be used to create other kind of files.

Pug

Pug is a high-performance template engine heavily influenced by Haml and implemented with JavaScript for Node.js and browsers

In many regards Pug is like many other template engines: it supports modular templates, control flow, etc. The difference is that Pug looks like DSL and it is only meant to work with HTML. The result is that a Pug template looks very clean and simple.

Parser Generation

A parser generation is a tool to automatically and quickly create parser for a language. They are very successful and productive because the problem of parsing a language has been extensively studied. So there are solutions that can guarantees to parse most languages that people need to parse.

ANTLR

ANTLR is probably the most used parser generator. That means that there are many examples out there. But the real added value of a vast community it is the large amount of grammars available.

ANTLR takes as input a grammar: a formal description of the language. The output of a parser is a parse tree: a structure that contains the source code transformed in a way that is easy to use for the rest of the program. ANTLR also provides two ways to walk the parse tree: visitors and listeners. The first one is suited when you have to manipulate or interact with the elements of the tree, while the second is useful when you just have to do something when a rule is matched.

The format of the grammar is clean and language independent.

If you are interested to learn how to use ANTLR, you can look into this giant ANTLR tutorial we have written. If you are ready to become a professional ANTLR developer, you can buy our video course to Build professional parsers and languages using ANTLR.

Model Driven Design

These are usually plugins for IDEs, or standalone IDEs, that allows to describe a model of an application, with a graphical interface, from which to generate the skeleton of an application. This can happen because the foundation of model driven design is an abstract model, either defined with UML diagrams or DSLs. And once the main features of a program can be described in relation with the model is then possible to automatically generate a representation of that program. This representation of the model in code will have the structure designed by the developer, but usually the behavior will have to be implemented directly by the developer itself.

For example, with model driven design you can automatically generate a class, with a list of fields and the proper methods, but you cannot automatically implement the behavior of a method.

Acceleo

Acceleo 3 is a code generator implementing of the OMG’s Model-to-text specification. It supports the developer with most of the features that can be expected from a top quality code generator IDE: simple syntax, efficient code generation, advanced tooling, and features on par with the JDT. Acceleo help the developer to handle the lifecycle of its code generators. Thanks to a prototype based approach, you can quickly and easily create your first generator from the source code of an existing prototype, then with all the features of the Acceleo tooling like the refactoring tools you will easily improve your generator to realize a full fledged code generator.

This is a quote from the Acceleo website that describes quite well what Acceleo does: implementing the principles of model-drive design. What it lacks is a description of the experience of working with Acceleo. It is basically an Eclipse plugin that gives you tool to create Java code starting from a EMF Model, according to a template you specify. The EMF model can be defined in different ways: with an UML diagram or a custom DSL.

This image shows the workflow of an Acceleo project.

Acceleo workflow

 

Umple

Umple is a modeling tool and programming language family to enable what the authors call Model-Oriented Programming. It adds abstractions such as Associations, Attributes and State Machines derived from UML to object-oriented programming languages such as Java, C++, PHP and Ruby. Umple can also be used to create UML class and state diagrams textually.

Umple is an example of a tool that combines UML modes with traditional programming language in a structured way. It was born to simplify the process of model-driven development, which traditionally requires specific and complex tools. It is essentially a programming language that supports features of UML (class and state) diagrams to define models. Then the Umple code is transformed by its compiler in a traditional language like Java or PHP.

Umple can be used in more than one way:

  • it can be used to describe UML diagrams in a textual manner
  • it can be used, in combination with a traditional language, as a preprocessor or an extension for that target language. The Umple compiler transform the Umple code in the target language and leaves the existing target language unchanged
  • given its large support for UML state machine it can work as a state machine generator, from a Umple description of a state machine an implementation can be generated in many target languages

The following Umple code is equivalent to the next UML diagram.

 

 

Umple Example UML

Instead the following Umple code describes a state machine.

Database-related Code

It all revolves around a database schema, from which code is generated or from a database, from which a schema is generated. These kind of generators are possible because of two reasons:

  • relational databases support a standard language to interact with them (SQL)
  • in programming languages there are widespread patterns and libraries to interact with databases

These two reasons guarantee that is generally possible to create standard glue code between a programming language and a database containing the data the program needs. In practice the database schema works as a simple model that can be used to generated code.

Many frameworks or IDEs contain basic tools to generate a database schema from class or, vice versa, generate a class to interact with a table in a database. In this section we see an example of a tool that can do something more.

Celerio

Celerio is a code generator tool for data-oriented application.

Celerio is a Java tool that include a database extractor, to get a database schema from an existing database. Then it couples this generated schema with configuration files and then launch a template engine in order to create a whole application. The extracted database schema is in XML format.

For instance, there is an example application that generates an Angular + Spring application from a database schema.

Celerio Overview

Domain Specific Language

DSLs are a great way to capture the business logic in a formalized way. After that this needs to be executed somehow. While sometimes interpreters and compilers are built to execute DSLs very frequently code generators are used. In this way the DSL can be translated to a language for which a compiler already exist, like Java or C#.

Nowadays DSLs can be built using language workbenches, which are specific IDEs created to design and implement DSLs. Language Workbenches are useful because they make possible also to define editors and other supporting tools for the DSL at a reduced cost. This is very important because DSLs can be used by non-developers, which needs custom editors to take advantage of the features of the language or simply are not able to use a common textual editor. Among other features language workbenches typically are integrated with code generation facilities. Let’s see a couple of examples.

JetBrains MPS

JetBrains MPS is a language workbench based on a projectional editor. You can use it to create one or more DSLs. It can also be used to extend an existing language. For example, mbeddr is an extension for C to improve embedded programming and it is based on JetBrains MPS.

The term projectional editor means that MPS preserves the underlying structure of your data and show it to you in an easy editable form. The concept might be a bit hard to wrap your heard around. Think about the traditional programming: you write the source code, then the compiler transform the source code in its logical representation, the parse tree, that it uses to perform several things, like optimization or transforming in the machine code to be executed. With a projectional editor you are directly working with the logical representation: the parse tree. However, you can only modify it in the way that the editor (MPS) allows you.

The main consequence is that when creating a DSL with JetBrains MPS you need the whole IDE, with all its power and heaviness. You can get syntax highlighting, code completion, project management, etc.

The drawback of this approach is that you can modify the code only in certain ways. So, for example, normally you cannot do things like  copy and paste some code into your program, what you see in your editor is just a representation of the code, not the real code (i.e., the parse tree).

However, the advantage of this approach is that you can create a DSL that uses any form of input to modify the code, so you can create a graphical editor, a tabular input or even normal text. This advantage makes it particularly useful to create DSL, that can be used also by non-programmers.

This video shows an overview of JetBrains MPS.

Generating Code with JetBrains MPS

The workflow of working with JetBrains MPS is done mostly inside this software itself: you manipulate all aspects of your language in there. Once have done all of this, you may want to execute your code and the most common way to do is to translate it to a language for which you have a compiler. Here comes in play code generation.

This requires to do two things:

  1. Translating high level concepts to lower level concepts (essentially a model to model transformation)
  2. Have a text generator that takes the lower level concepts and map them directly to text

Obviously here we cannot go in many details but we can show you an example of a template used for model to model transformations: here we want to map a concept named Document to the corresponding Java code. You will see that we have basically a Java class with a main method. Inside that class some elements are fixed while others depends on the values of Document (i.e. the name of the class and the content of the container)

A template for code generation in MPS

We have said that this is model to model transformation because this template will generate a Java model. However MPS comes also with a text generator that can translate a Java model into code so at the end you will get a Java file like this one:

This is the easiest way to generate code using the facilities provided by the MPS platform. However, nothing forbids you to generate the code yourself. Since you can use this language also to define the behavior of the components, you could use it to interact with the filesystem, directly and write the code yourself. For example, you could write C#, Python or any other language into a file manually.

Basically, you could create a new file an add content to it, like this print("Python code here"). Of course it would be cumbersome, since you would basically write long strings of text in a file. So, this approach is useful only for writing configuration files or data formats, not code in a programming language.

Xtext

Xtext is a language workbench, built on top of Eclipse and the Eclipse Modeling Framework. It can be used to design textual DSLs and obtain editors for them.  Functionally Xtext is a combination of different tools (e.g., ANTLR to parse, Eclipse for the UI, etc.) assembled to generate DSLs.

For instance you are going to define a grammar for your DSL using the ANTLR format, like in this example.

Xtext itself will generate a parser that will recognize the code of your language and give to you an EMF model. At that point you could use different system to perform model-to-model transformations or code generation on the model.

Model-to-model transformation means that you can transform your model created with Xtext in another model. This could be an intermediate step on your way to generate code or the end product of your project. That is because Xtext takes advantage of already available tools, so you could use your EMF model with other tools from the EMF platform. This is one the great advantages of using a platform like Xtext: it can be used as a piece of a larger toolchain.

Code Generation with Xtext

There is also flexibility in generating the code for your language. Compare to the monolith that is JetBrains MPS it has both advantages and disadvantage. One important advantage is the fact that the generation process is simpler.

Essentially Xtext gives you a language (Xtend) that is a more powerful dialect of Java, with things like support for macro. However it does not provide special powers to generate code.

This is an example of a file that govern code generation.

As you can see, while you are note exactly writing long string of code, you are quite close. The support for macros reduces repetitions, but it is not a smooth experience.

This code in Xtend language is then transformed in Java code, where is very evident how manual the process is.

It looks very much as if you had build the generator by hand.

In any case, this also shows an advantage of the ad-hoc nature of Xtext. This is just Java code that you can execute and work with however you want it.

JetBrains MPS and Xtext are very different: they have their strong points and weakness. One is all integrated, which make easier to create standalone DSLs, the other is a combination of tools, which make easier to integrate your DSLs with an existing codebase or infrastructure.

This presentation explains how to develop DSLs with Xtext.

DSL Development with Xtext – Sven Efftinge from JavaZone on Vimeo.

Ad-Hoc Applications

This category includes everything: from tools designed to handle one thing to ad-hoc systems, used in an enterprise setting, that can generate entire applications from a formal custom description. These applications are generally part of a specific workflow. For example, the customer uses a graphical interface to describe an app and one ad-hoc system generates the database schema to support this app, another one generates a CRUD interface, etc.

This is not a proper defined category, but a catchall one that includes everything that does not belong to a specific groups. This means that there is no standard structure for this groups of programs. It is also a testament of the versatility of code generation: if you can create a model or description of the problem then you can solve it with code generation. Of course, you still have to understand if it makes sense to solve the general problem and create a code generation tool, or simply solve your problem directly.

In this section we talk about two tools: CMake, a development tool, and Yeoman, a scaffolding tool. The first one essentially produces configuration files: software that feeds other software. The second one simplify the life of developers, by providing a way to create ready-to-use project optimized for a particular software platform, library or need.

CMake

CMake is an open-source, cross-platform family of tools designed to build, test and package software.

CMake includes three development tools create to help developing with C and C++. The main tool is designed to generate build files (i.e., makefile(s) and project files) for different platforms and toolchains. For instance, tt can generate makefiles for Linux and Visual Studio project files.

CMake is not a compiler. The user defines the structure of the project in the CMake format, then the tool generate the normal build files that are used in the traditional building process.

A CMake file looks like a series of commands/macro that set options/flags for the compiler, link libraries, execute custom commands, etc.

Yeoman

Yeoman is a generic scaffolding system allowing the creation of any kind of app.

Nowadays being a good programmer means much more than just knowing how to code. You need to know the best practices for every tool you use and remember to implement each time. And that is hard enough when writing the code itself, but it is even more work when it  configuration files and the right project structure. That is where a tool like Yeoman comes in: it is a scaffolding tool that generates with one command a new project which implements all the best practices instantly.

The core of Yeoman is a generator ecosystem, on top of which developers build their own templates. The tool is so popular that there are already thousands of template available.

Yeoman is a JavaScript app, so writing a generator requires simply to write JavaScript code and using the provided API. The workflow is quite simple, too: you ask the user information about the project (e.g., name), gather configuration information and then generate the project.

The following code shows part of an example generator to create a Yeoman template.

Summary

In this article we have seen a glimpse of the vast world of code generation. We have seen the main categories of generators and a few examples of specific generators. Some of these categories would be familiar to most developers, some will new to most. All of them can be used to increase productivity, provided that you use them right.

There is much more to say about each specific categories, in fact we wrote a few articles on parsing generators for different languages: Java, C#, Python and JavaScript.  However we wanted to create an introductory article to introduce to this world without overwhelming the reader with information. And you, how do you plan to take advantage of code generation?

Download the guide with 68 resources on Creating Programming Languages

68resources

Receive the guide to your inbox to read it on all your devices when you have time

Powered by ConvertKit