Strategies, tips and tutorials on designing programming languages

Building a language: tool support

When building a language there is one aspect which is absolutely crucial: that is tool support.
Tool support will determine if your language is usable at all, it will influence the reception from user and the usability.
In this video I explain why and how you can build great tool support with a limited effort.

Tool support is what makes a real language. This is also one of the reasons why I am against the whole concept of internal DSLs: they are nice but they are not really languages. They cannot be compared in any way to external DSLs (a.k.a. real DSLs).

And it is nice that experts agree on this:

Building a compiler for your own language: validation

validation_for_your_own_programming_language

So you have parsed your code and built a clean AST for it. Now it is time to check if what the user has expressed make sense at all. We should perform validation, identifying semantical errors, to communicate together with lexical and syntactical errors (provided by the parser).


This post is the part of a series. The goal of the series is to describe how to create a useful language and all the supporting tools.

  1. Building a lexer
  2. Building a parser
  3. Creating an editor with syntax highlighting
  4. Build an editor with autocompletion
  5. Mapping the parse tree to the abstract syntax tree
  6. Model to model transformations
  7. Validation
  8. Generating bytecode

After writing this series of posts I refined my method, expanded it, and clarified into this book titled
How to create pragmatic, lightweight languages

Code

Code is available on GitHub under the tag 07_validation

Implement semantic checks

In the previous post we have seen how to implement the function process to execute an action on all the nodes of our AST. A typical case is that we want to execute certain operation only on certain nodes. We want still to use process to navigate the tree. We can do that by creating this function named specificProcess.

Let’s see how to use specificProcess to:

  • find all the VariableDeclarations and check they are not re-declaring a variable already declared
  • find all the VarReferences and verify they are not referring to a variable that has been not declared or has been declared after the VarReference
  • perform the same check done on VarReferences also for Assignments

Ok, so invoking validate on the root of the AST will return all possible semantic errors.

Getting all errors: lexical, syntactic, and semantic

We first need to invoke the ANTLR parser and get:

  • the parse tree
  • the list of lexical and syntactic error

Then we map the parse tree to the AST and perform the semantic validation. Finally we return the AST and all the errors combined.

In the rest of the system we will simply call the SandyParserFacade without the need to invoke the ANTLR parser directly.

Test validation

Will it fly? Let’s verify that.

Conclusions

This is all nice and well: with a simple call we can get a list of all errors we have. For each of them we have a description and the position. This is enough for our compiler but we would now need to show these errors in the editor. We will do that in out of the future posts.

 

Getting started with ANTLR: building a simple expression language


This post is the part of a series. The goal of the series is to describe how to create a useful language and all the supporting tools.

  1. Building a lexer
  2. Building a parser
  3. Creating an editor with syntax highlighting
  4. Build an editor with autocompletion
  5. Mapping the parse tree to the abstract syntax tree
  6. Model to model transformations
  7. Validation
  8. Generating bytecode

After writing this series of posts I refined my method, expanded it, and clarified into this book titled
How to create pragmatic, lightweight languages

In this post we will start working on a very simple expression language. We will build it in our language sandbox and therefore we will call the language Sandy.

I think that tool support is vital for a language: for this reason we will start with an extremely simple language but we will build rich tool support for it. To benefit from a language we need a parser, interpreters and compilers, editors and more. It seems to me that there is a lot of material on building simple parsers but very few material on building the rest of the infrastructure needed to make using a language practical and effective.

I would like to focus on exactly these aspects, making a language small but fully useful. Then you will be able to grow your language organically.

The code is available on GitHub: https://github.com/ftomassetti/LangSandbox. The code presented in this article corresponds to the tag 01_lexer.

The language

The language will permit to define variables and expressions. We will support:

  • integer and decimal literals
  • variable definition and assignment
  • the basic mathematical operations (addition, subtraction, multiplication, division)
  • the usage of parenthesis

Examples of a valid file:

The tools we will use

We will use:

  • ANTLR to generate the lexer and the parser
  • use Gradle as our build system
  • write the code in Kotlin. It will be very basic Kotlin, given I just started learning it.

Setup the project

Our build.gradle file will look like this

We can run:

  • ./gradlew idea to generate the IDEA project files
  • ./gradlew generateGrammarSource to generate the ANTLR lexer and parser

Implementing the lexer

We will build the lexer and the parser in two separate files. This is the lexer:

Now we can simply run ./gradlew generateGrammarSource and the lexer will be generated for us from the previous definition.

Testing the lexer

Testing is always important but while building languages it is absolutely critical: if the tools supporting your language are not correct this could affect all possible programs you will build for them. So let’s start testing the lexer: we will just verify that the sequence of tokens the lexer produces is the one we aspect.

Conclusions and next steps

We started with the first small step: we setup the project and built the lexer.

There is a long way in front of us before making the language usable in practice but we started. We will next work on the parser with the same approach: building something simple that we can test and compile through the command line.

What if sketching on a whiteboard was a form of programming?

There are different possible notations for languages: textual and graphical are the two larger families.

We can define graphical notations for formal languages. For example we can define a graphical notation to define state machines.

A Domain Specific Language for state machines can be defined using JetBrains MPS. However the user would have to edit the state machine inside MPS.

I wonder if we could instead specify a state machine just by sketching something on a whiteboard. Consider this example:

sm1

We have a start state (the white circle on top) which leads us to the state A without needing any event. Then from state A we can go to state B when an event of type b occurs. Or we can go to state C if an event of type a occurs. From C we need a state of type c to move to D.

Note that I did not represent the direction of transitions, we could assume that they always lead from the higher state to the lower.

I think that at this point we should just make a picture to this sketch and obtain a program which would permit to run this state machine.

To do that we should recognize the different shape, the text present and assign everything to a corresponding concept in our DSL.

The first step would be to recognize rectangles. I guess there are already libraries to do that. I tried OpenCV but I really do not like the way C++ dependencies are handled. I started playing with BoofCV instead, which is written in Java. I first used functions to recognize segments and I got something like this:

Screenshot from 2016-03-28 19-12-53

Then I wrote some functions to merge segments which have endpoints very close and similar slope, to reduce the number of segments. That did not work particularly well.

Screenshot from 2016-03-28 19-13-03

I realized I should probably use some pre-processing: first a Gaussian blur and then calculate the derivatives over the X and Y axis. I got something like this. The colors depends on the values present in the derivates X and Y images. I should use those values to recognize horizontal and vertical lines with common endpoints and so get rectangles.

Screenshot from 2016-03-29 18-49-46

This is not as easy as I hoped but I would love to spend more time on it in the future.

Conclusions

Still, people will need some ability to formalize their thoughts but we can definitely remove some of the incidental barriers due to languages, notations and tools.

Raising the level of abstraction: what if we tried to do that bottom up?

It is clear that software development is going in the direction of raising the level of abstraction since it was born.

The reason is simple: we can juggle with so many concepts in our mind. If we are busy thinking about what we stored in the EAX register it is difficult to think about the bigger picture. As we raise the level of abstraction we start using bigger bricks and we can see farther.  Once we think in terms of objects the OOP design patterns can emerge, because they build on combinations of a few concepts at our current level of abstraction.

How are we moving to higher levels of abstractions? By hiding the details. We hide how we use registers, how we allocate memory, how pointers are handled, how polymorphism is implemented.

There are several way we can achieve that: normally we build new languages from which we generate or compile to some lower level abstraction. For example many C++ compilers at the beginning were just spitting out some C. Also compilers for Eiffel were using the same strategy. One thing that instead was not tried was to evolve languages adding new concepts as they emerge. In this case I am thinking about a gradual process which would target micro-patterns. Things like getters or setters in Java. Or things like this one:

What we are doing here? we are just returning self.data[code]. If it was not initialized we would set it to a default value ({}). Ok, lazy initialization: this is a concept common and comprehensible per se. What if we could add this concept and see if it is then used to build up new patterns?

This would be trivial to do using a tool like Jetbrains MPS.

I plan to play a little bit with this idea in next week and see where it leads me.

Getting started with Jetbrains MPS: how to define a simple language (screencast)

I am blessed (or cursed) with having many different interests in software development. However my deepest love is with language engineering. There are different tools and techniques but Jetbrains MPS is particularly fascinating.

To help people getting started or just get the feeling of how it is to develop with Jetbrains MPS I recorded this short screencast.

 

 

If you are not familiar with this tool here there are the basic ideas.

Jetbrains MPS is a projectional editor: that means that it internally stores the information contained in the different nodes.

If you create a model, then change how certain nodes should look, and you reopen the model it will be automatically using the new “look & feel” that you just. This is simply because what is stored is the pure information, while the appearance is recreated each time you open a model. In other words when you open a model it is projected in a form that you defined and that should be easy to understand for a human being.

This is very different from what happens with text language: suppose you create your language and then decide to replace braces with brackets. You have now to manually go over all your existing source files and modify them, because in that case what is saved is not the pure information, the abstract syntax tree, but its representation: the textual form.

In Jetbrains MPS you work on the different concerns of your language separately: you first define the single concept and the information they contain, then you work on the other aspects. There are several like typesystem, generation, dataflow, etc. In this short screencast we just look at the concept definition and the editor aspect.

One of the advantages of a projectional editor is that you can change the appearance of your nodes without invalidating your existing models. That is great to incrementally improve the appearance of the single nodes and the way they can be edited. In the screencast we start with the default representation of the concept we defined, as provided by default by Jetbrains MPS. Then we refine the editors in a couple of steps until we have a representation which is good enough.

At the end of the screencast we have a language we can use to define very simple todo lists. The possibility to improve it are endless:

  • we could add more concepts: like priority, deadlines and tags for each single task
  • we could add the possibility to mark a task as done and add a nice shortcut to do so
  • we could integrate the language with some other tool: what if we could synchronize our models with Trello or with workflowy?
  • we could support subtasks

Question: have you ever thought of creating your own language, either a general purpose language or a domain specific language? If so did you actually get started?

 

Dynamic, static, optional, structural typing and engineering challenges

How an engineer should NOT look at technical matters

Dynamic versus static typing is one of those confrontations that seams to attracts zealots. It really sadden me to see how people tend to defend vehemently their side simply because they are not able to understand the other position. In my opinion if you do not get why both static and dynamic typing are great in certain situations you are an incomplete software developer. You have chosen to look at just a portion of the picture.

An engineer should be open-minded: it is ok to have your opinions but when you keep bashing ideas that made millions of developers very productive, chances are that you are missing something. Try being humble, you could end up learning something.

Ok, so what?

I had very good experiences using Java or Haskell (from the static side) or Python and Ruby (from the dynamic field). There is no way you can obtain the succintness of Ruby code, the amazing possibilities given by its metaprogramming features with statically typed languages. On the other end I really end up missing static typing when I have to refactor large chunks of code. Tests help (when they are there) but it is often not enough.

Now, there are approaches to try getting the best of whole worlds. One solution is to combine languages for different parts of the system: perhaps you could write the infrastructure of your platform using Java and the business logic using JRuby. Why not?

However that means that you have still some challenges at the language boundaries. You need to master several tools and learn the best practices for the different languages and so on. Therefore it would be interesting to get the advantages of both worlds in one single language.

A compromise?

A very partial solution is to use static typing with type inference: it does not give you the same power you get with dynamically types languages but you get some of the succintness. Not a giant leap but it helps closing the gap. Types are still calculated at compile time, and that means that you could have to streamline your solution, so that also the compiler can understand it. However you would not have to type yourself all the types. Also, when refactoring you have to change types in less places.

Type inference sounds as a great idea but it has its drawbacks: try writing some Haskell code, pretty soon you will start being puzzled about the type derived from the compiler for a value. In practice I need to add some annotations to make some of the types explicit otherwise I get lost pretty soon.

Another approach to combine static and dynamic typing is to start from a dynamic language and try to make it more static-like. There are two options for that: optional typing and structural typing.

Basically optional typing means that you can add some type annotations in certain parts of your dynamic language, so that a part of it can be statically checked, while the rest will keep being dynamic. This can be helpful both to catch errors and to improve performances because it enable the compiler to optimize some calls. Look at Typed Clojure and PEP -0484: Type hints for python.

Structural typing means to have duck-types: your type is the list of the operations that it supports. Basically you can look at the code of your method and look which operations perform on the parameters. Does it invoke the method foo on the parameter bar? Cool, that means that when this method is invoked I should ensure that bar has some type that has the method foo. It does not matter the name of the class, what it matters is its structure (i.e., the fields and methods it has). In principle a good idea but imho it can get confusing quite soon.

It is supported in Scala, where you can write this:

However it seems verbose, doesn’t it? With structural type inference perhaps we could avoid the verbose type declaration and have some safety at compilation time. At this time I am not aware of any language that get this exactly right. Any suggestion?

Can’t I just have Ruby with a super smart compiler catching my stupid errors?

No, so far you can’t.

People tried to do that: think about the Crystal language, which aims to be a statically typed version of Ruby or Mirah which tried to be a statically typed version of JRuby. It is tempting to think that you can have a nice language as Ruby, just add a little bit of static-typing and get the best of both worlds. The problem is that it does not just work in that way, like the authors of Crystal figured out. Eventually that particular feature ends up having an impact on other aspects of the language and force you to rethink it globally.

I think this is often the case in engineering project: a customer think you can just squeeze another feature in and he does not get the impact on other aspects such security or architectural integrity. We as engineers have to juggle many balls and take decisions considering all of them. Adding one other ball could force us to rethink our strategy or cause all the balls to fall on the ground.

Alternatives to global variables and passing the same value over a long chain of calls

The old, good global variables are the wrong solution to a real problem: you want some information to be available in a large context. In Object-Oriented languages like Java we often do not have something we call a global variable. We instead lie to ourself using public static variables or singletons and denying they are exactly the same thing. No, we are too much object-oriented and too good developers to be using global variables.

Passing a value to all single method call: much better?

Ok, so you repent and decide to give up global variables. What do you do? You pass a value explicitely. Cool. And then you pass it to another method and to another and the same value is propogate over and over until it is available in many places of your application. Much better, eh? You have polluted the signatures of far too many methods and should you ever decide to change the type of the value passed, that would require a lot of work.

fire-bucket-brigade

You have a bunch of methods just passing around the information. Many of them do absolutely nothing with that value, but still need to receive it to be able to pass it along the chain.

Consider a concrete example: I was working on tool for analyzing Java code and some objects required an instance of a TypeResolver: an object that given a name (such as java.lang.String or my.shiny.Class) would provide for them representations of types. Good.

Let’s call the classes using directly a TypeResolver typeResolverUsers. Now, whoever was using the typeResolverUsers (let’s call them the usersOfTypeResolverUsers) needed a TypeResolver to pass it to the typeResolverUsers it was invoking. So each usersOfTypeResolverUsers asked for a TypeResolver at construction time and store it, to later pass it to classes which actually needed it. So now whoever was instantiating one of the usersOfTypeResolverUsers needed to have a TypeResolver  reference. And so on, until far too many classes were aware of the existence of the TypeResolver class.

That is not nice to refactor and it pollutes your methods. I solved this particular case using a Registry. It was not flawless and not so straightforward so I was wondering: is there a better solution?

Another example: Position and WorldSize

Let’s think about another example: we want to represent a Position on a world map (think about my beloved world generator: WorldEngine). Now the Position is of course defined by two values: x and y. However given it is a world map we want to be able to wrap left-right. So to perform certain operations we need to access the world size: if we know that the world is 500 cells wide, when we move from a Position with x=499 to the right we end up having x=0.

In this scenario we could:

  • use a global variable: it means that in our program there is one possible world size. It is stored in some place accessible from the whole application. That works if this assumption holds. If we instead are working on several worlds at the same time we need an alternative.
  • we could store the world size in the Position instance. That means that in every occasion in which we want to create a Position we need a reference to a world size. Given the application deals all the time with Position instances that means that many methods (perhaps most methods) will get a WorldSize instance and pass it down the chain.

I am not happy with neither of these alternatives. I would like to have something that is almost like a global variables but somehow having a limited and controlled scope. Or if you want, I would like to implicitly pass some contextual information without polluting the signature of all methods.

I think that a Position is really defined only by x and y. It needs to know the WorldSize to perform certain tasks, but this is a contextual information, not something that is controlled or owned by Position.

Contextual information: what the heck is that?

For this reason I am working on the concept of contextual information. For now I have implemented it like this:

  • we can declare that certain values could be context dependant: we would specify a name and a type for each of them
  • we could assign values for a certain context-dependant variable according to a dynamic scopeit means that we would make the value available not only in a certain block of statements but also to all the functions and methods invoked in that block, recursively
  • the value would be accessible only in a given thread (yes, ThreadLocal to the rescue)

Examples

This is the current syntax as implemented in my programming language, Turin:

Alternative solutions

There are many ways to make values available to your application without using explicitly global variables. One way it to use a Singleton, another is to declare static variables.

Another solution is the RegistryPattern:

A registry is a global association from keys to objects, allowing the objects to be reached from anywhere. It involves two methods: one that takes a key and an object and add objects to the registry and one that takes a key and returns the object for the key

Pros & Cons

With respect to a global variable we are able to limit the scope of a value, to a context which is not static but dynamic, anyway totally unrelated pieces of code could not mess up with the same global variables we are using.

The main issue with that solution is that we cannot statically determine if a value is available or not in a given context. However we return an Optional leaving the responsibility to the user to verify if the value is present or not.

With respect to the ContextObject pattern we offer type checking: we declared the type of a context value so we can verify that only compatible values are ever assigned to it.

 

Representing relationships as first-class citizens in an Object-oriented programming language

As beginners we used to write very large functions and then giant God classes. As we improve our skills, our classes become smaller but more numerous: we get a swarm of small, focuses classes that collaborate to create the final system. In other words we shift the complexity from the relations between the components of a class to the relations between classes. Indeed most Design Patterns just describe how different classes should be related.

We shift the complexity from the relations between the

components of a class to the relations between classes

We have different kinds of relationships between classes: aggregation and composition, unidirectional and bidirectional associations. We have constraints on associations and we could have attributes associated to these relations. However we do not have language support for relations.

This is very different in meta-modelling languages or frameworks, such EMF, which instead permit to describe relations. When building complex models, for example Abstract Syntax Trees, a good representation of relations helps a lot.

So I am considering adding support for relations in Turin (my very own programming language).

Example: an AST

Consider an abstract syntax tree. The basic relation is a composition relation, so each node can contain zero or more children nodes, while all nodes have one single parent node (or none, for the root of the tree).

Suppose we can represent this relation like this:

Ideally we would like to use it like this:

So we would expect a fully bidirectional relation: when we update one endpoint, the other one is automatically aware of the change:

  • if we set the parent of A to be B, the children of B will list A
  • if we add A to the children of B, the parent of A will be B

And each node can have one parent so:

No book-keeping is needed: all the parties involved have always the most updated picture. Normally in Java you would have a field parent and a list of children of each single node and changing the parent of a node would mean:

  1. alerting the old parent so that it can remove the node from its children
  2. alerting new new parent so that it can add the node to its children
  3. update the parent field in the node

Boring and error prone. In Turin instead it just works out of the box. Multiple endpoints (such as children) are seen as auto-updating lists while single endpoints (such as parent) are instead seen as a sort of Optional (with the possibility to set the value, not just verifying if it is present and read it).

Relationships and subsets

That is great, however we would like to have something more. We would like to have different specializations of those relationships. For example:

  • a Method (which is a Node) can have multiple FormalArguments and one return type (TypeUsage)
  • a FormalArgument (which is a Node) has one type (TypeUsage)

Now, we would like to add a FormalArgument instance as the param of a method and see it both as part of the “params” and of the “children” of that Node.

In other words “params” would represent a subset of the “children” of Method. Also “returnType” would be a subset (with exactly one element) of the “children” of method.

Note that I want to be able to:

  • add a node into a subset (e.g., params). I should see that node both among the params and the children
  • add a node directly among the children: I should see the node as part of the children but not as part of any subset

Status

This is the syntax I am working on and most of the implementation for relationships is ready. Time to battle test it!

I am still working on relationships subsets but many tests are already passing and I should be ready to commit it soon.

For now I plan to support only relations with two endpoints which can be either single or multiple. In the future relations could hold values or have additional constraints, like we have for fields. I am still exploring the idea and experimenting how it works in practice, after all there are not many programming languages supporting relationships out there 🙂

A few references

No, I am not the first one to think about relations as first-class citizens in object-oriented programming languages:

A few useful references to understand more on the theoretical level:

I have to study better myself these papers, and then be sure to deliver a good compromise to make easier to think about some classes of problems which involves relationships and associations. After all, this is the goal of Turin: be a tool which support thinking about problems and solutions.

Turin Programming Language for the JVM: building advanced lexers with ANTLR

As I wrote in my last post, I recently started working on a new programming language named Turin. A working compiler for an initial version of the languag is available on GitHub. I am currently improving the language and working on a Maven and an IntelliJ plugins. Here and in the next posts I will go over the different components of the compiler and related tools.

Structure of the compiler

The compiler needs to do several things:

  1. Get the source code and generate an abstract syntax tree (AST)
  2. Translate the AST through different stages to simplify processing. We basically want to move from a representation very close to the syntax to a representation easier to process. For example we could “desugarize” the language, representing several (apparently) different constructs as variants of the same construct. An example? The Java compiler translates string concatenations into calls to StringBuffer.append
  3. Perform semantic checks. For example we want to check if all the expressions are using acceptable types (we do not want to sum characters, right?)
  4. Generate bytecode

The first step requires building two components: a lexer and a parser. The lexer operates on the text and produces a sequences of tokens, while the parser composes tokens into constructs (a type declaration, a statement, an expression, etc.) creating the AST. For writing the lexer and the parser I have used ANTLR.

In the rest of this post we look into the lexer. The parser and the other components of the compiler will be treated in future posts.

Why using ANTLR?

ANTLR is a very mature tool for writing lexer and parsers. It can generate code for several languages and has decent performance. It is well mantained and I was sure it had all the features I could possible need to handle all the corner cases I could meet. In addition to that, ANTLR 4 makes possible to write simple grammars because it solves left recursive definition for you. So you do not have to write many intermediate node types for specifying precedence rules for your expressions. More on this when we will look into the parser.

ANTLR is used by Xtext (which I have used a lot) and I have ANTLR while building a framework for Model-driven development for the .NET platform (a sort of EMF for .NET). So I know and trust ANTLR and I have no reason for looking into alternatives.

The current lexer grammar

This is the current version of the lexer grammar.

A few choices I have done:

  • there are two different types of ID: VALUE_ID and TYPE_ID. This permits to have less ambiguity in the grammar because values and types can be easily distinguished. In Java instead when (foo) is encountered we do not know if it is an expression (a reference to the value represented by foo between parenthesis) or a cast to the type foo. We need to look at what follows to understand it. In my opinion this is rather stupid because in practice everyone is using capitalized identifiers for types only, but because this is not enforced by the language the compiler cannot take advantage from it
  • newlines are relevant in Turin, so we have tokens for them we basically want to have statements terminated by newlines but we accept optional newlines after commas
  • whitespaces (but newlines) and comments are captured in their own channels, so that we can ignore them in the parser grammar but we can retrieve them when needed. For example we need them for syntax highlighting and in general for the IntelliJ plugin because it requires to define tokens for each single character in the source file, without gaps
  • the most tricky part is parsing string interpolations à la Ruby such as “my name is #{user.name}”. We use modes: when we encounter a string start (“) we switch to lexer mode IN_STRING. While in mode IN_STRING if we encounter the start of an interpolated value (#{) we move to lexer mode IN_INTERPOLATION. While in mode IN_INTERPOLATION we need to accept most of tokens used in expressions (and that sadly means a lot of duplication in our lexer grammar).
  • I had to collapse the relational operators in one single token type, so that the number of states of the generated lexer is not too big. It means that I will have to look into the text of RELOP tokens to figure out which operation need to be executed. Nothing too awful but you have to know how to fix these kinds of issues.

Testing the lexer

I wrote a bunch of tests specific for the lexer. In particular I tested the most involved part: the one regarding string interpolation.

An example of a few tests:

As you can see I just test the token on a string and verify it produces  the correct list of tokens. Easy and straight to the point.

Conclusions

My experience with ANTLR for this language has not been perfect: there are issues and limitations. Having to collapse several operators in a single token type is not nice. Having to repeat several token definitions for different lexer modes is bad. However ANTLR proved to be a tool usable in practice: it does all that it needs to do and for each problem there is an acceptable solution. The solution is maybe not ideal, maybe not elegant as desired but there is one. So I can use it and move on on more interesting parts of the compiler.