The creation of the Turin Programmin Language: examples, problems, solutions and strategies.

Alternatives to global variables and passing the same value over a long chain of calls

The old, good global variables are the wrong solution to a real problem: you want some information to be available in a large context. In Object-Oriented languages like Java we often do not have something we call a global variable. We instead lie to ourself using public static variables or singletons and denying they are exactly the same thing. No, we are too much object-oriented and too good developers to be using global variables.

Passing a value to all single method call: much better?

Ok, so you repent and decide to give up global variables. What do you do? You pass a value explicitely. Cool. And then you pass it to another method and to another and the same value is propogate over and over until it is available in many places of your application. Much better, eh? You have polluted the signatures of far too many methods and should you ever decide to change the type of the value passed, that would require a lot of work.

fire-bucket-brigade

You have a bunch of methods just passing around the information. Many of them do absolutely nothing with that value, but still need to receive it to be able to pass it along the chain.

Consider a concrete example: I was working on tool for analyzing Java code and some objects required an instance of a TypeResolver: an object that given a name (such as java.lang.String or my.shiny.Class) would provide for them representations of types. Good.

Let’s call the classes using directly a TypeResolver typeResolverUsers. Now, whoever was using the typeResolverUsers (let’s call them the usersOfTypeResolverUsers) needed a TypeResolver to pass it to the typeResolverUsers it was invoking. So each usersOfTypeResolverUsers asked for a TypeResolver at construction time and store it, to later pass it to classes which actually needed it. So now whoever was instantiating one of the usersOfTypeResolverUsers needed to have a TypeResolver  reference. And so on, until far too many classes were aware of the existence of the TypeResolver class.

That is not nice to refactor and it pollutes your methods. I solved this particular case using a Registry. It was not flawless and not so straightforward so I was wondering: is there a better solution?

Another example: Position and WorldSize

Let’s think about another example: we want to represent a Position on a world map (think about my beloved world generator: WorldEngine). Now the Position is of course defined by two values: x and y. However given it is a world map we want to be able to wrap left-right. So to perform certain operations we need to access the world size: if we know that the world is 500 cells wide, when we move from a Position with x=499 to the right we end up having x=0.

In this scenario we could:

  • use a global variable: it means that in our program there is one possible world size. It is stored in some place accessible from the whole application. That works if this assumption holds. If we instead are working on several worlds at the same time we need an alternative.
  • we could store the world size in the Position instance. That means that in every occasion in which we want to create a Position we need a reference to a world size. Given the application deals all the time with Position instances that means that many methods (perhaps most methods) will get a WorldSize instance and pass it down the chain.

I am not happy with neither of these alternatives. I would like to have something that is almost like a global variables but somehow having a limited and controlled scope. Or if you want, I would like to implicitly pass some contextual information without polluting the signature of all methods.

I think that a Position is really defined only by x and y. It needs to know the WorldSize to perform certain tasks, but this is a contextual information, not something that is controlled or owned by Position.

Contextual information: what the heck is that?

For this reason I am working on the concept of contextual information. For now I have implemented it like this:

  • we can declare that certain values could be context dependant: we would specify a name and a type for each of them
  • we could assign values for a certain context-dependant variable according to a dynamic scopeit means that we would make the value available not only in a certain block of statements but also to all the functions and methods invoked in that block, recursively
  • the value would be accessible only in a given thread (yes, ThreadLocal to the rescue)

Examples

This is the current syntax as implemented in my programming language, Turin:

Alternative solutions

There are many ways to make values available to your application without using explicitly global variables. One way it to use a Singleton, another is to declare static variables.

Another solution is the RegistryPattern:

A registry is a global association from keys to objects, allowing the objects to be reached from anywhere. It involves two methods: one that takes a key and an object and add objects to the registry and one that takes a key and returns the object for the key

Pros & Cons

With respect to a global variable we are able to limit the scope of a value, to a context which is not static but dynamic, anyway totally unrelated pieces of code could not mess up with the same global variables we are using.

The main issue with that solution is that we cannot statically determine if a value is available or not in a given context. However we return an Optional leaving the responsibility to the user to verify if the value is present or not.

With respect to the ContextObject pattern we offer type checking: we declared the type of a context value so we can verify that only compatible values are ever assigned to it.

 

Representing relationships as first-class citizens in an Object-oriented programming language

As beginners we used to write very large functions and then giant God classes. As we improve our skills, our classes become smaller but more numerous: we get a swarm of small, focuses classes that collaborate to create the final system. In other words we shift the complexity from the relations between the components of a class to the relations between classes. Indeed most Design Patterns just describe how different classes should be related.

We shift the complexity from the relations between the

components of a class to the relations between classes

We have different kinds of relationships between classes: aggregation and composition, unidirectional and bidirectional associations. We have constraints on associations and we could have attributes associated to these relations. However we do not have language support for relations.

This is very different in meta-modelling languages or frameworks, such EMF, which instead permit to describe relations. When building complex models, for example Abstract Syntax Trees, a good representation of relations helps a lot.

So I am considering adding support for relations in Turin (my very own programming language).

Example: an AST

Consider an abstract syntax tree. The basic relation is a composition relation, so each node can contain zero or more children nodes, while all nodes have one single parent node (or none, for the root of the tree).

Suppose we can represent this relation like this:

Ideally we would like to use it like this:

So we would expect a fully bidirectional relation: when we update one endpoint, the other one is automatically aware of the change:

  • if we set the parent of A to be B, the children of B will list A
  • if we add A to the children of B, the parent of A will be B

And each node can have one parent so:

No book-keeping is needed: all the parties involved have always the most updated picture. Normally in Java you would have a field parent and a list of children of each single node and changing the parent of a node would mean:

  1. alerting the old parent so that it can remove the node from its children
  2. alerting new new parent so that it can add the node to its children
  3. update the parent field in the node

Boring and error prone. In Turin instead it just works out of the box. Multiple endpoints (such as children) are seen as auto-updating lists while single endpoints (such as parent) are instead seen as a sort of Optional (with the possibility to set the value, not just verifying if it is present and read it).

Relationships and subsets

That is great, however we would like to have something more. We would like to have different specializations of those relationships. For example:

  • a Method (which is a Node) can have multiple FormalArguments and one return type (TypeUsage)
  • a FormalArgument (which is a Node) has one type (TypeUsage)

Now, we would like to add a FormalArgument instance as the param of a method and see it both as part of the “params” and of the “children” of that Node.

In other words “params” would represent a subset of the “children” of Method. Also “returnType” would be a subset (with exactly one element) of the “children” of method.

Note that I want to be able to:

  • add a node into a subset (e.g., params). I should see that node both among the params and the children
  • add a node directly among the children: I should see the node as part of the children but not as part of any subset

Status

This is the syntax I am working on and most of the implementation for relationships is ready. Time to battle test it!

I am still working on relationships subsets but many tests are already passing and I should be ready to commit it soon.

For now I plan to support only relations with two endpoints which can be either single or multiple. In the future relations could hold values or have additional constraints, like we have for fields. I am still exploring the idea and experimenting how it works in practice, after all there are not many programming languages supporting relationships out there 🙂

A few references

No, I am not the first one to think about relations as first-class citizens in object-oriented programming languages:

A few useful references to understand more on the theoretical level:

I have to study better myself these papers, and then be sure to deliver a good compromise to make easier to think about some classes of problems which involves relationships and associations. After all, this is the goal of Turin: be a tool which support thinking about problems and solutions.

Turin Programming Language for the JVM: building advanced lexers with ANTLR

As I wrote in my last post, I recently started working on a new programming language named Turin. A working compiler for an initial version of the languag is available on GitHub. I am currently improving the language and working on a Maven and an IntelliJ plugins. Here and in the next posts I will go over the different components of the compiler and related tools.

Structure of the compiler

The compiler needs to do several things:

  1. Get the source code and generate an abstract syntax tree (AST)
  2. Translate the AST through different stages to simplify processing. We basically want to move from a representation very close to the syntax to a representation easier to process. For example we could “desugarize” the language, representing several (apparently) different constructs as variants of the same construct. An example? The Java compiler translates string concatenations into calls to StringBuffer.append
  3. Perform semantic checks. For example we want to check if all the expressions are using acceptable types (we do not want to sum characters, right?)
  4. Generate bytecode

The first step requires building two components: a lexer and a parser. The lexer operates on the text and produces a sequences of tokens, while the parser composes tokens into constructs (a type declaration, a statement, an expression, etc.) creating the AST. For writing the lexer and the parser I have used ANTLR.

In the rest of this post we look into the lexer. The parser and the other components of the compiler will be treated in future posts.

Why using ANTLR?

ANTLR is a very mature tool for writing lexer and parsers. It can generate code for several languages and has decent performance. It is well mantained and I was sure it had all the features I could possible need to handle all the corner cases I could meet. In addition to that, ANTLR 4 makes possible to write simple grammars because it solves left recursive definition for you. So you do not have to write many intermediate node types for specifying precedence rules for your expressions. More on this when we will look into the parser.

ANTLR is used by Xtext (which I have used a lot) and I have ANTLR while building a framework for Model-driven development for the .NET platform (a sort of EMF for .NET). So I know and trust ANTLR and I have no reason for looking into alternatives.

The current lexer grammar

This is the current version of the lexer grammar.

A few choices I have done:

  • there are two different types of ID: VALUE_ID and TYPE_ID. This permits to have less ambiguity in the grammar because values and types can be easily distinguished. In Java instead when (foo) is encountered we do not know if it is an expression (a reference to the value represented by foo between parenthesis) or a cast to the type foo. We need to look at what follows to understand it. In my opinion this is rather stupid because in practice everyone is using capitalized identifiers for types only, but because this is not enforced by the language the compiler cannot take advantage from it
  • newlines are relevant in Turin, so we have tokens for them we basically want to have statements terminated by newlines but we accept optional newlines after commas
  • whitespaces (but newlines) and comments are captured in their own channels, so that we can ignore them in the parser grammar but we can retrieve them when needed. For example we need them for syntax highlighting and in general for the IntelliJ plugin because it requires to define tokens for each single character in the source file, without gaps
  • the most tricky part is parsing string interpolations à la Ruby such as “my name is #{user.name}”. We use modes: when we encounter a string start (“) we switch to lexer mode IN_STRING. While in mode IN_STRING if we encounter the start of an interpolated value (#{) we move to lexer mode IN_INTERPOLATION. While in mode IN_INTERPOLATION we need to accept most of tokens used in expressions (and that sadly means a lot of duplication in our lexer grammar).
  • I had to collapse the relational operators in one single token type, so that the number of states of the generated lexer is not too big. It means that I will have to look into the text of RELOP tokens to figure out which operation need to be executed. Nothing too awful but you have to know how to fix these kinds of issues.

Testing the lexer

I wrote a bunch of tests specific for the lexer. In particular I tested the most involved part: the one regarding string interpolation.

An example of a few tests:

As you can see I just test the token on a string and verify it produces  the correct list of tokens. Easy and straight to the point.

Conclusions

My experience with ANTLR for this language has not been perfect: there are issues and limitations. Having to collapse several operators in a single token type is not nice. Having to repeat several token definitions for different lexer modes is bad. However ANTLR proved to be a tool usable in practice: it does all that it needs to do and for each problem there is an acceptable solution. The solution is maybe not ideal, maybe not elegant as desired but there is one. So I can use it and move on on more interesting parts of the compiler.

Create a programming language for the JVM: getting started to welcome the Turin programming language

Because you know, in the end everyone wants to create his own programming language.

I have been interested in parsers and languages for a while. I worked with Xtext, Jetbrains MPS, created a few DSLs etc. etc. I also wrote my PhD thesis on this topic but so far I did not get started creating a complete programming language from scratch.

Why creating a new language?

Well, there are many wrong answers to this question. Mine is divided in two parts: first, I think it can be a great experience to learn a few things more, and second, a programming language is for me a tool to look at reality, describe it and reason about it to understand it better. Laugh if you wish, but for me it is mainly a tool for thinking. Often it is a clunky tool because I get distracted by technical workarounds or details which are hardly relevent or important for what I am trying to do. For example I am getting tired of clicking here and there to generate equals and hashCode methods (yes, I know about Lombok and some other tricks).

Ok kid… but to make a language usable you need a lot of stuff

I think now the barriers to create a programming language and make it usable by sane persons are significantly lower than they used to be. I think that a language needs a great syntax & semantics, sure, but also a large amount of libraries and decent tool support.

To get libraries just make it run on the JVM. Really. If it runs on the JVM you can reuse a gazzillion of libraries. Your programming language get batteries included. Consider also deployment: I am fascinated by many languages but every time I try to deploy something outside the JVM I end up regretting it. Ask my co-maintainer of WorldEngine (a python program) how much fun is to support libraries on Linux, Mac, Windows, across Python versions. A lot of fun.

Of course you need also tool support: it means to me mainly a nice editor and good integration with the platform of reference. I have an advantage here because I have experience developing IDE plugins and I just ended writing a type solver for Java. In addition to that I contribute to JavaParser. So I know a thing or two about tools that I could use down the road for the integration with Java.

In practice I plan to:

  1. Create a plugin for IntelliJ which is aware of references to file in my language and to Java files. The nice thing is that basically I wrote the libraries for doing that already.
  2. Create a Maven plugin because Maven is awful, but life without Maven is even more awful
  3. Write a parser with ANTLR (that’s the easy part)
  4. Write a bytecode generator with ASM (what a great library!)

Ok tell me more about this language

I just started working on it but… I have a basic compiler working and supporting a few constructs.

The language is named Turin like my hometown and it is available on GitHub: https://github.com/ftomassetti/turin-programming-language

It is a static language with type inference and this is an example of a piece of valid code that can be already compiled and run:

So I have a plan, I have something working already and I am having a lot of fun.
Could be idiotic, perhaps, to write yet another language but damn… it is so much fun!