Strategies, tips and tutorials on designing programming languages

68 Resources To Help You To Create Programming Languages

Here it is a new guide, to collect and organize all the knowledge that you need to create your programming language from scratch.

Creating a programming language is one of the most fascinating challenge you can dream of as a developer.

The problem is that there are a lot of moving parts, a lot of things to do right and it is difficult to find a well detailed map, to show you the way. Sure, you can find a tutorial on writing half a parser there, an half-baked list of advices on language design, an example of a naive interpreter. To find those things you will need to spend hours navigating forums and following links.

We thought it was the case to save you some time by collecting relevant resources, evaluate them and organize them. So you can spend time using good resources, not looking for them.

We organized the resources around the three stages in the creation of a programming language: design, parsing, and execution.

Download the guide with 68 resources on Creating Programming Languages

68resources

Receive the guide to your inbox to read it on all your devices when you have time

Powered by ConvertKit

Section 1:

Designing the Language

When creating a programming language you need to take ideas and transform them in decisions. This is what you do during the design phase.

Before You Start…

Some good resources to beef up your culture on language design.

Articles

Books

  • Design Concepts in Programming Languages, if you want to make deliberate choices in the creation of your programming language, this is the book you need. Otherwise, if you don’t already have the necessary theoretical background, you risk doing things the way everybody else does them. It’s also useful to develop a general framework to understand how the different programming languages behave and why.
  • Practical Foundations for Programming Languages, this is, for the most part, a book about studying and classifying programming languages. But by understanding the different options available it can also be used to guide the implementation of your programming language.
  • Programming Language Pragmatics, 4th Edition, this is the most comprehensive book to understand contemporary programming languages. It discusses different aspects, of everything from C# to OCaml, and even the different kinds of programming languages such as functional and logical ones. It also covers the several steps and parts of the implementation, such as an intermediate language, linking, virtual machines, etc.
  • Structure and Interpretation of Computer Programs, Second Edition, an introduction to computer science for people that already have a degree in it. A book widely praised by programmers, including Paul Graham directly on the Amazon Page, that helps you developing a new way to think about programming language. It’s quite abstract and examples are proposed in Scheme. It also covers many different aspect of programming languages including advanced topics like garbage collection.

Type Systems

Long discussions and infinite disputes are fought around type systems. Whatever choices you end up making it make sense to know the different positions.

Articles

  • These are two good introductory articles on the subject of type systems. The first discuss the dichotomy Static/Dynamic and the second one dive into Introspection.
  • What To Know Before Debating Type Systems, if you already know the basics of type systems this article is for you. It will permit you to understand them better by going into definitions and details.
  • Type Systems (PDF), a paper on the formalization of type systems that also introduces more precise definitions of the different type systems.

Books

  • Types and Programming Languages, a comprehensive book on understanding type systems. It will impact your ability to design programming languages and compilers. It has a strong theoretical support, but it also explains the practical importance of individual concepts.
  • Functional programming and type systems, an interesting university course on type systems for functional programming. It is used in a well known French university. There are also notes and presentation material available. It is as advanced as you would expect.
  • Type Systems for Programming Language, is a simpler course on type system for (functional) programming languages.

Section 2:

Parsing

Parsing transforms the concrete syntax in a form that is more easily manageable by computers. This usually means transforming text written by humans in a more useful representation of the source code, an Abstract Syntax Tree.

An abstract syntax tree for the Euclidean algorithm

There are usually two components in parsing: a lexical analyzer and the proper parser. Lexers, which are also known as tokenizers or scanners, transform the individual characters in tokens, the atom of meaning. Parsers instead organize the tokens in the proper Abstract Syntax Tree for the program. But since they are usually meant to work together you may use a single tool that does both the tasks.

Tools

  • Flex, as a lexer generator and (Berkeley) Yacc or Bison, for the generation of the proper parser, are the venerable choices to generate a complete parser. They are a few decades old and they are still maintained as open source software. They are written in and thought for C/C++. They still works, but they have limitations in features and support for other languages.
  • ANTLR, is both a lexer and a parser generator. It’s also more actively developed as open source software. It is written in Java, but it can generate both the lexer and the parser in languages as varied as C#, C++, Python, Javascript, Go, etc.
  • Your own lexer and parser. If you need the best performance and you can create your own parser. You just need to have the necessary computer science knowledge.

Tutorials

  • Flex and Bison tutorial, a good introduction to the two tools with bonus tips.
  • Lex and Yacc Tutorial, at 40 pages this is the ideal starting point to learn how to put together lex and yacc in a few hours.
  • Video Tutorial on lex/yacc in two parts, in an hour of video you can learn the basics of using lex and yacc.
  • ANTLR Mega Tutorial, the renown and beloved tutorial that explains everything you need to know about ANTLR, with bonus tips and tricks and even resources to know more.

Books

  • lex & yacc, despite being a book written in 1992 it’s still the most recommended book on the subject. Some people say because the lack of competition, others because it is good enough.
  • flex & bison: Text Processing Tools, the best book on the subject written in this millennium.
  • The Definitive ANTLR 4 Reference, written by the main author of the tool this is really the definitive book on ANTLR 4. It explains all of its secrets and it’s also a good introduction about how the whole parsing thing works.
  • Parsing Techniques, 2nd edition, a comprehensive, advanced and costly book to know more than you possibly need about parsing.

Section 3:

Execution

To implement your programming language, that is to say to actually making something happens, you can build one of two things: a compiler or an interpreter. You could also build both of them if you want. Here you can find a good overview if you need it: Compiled and Interpreted Languages.

The resources here are dedicated to explaining how compilers and/or interpreters are built, but for practical reasons often they also explain the basics of creating lexers and parsers.

Compilers

A compiler transforms the original code into something else, usually machine code, but it could also be simply any lower level language, such as C. In the latter case some people prefer to use the term transpiler.

Tools

  • LLVM, a collection of modular and reusable compiler and toolchain technologies used to create compilers.
  • CLR, is the virtual machine part of the .NET technologies, that permits to execute different languages transformed in a common intermediate language.
  • JVM, the Java Virtual Machine that powers the Java execution.
  • Babel, a JavaScript compiler. It has a very modern structure, based upon plugins, which means that it can be easily adapted, but it doesn’t do anything by default, really, that is what the documentation says. Its creators present it as a tool to support new featuers of JavaScript in old environment, but it can do much more. For instance, you can use to create JavaScript-based languages, such as LightScript.

Articles & Tutorials

  • Building Domain Specific Languages on the CLR, an article on how to build internal DSL on the CLR. It’s slightly outdated, since it’s from 2008, but it’s still a good presentation on the subject.
  • The digital issue of MSDN Magazine for February 2008 (CHM format), contains an article on how to Create a Language Compiler for the .NET Framework. It’s still a competent overview of the whole process.
  • Create a working compiler with the LLVM framework, Part 1 and Part 2, a two-part series of articles on creating a custom compiler by IBM, from 2012 and thus slightly outdated.
  • A few series of tutorials froms the LLVM Documentation, this is three great linked series of tutorial on how to implement a language, called Kaleidoscope, with LLVM. The only problem is that some parts are not always up-to-date.
  • My First LLVM Compiler, a short and gentle introduction to the topic of building a compiler with LLVM.
  • Creating an LLVM Backend for the Cpu0 Architecture, a whopping 600-pages tutorial to learn how to create a LLVM backend, also available in PDF or ePub. The content is great, but the English is lacking. On the positive side, if you are a student, they feel your pain of transforming theoretical knowledge into practical applications, and the book was made for you.
  • A Nanopass Framework for Compiler Education, a paper that present a framework to teach the creation of a compiler in a simpler way, transforming the traditional monolithic approach in a long series of simple transformations. It’s an interesting read if you already have some theoretical background in computer science.
  • An Incremental Approach to Compiler Construction (PDF), a paper that it’s also a tutorial that develops a basic Scheme compiler with an easier to learn approach.
  • The Super Tiny Compiler!: “This is an ultra-simplified example of all the major pieces of a modern compiler written in easy to read JavaScript” (and using Babel). There is a short introduction, but this is more of a walkthrough of the code than a proper tutorial. This may be a good or a bad thing, depending on your preference.

Books

  • Compilers: Principles, Techniques, and Tools, 2nd Edition, this is the widely known Dragon book (because of the cover) in the 2nd edition (purple dragon). There is a paperback edition, which probably costs less but it has no dragon on it, so you cannot buy that. It is a theoretical book, so don’t expect the techniques to actually include a lot of reusable code.
  • Modern Compiler Implementation in ML, this is known as a the Tiger book and a competitor of the Dragon book. It is a book that teaches the structure and the elements of a compiler in detail. It’s a theoretical book, although it explains the concept with code. There are other versions of the same book written using Java and C, but it’s widely agreed that the ML one is the best.
  • Engineering a Compiler, 2nd edition, it is another compiler book with a theoretical approach, but that it covers a more modern approach and it is more readable. It’s also more dedicated to the optimization of the compiler. So if you need a theoretical foundation and an engineering approach this is the best book to get.

Interpreters

An interpreter directly executes the language without transforming it in another form.

Articles & Tutorials

Books

  • Writing An Interpreter In Go, despite the title it actually shows everything from parsing to creating an interpreter. It’s contemporary book both in the sense that is recent (a few months old), and it is a short one with a learn-by-doing attitude full of code, testing and without 3-rd party libraries. We have interviewed the author, Thorsten Ball.
  • Crafting Interpreters, a work-in-progress and free book that already has good reviews. It is focused on making interpreters that works well, and in fact it will builds two of them. It plan to have just the right amount of theory to be able to fit in at a party of programming language creators.

Section 4:

General

This are resources that cover a wide range of the process of creating a programming language. They may be comprehensive or just give the general overview.

Tools

In this section we include tools that cover the whole spectrum of building a programming language and that are usually used as standalone tools.

  • Xtext, is a framework part of several related technologies to develop programming languages and especially Domain Specific Languages. It allows you to build everything from the parser, to the editor, to validation rules. You can use it to build great IDE support for your language. It simplifies the whole language building process by reusing and linking existing technologies under the hood, such as the ANTLR parser generator.
  • JetBrains MPS, is a projectional language workbench. Projectional means that the Abstract Syntax Tree is saved on disk and a projection is presented to the user. The projection could be text-like, or be a table or diagram or anything else you can imagine. One side effect of this is that you will not need to do any parsing, because it is not necessary. The term Language Workbench indicates that Jetbrains MPS is a whole system of technologies created to help you create your own programming language: everything from the language itself to IDE and supporting tools designed for your language. You can use it to build every kind of language, but the possibility and need to create everything makes it ideal to create Domain Specific Languages that are used for specific purposes, by specific audiences.
  • Racket, is described by its authors as “a general-purpose programming language as well as the world’s first ecosystem for developing and deploying new languages”. It’s a pedagogical tool developed with practical ambitions that has even a manifesto. It is a language made to create other languages that has everything: from libraries to develop GUI applications to an IDE and the tools to develop logic languages. It’s part of the Lisp family of languages, and this tells everything you need to know: it’s all or nothing and always the Lisp-way.

Articles

Tutorials

Books

  • How to create pragmatic, lightweight languages, the focus here is on making a language that works in practice. It explains how to generate bytecode, target the LLVM, build an editor for your language. Once you read the book you should know everything you need to make a usable, productive language. Incidentally, we have written this book.
  • How To Create Your Own Freaking Awesome Programming Language, it’s a 100-page PDF and a screencast that teach how to create a programming language using Ruby or the JVM. If you like the quick-and-dirty approach this book will get you started in little time.
  • Writing Compilers and Interpreters: A Software Engineering Approach, 3rd edition, it’s a pragmatic book that still teaches the proper approach to compilers/interpreters. Only that instead of an academic focus, it has an engineering one. This means that it’s full of Java code and there is also UML sprinkled here and there. Both the techniques and the code are slightly outdated, but this is still the best book if you are a software engineer and you need to actually do something that works correctly right now, that is to say in a few months after the proper review process has completed.
  • Language Implementation Patterns, this is a book from the author of ANTLR, which is also a computer science professor. So it’s a book with a mix of theory and practice, that guides you from start to finish, from parsing to compilers and interpreters. As the name implies, it focuses on explaining the known working patterns that are used in building this kind of software, more than directly explaining all the theory followed by a practical application. It’s the book to get if you need something that really works right now. It’s even recommended by Guido van Rossum, the designer of Python.
  • Build Your Own Lisp, it’s a very peculiar book meant to teach you how to use the C language and how to build you own programming language, using a mini-Lisp as the main example. You can read it for free online or buy it. It’s meant you to teach about C, but you have to be already familiar with programming. There is even a picture of Mike Tyson (because… lisp): it’s all so weird, but fascinating.
  • Beautiful Racket: how to make your own programming languages with Racket, it’s a good and continually updated online book on how to use Racket to build a programming language. The book is composed of a series of tutorials and parts of explanation and reference. It’s the kind of book that is technically free, but you should pay for it if you use it.
  • Programming Languages: Application and Interpretation, an interesting book that explains how to create a programming language from scratch using Racket. The author is a teacher, but of the good and understandable kind. In fact, there is also a series of recordings of the companion lectures, that sometimes have questionable audio. There is an updated version of the book and of the recordings, but the new book has a different focus, because it want also to teach about programming language. It also doesn’t uses Racket. If you don’t know any programming at all you may want to read the new version, if you are an experienced one you may prefer the old one.
  • Implementing Domain-Specific Languages with Xtext and Xtend, 2nd edition, is a great book for people that want to learn with examples and using a test-driven approach. It covers all levels of designing a DSL, from the design of the type system, to parsing and building a compiler.
  • Implementing Programming Languages, is an introduction to building compilers and interpreters with the JVM as the main target. There are related materials (presentations, source code, etc.) in a dedicated webpage. It has a good balance of theory and practice, but it’s explicitly meant as a textbook. So don’t expect much reusable code. It’s the typical textbook also in the sense that it can be a great and productive read if you already have the necessary background (or a teacher), otherwise you risk ending up confused.
  • Implementing functional languages: a tutorial, a free book that explains how to create a simple functional programming language from the parsing to the interpreter and compiler. On the other hand: “this book gives a practical approach to understanding implementations of non-strict functional languages using lazy graph reduction”. Also, expect a lot of math.
  • DSL Engineering, a great book that explains the theory and practice of building DSLs using language workbenches, such as MPS and Xtext. This means that other than traditional design aspects, such as parsing and interpreters, it covers things like how to create an IDE or how to test your DSL. It’s especially useful to software engineers, because it also discusses software engineering and business related aspects of DSLs. That is to say it talks about why a company should build a DSL.
  • Lisp in Small Pieces, an interesting book that explain in details how to design and implement a language of the Lisp family. It describes “11 interpreters and 2 compilers” and many advanced implementation details such as the optimization of the compiler. It’s obviously most useful to people interested in creating a Lisp-related language, but it can be an interesting reading for everybody.

Section 5:

Summary

Here you have the most complete collection of high-quality resources on creating programming languages. You have just to decide what you are going to read first.

At this point we have two advices for you:

  1. Get started. It does not matter how many amazing resources we will send you, if you do not take the time to practice, trying and learning from your mistake you will never create a programming language
  2. If you are interested in building programming languages you should subscribe to our newsletter. You will receive updates on new articles, more resources, ideas, advices and ultimately become part of a community that share your interests on building languages

You should have all you need to get started. If you have questions, advices or ideas to share feel free to write at federico@tomassetti.me. We read and answer every email.

Our thanks to Krishna and the people on Hacker News for a few good suggestions.

Building a language: tool support

When building a language there is one aspect which is absolutely crucial: that is tool support.
Tool support will determine if your language is usable at all, it will influence the reception from user and the usability.
In this video I explain why and how you can build great tool support with a limited effort.

Tool support is what makes a real language. This is also one of the reasons why I am against the whole concept of internal DSLs: they are nice but they are not really languages. They cannot be compared in any way to external DSLs (a.k.a. real DSLs).

And it is nice that experts agree on this:

Building a compiler for your own language: validation

validation_for_your_own_programming_language

So you have parsed your code and built a clean AST for it. Now it is time to check if what the user has expressed make sense at all. We should perform validation, identifying semantical errors, to communicate together with lexical and syntactical errors (provided by the parser).


This post is the part of a series. The goal of the series is to describe how to create a useful language and all the supporting tools.

  1. Building a lexer
  2. Building a parser
  3. Creating an editor with syntax highlighting
  4. Build an editor with autocompletion
  5. Mapping the parse tree to the abstract syntax tree
  6. Model to model transformations
  7. Validation
  8. Generating bytecode

After writing this series of posts I refined my method, expanded it, and clarified into this book titled
How to create pragmatic, lightweight languages

Code

Code is available on GitHub under the tag 07_validation

Implement semantic checks

In the previous post we have seen how to implement the function process to execute an action on all the nodes of our AST. A typical case is that we want to execute certain operation only on certain nodes. We want still to use process to navigate the tree. We can do that by creating this function named specificProcess.

Let’s see how to use specificProcess to:

  • find all the VariableDeclarations and check they are not re-declaring a variable already declared
  • find all the VarReferences and verify they are not referring to a variable that has been not declared or has been declared after the VarReference
  • perform the same check done on VarReferences also for Assignments

Ok, so invoking validate on the root of the AST will return all possible semantic errors.

Getting all errors: lexical, syntactic, and semantic

We first need to invoke the ANTLR parser and get:

  • the parse tree
  • the list of lexical and syntactic error

Then we map the parse tree to the AST and perform the semantic validation. Finally we return the AST and all the errors combined.

In the rest of the system we will simply call the SandyParserFacade without the need to invoke the ANTLR parser directly.

Test validation

Will it fly? Let’s verify that.

Conclusions

This is all nice and well: with a simple call we can get a list of all errors we have. For each of them we have a description and the position. This is enough for our compiler but we would now need to show these errors in the editor. We will do that in out of the future posts.

 

Getting started with ANTLR: building a simple expression language


This post is the part of a series. The goal of the series is to describe how to create a useful language and all the supporting tools.

  1. Building a lexer
  2. Building a parser
  3. Creating an editor with syntax highlighting
  4. Build an editor with autocompletion
  5. Mapping the parse tree to the abstract syntax tree
  6. Model to model transformations
  7. Validation
  8. Generating bytecode

After writing this series of posts I refined my method, expanded it, and clarified into this book titled
How to create pragmatic, lightweight languages

In this post we will start working on a very simple expression language. We will build it in our language sandbox and therefore we will call the language Sandy.

I think that tool support is vital for a language: for this reason we will start with an extremely simple language but we will build rich tool support for it. To benefit from a language we need a parser, interpreters and compilers, editors and more. It seems to me that there is a lot of material on building simple parsers but very few material on building the rest of the infrastructure needed to make using a language practical and effective.

I would like to focus on exactly these aspects, making a language small but fully useful. Then you will be able to grow your language organically.

The code is available on GitHub: https://github.com/ftomassetti/LangSandbox. The code presented in this article corresponds to the tag 01_lexer.

The language

The language will permit to define variables and expressions. We will support:

  • integer and decimal literals
  • variable definition and assignment
  • the basic mathematical operations (addition, subtraction, multiplication, division)
  • the usage of parenthesis

Examples of a valid file:

The tools we will use

We will use:

  • ANTLR to generate the lexer and the parser
  • use Gradle as our build system
  • write the code in Kotlin. It will be very basic Kotlin, given I just started learning it.

Setup the project

Our build.gradle file will look like this

We can run:

  • ./gradlew idea to generate the IDEA project files
  • ./gradlew generateGrammarSource to generate the ANTLR lexer and parser

Implementing the lexer

We will build the lexer and the parser in two separate files. This is the lexer:

Now we can simply run ./gradlew generateGrammarSource and the lexer will be generated for us from the previous definition.

Testing the lexer

Testing is always important but while building languages it is absolutely critical: if the tools supporting your language are not correct this could affect all possible programs you will build for them. So let’s start testing the lexer: we will just verify that the sequence of tokens the lexer produces is the one we aspect.

Conclusions and next steps

We started with the first small step: we setup the project and built the lexer.

There is a long way in front of us before making the language usable in practice but we started. We will next work on the parser with the same approach: building something simple that we can test and compile through the command line.

What if sketching on a whiteboard was a form of programming?

There are different possible notations for languages: textual and graphical are the two larger families.

We can define graphical notations for formal languages. For example we can define a graphical notation to define state machines.

A Domain Specific Language for state machines can be defined using JetBrains MPS. However the user would have to edit the state machine inside MPS.

I wonder if we could instead specify a state machine just by sketching something on a whiteboard. Consider this example:

sm1

We have a start state (the white circle on top) which leads us to the state A without needing any event. Then from state A we can go to state B when an event of type b occurs. Or we can go to state C if an event of type a occurs. From C we need a state of type c to move to D.

Note that I did not represent the direction of transitions, we could assume that they always lead from the higher state to the lower.

I think that at this point we should just make a picture to this sketch and obtain a program which would permit to run this state machine.

To do that we should recognize the different shape, the text present and assign everything to a corresponding concept in our DSL.

The first step would be to recognize rectangles. I guess there are already libraries to do that. I tried OpenCV but I really do not like the way C++ dependencies are handled. I started playing with BoofCV instead, which is written in Java. I first used functions to recognize segments and I got something like this:

Screenshot from 2016-03-28 19-12-53

Then I wrote some functions to merge segments which have endpoints very close and similar slope, to reduce the number of segments. That did not work particularly well.

Screenshot from 2016-03-28 19-13-03

I realized I should probably use some pre-processing: first a Gaussian blur and then calculate the derivatives over the X and Y axis. I got something like this. The colors depends on the values present in the derivates X and Y images. I should use those values to recognize horizontal and vertical lines with common endpoints and so get rectangles.

Screenshot from 2016-03-29 18-49-46

This is not as easy as I hoped but I would love to spend more time on it in the future.

Conclusions

Still, people will need some ability to formalize their thoughts but we can definitely remove some of the incidental barriers due to languages, notations and tools.

Raising the level of abstraction: what if we tried to do that bottom up?

It is clear that software development is going in the direction of raising the level of abstraction since it was born.

The reason is simple: we can juggle with so many concepts in our mind. If we are busy thinking about what we stored in the EAX register it is difficult to think about the bigger picture. As we raise the level of abstraction we start using bigger bricks and we can see farther.  Once we think in terms of objects the OOP design patterns can emerge, because they build on combinations of a few concepts at our current level of abstraction.

How are we moving to higher levels of abstractions? By hiding the details. We hide how we use registers, how we allocate memory, how pointers are handled, how polymorphism is implemented.

There are several way we can achieve that: normally we build new languages from which we generate or compile to some lower level abstraction. For example many C++ compilers at the beginning were just spitting out some C. Also compilers for Eiffel were using the same strategy. One thing that instead was not tried was to evolve languages adding new concepts as they emerge. In this case I am thinking about a gradual process which would target micro-patterns. Things like getters or setters in Java. Or things like this one:

What we are doing here? we are just returning self.data[code]. If it was not initialized we would set it to a default value ({}). Ok, lazy initialization: this is a concept common and comprehensible per se. What if we could add this concept and see if it is then used to build up new patterns?

This would be trivial to do using a tool like Jetbrains MPS.

I plan to play a little bit with this idea in next week and see where it leads me.

Getting started with Jetbrains MPS: how to define a simple language (screencast)

I am blessed (or cursed) with having many different interests in software development. However my deepest love is with language engineering. There are different tools and techniques but Jetbrains MPS is particularly fascinating.

To help people getting started or just get the feeling of how it is to develop with Jetbrains MPS I recorded this short screencast.

 

 

If you are not familiar with this tool here there are the basic ideas.

Jetbrains MPS is a projectional editor: that means that it internally stores the information contained in the different nodes.

If you create a model, then change how certain nodes should look, and you reopen the model it will be automatically using the new “look & feel” that you just. This is simply because what is stored is the pure information, while the appearance is recreated each time you open a model. In other words when you open a model it is projected in a form that you defined and that should be easy to understand for a human being.

This is very different from what happens with text language: suppose you create your language and then decide to replace braces with brackets. You have now to manually go over all your existing source files and modify them, because in that case what is saved is not the pure information, the abstract syntax tree, but its representation: the textual form.

In Jetbrains MPS you work on the different concerns of your language separately: you first define the single concept and the information they contain, then you work on the other aspects. There are several like typesystem, generation, dataflow, etc. In this short screencast we just look at the concept definition and the editor aspect.

One of the advantages of a projectional editor is that you can change the appearance of your nodes without invalidating your existing models. That is great to incrementally improve the appearance of the single nodes and the way they can be edited. In the screencast we start with the default representation of the concept we defined, as provided by default by Jetbrains MPS. Then we refine the editors in a couple of steps until we have a representation which is good enough.

At the end of the screencast we have a language we can use to define very simple todo lists. The possibility to improve it are endless:

  • we could add more concepts: like priority, deadlines and tags for each single task
  • we could add the possibility to mark a task as done and add a nice shortcut to do so
  • we could integrate the language with some other tool: what if we could synchronize our models with Trello or with workflowy?
  • we could support subtasks

Question: have you ever thought of creating your own language, either a general purpose language or a domain specific language? If so did you actually get started?

 

Dynamic, static, optional, structural typing and engineering challenges

How an engineer should NOT look at technical matters

Dynamic versus static typing is one of those confrontations that seams to attracts zealots. It really sadden me to see how people tend to defend vehemently their side simply because they are not able to understand the other position. In my opinion if you do not get why both static and dynamic typing are great in certain situations you are an incomplete software developer. You have chosen to look at just a portion of the picture.

An engineer should be open-minded: it is ok to have your opinions but when you keep bashing ideas that made millions of developers very productive, chances are that you are missing something. Try being humble, you could end up learning something.

Ok, so what?

I had very good experiences using Java or Haskell (from the static side) or Python and Ruby (from the dynamic field). There is no way you can obtain the succintness of Ruby code, the amazing possibilities given by its metaprogramming features with statically typed languages. On the other end I really end up missing static typing when I have to refactor large chunks of code. Tests help (when they are there) but it is often not enough.

Now, there are approaches to try getting the best of whole worlds. One solution is to combine languages for different parts of the system: perhaps you could write the infrastructure of your platform using Java and the business logic using JRuby. Why not?

However that means that you have still some challenges at the language boundaries. You need to master several tools and learn the best practices for the different languages and so on. Therefore it would be interesting to get the advantages of both worlds in one single language.

A compromise?

A very partial solution is to use static typing with type inference: it does not give you the same power you get with dynamically types languages but you get some of the succintness. Not a giant leap but it helps closing the gap. Types are still calculated at compile time, and that means that you could have to streamline your solution, so that also the compiler can understand it. However you would not have to type yourself all the types. Also, when refactoring you have to change types in less places.

Type inference sounds as a great idea but it has its drawbacks: try writing some Haskell code, pretty soon you will start being puzzled about the type derived from the compiler for a value. In practice I need to add some annotations to make some of the types explicit otherwise I get lost pretty soon.

Another approach to combine static and dynamic typing is to start from a dynamic language and try to make it more static-like. There are two options for that: optional typing and structural typing.

Basically optional typing means that you can add some type annotations in certain parts of your dynamic language, so that a part of it can be statically checked, while the rest will keep being dynamic. This can be helpful both to catch errors and to improve performances because it enable the compiler to optimize some calls. Look at Typed Clojure and PEP -0484: Type hints for python.

Structural typing means to have duck-types: your type is the list of the operations that it supports. Basically you can look at the code of your method and look which operations perform on the parameters. Does it invoke the method foo on the parameter bar? Cool, that means that when this method is invoked I should ensure that bar has some type that has the method foo. It does not matter the name of the class, what it matters is its structure (i.e., the fields and methods it has). In principle a good idea but imho it can get confusing quite soon.

It is supported in Scala, where you can write this:

However it seems verbose, doesn’t it? With structural type inference perhaps we could avoid the verbose type declaration and have some safety at compilation time. At this time I am not aware of any language that get this exactly right. Any suggestion?

Can’t I just have Ruby with a super smart compiler catching my stupid errors?

No, so far you can’t.

People tried to do that: think about the Crystal language, which aims to be a statically typed version of Ruby or Mirah which tried to be a statically typed version of JRuby. It is tempting to think that you can have a nice language as Ruby, just add a little bit of static-typing and get the best of both worlds. The problem is that it does not just work in that way, like the authors of Crystal figured out. Eventually that particular feature ends up having an impact on other aspects of the language and force you to rethink it globally.

I think this is often the case in engineering project: a customer think you can just squeeze another feature in and he does not get the impact on other aspects such security or architectural integrity. We as engineers have to juggle many balls and take decisions considering all of them. Adding one other ball could force us to rethink our strategy or cause all the balls to fall on the ground.

Alternatives to global variables and passing the same value over a long chain of calls

The old, good global variables are the wrong solution to a real problem: you want some information to be available in a large context. In Object-Oriented languages like Java we often do not have something we call a global variable. We instead lie to ourself using public static variables or singletons and denying they are exactly the same thing. No, we are too much object-oriented and too good developers to be using global variables.

Passing a value to all single method call: much better?

Ok, so you repent and decide to give up global variables. What do you do? You pass a value explicitely. Cool. And then you pass it to another method and to another and the same value is propogate over and over until it is available in many places of your application. Much better, eh? You have polluted the signatures of far too many methods and should you ever decide to change the type of the value passed, that would require a lot of work.

fire-bucket-brigade

You have a bunch of methods just passing around the information. Many of them do absolutely nothing with that value, but still need to receive it to be able to pass it along the chain.

Consider a concrete example: I was working on tool for analyzing Java code and some objects required an instance of a TypeResolver: an object that given a name (such as java.lang.String or my.shiny.Class) would provide for them representations of types. Good.

Let’s call the classes using directly a TypeResolver typeResolverUsers. Now, whoever was using the typeResolverUsers (let’s call them the usersOfTypeResolverUsers) needed a TypeResolver to pass it to the typeResolverUsers it was invoking. So each usersOfTypeResolverUsers asked for a TypeResolver at construction time and store it, to later pass it to classes which actually needed it. So now whoever was instantiating one of the usersOfTypeResolverUsers needed to have a TypeResolver  reference. And so on, until far too many classes were aware of the existence of the TypeResolver class.

That is not nice to refactor and it pollutes your methods. I solved this particular case using a Registry. It was not flawless and not so straightforward so I was wondering: is there a better solution?

Another example: Position and WorldSize

Let’s think about another example: we want to represent a Position on a world map (think about my beloved world generator: WorldEngine). Now the Position is of course defined by two values: x and y. However given it is a world map we want to be able to wrap left-right. So to perform certain operations we need to access the world size: if we know that the world is 500 cells wide, when we move from a Position with x=499 to the right we end up having x=0.

In this scenario we could:

  • use a global variable: it means that in our program there is one possible world size. It is stored in some place accessible from the whole application. That works if this assumption holds. If we instead are working on several worlds at the same time we need an alternative.
  • we could store the world size in the Position instance. That means that in every occasion in which we want to create a Position we need a reference to a world size. Given the application deals all the time with Position instances that means that many methods (perhaps most methods) will get a WorldSize instance and pass it down the chain.

I am not happy with neither of these alternatives. I would like to have something that is almost like a global variables but somehow having a limited and controlled scope. Or if you want, I would like to implicitly pass some contextual information without polluting the signature of all methods.

I think that a Position is really defined only by x and y. It needs to know the WorldSize to perform certain tasks, but this is a contextual information, not something that is controlled or owned by Position.

Contextual information: what the heck is that?

For this reason I am working on the concept of contextual information. For now I have implemented it like this:

  • we can declare that certain values could be context dependant: we would specify a name and a type for each of them
  • we could assign values for a certain context-dependant variable according to a dynamic scopeit means that we would make the value available not only in a certain block of statements but also to all the functions and methods invoked in that block, recursively
  • the value would be accessible only in a given thread (yes, ThreadLocal to the rescue)

Examples

This is the current syntax as implemented in my programming language, Turin:

Alternative solutions

There are many ways to make values available to your application without using explicitly global variables. One way it to use a Singleton, another is to declare static variables.

Another solution is the RegistryPattern:

A registry is a global association from keys to objects, allowing the objects to be reached from anywhere. It involves two methods: one that takes a key and an object and add objects to the registry and one that takes a key and returns the object for the key

Pros & Cons

With respect to a global variable we are able to limit the scope of a value, to a context which is not static but dynamic, anyway totally unrelated pieces of code could not mess up with the same global variables we are using.

The main issue with that solution is that we cannot statically determine if a value is available or not in a given context. However we return an Optional leaving the responsibility to the user to verify if the value is present or not.

With respect to the ContextObject pattern we offer type checking: we declared the type of a context value so we can verify that only compatible values are ever assigned to it.

 

Representing relationships as first-class citizens in an Object-oriented programming language

As beginners we used to write very large functions and then giant God classes. As we improve our skills, our classes become smaller but more numerous: we get a swarm of small, focuses classes that collaborate to create the final system. In other words we shift the complexity from the relations between the components of a class to the relations between classes. Indeed most Design Patterns just describe how different classes should be related.

We shift the complexity from the relations between the

components of a class to the relations between classes

We have different kinds of relationships between classes: aggregation and composition, unidirectional and bidirectional associations. We have constraints on associations and we could have attributes associated to these relations. However we do not have language support for relations.

This is very different in meta-modelling languages or frameworks, such EMF, which instead permit to describe relations. When building complex models, for example Abstract Syntax Trees, a good representation of relations helps a lot.

So I am considering adding support for relations in Turin (my very own programming language).

Example: an AST

Consider an abstract syntax tree. The basic relation is a composition relation, so each node can contain zero or more children nodes, while all nodes have one single parent node (or none, for the root of the tree).

Suppose we can represent this relation like this:

Ideally we would like to use it like this:

So we would expect a fully bidirectional relation: when we update one endpoint, the other one is automatically aware of the change:

  • if we set the parent of A to be B, the children of B will list A
  • if we add A to the children of B, the parent of A will be B

And each node can have one parent so:

No book-keeping is needed: all the parties involved have always the most updated picture. Normally in Java you would have a field parent and a list of children of each single node and changing the parent of a node would mean:

  1. alerting the old parent so that it can remove the node from its children
  2. alerting new new parent so that it can add the node to its children
  3. update the parent field in the node

Boring and error prone. In Turin instead it just works out of the box. Multiple endpoints (such as children) are seen as auto-updating lists while single endpoints (such as parent) are instead seen as a sort of Optional (with the possibility to set the value, not just verifying if it is present and read it).

Relationships and subsets

That is great, however we would like to have something more. We would like to have different specializations of those relationships. For example:

  • a Method (which is a Node) can have multiple FormalArguments and one return type (TypeUsage)
  • a FormalArgument (which is a Node) has one type (TypeUsage)

Now, we would like to add a FormalArgument instance as the param of a method and see it both as part of the “params” and of the “children” of that Node.

In other words “params” would represent a subset of the “children” of Method. Also “returnType” would be a subset (with exactly one element) of the “children” of method.

Note that I want to be able to:

  • add a node into a subset (e.g., params). I should see that node both among the params and the children
  • add a node directly among the children: I should see the node as part of the children but not as part of any subset

Status

This is the syntax I am working on and most of the implementation for relationships is ready. Time to battle test it!

I am still working on relationships subsets but many tests are already passing and I should be ready to commit it soon.

For now I plan to support only relations with two endpoints which can be either single or multiple. In the future relations could hold values or have additional constraints, like we have for fields. I am still exploring the idea and experimenting how it works in practice, after all there are not many programming languages supporting relationships out there 🙂

A few references

No, I am not the first one to think about relations as first-class citizens in object-oriented programming languages:

A few useful references to understand more on the theoretical level:

I have to study better myself these papers, and then be sure to deliver a good compromise to make easier to think about some classes of problems which involves relationships and associations. After all, this is the goal of Turin: be a tool which support thinking about problems and solutions.