Informations and opinions about the Whole Platform

Recognize patterns in Java code to understand and transform applications

Code is arguably the most valuable asset for many organizations.

However the value trapped in the code is not easy to use.

Why?

In this article we are going to see how we can extract and process knowledge in code by specifying patterns. We will see:

  1. A clear example of what we mean by extracting knowledge from code
  2. An explanation on how to implement this approach in practice (with code available on GitHub)
  3. We will discuss why do we want to extract knowledge from code.

What do you mean? Show it to me

Let’s see what we mean in practice, with the simplest example we can come up with. And let’s see it working for real, on real code.

Of course recognizing properties is a very simple example, something we are all familiar with, but the same idea can be applied to more complex patterns.

There are many patterns that can be used:

  • patterns typical of a language: think about the for loops to iterate over collections
  • design patterns: singleton, observer, delegate to name just a few
  • patterns related to frameworks: think about all the applications based on MVC or the DAO defined to access database tables
  • project specific patterns: for example in JavaParser we use the same structure for the tens of AST classes we defined

Patterns can be regarding small pieces of a single method or the organization of entire applications.

Patterns can also be applied incrementally: we can start recognizing smaller patterns, making the application simpler and then recognize more complex patterns over the simplified application.

How knowledge end up being trapped in code

Developers use a lot of idiomatic small patterns when writing programs. Experienced developers apply them automatically while implementing their solutions. Slowly they build these large applications that contain a lot of knowledge, enshrined in the code.

The problem is that the initial idea is not stated explicitely in the code. The developer translates it into code by using some of the idiomatic patterns typical of the language, or linked to the framework is using, or specific to its organization.

Somewhere a developer is thinking: “this kind of entity has this property” and he is translating it to a field declaration, to a getter and a setter, to a new parameter in a constructor, to some additional lines in the equals and the hashCode methods. The idea of the property is not present explicitly in the code: you can see it if you are familiar with the programming language but it requires some work.

There is so much noise, so many technical details that shadow the intention

But this is not true just for Java or for the properties: when a developer determines that an object is unique he could decide to implement a singleton, and again this means following certain steps. Or maybe he is deciding to implement a view or a Data Access Object (DAO) or any other typical component required by the framework he is using. In any case all the knowledge he had in mind is scattered in the code, difficult to retrieve.

Why is this bad?

For two reasons:

  • it is difficult to see the knowledge.
  • it is difficult to reuse that knowledge.

What it means that is difficult to see the knowledge in the code?

A lot of work go into understanding a problem and build a representation of the solution in the code. There are a lot of micro-decision, a lot of learning involved in the process. Then all of this effort lead to knowledge.

Where does this knowledge go?

This knowledge is typically not written down directly, it is instead represented in code. The fact is that in many occasions this knowledge is translated in a mechanical way into code. Therefore this translation can be reversed. The problem is that to see the knowledge that is present in the code you need to look directly at the code, read it carefully and mentally deduct the knowledge that is present there.

To understand things:

  • we need to understand programming and the typical patterns we use
  • requires work to check the details
  • this process is done mentally, the result is just in our head and cannot be processed automatically

If the knowledge was represented directly, to an higher level of abstraction it could be easier to check it and it could be accessible to more persons.

What it means that is difficult to reuse the knowledge in the code?

We have seen that basically the only way we have to extract knowledge from code is reading it and understanding it. The results stay just in our head, so they are not usable by a machine. If we had instead a representation of the abstract knowledge, the original intentions, we could elaborate them for different goals.

We could for example use that knowledge to generate diagrams or reports.

Do you want to know how many views have been written? How many tables do we have in the database? Easy! Project managers could get their answer without having to ask. And they would get always the updated, honest answers on the state of the code.

We could also use that information for re-engineering even partially applications. Some aspects of your application could be migrated to a different version of a library, to a different framework or even a different language. It would not mean that complex migration could be performed completely automatically but it could be a start.

Implementation using the Whole Platform

Ok, we have talked about the problem, let’s now talk about a solution we can build today.

We are previously discussed how to build grammars using the whole platform and we have seen it also when looking into Domain Specific Languages.

The code is available on GitHub, courtesy of Riccardo Solmi, the author of the Whole Platform.

1) Defining the higher level concepts

First of all we need to define the higher level concepts, that we have in mind but are not expressed explicitely in the code. For example, the concept of property of a Java bean.

In the model-driven parlance we define the metamodel: i.e., the structure of those concepts.

2) Define how to recognize such concepts in Java code

Once we have those concepts defined we need to specify how to identify those in the code. The whole platform use a Domain Specific Language to specify patterns. It looks like this:

What are we saying here?

We are saying that a certain pattern should be looked into the selected node and all the descendants. The pattern should match the given snippet: a Field Declaration with the private modifier, associating the label type to the type of the field and the label name to the name of the field.

What should happen when we recognize this pattern?

We should:

  1. Remove the corresponding getter. We will match a method with the expected name (calculated by the getterName function), the expected type, taking no parameters and returning a field with the expected name
  2. Remove the corresponding setter. It should be a method returning void, with the expected name, the expected parameter and assigning the parameter to the field with the expected name
  3. It should replace the Java field with this higher level concept representing the whole property (here it is named Field but Property would have been a better name)

Now, what we need to do is just to add this action into the contextual menu, under the group name Refactor. We can do that in Whole by defining actions.

Voila! We have now the magic power of recognizing patterns in any piece of Java code. As easy as that.

3) Doing the opposite: expand the concepts

So far we have discussed how to recognize patterns in code and map them to higher level concepts.

However we can also do the opposite: we can expand higher level concepts into the corresponding code. In the whole platform we can do this with this code:

Let’s focus exclusively on the property definition. Each instance is expanded by:

  1. Adding a field declaration
  2. Adding a getter
  3. Adding a setter
  4. Adding a parameter to the first constructor
  5. Adding an assignment in the first constructor. The assignment take the value of the parameter added and assign it to the corresponding field

Now, the way we recognize the pattern and the way we reverse it is not 100% matching in this example, bear with us over this little discrepancy. We just wanted to show you two different ways to look at properties in Java.

What this approach could be used for

We see this approach being useful for three goals:

  • Running queries to answer specific questions on code
  • Understanding applications
  • Transforming or re-engineering applications

Queries

We could define patterns for all sort of things. For example, patterns could be defined to recognize views defined in our code. We could imagine to run queries to identify patterns. Those queries could be used by project managers or other stakeholders involved in the project to examine the progress of the project itself.

They could also be used by developers to navigate into the code and familiarize with complex codebases. Do we need to identify all observers or all singletons in this legacy codebase? Just run a query!

Understanding applications

The fact is that programming languages tend to be very low level and very detailed. The amount of boiler plate code varies between languages, sure, but it is always there.

Now, one of the problem is that the amount of code could hide things and make them difficult to notice. Imagine reading ten Java beans in a row. Among them there is one that implements the equals method slightly differently from what you expect, like for some reason it ignores one field or compare one field using identity instead of equality. This is a detail that has a meaning but that you would very probably miss as you look at code.

Why is that?

This happens because after looking at a large amount of code and expecting to see certain patterns you become blind to those patterns. You stop reading them without even noticing it.

By recognizing patterns automatically (and precisely) we can represent higher level concepts and easily spot things that do not fit in common patterns, like slight differences in equals methods.

We have seen how to recognize bean properties, but we can go further and recognize whole beans. Consider this example.

This shows the relevant information ony. All the redundant information is gone. This makes things obvious.

We can recognize incresingly complex patterns by proceeding incrementally. In this way exceptions pop up and do not remain unnoticed.

Transforming applications

When we write code we translate some higher level ideas into code, following an approach that depends on the technology we have chosen. Over time the best technology could change even if the idea stay the same. We still want the same views but maybe the way to define them is changed in the new version of our web framework. Maybe we still want to use singletons but we decided is better to use public static methods with lazy initialization to provide the instance, instead of using a public static field.

By identifying higher level concepts we could decide to translate them differently, generating different code. This process is called re-engineering and we can perform it automatically to some extent. It seems a good idea to me and it is another advantage of using patterns to identifying higher level concepts.

Summary

Code has an incredible value for organizations because it captures a lot of knowledge in a form that is executable. As we evolve our applications and we cover more corner cases we improve our knowledge and our code. After years of developing a code base it becomes often invaluable for the organization owning it. However that value is like frozen: there is not much we can do with it. It is even difficult to understand exactly how much information there is in the code.

We think that one approach to extract knowledge from the code is proceeding bottom up: recognizing the small abstractions and composing over them, step by step, until we recognize larger structures and patterns and can represent easily the big picture hidden in our application.

Using the  Whole Platform is invaluable for these experiments.

This article has been written following a visit of Riccardo Solmi, the author of the Whole Platform. I would like to thank him for building this great product, for sharing ideas and writing the code used in this article. The code used in this article is available on GitHub.

Getting started with the Whole Platform: building grammars

I played for the first time with the Whole Platform a few years ago. It was one of the first Language Workbenches on which I put my eyes and I found it very fascinating. Then I was dragged into other things: whoever went through a PhD knows what I mean. Academic life has always a way to distract you.

Now I decided to find the time to take another look. I want to learn about it and compare it to other tools I used, most notable Jetbrains MPS. So, let’s get started.

Getting started with the Whole Platform

The Whole Platform can be downloaded from here: https://sourceforge.net/projects/whole/

It is based on Eclipse but it is not distributed as a plugin but instead as a separate IDE. You have just to unzip it and start it.

Update: I chose to download it as a separate IDE but you can also install it as a set of plugins for your Eclipse installation. Just use this update site: http://whole.sourceforge.net/updates

Once you have started it you have to create a Whole Project. Inside the Whole Project we are going to create a Whole Model. This is the wizard you will meet:

New Whole Model _062

What surprised me at this stage is the flexibility we have: we can choose:

  1. the language we are going to use for this model
  2. the template
  3. the persistence format.

“Language” in this case means the metamodel: the kind of information we are going to produce. We could then save the same information using different persistence formats: for example the generic Whole XML format or some custom textual formats created for a specific language. We can also use different templates for each Language. I think this can significantly speed-up the daily activities. This also provide to newbies like me access to some examples for each language: I created a few grammars using different templates and by looking at those I understood how the Grammars language worked. Training is an aspect frequently underestimated when talking about DSLs and this little feature could help.

Whole Platform and grammars

From what I read and by talking with Riccardo Solmi, the author of the Whole Platform, I understand that one of the strength points of this Language Workbench is its ability to support several persistence formats. It means you can load and save models using different formats.

This is particularly important when you want to open files written in a format already defined. Suppose for example that we are dealing with some very simple todo lists. Something like:

We want to be able to import this file as it is in the Whole Platform and edit it through its reflective editor. Like this:

In practice we are using two projections of the same data and it works like a charme.

Define the grammar

How could you import our todo list in the Whole Platform to later process it?

By defining the corresponding grammar:

Selection_058

A common Back-Naur Form (BNF) grammar has one goal: recognize the information in a text file and build a structure out of it (the Abstract Syntax Tree). In the Whole Platform grammars have instead two roles:

  1. parse the text files to get the AST (same as the BNF grammars)
  2. serialize the AST back into the text format

So each element of the grammar has to provide these two functions. Let’s look into this.

  • We are saying that our start or top symbol is the TodoFile
  • Each TodoFile is composed by a list of TodoElement. This list will be assigned to the property todoList of the TodoFile
  • Each TodoElement is a sequence of an asterisk followed by a Todo, which is basically a string

We can make the grammar simple to examine by normalizing it. We have a feature to perform this refactoring automatically for us. We get this:

Selection_061

To be precise: we almost get this automatically: I had just to rename the three elements derived (Asterisk, Space, and NewLine). They were created with default names (Token, Token1, and Token2).

Look at terminals

Terminals are represented by that sort of division. Above the line we have a regular expression while tells us how to parse that element, while the expression below tells us how that node can be serialized.

It is interesting to notice that Space and NewLine are not actually recognized associated to any character of the input while parsing: they correspond to characters which are ignored (see the Delimiter rule). However these definitions are useful when we want to dump our model to text.

Let’s look at the other terminals we have.:

Selection_059

The regular expression above the line means parse an asterisk. You can use all regular expressions which are recognized by Java: see the documentation for details . The fact is the asterisk has a special meaning in a regular expression so we escape it by making it preceded by \Q and followed by \E. The line below the line is instead a simple string which will be used in the generated text file to represent this node: it is just an asterisk.

This is instead the content of one Todo element:

Selection_060

We specify that the description of our todo element can contain any number of letter, digit, space, tab or underscores. When we will dump it we will just copy its whole content (%s).

The nice thing is that in Whole you can define a grammar and immediately use it to process existing files, without the need to generate parsers or anything else. If you are familiar with the environment it permits to have a very fast turnaround.

Conclusions

The Whole platform feels different from the other Language Workbench I am most familiar with: Jetbrains MPS. It feels more flexible and I think that the possibility to work with existing formats seamlessly is a great feature.

There are things I miss: the auto-completion in MPS makes me faster when defining my models. To be fair I have some years of experience using Jetbrains MPS while I am new to Whole. Perhaps using its drag and drop functionality could make the editing much faster. Also, several refactorings are available from the contextual menu, the same mechanism used to created nodes. In MPS you create nodes through auto-completion and then you have intentions for refactoring (triggered by pressing ALT + Enter).

Renaming works differently: in MPS a reference to an existing element is automatically updated when renaming the original element. This is not the default behavior in MPS: a refactoring action can be easily implemented to achieve the same result. In this case as in other I had the impression that Whole is about flexibility while MPS is about sensible default behaviors.

One problem is the documentation: the tool has been extensively used in a large company in Italy but it lacks documentation available publicly. To this day the best source of documentation is the submission to the Language Workbench Challenge for 2011: it contains a huge tutorial describing screen by screen how the solution to not trivial tasks was implemented. You can download it from here: https://sourceforge.net/projects/whole/. I hope to help a little bit by writing this and hopefully more tutorials.

Finally let me thank you Riccardo Solmi for helping me out while I was experimenting with the Whole platform. He is the author of this incredible platform, and while he had some valuable help from Enrico Persiani along the years I think it is an incredible achievement to have designed and built a tool which can compete with a product from Jetbrains.