Articles about extracting data from code, analyze it and programmatically transform it. In other words we talk about static analysis, automated refactoring, and code generation.

Recognize patterns in Java code to understand and transform applications

Code is arguably the most valuable asset for many organizations.

However the value trapped in the code is not easy to use.

Why?

In this article we are going to see how we can extract and process knowledge in code by specifying patterns. We will see:

  1. A clear example of what we mean by extracting knowledge from code
  2. An explanation on how to implement this approach in practice (with code available on GitHub)
  3. We will discuss why do we want to extract knowledge from code.

What do you mean? Show it to me

Let’s see what we mean in practice, with the simplest example we can come up with. And let’s see it working for real, on real code.

Of course recognizing properties is a very simple example, something we are all familiar with, but the same idea can be applied to more complex patterns.

There are many patterns that can be used:

  • patterns typical of a language: think about the for loops to iterate over collections
  • design patterns: singleton, observer, delegate to name just a few
  • patterns related to frameworks: think about all the applications based on MVC or the DAO defined to access database tables
  • project specific patterns: for example in JavaParser we use the same structure for the tens of AST classes we defined

Patterns can be regarding small pieces of a single method or the organization of entire applications.

Patterns can also be applied incrementally: we can start recognizing smaller patterns, making the application simpler and then recognize more complex patterns over the simplified application.

How knowledge end up being trapped in code

Developers use a lot of idiomatic small patterns when writing programs. Experienced developers apply them automatically while implementing their solutions. Slowly they build these large applications that contain a lot of knowledge, enshrined in the code.

The problem is that the initial idea is not stated explicitely in the code. The developer translates it into code by using some of the idiomatic patterns typical of the language, or linked to the framework is using, or specific to its organization.

Somewhere a developer is thinking: “this kind of entity has this property” and he is translating it to a field declaration, to a getter and a setter, to a new parameter in a constructor, to some additional lines in the equals and the hashCode methods. The idea of the property is not present explicitly in the code: you can see it if you are familiar with the programming language but it requires some work.

There is so much noise, so many technical details that shadow the intention

But this is not true just for Java or for the properties: when a developer determines that an object is unique he could decide to implement a singleton, and again this means following certain steps. Or maybe he is deciding to implement a view or a Data Access Object (DAO) or any other typical component required by the framework he is using. In any case all the knowledge he had in mind is scattered in the code, difficult to retrieve.

Why is this bad?

For two reasons:

  • it is difficult to see the knowledge.
  • it is difficult to reuse that knowledge.

What it means that is difficult to see the knowledge in the code?

A lot of work go into understanding a problem and build a representation of the solution in the code. There are a lot of micro-decision, a lot of learning involved in the process. Then all of this effort lead to knowledge.

Where does this knowledge go?

This knowledge is typically not written down directly, it is instead represented in code. The fact is that in many occasions this knowledge is translated in a mechanical way into code. Therefore this translation can be reversed. The problem is that to see the knowledge that is present in the code you need to look directly at the code, read it carefully and mentally deduct the knowledge that is present there.

To understand things:

  • we need to understand programming and the typical patterns we use
  • requires work to check the details
  • this process is done mentally, the result is just in our head and cannot be processed automatically

If the knowledge was represented directly, to an higher level of abstraction it could be easier to check it and it could be accessible to more persons.

What it means that is difficult to reuse the knowledge in the code?

We have seen that basically the only way we have to extract knowledge from code is reading it and understanding it. The results stay just in our head, so they are not usable by a machine. If we had instead a representation of the abstract knowledge, the original intentions, we could elaborate them for different goals.

We could for example use that knowledge to generate diagrams or reports.

Do you want to know how many views have been written? How many tables do we have in the database? Easy! Project managers could get their answer without having to ask. And they would get always the updated, honest answers on the state of the code.

We could also use that information for re-engineering even partially applications. Some aspects of your application could be migrated to a different version of a library, to a different framework or even a different language. It would not mean that complex migration could be performed completely automatically but it could be a start.

Implementation using the Whole Platform

Ok, we have talked about the problem, let’s now talk about a solution we can build today.

We are previously discussed how to build grammars using the whole platform and we have seen it also when looking into Domain Specific Languages.

The code is available on GitHub, courtesy of Riccardo Solmi, the author of the Whole Platform.

1) Defining the higher level concepts

First of all we need to define the higher level concepts, that we have in mind but are not expressed explicitely in the code. For example, the concept of property of a Java bean.

In the model-driven parlance we define the metamodel: i.e., the structure of those concepts.

2) Define how to recognize such concepts in Java code

Once we have those concepts defined we need to specify how to identify those in the code. The whole platform use a Domain Specific Language to specify patterns. It looks like this:

What are we saying here?

We are saying that a certain pattern should be looked into the selected node and all the descendants. The pattern should match the given snippet: a Field Declaration with the private modifier, associating the label type to the type of the field and the label name to the name of the field.

What should happen when we recognize this pattern?

We should:

  1. Remove the corresponding getter. We will match a method with the expected name (calculated by the getterName function), the expected type, taking no parameters and returning a field with the expected name
  2. Remove the corresponding setter. It should be a method returning void, with the expected name, the expected parameter and assigning the parameter to the field with the expected name
  3. It should replace the Java field with this higher level concept representing the whole property (here it is named Field but Property would have been a better name)

Now, what we need to do is just to add this action into the contextual menu, under the group name Refactor. We can do that in Whole by defining actions.

Voila! We have now the magic power of recognizing patterns in any piece of Java code. As easy as that.

3) Doing the opposite: expand the concepts

So far we have discussed how to recognize patterns in code and map them to higher level concepts.

However we can also do the opposite: we can expand higher level concepts into the corresponding code. In the whole platform we can do this with this code:

Let’s focus exclusively on the property definition. Each instance is expanded by:

  1. Adding a field declaration
  2. Adding a getter
  3. Adding a setter
  4. Adding a parameter to the first constructor
  5. Adding an assignment in the first constructor. The assignment take the value of the parameter added and assign it to the corresponding field

Now, the way we recognize the pattern and the way we reverse it is not 100% matching in this example, bear with us over this little discrepancy. We just wanted to show you two different ways to look at properties in Java.

What this approach could be used for

We see this approach being useful for three goals:

  • Running queries to answer specific questions on code
  • Understanding applications
  • Transforming or re-engineering applications

Queries

We could define patterns for all sort of things. For example, patterns could be defined to recognize views defined in our code. We could imagine to run queries to identify patterns. Those queries could be used by project managers or other stakeholders involved in the project to examine the progress of the project itself.

They could also be used by developers to navigate into the code and familiarize with complex codebases. Do we need to identify all observers or all singletons in this legacy codebase? Just run a query!

Understanding applications

The fact is that programming languages tend to be very low level and very detailed. The amount of boiler plate code varies between languages, sure, but it is always there.

Now, one of the problem is that the amount of code could hide things and make them difficult to notice. Imagine reading ten Java beans in a row. Among them there is one that implements the equals method slightly differently from what you expect, like for some reason it ignores one field or compare one field using identity instead of equality. This is a detail that has a meaning but that you would very probably miss as you look at code.

Why is that?

This happens because after looking at a large amount of code and expecting to see certain patterns you become blind to those patterns. You stop reading them without even noticing it.

By recognizing patterns automatically (and precisely) we can represent higher level concepts and easily spot things that do not fit in common patterns, like slight differences in equals methods.

We have seen how to recognize bean properties, but we can go further and recognize whole beans. Consider this example.

This shows the relevant information ony. All the redundant information is gone. This makes things obvious.

We can recognize incresingly complex patterns by proceeding incrementally. In this way exceptions pop up and do not remain unnoticed.

Transforming applications

When we write code we translate some higher level ideas into code, following an approach that depends on the technology we have chosen. Over time the best technology could change even if the idea stay the same. We still want the same views but maybe the way to define them is changed in the new version of our web framework. Maybe we still want to use singletons but we decided is better to use public static methods with lazy initialization to provide the instance, instead of using a public static field.

By identifying higher level concepts we could decide to translate them differently, generating different code. This process is called re-engineering and we can perform it automatically to some extent. It seems a good idea to me and it is another advantage of using patterns to identifying higher level concepts.

Summary

Code has an incredible value for organizations because it captures a lot of knowledge in a form that is executable. As we evolve our applications and we cover more corner cases we improve our knowledge and our code. After years of developing a code base it becomes often invaluable for the organization owning it. However that value is like frozen: there is not much we can do with it. It is even difficult to understand exactly how much information there is in the code.

We think that one approach to extract knowledge from the code is proceeding bottom up: recognizing the small abstractions and composing over them, step by step, until we recognize larger structures and patterns and can represent easily the big picture hidden in our application.

Using the  Whole Platform is invaluable for these experiments.

This article has been written following a visit of Riccardo Solmi, the author of the Whole Platform. I would like to thank him for building this great product, for sharing ideas and writing the code used in this article. The code used in this article is available on GitHub.

Translate Javascript to C#

Problems and Strategies to Port from Javascript to C#

Let’s say you need to automatically port some code from one language to another, how are going to do it? Is it even possible? Maybe you have already seen a conversion between similar languages, such as Java to C#. That sounds much simpler in comparison.

In this article we are going to discuss some strategies to translate Javascript to a very different language, such as C#. We will discuss the issues with that and plan some possible solutions. We will not arrive to writing code: that would be far too complicate for an introduction to the topic. Let’s avoid putting together something terribly hacky just for the sake of typing some code.

Having said that, we are going to see all the problems you may find in converting one real Javascript project: fuzzysearch, a tiny but very successful library to calculate the difference between two strings, in the context of spelling correction.

When it’s worth the effort

First of all you should ask yourself if the conversion it’s worth the effort. Even if you were able to successfully obtain some runnable C# you have to consider that the style and the architecture will probably be “unnatural”. As consequence the project could be harder to maintain than if you write it from scratch in C#.

This is a common problem even in carefully planned conversion, as the one who originated Lucene.net, that started as a conversion from Java to C#. Furthermore, you will not be able to use it without manual work for every specific project, because even the standard libraries are just different. Look at the example: while you could capitalize the length of haystack.length, you cannot just capitalize charCodeAt, you will have to map different functions in the source and destination language.

On the other hand all languages have area of specialization which may interest to you, such as Natural Language Processing in Python. And if you accept the fact that you will have to do some manual work, and you are very interested in one project, creating an automatic conversion will give you a huge head start. Though if you are interested in having a generic tool you may want to concentrate on small libraries, such as the Javascript one in the example.

Parse with ANTLR

The first step is parsing, and for that, you should just use ANTLR. There are already many grammars available which may not necessarily be up-to-date, but are much better than starting from scratch and they will give you an idea of the scale of the project. You should use visitors, instead of listeners, because they allow you to control the flow more easily. You should parse the different elements in custom classes, that can manage the small problems that arises. Once you have done this generating C# should be easier.

The small differences

There are things that you could just skip, such as the first and last lines, they most probably don’t apply to your C# project. But you must pay attention to the small differences: the var keyword has a different meaning in Javascript and C#. By coincidence it would work most of the time, and would be quite useful to avoid the problem of the lack of strict typing in Javascript. But it’s not magic, you are just hoping that the compiler will figure it out. And sometimes it’s not a one to one conversion. For instance you can’t use in C# in the way it’s used in the initialization of the for cycle.

The continue before outer should be transformed in goto,  but when it is alone it works just as in C#. A difference that could be fixed quite brutally is the strict equality comparison “===/!==”, that could be replaced with “==/!=” in most of cases, since it’s related to problems due to the dynamic typing of Javascript. In general you can do a pre-parse check and transform the original source code to avoid some problems or even comment out some things that cannot be easily managed.

I present you thy enemy: dynamic typing

The real problem is that Javascript uses dynamic typing while C# use strict typing. In Javascript anything could be anything, which lead to certain issues, such as the aforementioned strict equality operator, but it’s very easy to use. In C# you need to know the type of your variables, because there are checks to be made. And this information is simply not available in the Javascript source code. You might think that you could just use the var keyword, but you can’t. The compiler must be able to determine the real type at compile time, something that will not always be possible. For example you cannot use it in declaring function arguments.

You can use the dynamic keyword, which makes the type be determined at execution time. Still this doesn’t fixes all the problem, such as initialization. You may check the source code for literal initialization or, in theory, even execute the original Javascript in C# and find a way to determine the correct type. But that would be quite convoluted. You might get lucky, and in small project, such our example, you will, but not always.

There are also problems that can be more easily to manage than you imagined. For instance, assigning a function to a variable it’s not something that you usually do as explicitly in C# as you do in Javascript. But it’s easy using the type delegate and constructs such as Func. Of course you still have to deal with determining the correct types of the arguments, if any is present, but it doesn’t add any other difficulties per se.

Not everything is an object and other issues

In Javascript “string” is a string, but not an object, while in C# everything is an object, there are no exceptions. This is a relevant issue, but it’s less problematic than dynamic typing. For instance to convert our example we just have to wrap around the function a custom class, which is not really hard. One obvious problem is that there are different libraries in different languages. Some will not be available in the destination language. On the other hand some part of the project might not be needed in the destination language, because there are already better alternatives. Of course you still have to actually change all the related code or wrap the real library in the destination language around a custom class that mimic the original one.

Conclusion

There are indeed major difficulties even for small project to be able to transform from language to another, especially when they are so different like Javascript and C#. But let’s image that you are interested in something very specific, such a very successful library and its plugins. You want to port the main library and to give a simpler way for the developers of the plugins to port their work. There are probably many similarities in the code, and so you can do most of the work to manage typical problems and can provide guidance for the remaining ones.

Converting code between languages so different in nature it is not easy, that is sure, but you can apply some mixed automatic/manual approach by converting a large amount of code automatically and fix the corner cases manually. If you can also translates the tests maybe you can later refactor the code, once it is in C#, and over time improve the quality.

Code Generation with Roslyn: a Skeleton Class from UML

Code Generation with Roslyn: a Skeleton Class from UML

Get the source code for this article on GitHub

We have already seen some examples of transformation and analysis of C# code with Roslyn. Now we are going to see how to create a more complex example of code generation with Roslyn and parsing with Sprache. We are going to create a skeleton class from a PlantUML file. In short, we are doing the inverse of what we have done. The first step is to parse, of course.

As you can see, there are four entities in this example: PlantUML start and end tags and class, variable and function declarations.

Parsing all the things

We are going to parse the file line by line instead of doing it in one big swoop, this is in part because of the limitations of Sprache, but also because it’s easier to correctly parse one thing at a time instead of trying to get it right all in one go.

With CharExcept we are parsing all characters except for the one(s) indicated, which is an handy but imprecise way to collect all the text for an identifier. The roughness of this process is obvious, because we are forced to exclude all the characters that comes after an identifier. If you look at the file .plantuml, at the beginning of the article, you see that there is a space after the field names, a ‘}’ after the modifier static, a ‘:’ after the argument, to divide identifier and its type, and finally the closing parenthesis, after the type. You might say that we should simply have checked for “Letters”, which would work in this specific case, but would exclude legal C# name for identifiers.

The Modifier parser is quite uninteresting, except for the lines 6 and 11 where we are seeing the same problem just mentioned to identify the correct name. The last case is referring to something that doesn’t happen in this example, but could happen in others UML diagrams: override modifiers. The real deal is in the lines 18 and 22, where we are seeing the Ref parser, which is used, as the documentation says, to: “Refer to another parser indirectly. This allows circular compile-time dependency between parsers”. DelimitedBy is use to select many of the same items delimited by the specified rule, and finally Optional refers to a rule that isn’t necessary to parse correctly, but it might appear. Since the rule is optional, the value could be undefined and it must be accessed using the method shown on the line 22. The rule Method is slightly more complicated, but it uses the same methods. In case you are wondering, methods without a return type are constructors.

Parsing line by line

We can see our parser at work on the main method, where we try to parse every line with every parser and, if successful, we add the value to a custom type, that we are going to see later. We need a custom type because code generation requires to have all the elements in their place, we can’t do it line by line, at least we can’t if we want to use the formatter of Roslyn. We could just take the information and print them ourselves, which is good enough for small project, but complicated for larger one. Also, we would miss all the nice automatic options for formatting. On line 13 we are skipping a cycle, if we found a method, because method could also be parsed, improperly, as fields, so to avoid the risk we jump over.

Code Generation

If you remember the first lessons about Roslyn it’s quite verbose, because it’s very powerful. You have also to remember that we can’t modify nodes, even the ones we create ourselves and are not, say, parsed from a file. Once you get around to use SyntaxFactory for everything, it’s all quite obvious, you have just to find the correct methods. The using directive are simply the ones usually inserted by default by Visual Studio.

Generation of methods

Let’s start by saying that Declarations and DeclarationType are fields in our custom class, that is not shown, but you can look at it in the source code. Then we proceed to generate the method of our skeleton C# class. MethodDeclaration allow us to choose the name and the return type of the method itself; mods refer to the modifiers, which obviously could be more than one, and so they are in a list. Then we create the parameters, which in our case need only a name and a type.

We choose to throw an exception, since we obviously cannot determine the body of the methods just with the UML diagram. So we create a throw statement and a new object of the type NotImplementedException. This also allows us to add a meaningful body to the method. You should add a body in any case, if you use the formatter, because otherwise it will not create a correct method: there won’t be a body or the curly braces.

Generation of fields

The case “field”  is easier that the “method” one and the only real new thing is on line 12, where we use a method to parse the type from a string filled by our parser.

The end of the Generate method is where we add the class created by the for cycle, and use Formatter. Notice that cu is the CompilationUnitSyntax that we created at the beginning of this method.

Limitations of this example

The unit tests are not shown because they don’t contain anything worth noting, although I have to say that Sprache is really easy to test, which is a great thing. If you run the program you would find that the generated code is correct, but it’s still missing something. It lack some of the necessary using directives, because we can’t detect them starting just from the UML diagram. In a real life scenario, with many files and classes and without the original source code, you might identify the assemblies beforehand and then you could use reflection to find their namespace(s). Also, we obviously don’t implement many things that PlantUML has, such as the relationship between classes, so keep that in mind.

Conclusions

Code Generation with Roslyn is not hard, but it requires to know exactly what are you doing. It’s better to have an idea of the code you are generating beforehand, or you will have to take in account every possible case, which would make every little step hard to accomplish. I think it works best for specific scenarios and short pieces of code, for which it could become very useful. In such cases, you could create tools that are useful and productive for your project, or yourself, in a very short period of time and benefit from them, as long as you don’t change tools or work habit. For instance, if you are a professor, you could create an automatic code generator to translate your pseudo-code of a short algorithm in real C#. If you think about it, this complexity is a good thing, otherwise, if anybody could generate whole programs from scratch, us programmers will lose our jobs.

You might think that using Sprache for such a project might have been a bad idea, but it’s actually a good tool for parsing single lines. And while there are limitations, this approach make much easier to make something working in little time, instead of waiting to create a complete grammar for a “real” parser. For cases in which code generation is most useful, specific scenarios and such, this is actually the best approach, in my opinion, since it allows you to easily pick and choose which part to use and just skip the rest.

Code Climate: a Service for Static Analysis

Code Climate: A service for static analysis

GitHub has been a revolution for developers. You could consider SourceForge a predecessor, in the sense that it also let people share code. But GitHub it’s not simply a place where you can download programs, it’s mainly a platform for developers. One thing that GitHub has brought is integrations, of which there are many. The most famous is probably Travis and, of course, there are many other services for continuous integration. There are also other integrations, such as the ones with messagging apps, which are useful, but less obvious. Though today we talk about a code-related one, but less known: Code Climate, a service for static analysis.

Static analysis as a service

CodeClimate integration in GitHub

We have mentioned static analysis when we talked about Roslyn, but this service deliver it. Although it doesn’t cover C#, which should probably be a crime. Following the lead of GitHub, it’s also free for open-source, but it costs for businesses. This is a successful model that is also followed by Travis and AppVeyor, which have become synonymous with CI on Linux and Windows, and, among other things, Walkmod, another tool that can modify your code to your liking.

You can hook the service to your pull requests, but there is also a local version, that use a docker image of the CodeClimate CLI. Which is great if you don’t want to wait for pushes to check your code. You can easily integrate the JSON output can be with other tools you use. On the other hand, if you find a use for such a service you don’t really want to manage it yourself. And you really want to check the code of everybody, and not just yours. After all, they created it for GitHub, where developers works together.

It seems to hate Javascript developers

Code Climate explanation

An example of a problem found by the Code Climate analysis

The service itself seems to relay on libraries developed by other people. There is nothing wrong per se in that, of course, there is tremendous value in offering a service. And leveraging the works of everybody allows them to check for everything from bugs, security risk, excessive complexity, style, linter, etc. and basically everything you can image, they can also check Chef cookbooks and bash scripts. People that have a not-developed-here attitude can’t do that. Though the drawback is that there is a lack of uniformity in analysis and judgement between different languages. If you look at their Open Source Section, it seems that every Javascript project is judged harshly: Bootstrap, jQuery, node, etc. Of course I haven’t checked everything, but the only one that seems to be good is D3. While if you look at Ruby instead, and at Jekyll or Rails

It may be that Javascript developers are all terrible, but that seems unlikely, more probably there are differences in the tools that focus on Javascript and other kinds of languages. There might be legitimate concerns, maybe because Javascript is used in a different way. Rules exists to help you create good software, but if the best programmer ignored them, are actually good rules? Are rules that can be followed without killing productivity ? While this is a great way to humble everybody, the risk is that many warnings will soon become ignored warnings, after all if everything sucks, why bother ? We all know the story of The Boy Who Cried Wolf. So you have to careful and take your time to configure it in a way that works for your project.

Conclusion

If you are searching for a tool to verify the quality of your code you might want to use CodeClimate, especially if you are working on open source software with many other people. Although if you use Javascript be aware that it might tell you that everything you do is wrong.

Extracting JavaDoc documentation from source files using JavaParser

 

A lot of people are using JavaParser for the most different goals. One of these is extracting documentation. In this short post we will see how you can print all the JavaDoc comments associated to classes or interfaces.

Code is available on GitHub: https://github.com/ftomassetti/javadoc-extractor

Getting all the Javadoc comments for classes

We are reusing DirExplorer, a supporting class presented in the introduction to JavaParser. This class permits to process a directory, recursively, parsing all the Java files contained there.

We can start by iterating over all the classes and find the associated Javadoc comments.

As you can see getting the JavaDoc comments is fairly easy. It produces this result:

Getting all the Javadoc comments and find the documented elements

In other cases we may want to start collecting all the Javadoc comments and then finding the element which is commented. We can also do that easily with Javaparser:

Here most of the code is about providing a description for the commented node (method describe).

Conclusions

Manipulate the AST and finding the Javadoc comments is quite easy. However one missing feature is the possibility to extract the information contained in the Javadoc in a structured form. For example, you may want to get only the part of the Javadoc associated to a certain parameter or to the return value. Javaparser currently does not have this feature, but I am working on it and it should be merged in the next 1-2 weeks. If you want to follow the development take a look at issue 433.

Thanks for reading and happy parsing!

Implementing Lexical Preservation for JavaParser

lexical_preservation_javaparser

Many users of JavaParser are asking to implement lexical preservation, i.e., the ability of parsing a Java source file, modifying the AST and get back the modified Java source code keeping the original layout.

Currently this is not supported by JavaParser: you can parse all Java 8 code with JavaParser, get an AST but then the code obtained is pretty-printed, i.e., you lose the original formatting and sometimes comments can be moved around.

Why is this complex and how could this be implemented? I tried experimenting with a solution which does not require to change the JavaParser APIs or the internal implementation. It is just based on using the observer mechanism and a few reflection-based tricks.

Let’s take a look.

How would I use that?

As a user you basically need to setup a LexicalPreservingPrinter for an AST and then change the AST as you want. When you are done you just ask the LexicalPreservingPrinter to produce the code:

What the setup method does?

Two things:

  1. associate the original code to the starting elements of the AST
  2. attach an observer to react to changes

Let’s see how this work.

Connect the parsed nodes to the original code

Let’s consider the scenario in which we start by parsing an existing Java file. We get back an AST in which is node has a Range which indicates its position in the original source code. For example, if we parse this:

We know that the class declaration will start at line 3, column 1 and end at line 6, column 1 (inclusive). So if we have the original code we can use the range to get the corresponding text for each single AST node.

This is easy. This part is implemented in the registerText method of the LexicalPreservingPrinter:

putPlaceholders find the part of text corresponding to children and create ChildNodeTextElement for those. In practice at the end for each node we will have a list of strings (StringNodeTextElement) and placeholders to indicate the position of children in the text (ChildNodeTextElement)

For example for the class class A { int a;} we would have a template of three elements:

  1. StringNodeTextElement(“class A {“)
  2. ChildNodeTextElement(field a)
  3. StringNodeTextElement(“}”)

Now, every time a change is performed on the AST we need to decide how the original text will change.

Removing nodes

The simplest case is when a node is removed. Conceptually when a node is removed we will need to find the text of the parent and remove the portion corresponding to the child.

Consider this case:

If I want to remove the field f I need to find the parent of f and update its text. That would mean changing the text of the class in this case. And if we change the text of the class we should also change the text of its parent (the CompilationUnit, representing the whole file).

Now, we use placeholders and template exactly to avoid having to propagate changes up in the hierarchy.  a parent does not store the portion of text corresponding to the child but is uses placeholders instead. For example for the class we will store something that conceptually looks like this:

So removing a child will just mean removing an element from the list of the parent which will look like this:

In other words when we remove an element we remove the corresponding ChildNodeTextElement from the text associated to the parent.

At this point we may want to merge the two consecutive strings and update the spacing to remove the empty line, but you get the basic idea.

Now, not all cases are that simple. What if we want to remove a parameter? Take this method:

The corresponding list of element will be:

If we want to remove the first parameter we would get:

Which would not be valid because of the extra comma. In this case we should know that the element is part of comma-separated list, we are removing an element from a list with more than one element so we need to remove one comma.

Now, these kind of changes depends to the role of a certain node: i.e., where that node is used. For example where a node contained in a list is removed the method concreteListChange of our observer is called:

Now, to understand what the modified NodeList represents we use reflection, in the method findNodeListName:

If the modified NodeList is the same one we get when calling getParameters on the parent of the list then we know that this is the NodeList containing the parameters. We then have to specify rules for each possible role: in other words we have to specify that when deleting a node from a list of parameters we have to remove the preceeding comma. When removing a node from a list of methods instead there is no comma or other separator to consider.

Note that while removing the comma would be enough to get something which can be parsed correctly it would not be enough to produce an acceptable result because we are not considering adjustments to whitespace. We could have newlines, spaces or tabs added before or after the comma in order to obtain a readable layout. When removing an element and the comma the layout would change and adjust whitespace accordingly would be not necessarily trivial.

Adding nodes

When adding nodes we have mostly the same problems we have seen when removing nodes: we could need to add separators instead of removing them. We have also another problem: we need to figure out where to insert an element.

We can have two different cases:

  • it is single element (like the name of a class or the return type of a method)
  • it is an element part of a list (for example a method parameter)

In the first case we could have:

We need to specify that when adding a name to a Class Definition we need to find the class keyword and put the name after it.

What if we want to insert an element in a list?

If the list is empty or if we want to insert the first element of a list we need to use some delimiter. For example in the case of a method definition, when adding a parameter we should add it after the left parenthesis. If instead we want to add an element in a position different from the first one we need to find the preceeding element in the list, insert a delimiter (if necessary) and then place the new element.

Also in this case we would need to adapt whitespace.

Changing nodes

Most changes to the AST would be performed by adding or removing nodes. In some cases however we would need to change single properties of existing nodes. This is the case when we add or remove modifiers, which are not nodes per se. For these cases we would need specific support for each property of each node. Some more work for us.

Associate comments to nodes

Some time ago I started working on comment attribution: i.e., finding to which node a comment is referred. Why is this necessary? Because when we remove a Node we should remove also the corresponding comments and when we move a Node around the associated comments should be moved with it. And again we usually put some whitespace around the comments. Also that need to be handled.

Conclusions

Lexical preservation is a very important feature: we need to implement it in JavaParser and we need to get it right. However it is far from being trivial to implement and there is not a clear, easy to implement solution. There are different aspects to consider and heuristics to address the problem. For this reason we will need to collect a lot of feedback and being ready to a lot of testing and incremental work to polish our solution.

And you, what do you think about lexical preservation? Do you need it? Any advice on how to implement it?

Observers for AST nodes in JavaParser

javaparser-observer-ast

We are getting closer to the first Release Candidate for JavaParser 3.0. One of the last features we added was support for observing changes to all nodes of the Abstract Syntax Tree. While I wrote the code for this feature I received precious feedback from Danny van Bruggen (a.k.a. Matozoid) and Cruz Maximilien. So I use “we” to refer to the JavaParser team.

What observers on AST nodes could be used for?

I think this is a very important feature for the ecosystem of JavaParser because it makes easier to integrate with JavaParser by reacting to the changes made on the AST. Possible changes that can be observed are setting a new name for a class or add a new field. Different tools could react to those changes in different ways. For example:

  • an editor could update its list of symbols, which could be used for things like auto-completion
  • some frameworks could regenerate source code to reflect the changes
  • validation could be performed to verify if the new change lead to an invalid AST
  • libraries like JavaSymbolSolver could recalculate the types for expressions

These are just a few ideas that come to mind but I think that most scenarios in which JavaParser is used could benefit from the possibility to react to changes.

The AstObserver

The JavaParser 3.0 AST is based on Nodes and NodeLists. A Node, like a TypeDeclaration for instance, can have different groups of children. When these groups can contain more than one node we use NodeLists. For example a TypeDeclarations can have multiple members (fields, methods, inner classes). So each TypeDeclaration has a NodeList to contain fields, one to contain methods, etc. Other children, like the name of a TypeDeclaration, are instead directly contain in a node.

We introduced a new interface named AstObserver. An AstObserver receive changes on the Nodes and NodeLists.

What to observe

Now we have an AstObserver and we need to decide which changes it should received. We thought of three possible scenarios:

  1. Observing just one node, for example a ClassDeclaration. The observer would receive notifications for changes on that node (e.g., if the class change name) but not for any of its descendants. For example if a field of the class change name the observer would not be notified
  2. For a node and all its descendants at the moment of registration of the observer. In this case if I register an observer for the ClassDeclaration I would be notified for changes to the class and all its fields and methods. If a new field is added and later modified I would not receive notifications for those changes
  3. For a node and all its descendants, both the ones existing at the moment of registration of the observer and the ones added later.

So a Node has now this method:

To distinguish these three cases we simply use an enum (ObserverRegistrationMode). Later you can see how we implemented the PropagatingAstObserver.

Implementing support for observers

If JavaParser was based on some meta-modeling framework like EMF this would be extremely simple to do. Given this is not the case I needed to add a notification call in all the setters of the AST classes (there are around 90 of those).

So when a setter is invoke on a certain node it notifies all the observers. Simple. Take for example setName in TypeDeclaration<T>:

Given we do not have a proper metamodel we have no definitions for properties. Therefore we added a list of properties in an enum, named ObservableProperty. In this way an Observer can check which property was changed and decide how to react.

Internal hierarchy of observers

For performance reasons each node has its own list of observers. When we want to observe all descendants of a node we simply add the same observer to all nodes and nodelists in that subtree.

However this is not enough, because in some cases you may want to observe also all nodes which are added to the subtree after you have placed your observers. We do that by using a PropagatingAstObserver. It is an AstObserver that when see a new node been attached to a node it is observing start to observe the new node as well. Simple, eh?

Observers in action

Let’s see how this works in practice:

Conclusions

I am quite excited about this new feature because I think it enables more cool stuff to be done with JavaParser. I think our work as committers is to enable other people to do things we are not foreseeing right now. We should just act as enablers and then get out of the way.

I am really curious to see what people will build. By the way, do you know any project using JavaParser that you want to make known to us? Leave a comment or open an issue on GitHub, we are looking forward to hearing from you!

Interview to Erik Dietrich on Static Analysis and a data driven approach to refactoring

eirk_dietrich_static_analysis

Erik Dietrich is a well known Software Architect with a long experience in consulting. His blog (DaedTech) is a source of thoughtful insights on software development. In particular I was very interested by his usage of static analysis and code metrics in his consulting work.

I am myself a relatively young consultant and I thought it would be very valuable for me to hear his opinion on these topics and in general about consulting in the software industry.

So I wrote to him and he was kind enough to answer some questions.

Introduction

Can you tell us about the typical projects you work on?

It’s hard to summarize typical projects over the course of my entire career.  It’s been an interesting mix.  Over the last couple of years, though, it’s coalesced into what I think of as pure consulting.  I come in to help managers or executives assess their situations and make strategic decisions.

Can you tell us about a project which results exceeded your expectations?

There have been a couple of projects of late that I felt pretty good about.  Both of them involved large clients (a Fortune 10 company and a state government) to whom I furnished some strategic recommendations.  In both cases, I checked back from time to time and discovered that they had taken my recommendations, executed them, and expanded on them to great success.  What I found particularly flattering wasn’t just that they realized success with recommendations (that’s just what you’d expect from a good consultant), but how foundational they found the seeds I had planted there to be.  They had taken things that I had written up and made them core parts of what they were doing and their philosophy.

What make your work difficult? What are the conditions which reduce the possibility of a success?

When it comes to purely consultative engagements, the hardest thing is probably gathering and basing recommendations on good data.

If I’m writing code, it’s a question of working with compilers and test suites and looking at fairly objective definitions of success.  But when it comes to strategic advice like “should we evolve this codebase or start over” or “how could we make our development organization more efficient,” you get into pretty subjective territory.  So the hardest part is finding ways to measure and improve that aren’t just opinion and hand-waving.  They’re out there, but you really have to dig for them.

Human factor

When promoting changes to a software development team how important is the human factor? Do you face much resistance from developers?

It’s the most important thing.  In a consultative capacity, it’s not as though I can elbow someone out of the way, roll up my sleeves, and deliver the code myself.  The recommendations can only come to fruition if the team buys in and executes them.

As for facing resistance, that happens at times.  But, honestly, not as much as you might think.  Almost invariably, I’m called in when things aren’t great, and people aren’t enjoying themselves.  I try to get buy-in by helping people understand how the recommendations that I’m making will improve their own day to day lives.  For example, you can reduce friction with a recommendation to say, have a unit test suite, by explaining that it will lead to fewer late night bug hunts and frustrating rework sessions.

How do you win support from people? Developers employed by the company can get defensive when someone from outside comes to suggest changes to their way of work. In your recent post What to do when Your Colleague Creates Spaghetti Code you suggest using data to support your considerations. How well does it work for you? Do you always find data to support your proposals?

Using data, in and of itself, tends to be a way more to settle debates than to get support.  Don’t get me wrong — it’s extremely important.  But you need to balance that with persuasion and the articulation of the value proposition that I mentioned in the previous answer.  Use the data to build an unassailable case for your position, and then use persuasion and value proposition to make the solution as palatable as possible for all involved.

Static analysis

Is there any typical low hanging fruits for organizations which start to adopt static analysis? Anything that can be achieved with limited experience and effort?

I would say the easiest, quickest thing to do is get an analysis tool and set it up to run against the team’s codebase.  Those tool vendors know their stuff, and you’ll get some good recommendations out of the box.

Do you work with different language communities (e.g., JVM, .NET, Ruby, Python)? My impression is that on .NET there are better tools for static analysis, and maybe more maturity in that community. Is that your feeling also?

With static analysis, these days I work pretty exclusively in the .NET and Java communities, though I may expand that situationally, if the need arises.  As for comparing the static analysis offerings across ecosystems, I’ve actually never considered that.  I don’t have enough data for comparison to comment 🙂

In the Java community there are a few tools for static analysis but I think their default configuration reports way too many irrelevant warnings and this contribute to the poor reputation of static analysis in the Java world. What do you think about that?

I don’t know that this is unique to Java or to any particular tool.  Just about every static analysis tool I’ve ever seen hits you with far more than you probably want right out of the gate.  I suspect the idea is that it’s better to showcase the full feature offering and let users turn warnings off than to give the impression that they don’t do as much analysis.  

A common problem of open source software is the lack of documentation. In your post Reviewing Strangers’ Code on Github you talk about the challenge of confronting with unfamiliar GitHub projects. Do you think static analysis tools could be used to get an overview of the codebase and familiarize faster with the code?

Oh, no doubt about it.  I run custom analysis on new codebases just to get a feel for them and gather basic statistics.  What’s the length of the average method?  How systematically do they run afoul of so-called best practices?  How do the dependencies cluster?  I could go on and on.  I feel like I’d be looking at a codebase blind without the ability to analyze it this way and compare it against my experience with all of the previous ones I’ve examined.

The role of developers

Your next book, Developer Hegemony, is about a new organization of labor and the role developers could occupy in it. Can you tell something about it? And do you think this new organization could be advantageous also for other creative jobs?

In the book, I talk extensively about the problems with and vestigial nature of the corporation in the modern world of knowledge workers.  Corporations are organized like giant pyramids, in which the people at the bottom do the delivery work and are managed by layer upon layer of people above them.  To make matters worse for software people, a lot of these organizations view their labor as an overhead cost.

I see us moving toward a world where software developers move away from working for big, product companies that make things other than software.  I see us moving out from having our salaries depressed under the weight of all of those layers of corporate overhead.  A decent, if imperfect, analog, might be to doctors and lawyers.  And, yes, I think this absolutely can apply to creative fields as well.

To add on to your post How to Get that First Programming Job what is something that the average programmer is not aware of and could improve their work ?

One of the best pieces of advice I think I can offer is the following, though it’s not exactly technical.

If you’re working and you find yourself thinking/muttering, “there’s GOT to be a better way to do this,” stop!  You are almost certainly right, and there is a better way.  Go find it.  And don’t automate something yourself without checking for its existence.

One of the most common mistakes I see in relatively inexperienced programmers is to take the newfound power of automation and apply it everywhere, indiscriminately.  They solve already-solved problems without stopping to think about it (reinventing wheels) or they brute force their way through things (e.g. copy and paste programming).  So, long story short, be sensitive to solutions that may already exist or ways to solve problems that may not always involve writing code.

You are clearly an accomplished professional. Could you share one advice for advancing the careers of someone is the software industry?

One thing that drives me a little nuts about our industry is something I described in this post.  We line our resumes with alphabet soups of programming languages, frameworks, and protocols that we know, but with little, if any mention of what sorts of business or consumer problems we know how to solve.  It’s “I have 8 years of C++” instead of “I can help your team cut down on the kinds of memory leaks that lead to system crashes.”

It will improve your career and focus your work if you start to think less in terms of years and tools and more in terms of what value you can provide and for whom.

Conclusions

I am very thankful to Erik for sharing his opinions. I think there are many interesting takeaways. Two ideas in particular made me think:

  • The changing role of developers: maybe some will still prefer the security of corporate jobs but there are undoubtedly many opportunities for those we are willing to go independent.
  • We as professionals have to raise the bar. One way to do that is start taking more seriously our job by basing it on data, when that makes sense. Data cannot yet provide all the answers, but it can help and it should not be ignored.

Do you agree?

Generate diagrams from C# source code using Roslyn

Representation of the world inspired by Matrix

The code for this post is on Github

Beyond the source code

Last week we have seen how to use Roslyn to rewrite source code to your liking. That’s all well and good, but it’s not the only thing you can do when you have a compiler open and ready to do your bidding. Another possibility is to leverage the knowledge that the compiler has, to support other tools that you use as a programmer, or that are needed by co-workers to simplify their job.

There is two great advantages to use the source code to support everything else:

  1. the source code become the truth, from which everything follow
  2. you can integrate the support for these tools into the processes of continuous integration that you already use

You may say that the point number 1 is already true in any case. But, even for open source software, how many are going to wade through hundreds of files to understand how to use the damn thing ? The reality is that if there is no documentation, it doesn’t exist for most people. Time is too much valuable to lose it behind other people’s code. And this doesn’t even count people that don’t understand code, but they need to know the feature of the software.

Roslyn doesn’t help just programmers

No, it’s true, Roslyn would not write documentation on its own, but it can be used to make it easier and even manage other structured information. In particular today we are talking about UML diagrams. The traditional way is to create them is by hand, which is prone to make them obsolete, or to use programs that reverse engineer the code itself, which is costly and not easily adaptable. Roslyn, instead, allows you to easily create diagrams, at least some kind of diagrams such as class diagrams. Another advantage is that by understanding the source code programmatically you can hide or shows information that are not needed by the reader. For instance, you can hide private properties and methods that the user of the library doesn’t need to know.

The plan

In short the idea is to create text files that are compatible with PlantUML for every class of our source code and then to use PlantUML to create the actual diagrams. In real life it would be trivial to then  create the diagrams programmatically, thanks to the command line and upload the images wherever you want. To generate class diagrams by leveraging the compiler is so easy because the compiler need to understand the source code and so every information is readily available to us. In fact, I didn’t even need to write much code since there is already a small library that does it: https://github.com/pierre3/PlantUmlClassDiagramGenerator1. Ehi, we are programmers, we are lazy, we are smart enough to leverage existing resources.

We just need to understand how it works. It’s less than 300 lines of code, including comments, so we can delve right in.

Generating the diagram

See, I wasn’t kidding, it’s easy. All the information is readily available from the parser of Roslyn, we just need to take it. GetMembersModifierText (not shown) is simply a switch to associate every modifier keyword to its respesctive plantuml symbol, like SyntaxKind.PublicKeyword equals “+”.  Of course you need to learn the terminology, such as SyntaxKind or the names of the several *Syntax(s), but that isn’t really hard. The only thing slightly harder than a simple “copy value and write a string” is relative to properties, which are what the developers of .NET call “syntactic sugar”, that is to say a shortcut for programmers, that the compiler transform in real functions. Since they are not a standard feature of many languages you have to translate them for UML.

The main method

I don’t show the whole main method because it’s you typical console app: very simple. Since ClassDiagramGenerator is nothing more than a CSharpSyntaxWalker, we just need to gather the text, parse it, and give the order to visit the tree with our walker. The only things to notice are the starting and closing plantuml notation lines that we add to our generated files. Now you can use plantuml to create the diagrams.

Conclusion

Class diagram of ClassDiagramGenerator

Class Diagram generated by PlantUML

Using the source code as a source of intelligence about the code itself is not exactly a free lunch, but it’s quite there. You can write code and then automatically have it translated in a form that co-workers can understand, be them other programmers or something else. And you can integrate this information into the practices and tools that you already use, it’s a win-win. It’s true that in real life there is probably more setup, but the advantages are clear. The information is already there, now Roslyn make it easy accessible, why not use it ?


[1] I just added a few lines to include the relation between base and derived classes [^]

Resolve method calls in Java code using the JavaSymbolSolver

javasymbolsolver

Resolve method calls in Java Code using JavaSymbolSolver

Why I created the java-symbol-solver?

A few years ago I started using JavaParser and then I started contributing. After a while I realized that many operations we want to do on Java code cannot be done just by using the Abstract Syntax Tree produced by a parser, we need also to resolve types, symbols and method calls. For this reason I have created the JavaSymbolSolver. It is now been used to produce static analysis tools by Coati.

One thing that is missing is documentation: people open issues on JavaParser asking how to answer a certain question and the answer is often “for this you need to use JavaSymbolSolver”. Starting from these issues I will show a few examples.

Inspired by this issue I will show how to produce a list of all calls to a specific method.

Learn advanced JavaParser

Javaparser_visited

Receive a chapter on the book JavaParser: Visited.

This chapter presents the JavaSymbolSolver, which you will need for all the advanced analysis and transformation of Java code

Powered by ConvertKit

How can we resolve method calls in Java using the java-symbol-solver?

It can be done in two steps:

  1. You use JavaParser on the source code to build your ASTs
  2. You call JavaSymbolSolver on the nodes of the ASTs representing method calls and get the answer

We are going to write a short example. At the end we will get an application that given a source file will produce this:

We are going to use Kotlin and Gradle. Our build file looks like this:

Building an AST is quite easy, you simply call this method:

What the hell is a Type Solver? It is the object which knows where to look for classes. When processing source code you will typically have references to code that is not yet compiled, but it is just present in other source files. You could also use classes contained in JARs or classes from the Java standard libraries. You have just to tell to your TypeSolver where to look for classes and it will figure it out.

In our example we will parse the source code from the JavaParser project (how meta?!). This project has source code in two different directories, for proper source code and code generated by JavaCC (you can ignore what JavaCC is, it is not relevant to you). We of course use also classes from the java standard libraries. This is how our TypeSolver looks like:

This is where we invoke JavaParserFacade, one of the classes provided by JavaSymbolSolver. We just take a method call at the time and we pass it to the method solve of the JavaParserFacade. We get a MethodUsage (which is basically a method declaration + the value of the parameter types for that specific invocation). From it we get the MethodDeclaration and we print the qualified signature, i.e., the qualified name of the class followed by the signature of the method. This is how we get the final output:

There is so plumbing to do but basically JavaSymbolSolver does all the heavy work behind the scene. Once you have a node of the AST you can throw it at the class JavaParserFacade and it will give you back all the information you may need: it will find corresponding types, fields, methods, etc.

The problem is… we need more documentation and feedback from users. I hope some of you will start using JavaSymbolSolver and tell us how we can improve it.

Also, last week the JavaSymbolSolver was moved under the JavaParser organization. This means that in the future we will work more closely with the JavaParser project.

The code is available on GitHub: java-symbol-solver-examples