Posts

How to build a symbol solver for Java, in Clojure

Last week we have seen how to build models of Java code from source and JAR files. Now we want to be able to solve symbols: in practice we want to link the symbol references in our source code to symbol declarations either in Java source files or jar files.

Symbol references and symbol declarations

As first thing we have to consider that the same symbol could point to different declarations in different contexts. Consider this absurd piece of code:

It contains several declaration of the symbol a:

  • at line 5 it is declared as field of type int
  • at line 7 it is declared as a parameter of type long
  • at line 11 it is declared as a field of anonymous class, and it has type char
  • at line 14 it is declared as a local variable of type long

We have also two references to the symbol a:

  • at line 8, when we increment a by 1
  • at line 15, when we increment a by 2

Now to solve symbols we can figure out which symbol references refer to which symbol declarations. So that we can do things like understanding the type of a symbol, and so which operations can be performed on a symbol.

Scoping rules

The principle to solve symbol is easy (but the implementation could be… tricky): given a symbol reference we look for the closest corresponding declaration. Until we find one we keep moving further away from the reference, going up in the AST.

In our example we would match the reference on line 15 with the declaration on line 14. We would also mach the reference on line 8 with the declaration on line 7. Simple, eh?

Consider this other example:

Now we have different definitions of A, in some cases as a Class (lines 3 and 13), in others as a variables (line 20). The reference on line 21 (new A().foo();) matches the declaration on line 13, not the declaration on line 20, because we can use only type declarations to match symbol references inside a new object instantiation statement. Things starts to become not so easy…

There are other things to consider:

  • import statements: importing com.foo.A makes references to A be resolved with the declarations of the class A in package com.foo
  • classes in the same package which can be referred by the simple name, while for the others we have to use the fully qualified name
  • classes in the default package cannot be referred outside the default package
  • fields and methods inherited should be considered (for methods we have to consider overloading and overriding without confusing them)
  • when matching methods we have to consider the compatibility of the parameters (not easy at all)
  • etc. etc. etc.

While the principle is simple, there are tons of rules and exceptions and things to consider to properly resolve the symbols. The nice thing is: we can start easily and then improve the solution as we go on.

Design our symbol solver

One note: in this post I will explain the approach I am currently using to build my symbol solver for effectivejava. I am not saying this is a perfect solution and I am sure I am going to improve this solution over time. However this approach seems to cover a large number of cases and I think it is not an awful solution.

A simple reference to a name (whatever it indicates a variable, a parameter, a field, etc.) is represented by a NameExpr in the AST produced by JavaParser. We start by creating a function that takes a NameExpr node and return the corresponding declaration, if any could be found. The function solveNameExpr basically just invoke the function solveSymbol passing three parameters:

  1. the AST node itself: it represents the scope in which to solve the symbol
  2. nil: this represent extra context information, in this case it is not needed
  3. the name to be resolved: well, it should be self-explanatory

We start by declaring a protocol (which is somehow similar to a Java interface) for the concept of scope:

The basic idea is that in each scope we try to look for declarations corresponding to the symbol to be solved, if we do not find them we delegate to the parent scope. We specify a default implementation for the AST node (com.github.javaparser.ast.Node) which just delegate to the parent. For some node types we provide a specific implementation.

For BlockStmt we look among the instructions preceding the one examined for variable declarations. The current examine statement is passed as the context. If you are interested for the functions used by this one just look at the code of effectivejava: it is on GitHub 🙂

For a MethodDeclaration we look among the parameters

For ClassOrInterfaceDeclaration we look among the fields of the class or interface (classes could have static fields).

For CompilationUnit we look for other classes in the same package (both using their simple or qualified names), we consider the import statements and we look for the types declared in the file.

If we do not manage to solve a symbol in the scope of the Compilation Unit there is no parent to delegate, so we use the nil scope which represents the absence of scope. In that case only absolute names can be solved, like the canonical names of classes. We do that using the typeSolver function. The typeSolver need to know which classpath to use and it basically look up for classes among source files directories and JAR. Also in this case feel free to dig into the code of effectivejava.

Conclusions

I think Clojure is great to build solutions incrementally: we can implement the different methods of the protocols as we move forward. Build simple tests and improve them one piece at the time.

We have built this simple solution iteratively and until now it is working fine for our goals. We will keep write more test cases and we could have to refactor a thing or two but I think we are heading in the right general direction.

In the future could be useful to extract this symbol resolver in a separate library, to be used with JavaParser to perform static analysis and refactoring.

Building models of Java code from source and JAR files

Recently I spent some time working on effectivejava, which is on its way to reach 300 stars on GitHub (feel free to help reaching the target :D).

Effectivejava is a tool to run queries on your Java code. It is based on another project I contribute to, javaparser. Javaparser takes as input Java source code and produce an Abstract Syntax Tree (AST). We can perform simple analysis directly on the AST. For example we can find out which methods take more than 5 parameters (you may want to refactor them…). However more sophisticate analysis require to resolve symbols.

In this post I describe how I am working on implementing symbol resolution considering both source code and JAR files. In this first post we will build an homogenous view on both source code and JAR files, in the next post we will solve these symbols exploring these models.

Code is available on GitHub, on the branch symbolsolver of effectivejavaUpdate: it has been merged into master

Resolving symbols

For which reason do we need to resolve symbols?

Given this code:

we need to figure out what foo, method, a, b, c are. Are they references to local variables? To arguments of the current method? To fields declared in the class? To fields inherited from a super-class class? What type they have? To answer this question we need to be able to resolve symbols.

To solve symbols we can navigate the AST and apply scoping rules. For example we may look if a certain symbol corresponds to a local variable. If not we can look among the parameters of that method. If we cannot still find a correspondence we need to look among the fields declared by the class and if have still no luck we may have to luck among the fields inherited by this class.

Now, scoping rules are much more complex than the bunch of little steps I just described. It is especially complex to resolve methods, because of overloading. However one key point is that to solve symbols we need to look among imported classes, extended classes and external classes in general which may be part of the project or be imported as dependencies.

So to solve symbol we need to look for corresponding declarations:

  1. on the ASTs of the classes of the project we are examining
  2. among the classes contained in the JAR files used as dependencies

Javaparser provides to us the ASTs we need for the first point, for the second one we are going to build a model of classes in JAR files using Javassist.

Build a model of classes contained in JAR files

Our symbol solver should look among a list of entries (our classpath entries) in order, and see if a certain class can be found there. To do so, we would need to open the JAR files and look among its contents. For performance reasons we could want to build a cache of elements contained in a given JAR.

How we start? First of all we read the entries listed in the jar (getElementEntriesInJar). In this way we get a list of ClasspathElements. Then we focus only on the .class files (getClassesEntriesInJar). This method should be invoked once per jar and result should be cached. Given a list of ClasspathElement we can then search for the element corresponding to a given name (e.g., com.github.javaparser.ASTParser). For doing that we can use the method findEntry. Or we can also load that class by using http://www.csg.ci.i.u-tokyo.ac.jp/~chiba/javassist/“>Javassist: this what the method findType does, returning an instance of CtClass.

Why not just using reflection?

Someone could think that it would be easier to just add the dependencies in the classpath of effectivejava and then use the normal classloader and reflection to obtain the needed information. While it would be easier there are some drawbacks:

  1. when a class is loaded the static initializers are executed and it could be not what we want
  2. it could possibly conflict with real dependencies of effective java.
  3. Finally not all the information available in the bytecode are easily retrievable through the reflection API

Solve symbols: combining heterogenous models

Ok now, to solve symbols we will have to implement the scoping rules and navigate both the ASTs obtained from Javaparser and the CtClasses obtained from Javassist. We will see the details on a future blog post but we need to consider one other aspect first. Consider this code:

In this case we suppose to have a JAR containing the class com.github.someproject.ClassInJar which declared the field myInheritedField. When we will solve symbols we will have these mappings:

  • myDeclaredField will be resolved to an instance of com.github.javaparser.ast.body.VariableDeclarator (in Javaparser we have nodes of type FieldDeclaration which maps to constructs such as private int a, b, c;. VariableDeclarators instead point to the single fields such as a, b or c)
  • myInheritedField will be resolved to an instance of javassist.CtField

The problem is that we want to be able to treat them in an homogenous way: we should be able to treat each field using the same functions, irrespectively of their origin (a JAR file or a Java source file). To do so we are going to build common views using clojure protocols. I tend to view clojure’s protocols as the equivalent of Java’s interfaces.

While in Java we would have to build adapters, implementing the new interface (FieldDecl) and wrapping the existing classes (VariableDeclarator, CtField) in Clojure we can just say that those classes extend the protocol and we are done.

Now we are able to treat each field as fieldDecl and we can invoke on each field fieldName. We still need to figure out how to solve the type of the field. For doing that we need to look into symbol resolution and in particular into type resolution, which is our next step.

Conclusions

Building model of Java code is something that has fascinated me for a while. As part of my master thesis I wrote a DSL which interacted with existing Java code (I had also editors, written as Eclipse plugins and code generators: it was kind of cool). In the DSL was possible to specify references to Java classes, using both source code and JAR files. I was using EMF and probably I adopted JaMoPP and Javassist for that project.

Later I built CodeModels a library to analyze ASTs of several languages (Java, JavaScript, Ruby, Html, etc.).

I think that building tools to manipulate code is a very interesting form of metaprogramming, and it should be in the toolbox of each developer. I plan to spend some more time playing with effectivejava. Fun times are coming.

Feel free to share comments and suggestions!

Joy of Clojure

They picked up for the second time a sentence of mine for a book cover. This time is Joy of Clojure. I think the language and the book are really worthy to take a look at them.

http://www.manning.com/fogus/