How to build a symbol solver for Java, in Clojure

Last week we have seen how to build models of Java code from source and JAR files. Now we want to be able to solve symbols: in practice we want to link the symbol references in our source code to symbol declarations either in Java source files or jar files. This process is called symbol resolution.

Symbol References and Symbol Declarations

As first thing we have to consider that the same symbol could point to different declarations in different contexts. Consider this absurd piece of code:

package me.tomassetti;

public class Example {

    private int a;
    
    void foo(long a){
        a += 1;
        for (int i=0; i<10; i++){
            new Object(){
                char a;
                
                void bar(long l){
                    long a = 0;
                    a += 2;
                }
            }.bar(a);
            
        }
    }
    
}

It contains several declaration of the symbol a:

  • at line 5 it is declared as field of type int
  • at line 7 it is declared as a parameter of type long
  • at line 11 it is declared as a field of anonymous class, and it has type char
  • at line 14 it is declared as a local variable of type long

We have also two references to the symbol a:

  • at line 8, when we increment a by 1
  • at line 15, when we increment a by 2

Now to solve symbols we can figure out which symbol references refer to which symbol declarations. So that we can do things like understanding the type of a symbol, and so which operations can be performed on a symbol.

Scoping Rules

The principle to solve symbol is easy (but the implementation could be… tricky): given a symbol reference we look for the closest corresponding declaration. Until we find one we keep moving further away from the reference, going up in the AST.

In our example we would match the reference on line 15 with the declaration on line 14. We would also mach the reference on line 8 with the declaration on line 7. Simple, eh?

Consider this other example:

package me.tomassetti;

class A {
    public void foo(){
        System.out.println("I am the external A");
    }    
}

public class Example {
    
    A a;

    class A {
        public void foo(){
            System.out.println("I am the internal A");
        }
    }

    void foo(A a){
        final int A = 10;
        new A().foo();
    }

}

Now we have different definitions of A, in some cases as a Class (lines 3 and 13), in others as a variables (line 20). The reference on line 21 (new A().foo();) matches the declaration on line 13, not the declaration on line 20, because we can use only type declarations to match symbol references inside a new object instantiation statement. Things starts to become not so easy…

There are other things to consider:

  • import statements: importing com.foo.A makes references to A be resolved with the declarations of the class A in package com.foo
  • classes in the same package which can be referred by the simple name, while for the others we have to use the fully qualified name
  • classes in the default package cannot be referred outside the default package
  • fields and methods inherited should be considered (for methods we have to consider overloading and overriding without confusing them)
  • when matching methods we have to consider the compatibility of the parameters (not easy at all)
  • etc. etc. etc.

While the principle is simple, there are tons of rules and exceptions and things to consider to properly resolve the symbols. The nice thing is: we can start easily and then improve the solution as we go on.

Designing Our Symbol Solver

One note: in this post I will explain the approach I am currently using to build my symbol solver for effectivejava. I am not saying this is a perfect solution and I am sure I am going to improve this solution over time. However this approach seems to cover a large number of cases and I think it is not an awful solution.

A simple reference to a name (whatever it indicates a variable, a parameter, a field, etc.) is represented by a NameExpr in the AST produced by JavaParser. We start by creating a function that takes a NameExpr node and return the corresponding declaration, if any could be found. The function solveNameExpr basically just invoke the function solveSymbol passing three parameters:

  1. the AST node itself: it represents the scope in which to solve the symbol
  2. nil: this represent extra context information, in this case it is not needed
  3. the name to be resolved: well, it should be self-explanatory
(defn solveNameExpr
  "given an instance of com.github.javaparser.ast.expr.NameExpr returns the declaration it refers to,
   if it can be found, nil otherwise"
  [nameExpr]
  (let [name (.getName nameExpr)]
    (solveSymbol nameExpr nil name)))

We start by declaring a protocol (which is somehow similar to a Java interface) for the concept of scope:

(defprotocol Scope
  ; for example in a BlockStmt containing statements [a b c d e], when solving symbols in the context of c
  ; it will contains only statements preceeding it [a b]
  (solveSymbol [this context nameToSolve])
  ; solveClass solve on a subset of the elements of solveSymbol
  (solveClass [this context nameToSolve]))

The basic idea is that in each scope we try to look for declarations corresponding to the symbol to be solved, if we do not find them we delegate to the parent scope. We specify a default implementation for the AST node (com.github.javaparser.ast.Node) which just delegate to the parent. For some node types we provide a specific implementation.

(extend-protocol Scope
  com.github.javaparser.ast.Node
  (solveSymbol [this context nameToSolve]
    (solveSymbol (.getParentNode this) this nameToSolve))
  (solveClass [this context nameToSolve]
    (solveClass (.getParentNode this) this nameToSolve)))

For BlockStmt we look among the instructions preceding the one examined for variable declarations. The current examine statement is passed as the context. If you are interested for the functions used by this one just look at the code of effectivejava: it is on GitHub :)

(extend-protocol Scope
  BlockStmt
  (solveSymbol [this context nameToSolve]
    (let [elementsToConsider (if (nil? context) (.getStmts this) (preceedingChildren (.getStmts this) context))
          decls (map (partial declare-symbol? nameToSolve) elementsToConsider)]
      (or (first decls) (solveSymbol (.getParentNode this) this nameToSolve)))))

For a MethodDeclaration we look among the parameters

(defn solve-among-parameters [method nameToSolve]
  (let [parameters (.getParameters method)
        matchingParameters (filter (fn [p] (= nameToSolve (.getName (.getId p)))) parameters)]
    (first matchingParameters)))

(extend-protocol Scope
  com.github.javaparser.ast.body.MethodDeclaration
  (solveSymbol [this context nameToSolve]
    (or (solve-among-parameters this nameToSolve)
      (solveSymbol (.getParentNode this) nil nameToSolve)))
  (solveClass [this context nameToSolve]
    (solveClass (.getParentNode this) nil nameToSolve)))

For ClassOrInterfaceDeclaration we look among the fields of the class or interface (classes could have static fields).

(defn solveAmongVariableDeclarator
  [nameToSolve variableDeclarator]
  (let [id (.getId variableDeclarator)]
    (when (= nameToSolve (.getName id))
      id)))

(defn- solveAmongFieldDeclaration
  "Consider one single com.github.javaparser.ast.body.FieldDeclaration, which corresponds to possibly multiple fields"
  [fieldDeclaration nameToSolve]
  (let [variables (.getVariables fieldDeclaration)
        solvedSymbols (map (partial solveAmongVariableDeclarator nameToSolve) variables)
        solvedSymbols' (remove nil? solvedSymbols)]
    (first solvedSymbols')))

(defn- solveAmongDeclaredFields [this nameToSolve]
  (let [members (.getMembers this)
        declaredFields (filter (partial instance? com.github.javaparser.ast.body.FieldDeclaration) members)
        solvedSymbols (map (fn [c] (solveAmongFieldDeclaration c nameToSolve)) declaredFields)
        solvedSymbols' (remove nil? solvedSymbols)]
    (first solvedSymbols')))

(extend-protocol Scope
  com.github.javaparser.ast.body.ClassOrInterfaceDeclaration
  (solveSymbol [this context nameToSolve]
    (let [amongDeclaredFields (solveAmongDeclaredFields this nameToSolve)]
      (if (and (nil? amongDeclaredFields) (not (.isInterface this)) (not (empty? (.getExtends this))))
        (let [superclass (first (.getExtends this))
              superclassName (.getName superclass)
              superclassDecl (solveClass this this superclassName)]
          (if (nil? superclassDecl)
            (throw (RuntimeException. (str "Superclass not solved: " superclassName)))
            (let [inheritedFields (allFields superclassDecl)
                  solvedSymbols'' (filter (fn [f] (= nameToSolve (fieldName f))) inheritedFields)]
              (first solvedSymbols''))))
        amongDeclaredFields)))
  (solveClass [this context nameToSolve]
    (solveClass (.getParentNode this) nil nameToSolve)))

For CompilationUnit we look for other classes in the same package (both using their simple or qualified names), we consider the import statements and we look for the types declared in the file.

(defn qNameToSimpleName [qualifiedName]
  (last (clojure.string/split qualifiedName #".")))

(defn importQName [importDecl]
  (str (.getName importDecl)))

(defn isImportMatchingSimpleName? [simpleName importDecl]
  (= simpleName (qNameToSimpleName (importQName importDecl))))

(defn solveImportedClass 
  "Try to solve the classname by looking among the imported classes"
  [cu nameToSolve]
  (let [imports (.getImports cu)
        relevantImports (filter (partial isImportMatchingSimpleName? nameToSolve) imports)
        importNames (map (fn [i] (.getName (.getName i))) imports)
        correspondingClasses (map typeSolver importNames)]
    (first correspondingClasses)))

(extend-protocol Scope
  com.github.javaparser.ast.CompilationUnit
  (solveClass [this context nameToSolve]
    (let [typesInCu (topLevelTypes this)
          ; match types in cu using their simple name
          compatibleTypes (filter (fn [t] (= nameToSolve (getName t))) typesInCu)
          ; match types in cu using their qualified name
          compatibleTypes' (filter (fn [t] (= nameToSolve (getQName t))) typesInCu)]
      (or (first compatibleTypes)
          (first compatibleTypes')
          (solveImportedClass this nameToSolve)
          (solveClassInPackage (getClassPackage this) nameToSolve)
          ; we solve in nil context: it means look for absolute names
          (solveClass nil nil nameToSolve)))))

If we do not manage to solve a symbol in the scope of the Compilation Unit there is no parent to delegate, so we use the nil scope which represents the absence of scope. In that case only absolute names can be solved, like the canonical names of classes. We do that using the typeSolver function. The typeSolver need to know which classpath to use and it basically look up for classes among source files directories and JAR. Also in this case feel free to dig into the code of effectivejava.

(extend-protocol Scope
  nil
  (solveClass [this context nameToSolve] (typeSolver nameToSolve)))

Conclusions

I think Clojure is great to build solutions incrementally: we can implement the different methods of the protocols as we move forward. Build simple tests and improve them one piece at the time.

We have built this simple solution iteratively and until now it is working fine for our goals. We will keep write more test cases and we could have to refactor a thing or two but I think we are heading in the right general direction.

In the future could be useful to extract this symbol resolver in a separate library, to be used with JavaParser to perform static analysis and refactoring.