Implementing Lexical Preservation for JavaParser

Many users of JavaParser are asking to implement lexical preservation, i.e., the ability of parsing a Java source file, modifying the AST and get back the modified Java source code keeping the original layout.

Currently this is not supported by JavaParser: you can parse all Java 8 code with JavaParser, get an AST but then the code obtained is pretty-printed, i.e., you lose the original formatting and sometimes comments can be moved around.

Why is this complex and how could this be implemented? I tried experimenting with a solution which does not require to change the JavaParser APIs or the internal implementation. It is just based on using the observer mechanism and a few reflection-based tricks.

Let’s take a look.

How would I use that?

As a user you basically need to setup a LexicalPreservingPrinter for an AST and then change the AST as you want. When you are done you just ask the LexicalPreservingPrinter to produce the code:

val originalCode = "class A { int a; }"
// you have to parse the code using JavaParser, as usual
CompilationUnit cu = JavaParser.parse(originalCode);
// now you attach a LexicalPreservingPrinter to the AST
val lpp = setup(cu, originalCode);
// do some changes
cu.getClassByName("A").getFieldByName("a").getVariable(0).setInit("10");
// get the transformed code
val transformedCode = lpp.print(cu);

What the setup method does?

Two things:

associate the original code to the starting elements of the AST
attach an observer to react to changes

public static LexicalPreservingPrinter setup(CompilationUnit cu, String code) {
    LexicalPreservingPrinter lpp = new LexicalPreservingPrinter();
    AstObserver observer = createObserver(lpp);
    cu.registerForSubtree(observer);
    cu.onSubStree(node -> lpp.registerText(node, code));
    return lpp;
}

Let’s see how this work.

Connect the parsed nodes to the original code

Let’s consider the scenario in which we start by parsing an existing Java file. We get back an AST in which is node has a Range which indicates its position in the original source code. For example, if we parse this:

package a.b.c;

class A {
   void foo() {
   }
}

We know that the class declaration will start at line 3, column 1 and end at line 6, column 1 (inclusive). So if we have the original code we can use the range to get the corresponding text for each single AST node.

This is easy. This part is implemented in the registerText method of the LexicalPreservingPrinter:

    public void registerText(Node node, String documentCode) {
        if (node instanceof CompilationUnit) {
            node.setRange(node.getRange().withBegin(Position.HOME));
        }
        // we get all the text corresponding to the node
        String text = getRangeFromDocument(node.getRange(), documentCode);
        // we transform that string in a sort of template, with placeholders to indicate the position of the children
        NodeText nodeText = putPlaceholders(documentCode, node.getRange(), text, new ArrayList<>(node.getChildNodes()));
        // we save this template
        textForNodes.put(node, nodeText);
    }

putPlaceholders find the part of text corresponding to children and create ChildNodeTextElement for those. In practice at the end for each node we will have a list of strings (StringNodeTextElement) and placeholders to indicate the position of children in the text (ChildNodeTextElement)

    private NodeText putPlaceholders(String documentCode, Range range, String text, List<Node> children) {

        children.sort((o1, o2) -> o1.getRange().begin.compareTo(o2.getRange().begin));

        NodeText nodeText = new NodeText(this);

        int start = findIndex(documentCode, range.begin);
        int caret = start;
        for (Node child : children) {
            int childStartIndex = findIndex(documentCode, child.getBegin());
            int fromStart = childStartIndex - caret;
            if (fromStart > 0) {
                nodeText.addElement(new StringNodeTextElement(text.substring(caret - start, childStartIndex - start)));
                caret += fromStart;
            }
            nodeText.addElement(new ChildNodeTextElement(this, child));
            int lengthOfOriginalCode = getRangeFromDocument(child.getRange(), documentCode).length();
            caret += lengthOfOriginalCode;
        }
        // last string
        int endOfNode = findIndex(documentCode, range.end) + 1;
        if (caret < endOfNode) {
            nodeText.addElement(new StringNodeTextElement(text.substring(caret - start)));
        }

        return nodeText;
    }

For example for the class class A { int a;} we would have a template of three elements:

StringNodeTextElement(“class A {“)
ChildNodeTextElement(field a)
StringNodeTextElement(“}”)

Now, every time a change is performed on the AST we need to decide how the original text will change.

Removing nodes

The simplest case is when a node is removed. Conceptually when a node is removed we will need to find the text of the parent and remove the portion corresponding to the child.

Consider this case:

class A {
   int i;
   float f;
   char c;
}

If I want to remove the field f I need to find the parent of f and update its text. That would mean changing the text of the class in this case. And if we change the text of the class we should also change the text of its parent (the CompilationUnit, representing the whole file).

Now, we use placeholders and template exactly to avoid having to propagate changes up in the hierarchy. a parent does not store the portion of text corresponding to the child but is uses placeholders instead. For example for the class we will store something that conceptually looks like this:

["class A {n   ", field(0), "n    ", field(1), "n    ", field(2), "}"]

So removing a child will just mean removing an element from the list of the parent which will look like this:

["class A {n   ", field(0), "n    ", "n    ", field(2), "}"]

In other words when we remove an element we remove the corresponding ChildNodeTextElement from the text associated to the parent.

At this point we may want to merge the two consecutive strings and update the spacing to remove the empty line, but you get the basic idea.

Now, not all cases are that simple. What if we want to remove a parameter? Take this method:

int foo(int p1, int p2, int p3) {
   return 0;
}

The corresponding list of element will be:

[return(0), " foo(", param(0), ", " param(1), ", " param(2), ")", body(0)]

If we want to remove the first parameter we would get:

[return(0), " foo(", ", " param(1), ", " param(2), ")", body(0)]

Which would not be valid because of the extra comma. In this case we should know that the element is part of comma-separated list, we are removing an element from a list with more than one element so we need to remove one comma.

Now, these kind of changes depends to the role of a certain node: i.e., where that node is used. For example where a node contained in a list is removed the method concreteListChange of our observer is called:

@Override
public void concreteListChange(NodeList observedNode, ListChangeType type, int index, Node nodeAddedOrRemoved) {
   ...our magic...
}

Now, to understand what the modified NodeList represents we use reflection, in the method findNodeListName:

private String findNodeListName(NodeList nodeList) {
    Node parent = nodeList.getParentNodeForChildren();
    for (Method m : parent.getClass().getMethods()) {
        if (m.getParameterCount() == 0 && m.getReturnType().getCanonicalName().equals(NodeList.class.getCanonicalName())) {
            try {
                NodeList result = (NodeList)m.invoke(parent);
                if (result == nodeList) {
                    String name = m.getName();
                    if (name.startsWith("get")) {
                        name = name.substring("get".length());
                    }
                    return name;
                }
            } catch (IllegalAccessException e) {
                throw new RuntimeException(e);
            } catch (InvocationTargetException e) {
                throw new RuntimeException(e);
            }
        }
    }
    throw new IllegalArgumentException();
}

If the modified NodeList is the same one we get when calling getParameters on the parent of the list then we know that this is the NodeList containing the parameters. We then have to specify rules for each possible role: in other words we have to specify that when deleting a node from a list of parameters we have to remove the preceeding comma. When removing a node from a list of methods instead there is no comma or other separator to consider.

Note that while removing the comma would be enough to get something which can be parsed correctly it would not be enough to produce an acceptable result because we are not considering adjustments to whitespace. We could have newlines, spaces or tabs added before or after the comma in order to obtain a readable layout. When removing an element and the comma the layout would change and adjust whitespace accordingly would be not necessarily trivial.

Adding nodes

When adding nodes we have mostly the same problems we have seen when removing nodes: we could need to add separators instead of removing them. We have also another problem: we need to figure out where to insert an element.

We can have two different cases:

it is single element (like the name of a class or the return type of a method)
it is an element part of a list (for example a method parameter)

In the first case we could have:

[ "class ", /* Here we should put the ID */, "{ }" ]

We need to specify that when adding a name to a Class Definition we need to find the class keyword and put the name after it.

What if we want to insert an element in a list?

If the list is empty or if we want to insert the first element of a list we need to use some delimiter. For example in the case of a method definition, when adding a parameter we should add it after the left parenthesis. If instead we want to add an element in a position different from the first one we need to find the preceeding element in the list, insert a delimiter (if necessary) and then place the new element.

Also in this case we would need to adapt whitespace.

Changing nodes

Most changes to the AST would be performed by adding or removing nodes. In some cases however we would need to change single properties of existing nodes. This is the case when we add or remove modifiers, which are not nodes per se. For these cases we would need specific support for each property of each node. Some more work for us.

Associate comments to nodes

Some time ago I started working on comment attribution: i.e., finding to which node a comment is referred. Why is this necessary? Because when we remove a Node we should remove also the corresponding comments and when we move a Node around the associated comments should be moved with it. And again we usually put some whitespace around the comments. Also that need to be handled.

Conclusions

Lexical preservation is a very important feature: we need to implement it in JavaParser and we need to get it right. However it is far from being trivial to implement and there is not a clear, easy to implement solution. There are different aspects to consider and heuristics to address the problem. For this reason we will need to collect a lot of feedback and being ready to a lot of testing and incremental work to polish our solution.

And you, what do you think about lexical preservation? Do you need it? Any advice on how to implement it?

Implementing Lexical Preservation for JavaParser

How would I use that?

Connect the parsed nodes to the original code

Removing nodes

Adding nodes

Changing nodes

Associate comments to nodes

Conclusions

Categories

More on Code processing

Code Completion with ANTLR4-c3

Parsing SQL