Posts

Extracting JavaDoc documentation from source files using JavaParser

 

A lot of people are using JavaParser for the most different goals. One of these is extracting documentation. In this short post we will see how you can print all the JavaDoc comments associated to classes or interfaces.

Code is available on GitHub: https://github.com/ftomassetti/javadoc-extractor

Getting all the Javadoc comments for classes

We are reusing DirExplorer, a supporting class presented in the introduction to JavaParser. This class permits to process a directory, recursively, parsing all the Java files contained there.

We can start by iterating over all the classes and find the associated Javadoc comments.

As you can see getting the JavaDoc comments is fairly easy. It produces this result:

Getting all the Javadoc comments and find the documented elements

In other cases we may want to start collecting all the Javadoc comments and then finding the element which is commented. We can also do that easily with Javaparser:

Here most of the code is about providing a description for the commented node (method describe).

Conclusions

Manipulate the AST and finding the Javadoc comments is quite easy. However one missing feature is the possibility to extract the information contained in the Javadoc in a structured form. For example, you may want to get only the part of the Javadoc associated to a certain parameter or to the return value. Javaparser currently does not have this feature, but I am working on it and it should be merged in the next 1-2 weeks. If you want to follow the development take a look at issue 433.

Thanks for reading and happy parsing!

Implementing Lexical Preservation for JavaParser

lexical_preservation_javaparser

Many users of JavaParser are asking to implement lexical preservation, i.e., the ability of parsing a Java source file, modifying the AST and get back the modified Java source code keeping the original layout.

Currently this is not supported by JavaParser: you can parse all Java 8 code with JavaParser, get an AST but then the code obtained is pretty-printed, i.e., you lose the original formatting and sometimes comments can be moved around.

Why is this complex and how could this be implemented? I tried experimenting with a solution which does not require to change the JavaParser APIs or the internal implementation. It is just based on using the observer mechanism and a few reflection-based tricks.

Let’s take a look.

How would I use that?

As a user you basically need to setup a LexicalPreservingPrinter for an AST and then change the AST as you want. When you are done you just ask the LexicalPreservingPrinter to produce the code:

What the setup method does?

Two things:

  1. associate the original code to the starting elements of the AST
  2. attach an observer to react to changes

Let’s see how this work.

Connect the parsed nodes to the original code

Let’s consider the scenario in which we start by parsing an existing Java file. We get back an AST in which is node has a Range which indicates its position in the original source code. For example, if we parse this:

We know that the class declaration will start at line 3, column 1 and end at line 6, column 1 (inclusive). So if we have the original code we can use the range to get the corresponding text for each single AST node.

This is easy. This part is implemented in the registerText method of the LexicalPreservingPrinter:

putPlaceholders find the part of text corresponding to children and create ChildNodeTextElement for those. In practice at the end for each node we will have a list of strings (StringNodeTextElement) and placeholders to indicate the position of children in the text (ChildNodeTextElement)

For example for the class class A { int a;} we would have a template of three elements:

  1. StringNodeTextElement(“class A {“)
  2. ChildNodeTextElement(field a)
  3. StringNodeTextElement(“}”)

Now, every time a change is performed on the AST we need to decide how the original text will change.

Removing nodes

The simplest case is when a node is removed. Conceptually when a node is removed we will need to find the text of the parent and remove the portion corresponding to the child.

Consider this case:

If I want to remove the field f I need to find the parent of f and update its text. That would mean changing the text of the class in this case. And if we change the text of the class we should also change the text of its parent (the CompilationUnit, representing the whole file).

Now, we use placeholders and template exactly to avoid having to propagate changes up in the hierarchy.  a parent does not store the portion of text corresponding to the child but is uses placeholders instead. For example for the class we will store something that conceptually looks like this:

So removing a child will just mean removing an element from the list of the parent which will look like this:

In other words when we remove an element we remove the corresponding ChildNodeTextElement from the text associated to the parent.

At this point we may want to merge the two consecutive strings and update the spacing to remove the empty line, but you get the basic idea.

Now, not all cases are that simple. What if we want to remove a parameter? Take this method:

The corresponding list of element will be:

If we want to remove the first parameter we would get:

Which would not be valid because of the extra comma. In this case we should know that the element is part of comma-separated list, we are removing an element from a list with more than one element so we need to remove one comma.

Now, these kind of changes depends to the role of a certain node: i.e., where that node is used. For example where a node contained in a list is removed the method concreteListChange of our observer is called:

Now, to understand what the modified NodeList represents we use reflection, in the method findNodeListName:

If the modified NodeList is the same one we get when calling getParameters on the parent of the list then we know that this is the NodeList containing the parameters. We then have to specify rules for each possible role: in other words we have to specify that when deleting a node from a list of parameters we have to remove the preceeding comma. When removing a node from a list of methods instead there is no comma or other separator to consider.

Note that while removing the comma would be enough to get something which can be parsed correctly it would not be enough to produce an acceptable result because we are not considering adjustments to whitespace. We could have newlines, spaces or tabs added before or after the comma in order to obtain a readable layout. When removing an element and the comma the layout would change and adjust whitespace accordingly would be not necessarily trivial.

Adding nodes

When adding nodes we have mostly the same problems we have seen when removing nodes: we could need to add separators instead of removing them. We have also another problem: we need to figure out where to insert an element.

We can have two different cases:

  • it is single element (like the name of a class or the return type of a method)
  • it is an element part of a list (for example a method parameter)

In the first case we could have:

We need to specify that when adding a name to a Class Definition we need to find the class keyword and put the name after it.

What if we want to insert an element in a list?

If the list is empty or if we want to insert the first element of a list we need to use some delimiter. For example in the case of a method definition, when adding a parameter we should add it after the left parenthesis. If instead we want to add an element in a position different from the first one we need to find the preceeding element in the list, insert a delimiter (if necessary) and then place the new element.

Also in this case we would need to adapt whitespace.

Changing nodes

Most changes to the AST would be performed by adding or removing nodes. In some cases however we would need to change single properties of existing nodes. This is the case when we add or remove modifiers, which are not nodes per se. For these cases we would need specific support for each property of each node. Some more work for us.

Associate comments to nodes

Some time ago I started working on comment attribution: i.e., finding to which node a comment is referred. Why is this necessary? Because when we remove a Node we should remove also the corresponding comments and when we move a Node around the associated comments should be moved with it. And again we usually put some whitespace around the comments. Also that need to be handled.

Conclusions

Lexical preservation is a very important feature: we need to implement it in JavaParser and we need to get it right. However it is far from being trivial to implement and there is not a clear, easy to implement solution. There are different aspects to consider and heuristics to address the problem. For this reason we will need to collect a lot of feedback and being ready to a lot of testing and incremental work to polish our solution.

And you, what do you think about lexical preservation? Do you need it? Any advice on how to implement it?

Observers for AST nodes in JavaParser

javaparser-observer-ast

We are getting closer to the first Release Candidate for JavaParser 3.0. One of the last features we added was support for observing changes to all nodes of the Abstract Syntax Tree. While I wrote the code for this feature I received precious feedback from Danny van Bruggen (a.k.a. Matozoid) and Cruz Maximilien. So I use “we” to refer to the JavaParser team.

What observers on AST nodes could be used for?

I think this is a very important feature for the ecosystem of JavaParser because it makes easier to integrate with JavaParser by reacting to the changes made on the AST. Possible changes that can be observed are setting a new name for a class or add a new field. Different tools could react to those changes in different ways. For example:

  • an editor could update its list of symbols, which could be used for things like auto-completion
  • some frameworks could regenerate source code to reflect the changes
  • validation could be performed to verify if the new change lead to an invalid AST
  • libraries like JavaSymbolSolver could recalculate the types for expressions

These are just a few ideas that come to mind but I think that most scenarios in which JavaParser is used could benefit from the possibility to react to changes.

The AstObserver

The JavaParser 3.0 AST is based on Nodes and NodeLists. A Node, like a TypeDeclaration for instance, can have different groups of children. When these groups can contain more than one node we use NodeLists. For example a TypeDeclarations can have multiple members (fields, methods, inner classes). So each TypeDeclaration has a NodeList to contain fields, one to contain methods, etc. Other children, like the name of a TypeDeclaration, are instead directly contain in a node.

We introduced a new interface named AstObserver. An AstObserver receive changes on the Nodes and NodeLists.

What to observe

Now we have an AstObserver and we need to decide which changes it should received. We thought of three possible scenarios:

  1. Observing just one node, for example a ClassDeclaration. The observer would receive notifications for changes on that node (e.g., if the class change name) but not for any of its descendants. For example if a field of the class change name the observer would not be notified
  2. For a node and all its descendants at the moment of registration of the observer. In this case if I register an observer for the ClassDeclaration I would be notified for changes to the class and all its fields and methods. If a new field is added and later modified I would not receive notifications for those changes
  3. For a node and all its descendants, both the ones existing at the moment of registration of the observer and the ones added later.

So a Node has now this method:

To distinguish these three cases we simply use an enum (ObserverRegistrationMode). Later you can see how we implemented the PropagatingAstObserver.

Implementing support for observers

If JavaParser was based on some meta-modeling framework like EMF this would be extremely simple to do. Given this is not the case I needed to add a notification call in all the setters of the AST classes (there are around 90 of those).

So when a setter is invoke on a certain node it notifies all the observers. Simple. Take for example setName in TypeDeclaration<T>:

Given we do not have a proper metamodel we have no definitions for properties. Therefore we added a list of properties in an enum, named ObservableProperty. In this way an Observer can check which property was changed and decide how to react.

Internal hierarchy of observers

For performance reasons each node has its own list of observers. When we want to observe all descendants of a node we simply add the same observer to all nodes and nodelists in that subtree.

However this is not enough, because in some cases you may want to observe also all nodes which are added to the subtree after you have placed your observers. We do that by using a PropagatingAstObserver. It is an AstObserver that when see a new node been attached to a node it is observing start to observe the new node as well. Simple, eh?

Observers in action

Let’s see how this works in practice:

Conclusions

I am quite excited about this new feature because I think it enables more cool stuff to be done with JavaParser. I think our work as committers is to enable other people to do things we are not foreseeing right now. We should just act as enablers and then get out of the way.

I am really curious to see what people will build. By the way, do you know any project using JavaParser that you want to make known to us? Leave a comment or open an issue on GitHub, we are looking forward to hearing from you!

Getting started with JavaParser: analyzing Java Code programmatically

One of the things I like the most is to parse code and to perform automatic operations on it. For this reason I started contributing to JavaParser and created a couple of related projects: java-symbol-solver and effectivejava.

As a contributor of JavaParser I read over and over some very similar questions about extracting information from Java source code. For this reason I thought that I could help providing some simple examples, just to get started with parsing Java code.

All the source code is available on Github: analyze-java-code-examples

java_jp

Common code

When using JavaParser there are a bunch of operations we want typically to do every time. Often we want to operate on a whole project, so given a directory we would explore all the Java files. This class should help doing this:

For each Java file we want first to build an Abstract Syntax Tree (AST) for each Java file and then to navigate it. There are two main strategies to do so:

  1. use a visitor: this is the right strategy when you want to operate on specific types of AST nodes
  2. use a recursive iterator: this permits to process all sort of nodes

Visitors can be written extending classes included in JavaParser, while this is a simple node iterator:

Now let’s see how to use this code to solve some questions found on Stack Overflow.

How to extract the name of all classes in a normal String from java class?

Asked on Stack Overflow

This solution can be solved looking for the ClassOrInterfaceDeclaration nodes. Given we want a specific kind of node we can use a Visitor. Note that the VoidVisitorAdapter permits to pass an arbitrary argument. In this case we do not need that, so we specify the type Object and we just ignore it in our visit method.

We run the example on the source code of JUnit and we got this output:

 

Is there any parser for Java code that could return the line numbers that compose a statement?

Asked on Stack Overflow

In this case I need to find all sort of statements. Now, there are several classes extending the Statement base class so I could use a visitor but I would need to write the same code in several visit methods, one for each subclass of Statement. In addition I want only to get the top level statements, not the statements inside it. For example, a for statement could contain several other statements. With our custom NodeIterator we can easily implement this logic.

And this is a portion of the output obtained running the program on the source code of JUnit.

You could notice that the statement reported spans across 5, not 6 as reported (12..17 are 6 lines). This is because we are printing a cleaned version of the statement, removing whitelines, comments and formatting the code.

Extract methods calls from Java code

Asked on Stack Overflow

For extract method calls we can use again a Visitor, so this is pretty straightforward and fairly similar to the first example we have seen.

As you can see the solution is very similar to the one for listing classes.

Next steps

You can answer a lot of questions with the approaches presented here: you navigate the AST, find the nodes you are interested into and get whatever information you are looking for. There are however a couple of other things we should look at: first of all how to transform the code. While extract information is great, refactoring is even more useful. Then for more advanced questions we need to resolve symbols using java-symbol-solver. For example:

  • looking at the AST we can find the name of a class, but not the list of interfaces it implements indirectly
  • when looking at a method invokation we can not easily find the declaration of that method. In which class or interface was it declared? Which of the different overloaded variants are we invoking?

We will look into that in the future. Hopefully these examples should help you getting started!

Releasing JavaParser 2.1

The other day the guys involved in JavaParser left me the honor of releasing our new version: 2.1

The community on GitHub took over the project previously hosted on Google Code and abandoned at some point. Nicholas Smith, among the other things rewrote all the tests to use JBehave and wrote detailed instructions to perform the release on Maven Central using Sonatype. To release the new version I just had to follow his instructions and get the permissions from Sonatype.

So now you can add it to you projects using:

What is news

We have 30 closed issues or pull requests

Including but not limited to:

  • a lot of bug fixing
  • improved test coverage
  • correctly support different encodings
  • improvement to the documentation
  • fix some issues with lambdas
  • removing some major performance issues
  • introduced the NamedNode interface

And now the community is already working on the next release, which will probably be JavaParser 3.0. Exciting times are coming.

Java comments parsing

Recently I have done some work on JavaParser, focusing on parsing comments and attributing them to the element being commented.

I like working on manipulating source code. I like this problem also because it does not have obvious solutions, but it can be solved only relaying on heuristics and conventions.

Some notes on comments parsing as it is implemented right now, with more documentation to come soon.

 

Three different kinds of comments are parsed:

  • Line comments (from // to end line)
  • Block comments (from /* to */)
  • Javadoc comments (from /** to */)

Comments are parsed as all the other elements of the grammar, so we provide their position in the source code and their content.

We also try to understand to which element they refer and attribute comments to the node we supposed being the target of that comment. Note that to do that we use some simple heuristics, and while this normally works quite well there are limitations and it is not possible to devise an algorithm able to understand with absolute accuracy which element is targeted by a comment.

Principle used to attribute comments

  • Each element can have only one comment associated
  • Line comments which follow an element on the same line are attributed to the last element present in the line which starts and end on the line. If no element start and end on the line, the comments is associated to the last node ending on that line. This kind of association is stronger in respect to the others.
  • Comments which are alone in one line (or more than one lines) are associated to the first element following them.
  • A comment cannot be associated to another comment (i.e., no comments commenting other comments)
  • Comments not on the same line as other nodes and preceeding empty lines are considered orphans (this behavior can be changed using JavaParser.setDoNotAssignCommentsPreceedingEmptyLines(boolean) )

Not all the comments can be associated to one element, the remaining comments are considered orphan comments. They will be inserted in the list of orphan comments of the first node which contains them.

Typical Use Examples


class A {
// orphan comment
}

In this case there is no element immediately following the orphan comment, therefore it is listed as an orphan comment of the element containing it (class A).


/* Orphan comment /
/
Comment of the class */
class A { }

In this case the first comment is attributed to the declaration of variable a because it precedes it, while the second remains an orphan comment because empty lines separate it from the first node. If JavaParser.setDoNotAssignCommentsPreceedingEmptyLines(false) was invoked before parsing, also the second comment would have been associated to the following declaration.


int a = 0; // comment associated to the field

This comment is associated to the whole field, because it is the last (and only) node before the comment.


int a
= 0; // comment associated to zero

In this case “another comment” is associated to the variable declaration, and because only one element can be associated to a single node “a comment” remains an orphan comment.

Atypical Use Examples

Due to the liberal nature of what is considered valid with regards to comment syntax the parser has had to make a number of sensible assumptions.


/* A block comment that
// Contains a line comment
*/
public static void main(String args[]) {
}

In this case a single comment is created for the block comment, where the content is “A block comment that // Contains a line comment”


@Override
// Returns number of vowels in a name
public int countVowels(String name) {
}

In this case the line comment is attributed to the return type of the method, rather than the method itself. This is because the start line number of a method is determined by it’s first annotation; therefore all methods comments need to proceed annotations.

The up to date documentation is available at https://github.com/matozoid/javaparser/wiki/CommentsParsing.