Articles about extracting data from code, analyze it and programmatically transform it. In other words we talk about static analysis, automated refactoring, and code generation.

JArchitect: review of a tool to manage your Java project

JArchitect is a software to analyze Java projects.

I think we need to use software to reason about software. As software developers we are building software to help other professionals in doing their job but still we rely not so much on software to help us in our work.

We like the romantic ideas of being artisans. While there is a large amount of craft involved in writing software there is also space for benefiting from software to:

  • navigate large codebases
  • identify problems in automated way
  • recognize patterns and repetitive structures
  • perform refactoring at scale

The question is:

Can JArchitect helps with any of that?

Disclaimer: I was contacted by the company behind JArchitect and I was given a license to review their product. They did not compensate me for this review and I have no incentives to do a biased review

Learn advanced JavaParser

Javaparser_visited

Receive a chapter on the book JavaParser: Visited.

This chapter presents the JavaSymbolSolver, which you will need for all the advanced analysis and transformation of Java code

Powered by ConvertKit

First impression on JArchitect

Right after I downloaded and installed JArchiatect I wanted to start it for the first time. At that point I have noticed how the icon was looking poorly compared to any others. It would pop out simply because it was the only ones not looking nice.

Let me say that I am not a design maniac: I have been using Linux as my main system for most of my professional life and I am an engineer at my core: I want just to get things done properly. However in this screen I can see different icons: from large companies or open-source projects. They all look decent. I can like or not like them but they all look professional. The one from JArchitect is the only one which looks sloppy.

Is this relevant? Not so much.

Does it help making a good impression? Not so much.

I have received the license by email. They had sent me the license for the last version of the software, however it was not available for Mac or Linux, just for Windows. I guess they assumed I was running Windows. I am not a Windows-hater or anything like that but in the last five years when I have worked with Java I have seen only a small percentage of developers using Windows machine. So I take the fact they assumed I was using Windows as an hint that they cannot be so familiar with the Java development world. Or maybe I have only worked with companies that were the exception and Windows is far more spread among Java developers than I can see from my small, biased world.

Anyway when I downloaded the version I got the previous one, not the one for which I had the license. So after some pretty confusing screens and instructions that were Windows specific I wrote to the guys at Codergears and got the right license. Still in the email there are references to Windows even if their software is not Windows-specific but at least at this point I was able to insert the license.

Ok, the start was not frictionless but now it is time to dive in.

Selecting a project

You can select a project in different ways:

I am happy to see they support Maven but I am frustrated they do not support Gradle.

Personally I am using Gradle in all my new projects so for me this is a big limitation. You can also import an Eclipse Workspace but you cannot import an IntelliJ IDEA one. For me the most important option would be a gradle import, followed by an IDEA project import. Howevever none of this is supported.

For the sake of reviewing JArchitect I opened JavaParser, which is based on Maven. If you do not know JavaParser it is a great library to process Java code. Incidentally I am a contributor to that library.

How the home screen looks like

This is how JArchitect looks like when you start it:

JArchitect - How it looks like

JArchitect – How it looks like

One of the characteristics of JavaParser is to not have any dependency. In the image at the center of the screen it seems that javaparser-core has a dependency on a module. That module has a suspicious name: MISSING.

I do not know why they report this MISSING module. Maybe it is an error that happens when the project has no dependencies, maybe they cannot handle multi-module pom files. For example in the case of javaparser there is a pom representing the whole project (the “parent”) and then modules like javaparser-core and javaparser-testing.

Let’s look into this MISSING module.

This module apparently refers o the MISSING.jar. I guess it is a default value for when something wrong happens. To me this seems very… sloppy. Instead of saying explicitly that is not connected to any JAR they just used this sort of placeholder. This piece of software is intended for people who builds software and they realize this is a very poor programming practice. This is not helping to give a good impression.

The fact JArchitect cannot deal with a simple project defined using Maven, when you basically support just maven is surprising for me.

Exploring the tool

We have a bar that we can use to jump to different views.

Some of these buttons change the view in the tool, while others open windows in the browser. Personally I do not like it and I would expect that HTML files could be displayed inside the application. On the other hand I appreciate the fact that some HTML files could be shared also with persons that do not have access to the tool.

For example one of the view show a report. This report should give an overview of the status of the project and I like the way it is organized.

However some of the diagrams reported just do not make sense to me. Take this one:

 

What does this mean? Aside the fact that the name of the modules are very small there is basically not much explanation. I am also not sure why most modules have (apparently) a very high instability (really? Why?) but still they are in a green band, which I suppose is a good sign.

Clicking on the trend charts things work less well. It opens a window that looks like this:

We are talking about a very simple project based on Maven. As I open it with a fresh installation I get broken pages and errors?

As a note: some of the code in JavaParser is generated and it is marked using the standard annotation javax.annotation.Generated. From what I saw no report or graphs seem to handle generated code differently or separately. Maybe the tool does that and it is just not evident to the user, or maybe the author of the tools are not familiar with this annotation. I am not sure.

Looking at the queries written

Honestly the report and the charts did not provide much value to me but the strong point of JArchitect should be the fact it supports queries on code.

Let’s first look at the queries that are provided with the tools and the results they give us. In the next section we will instead look at how we can write our own queries.

Queries are organized by groups. You can select a group on the left and see the corresponding queries on the right

 

Queries on Java code

Queries on Java code

If you then click on a specific queries you can see the code of the query itself and the matched elements in another panel:

Queries on Java code

Queries on Java code

At this point you may want to investigate further some of the elements identified by the query. I would expect to be able to click on a method to open it and look into it. You cannot. There is an item in the context menu but I get this:

What should have been set? I imported the pom, the source code is located accordingly to the standard organization of Java projects, why is not just imported correctly?

At this point I have these bunch of methods and I cannot just check them from JArchitect.

Also, looking at the error messages it seems clear it has been written using .NET. They are not from the Java world.

Some of these queries are naive. For example, there is a query suggesting to not use boxed types:

Prefer primitive types to boxed primitives

Prefer primitive types to boxed primitives

But it also do not make much sense: there are differences between primitive and boxed types.

Another one is “Always override toString”. Really?

Or another named about “potentially dead methods”: I am writing a library, of course I do not call all methods, some are intended to be called by users. Same thing for the methods that could have lower visibility (3219, according to JArchitect).

It is hard to write queries that are useful in all cases and I would expect to have to do some tuning to get values out of the queries provided. However honestly I did not find many valuable insights using these queries. The fact that I can neither see the code of the classes or methods identified by the queries make the user experience much more painful and it reduces significantly the possibility of using the tool in a productive way.

Writing new queries

So I started writing my first query. My goal is to identify methods with a long name. The query editor offers auto-completion and by looking at existing queries I could write my own quickly:

The problem with the result is that it seems to consider as the method name what is actually the method signature. The proper name (juggleArrayCreation in the picture) is most definitely not 76 characters long, as it seems looking at the results of the query.

It turns out I should use the SimpleName. This is not a term I have ever heard when discussing about Java methods but at least I can get my intended result:

Or I may want to look for methods with many parameters:

If you look at the first result it is not even a method, it is a constructor. I think there are definitely issues with the naming chosen. As result it is quite confusing for Java developers.

I really, really like the idea of running queries on Java code and the basic support is there, however the underlying metamodel does not seem to have been created with Java in mind or by someone familiar with Java. This is a major issue because even if I learn and get used to the wrong terminology used it would make my queries confusing and unreadable for persons who know Java.

Summary

I am disappointed.

I heard good things about NDepend and it seems an hurried, half-baked porting to the Java world, probably done without involving Java developers.

I like the idea of writing queries on Java code. That is great. This is why I worked on a project named effectivejava years ago and I am now working on JavaSymbolSolver.

I get that writing automated queries that make sense is not easy. Most static analysis tools are a wastedof time unless they are properly configured. However I think the company does not have experience or knowledge of how things work in the Java ecosystem.

I think the possibility of writing your own queries has great potential and it could justify a good price. However but in my opinion the quality and the current status of the project are inadequate for the price. We are not talking of an opensource project given away for free: the cheapest license with a 1-year subscription comes for $599.

I believe in releasing early and often. This is always true for open-source and can be useful also for commercial software, to get feedback. However I would expect people to focus on a few core features and implement them correctly. This software seems very far from a level of quality that is acceptable.

I know fairly well JavaParser and I did not get any insight about it using JArchitect. Maybe someone that is totally unfamiliar with a large project could get some help in navigating the project. I cannot exclude that but to me the value was very limited. I would love to hear from people with different experiences to better understand what is the intended benefit of this.

In my opinion this is not a product that has been designed and crafted well enough to justify $599. I know that the same product can be used differently by different people and in different contexts can provide a very different amount of value. However at that price point I would expect to get correct information and a polished product, with support for mainstream options (like gradle and IntelliJ IDEA). I have seen none of this in this product.

I do not think there are many alternatives to JArchitect. For this reason if you need such tool you could either build it yourself by using libraries like JavaParser and JavaSymbolSolver or you have very few alternatives. This could lead you to buy this tool, going over its limitations.

I would be happy to revisit my opinion once they have fixed the most obvious issues.

NDepend: Static Analysis For Software Architects

NDepend is a tool that helps .NET developers to write beautiful code

This is a definition that does not come from the official documentation, but from a post on the blog of its developer. We like it because it is more interesting than the technical one: NDepend is a static analysis tool for .NET code (i.e., .NET assemblies). More importantly, it gives a more holistic picture of what static analysis can do for your project. In fact, not all developers agree on what static analysis can and should actually do. This is also due to the fact that the concept and the tools themselves have evolved during the years.

So, what can NDepend do for you? It can:

  • calculate standard metrics to understand your project
  • check whether your code respect standard quality rules
  • provide you with a tool to create your own metrics and enforce your custom rules

It integrates with Visual Studio, continuous integration tools (e.g., JetBrains TeamCity) and all sort of tools that works around the building of code like (Visual Studio) Team Foundation Server.

NDepend is a tool that affects mostly the life of developers, but it is relevant to the job of software architects, too. That is because it can give a high level picture of the quality of the code. So it can provide information to the people that make decisions on how to design software. A software architect could use the tool on its own, or use it to guide developers in developing software.

How We Are Reviewing NDepend

Programming is a peculiar job in many ways. The one that interest us right now is how much varies the quality of its practitioners is. This is also compounded by the fact that international projects and collaborations are pretty common. In traditional professions this an issue that is dealt with people higher in the management chain, instead this is actually the norm for many programmers.

This lead to the issue that all sorts of people can read this article, each we different assumptions. So we think it is necessary to make a few notes to have everybody on the same page.

NDepend Is Commercial Software

NDepend is a commercial tool that is reasonably priced for a company (see the pricing model), but it can be too costly for individual developers in certain countries. We cannot say if NDepend is worth its price for you, we can only review what it does.

If you are unfamiliar with commercial professional software you might be surprised by the fact that it has good documentation. And they even present all the features of their tool! Joking aside, this is obviously normal for commercial software, but it is sadly not the norm for open source software, even in the case of successful projects. So the official documentation gives you the information you need, though it might be a little outdated (e.g., it talks about a rumor on version of C# of VB.NET successive to the number five) and the formatting could be improved.

There are also good explanations accessible while working with the tool.

NDepend example documentation

For example, if your project violates a rule, you can see what the rule means and how to fix the underlying issue.

What We Are Reviewing

NDepend mainly works as a Visual Studio plugin, so it is available for all recent editions of Visual Studio, even the free Visual Studio Community. It would not work with the older free Visual Studio Express edition. It can also be used independently of Visual Studio with its own UI, called Visual NDepend.

As said before, it can also be integrated with several tools used in the build process. We are not going to the review that part because the static analysis features are the same of the Visual Studio plugin. The only differences are obvious ones, such as the fact that it can block the build process in case of a failure of a quality check.

We have tested the NDepend Version v2017.2.2.

What Kind Of Static Analysis?

NDepend Dashboard for LiteDB

We have already mentioned that there are many kinds of static analysis tools. They all work on code, so they all work in the realm of developers, but they can be created for different people.

The NDepend documentation mentions an integration with SonarQube, another static analysis tool. It also says, essentially, that their tool is higher level than SonarQube:

Basically the Roslyn analyzers and SonarQube rules are good at analyzing what is happening inside a method, the code flow while the NDepend code model, on which the NDepend rules are based, is optimized for a 360 view of particular higher-scale areas including OOP, dependencies, metrics, breaking changes, mutability, naming…

Concretely Roslyn analyzers and SonarQube rules can warn about problems like a reference that doesn’t need to be tested for nullity because in the actual scope it cannot be null, while NDepend can warn you about too complex classes or components, and offer advices about how to refactor to make the code cleaner and more maintainable.

You should always be cautious with the information that a seller gives about its product, but it seems to us a good comparison. Nothing is presented as better, both tools have their uses. NDepend is a tool designed for an higher level of static analysis, it helps you to design better software, rather than simply to write more correct code.

This means that is most useful if your work consist in taking care of the whole design process. Depending on the organization you work in, this might mean that you are a software architect, a senior developer or the only developer. It also mean that some features can be of a limited use for a beginner developer working on something as a library.

The Static Analysis Engine

The core of NDepend is a static analysis engine that can detect some properties of the code and, more importantly, whether it respects a certain rule.

The foundation of the engine are the code metrics, they are used by the rules and to create visualizations of the project. They range from basic statistical data about the whole project, and its constituent parts, to complex structural measures.

An example of the first is the number of lines of code, that is measured for all levels: application, assembly, namespace, class (type) and method. These are easy to understand, although they are generated based on PDB files, so they do not depend on the physical lines of code, but what they call logical Lines of Code (LOC) instead. PDB files are also used to calculate how many lines of comments there are. PDB files contain are generated by the compiler and contain debugging information.

An example of the second kind of measure is cyclomatic complexity. Even if you did not follow a computer science course there is a good explanation that is easy to follow. Basically cyclomatic complexity is the number of different paths that a method can take. It measure the decision points in the code that can lead it to do different things.

For instance, if the source code contains no control flow statements the complexity is 1, since there is only a single path through the code. If the code had one single if statement, there are two paths through the code: one when the condition on the if is true and the other when is false.

Acting On The Code Metrics

We do not think that understanding the concept is the issue, the problem is what to do with it. The documentation says that you should be concerned if the number is higher than 15 and split the method if it is higher than 30. So there is some guidance, but if you do not understand its impact and how it affects larger issues like testing and QA. You might not know whether this indicate a larger design problem that cannot be solved simply by refactoring a bit.

That is where the software architect target audience becomes more evident.

An important note is that some of these measures are only available for C#, while a version for Visual Basic .NET is under development. So if your projects are written in VB.NET you have to check carefully what is available from the documentation.

A Few Interesting Metrics

Let’s see a few interesting metrics to understand the structure of the project.

Metrics For Assemblies

The main objective of the metrics for assemblies is to let you understand which ones can be hard to maintain (concrete/stable) and which ones are of little use (abstract/instable). The first ones are assemblies whose types are used by many other assemblies, thus they can create cascade effect in case of modifications. The second ones are full of abstract types (interfaces and abstract classes).

To do that it relies on the concept of efferent coupling:

The number of types outside this assembly used by child types of this assembly. High efferent coupling indicates that the concerned assembly is dependant

…and afferent coupling:

The number of types outside this assembly that depend on types within this assembly

The ratio of efferent coupling to total coupling define instability, the closer this value is to 1 the more instable the assembly is. At the opposite the value of 0 indicates an assembly resistant to change.

Metrics For Types (Classes)

The aforementioned ciclomatic complexity is related to types.

Type rank applies the Google PageRank algorithm on the graph of types dependencies. The higher the value is the more important a type is and and should be carefully checked, because a bug could affect the whole software.

There are a couple of interesting metrics related to the inheritance of classes: number of children and depth of inheritance tree. They can be useful to detect design issues, but it is hard to have general rule. That is because there is a lot of inheritance that comes with the language and the standard library itself.

The most notable metric is the Lack of Cohesion Methods (LCOM) which is designed to find which classes respect the Single Responsibility in SOLID. A class that respect this principle is said to be cohesive.

The underlying idea behind these formulas can be stated as follow: a class is utterly cohesive if all its methods use all its instance field

We find this metric particularly interesting because it is a smart way to calculate an abstract principle. And an abstract principle can be hard to understand and apply without the necessary experience. Sadly, there are not metrics to easily detect problems with other SOLID principles, but  one is already one more than we expected.

Metrics For Methods

Method rank applies the Google PageRank algorithm on the graph of types dependencies. The higher the value is the more important a type is and should be carefully checked, because a bug could affect the whole software.

IL Nesting Depth indicates the maximum number of encapsulated scopes inside the body of a method. The number is generally the one you could expect looking at the source code, but the compiler could have optimized the original code and thus changed its structure a bit.

The Default Rules

There are many default rules included in 19 categories, plus one category of samples for custom rules. These rules perform all sorts of checks:

  • about the architecture of the software (e.g., Avoid partitioning the code base through many small library Assemblies)
  • issues related to code smell (e.g., Avoid methods potentially poorly commented)
  • the correct usage of some .NET functions (e.g., Remove calls to GC.Collect(), but if you must use it at least Don’t call GC.Collect() without calling GC.WaitForPendingFinalizers())
  • …and many others

The rules also varies in their concerns. That is to say some of the rules takes care of facilitating the respect of well known design rules (e.g., rules about using the lowest visibility possible to increase encapsulation), while others deal with simpler issues like naming (e.g., you should not name random functions Dispose, it should be reserved for types that implement System.IDisposable).

They are all listed in the documentation. Both in the documentation and in Visual Studio you can find a description and how to fix the corresponding issue. The online documentation also shows how each rule is implemented.

The positive thing is that there are few false positives, the kind of rules that are always matched. They are an issue because if all code is bad, but it actually keeps working fine, maybe the code is actually right and the metric is wrong. In reality, usually things are more nuanced, you should not do certain things except when you have no other choice. Simpler tools for static analysis cannot understand that. And if they show thousands of false positive errors you might as well check the code manually.

A Few Issues

NDepend instead generally is doing good in that regard, so there few false positive. But there are a few issues. The most common one is the rule named Avoid methods with too many parameters, which invariably end up being a long list of constructors. Not all constructors needs many parameters, but when they do there is no good alternative.

Rules about avoiding the Singleton pattern at all cost or having types with too many lines of code, can be considered correct according to current best practices. However some people might disagree, and if you think there are nuances, you will have to modify the rules and put the nuances yourself.

There are also rules that measure things, statistics concerning the evolution of the project (e.g., how many issues you have solved) or certain properties of the code (e.g., most used methods).

Code Query LINQ

The rules are implemented using a language called Code Query LINQ (CQLinq). It is very similar to LINQ and thus quite easy to grasp by the average .NET programmer.

The language allows you to create rules that query:

  • code metrics
  • statistics about the evolution of the project (e.g., issues solves, code diff, code coverage)
  • code object model (i.e., the organization of the code: the fields and methods that each class contains, the classes that each namespace contains, etc.)
  • mutability of types
  • basic info about source code (e.g., names of things, source code paths)
  • code dependencies (i.e., which methods are called by each method, which assemblies are referenced by each project, etc.)
  • quality gates
  • technical debt

Let us concentrate on three things, the first of which is code coverage. The tool does not actually perform code coverage checks, but it can make available through CQLinq data generated by a third-party tool that does code coverage. The tools listed in the documentation are: NCover, JetBrains DotCover and obviously Visual Studio. Note that Visual Studio provide code coverage only with the Enterprise version.

Quality gates fundamentally combine the results of rules to produce a status for the project (pass, warn, fail). They are useful because they can inform of the high-level status of the project and larger problems that arise from the combination of many issues. For instance, if all the rules generate 10 critical issues you might want to see it instantly or throw a big red fail in front of the developer.

They can be used in the Visual Studio plugin, but in that case they can just inform the developer of a problem. They are more useful in the building process to automatically stop it or perform some action, like warning the developer responsible for a section of code or even somebody higher up in the hierarchy.

Measuring Technical Debt

Probably the most important metric is the last one: technical debt. Each rule defines something to check, but also how much time it would take to fix the underlying issue, this is the technical-debt. It also defines how much time it would be lost each year if the issue is left unfixed, this is the annual-interest. The standard rules define these values and you would have to do the same for your own custom rules.

These values can be combined to obtain an important business information: how much time it would pass before the time lost because of the issue reaches the cost of fixing it. This is called breaking-point in the documentation. For example, if an issue costs 1 day to fix it (technical debt) and cause 6 hours of wasted time each year (annual interest), it would take 4 years before the cost of fixing it reaches the time lost because of it.

NDepend debt settings

There are also a few settings that you can change relative to the technical debt: they affect the hours of work and the hourly cost of a developer. Related settings are the ones involved in the SQALE method, which is a standardized way to assess the business impact of technical debt. NDepend implements only part of said method.

The SQALE Debt Ratio express the effort of solving technical debt compared to the time it would take to rewrite the code from scratch. The time it would take to rewrite the code for scratch is based on the setting called Estimated number of man-days to develop 1000 logical lines and the number or logical LOCs in the project.

NDepend dashboard

For example, to solve all the issue for the project in this image it would take 14.39% of the time that it would take to rewrite the code from scratch.

It is obvious that by changing the setting about the time it takes to write 1000 lines of code, you can get a very different debt ratio. For instance, if the value of Estimated number of man-days to develop 1000 logical lines was set to 50, the project would have a debt of 5.18% and an A rating.

NDepend dashboard with a different setting

So it is extremely important to note that the default setting (18 days) is just an estimate. You would have to set the proper value for your business.

The default value is 18 man-days which represents an average of 55 new logical lines of code, 100% covered by unit-tests, written per day, per developer.

Note that your estimate should include the time it takes to create both the code and the corresponding unit tests to check its correct behavior.

Custom CQLinq Rules

You can use CQLinq to create custom rules. This can be done in several ways, for instance in special files or inside the source code with annotations. The annotations can also be used to target whole sections of code. You can use an annotation on each class you want to work on together and then write one rule to target all of them. You can event try the rules live and have them automatically applied for easy testing.

They are very easy to use and the autocompletion support make it as easy as writing C# code. Provided, of course, that you already have a clear of idea of what you want to do.

For example, let’s say that you like the aforementioned Avoid methods with too many parameters, but you want to exclude from the check constructors and initializers. You could modify directly the original rule, by just clicking on it and editing the CQLinq code.

Custom CQLinq Rule in NDepend

In the image you can see that this rule match 11 methods.

That is probably the correct way if you are just changing a configuration parameter. In this case, the threshold for too many parameters is defined as 7 or more. So you could change it to 8 or 9. Since we want to change the meaning of the rule we opt for disabling the current one and creating a new one.

We called it Avoid methods with too many parameters unless they are constructors.

All that we have changed is on line 3: we added a further check to exclude constructors and functions with the name Init.

Custom CQLinq Rule Matches

Now there is only one match.

CQLinq is powerful enough that you could make sophisticate checks. For example, at a first glance it seems that you could implement a check for something like the Law Of Demeter.

Visualizing Code

The metrics can also power useful visualization of the analyzed code.

Architectural Graphs

There are several of them related to graphs representing the structure of the code. The most used ones are probably dependency graphs, that show the relationship between assemblies.  They are mostly relevant for understanding complex projects. That is because the contemporary infrastructure allows normal project (e.g., libraries) to be neatly and easily organized. For example this is a dependency graph for AngleSharp, a very successful HTML parsing library.

Angle Sharp Dependencies Diagram

 

As you can see there are very little complications in the typical standalone library. Actually, if you try to analyze a few important C# project on GitHub you will be hard pressed to find something more complicated than this.

Of course there are a lot of complex applications out there, but it might be rare to have to deal with lot of them. At worst you are probably going to work with one big project. That is unless you are in a position to be responsible for many of such projects.

What you are going to use more frequently is the Internal Dependencies Graph, which can really help you to familiarize with how a class works.

Internal dependencies graph for the Browsing Context class

As you can see the visualization is also interactive and can give you direct access to the information on the underlying analysis.

Dependency Structure Matrix

A Dependency Structure Matrix (DSM) is a textual representation of the same information displayed in the kind of graph that we have just seen. The two representations have different advantages:

  • a graph is easier to understand, but is lighter on the details and does not scale well with many nodes and edges
  • a matrix is harder to parse, but gives more data and it does it in a precise way

AngleSharp Dependency Structure Matrix

This (partial) matrix correspond to the first graph that we have shown in this paragraph. Context-sensitive information is available in case you need it. However the matrix is fairly simple to understand:

  • the row indicates the assembly that is used
  • the column indicates the assembly that is using it
  • the number represents the members of the source assembly involved in the relationship

In short, the graph is useful to understand at-a-glance how things works, while the matrix is useful to detect patterns in code. If you have to do a manual analysis of the results, you are going to use the matrix. You might need to do that, for example, to understand how costly would be to replace a certain library.

A General Overview Of The Code

Another kind of visualization that NDepend can generate for you is a treemap. A treemap is a way to visualize a tree-structure by using nested rectangles. The tree represent the usual structure of a program: assemblies contain namespaces, namespaces contain classes, etc. The size of each rectangle is proportional to the LOCs of the corresponding element. A treemap can also use color to convey information.

AngleSharp treemap visualization

In the HTML report, by default, this information is the percentage of code coverage of the visualized element. The alternative is cyclomatic complexity, if code coverage information is absent. In the NDepend plugin this can customized to a few different values.

This visualization is called Treemap metric view in the HTML report, but Code Metrics View in the interactive representation accessible through the plugin.

If you keep all the generated treemap images versions you can see the evolution of the code coverage and the size of each section of the code. Coupled with the information obtained with other tools, you can get a better sense of how to attack a particular issue and how the solution is developing.

For example, imagine that you are tasked with adding unit testing to an old project with zero code coverage and you want to have the most impact. If you were starting without any tool you would probably start with the most important (i.e., largest section of the code). But using the DSM you discover that a certain small namespace actually contains a good part of the foundation of the library. So you start from there.

AngleSharp Abstractness vs InstabilityNDepend can also create a nice visualization of Abstractness vs. Instability properties, which is helpful to identify long term problems in the relationship between assemblies.

Reports

We have already mentioned the fact that NDepend can generate HTML reports. These have the advantage of being more accessible to anybody. Both because they do not require Visual Studio and also because they are simplified representations of the static analysis results.

The most important drawback is that you miss the interactive component and the connection to the source code. That is to say, usually with NDepend clicking on something might give you the chance to directly go to the corresponding source code for further analysis. You obviously cannot do that with the report.

You cannot customize it interactively, but you can use XSL styles to do it before it is generated.

Miscellany Features

There are two miscellaneous feature that are included with NDepend that can be very useful to get a grip on a project: code diff and advanced search features.

The first one is in part a legacy of the long story of NDepend, when there were not good diff tools for Visual Studio. But it is nonetheless still quite useful in the context of analyzing issues. For example, it can show the differences between different assemblies, by relying on the .NET Reflector decompiler. It can also be combined with powerful CQLinq queries to find, for example, the method changed since the last baseline code.

In a similar fashion, you can use complex search queries based on the value of rules like efferent or afferent coupling, type or method rank, etc. Basically you can leverage the code metrics to find all the code relevant to an issue or a design problem.

There is a third feature that could be very important: NDepend API. This makes possible to integrate the tool with whatever you are using, including custom software. For example, you could use the information provided by the more simple rules of NDepend to do some quick automatic refactoring.

NDepend And Your Business

There are a few tasks for which static analysis is indispensable, for example when a software architect have to review the code of an outsourced project or judge the status of an old one. You cannot easily walk through ten of thousands of lines of code manually.

More importantly, it is harder to point clearly at problems and potential solutions without a precise tool. You just cannot say that a lot of functions are really complicated, it is not easily actionable. But you can say that 53 functions have an high cyclomatic complexity, and this must be fixed, and then you can also suggest where to alter the design.

NDepend can do that, but can also do something different. If traditional static analysis tools can help you to enforce style and prevent bugs, NDepend can help you with maintaining education and simplify maintenance of software. NDepend can certainly be used to help directly a developer, but is a tool best used by whoever design software in your organization.

Note though that we said maintaining education, not actually educate developers. That is because you can simply force developer to avoid null problems, but you cannot block builds for design problems without explaining them first. You will look like a vengeful god that speak harsh judgements. More prosaically, some of your developers might not know how to avoid design problems found by NDepend. After all, that is why you need a software architect to begin with.

Using Technical Debt To Drive Business Decisions

It is obviously a bad practice to use just one metric to measure the quality of the work of someone. Steve Ballmer famously criticized IBM for trying to link KLOCs (thousands of lines of code) to value contributions to software. And he was right.

On the other hand, sometimes the benefits of having one definitive measure outweigh the drawbacks linked to that simplification. In this case, the measure can be technical debt, which is well defined by standard rules. If this measurement is used with a grain of salt and built correctly, it can be of great value.

It can be particularly useful because it can be understood by many parties, such as developers and management. In this case, the hard work is done by the creator of the rules, that must have the necessary expertise to calculate a correct technical debt and annual interest.

Depending on what you want to do, even an approximate measure can be useful. For instance, it can certainly help prioritizing the fixing of bugs to maximize business value. In other words, the breaking point can be used to calculate the ROI of fixing issues. For most small things there is really not an objective way to estimate it manually. So, as long as it is broadly correct, small problems can be solved in the optimal order to save some days of work in a year, which can lead to meaningful savings.

It is a risky measure to adopt in a somewhat adversarial relationship. For example it is probably a bad idea to link a low technical debt to a performance bonus with an outsourcing party. That is because it can try to just making the number go down, without fixing the real issues.

NDepend For Consultants

When we interviewed Erik Dietrich he said that static analysis can be helpful in analyzing open source projects. It can help limiting the problems due to the lack of documentation. You ran an analysis and can get an overview of how the code works. This can be particularly useful if you just need to patch a bug in a library, but the fix requires you to understand the inner relationships of the software.

Sometimes you are hired to analyze a codebase and write a report about it. You will have to keep writing the report by hand yourself, but the reporting features of NDepend can help you augment that report with nice and useful visualizations. Furthermore, you can synthesize your expertise in a bag of custom rules that can simplify your work. In theory, you can create a custom visualization that, in addition to your custom rules, can create a list of actionable and precise actions that your client can take.

NDepend cannot become your only static analysis tool, because it does not spot simple errors in the code. For example, there was a codebase full of declaration of a variable which was immediately followed by an initialization of said variable. NDepend cannot detect that and certainly cannot help in automatically refactoring the code.

Summary

NDepend it is a tool that can be really helpful to software architects in the course of their job. It can help in tasks such as reviewing the code of others and keeping the quality of your project high.

It can be less useful to individual developers. That is mostly because they are probably working on a few projects that they already know very well. Thus they can develop design insight by themselves. Though even in that case it might help you dealing with management, because it gives you numbers that the business can understand.

Note: A license has been given to us free of charge, however this had no influence on our opinion.

Recognize patterns in Java code to understand and transform applications

Code is arguably the most valuable asset for many organizations.

However the value trapped in the code is not easy to use.

Why?

In this article we are going to see how we can extract and process knowledge in code by specifying patterns. We will see:

  1. A clear example of what we mean by extracting knowledge from code
  2. An explanation on how to implement this approach in practice (with code available on GitHub)
  3. We will discuss why do we want to extract knowledge from code.

What do you mean? Show it to me

Let’s see what we mean in practice, with the simplest example we can come up with. And let’s see it working for real, on real code.

Of course recognizing properties is a very simple example, something we are all familiar with, but the same idea can be applied to more complex patterns.

There are many patterns that can be used:

  • patterns typical of a language: think about the for loops to iterate over collections
  • design patterns: singleton, observer, delegate to name just a few
  • patterns related to frameworks: think about all the applications based on MVC or the DAO defined to access database tables
  • project specific patterns: for example in JavaParser we use the same structure for the tens of AST classes we defined

Patterns can be regarding small pieces of a single method or the organization of entire applications.

Patterns can also be applied incrementally: we can start recognizing smaller patterns, making the application simpler and then recognize more complex patterns over the simplified application.

How knowledge end up being trapped in code

Developers use a lot of idiomatic small patterns when writing programs. Experienced developers apply them automatically while implementing their solutions. Slowly they build these large applications that contain a lot of knowledge, enshrined in the code.

The problem is that the initial idea is not stated explicitely in the code. The developer translates it into code by using some of the idiomatic patterns typical of the language, or linked to the framework is using, or specific to its organization.

Somewhere a developer is thinking: “this kind of entity has this property” and he is translating it to a field declaration, to a getter and a setter, to a new parameter in a constructor, to some additional lines in the equals and the hashCode methods. The idea of the property is not present explicitly in the code: you can see it if you are familiar with the programming language but it requires some work.

There is so much noise, so many technical details that shadow the intention

But this is not true just for Java or for the properties: when a developer determines that an object is unique he could decide to implement a singleton, and again this means following certain steps. Or maybe he is deciding to implement a view or a Data Access Object (DAO) or any other typical component required by the framework he is using. In any case all the knowledge he had in mind is scattered in the code, difficult to retrieve.

Why is this bad?

For two reasons:

  • it is difficult to see the knowledge.
  • it is difficult to reuse that knowledge.

What it means that is difficult to see the knowledge in the code?

A lot of work go into understanding a problem and build a representation of the solution in the code. There are a lot of micro-decision, a lot of learning involved in the process. Then all of this effort lead to knowledge.

Where does this knowledge go?

This knowledge is typically not written down directly, it is instead represented in code. The fact is that in many occasions this knowledge is translated in a mechanical way into code. Therefore this translation can be reversed. The problem is that to see the knowledge that is present in the code you need to look directly at the code, read it carefully and mentally deduct the knowledge that is present there.

To understand things:

  • we need to understand programming and the typical patterns we use
  • requires work to check the details
  • this process is done mentally, the result is just in our head and cannot be processed automatically

If the knowledge was represented directly, to an higher level of abstraction it could be easier to check it and it could be accessible to more persons.

What it means that is difficult to reuse the knowledge in the code?

We have seen that basically the only way we have to extract knowledge from code is reading it and understanding it. The results stay just in our head, so they are not usable by a machine. If we had instead a representation of the abstract knowledge, the original intentions, we could elaborate them for different goals.

We could for example use that knowledge to generate diagrams or reports.

Do you want to know how many views have been written? How many tables do we have in the database? Easy! Project managers could get their answer without having to ask. And they would get always the updated, honest answers on the state of the code.

We could also use that information for re-engineering even partially applications. Some aspects of your application could be migrated to a different version of a library, to a different framework or even a different language. It would not mean that complex migration could be performed completely automatically but it could be a start.

Implementation using the Whole Platform

Ok, we have talked about the problem, let’s now talk about a solution we can build today.

We are previously discussed how to build grammars using the whole platform and we have seen it also when looking into Domain Specific Languages.

The code is available on GitHub, courtesy of Riccardo Solmi, the author of the Whole Platform.

1) Defining the higher level concepts

First of all we need to define the higher level concepts, that we have in mind but are not expressed explicitely in the code. For example, the concept of property of a Java bean.

In the model-driven parlance we define the metamodel: i.e., the structure of those concepts.

2) Define how to recognize such concepts in Java code

Once we have those concepts defined we need to specify how to identify those in the code. The whole platform use a Domain Specific Language to specify patterns. It looks like this:

What are we saying here?

We are saying that a certain pattern should be looked into the selected node and all the descendants. The pattern should match the given snippet: a Field Declaration with the private modifier, associating the label type to the type of the field and the label name to the name of the field.

What should happen when we recognize this pattern?

We should:

  1. Remove the corresponding getter. We will match a method with the expected name (calculated by the getterName function), the expected type, taking no parameters and returning a field with the expected name
  2. Remove the corresponding setter. It should be a method returning void, with the expected name, the expected parameter and assigning the parameter to the field with the expected name
  3. It should replace the Java field with this higher level concept representing the whole property (here it is named Field but Property would have been a better name)

Now, what we need to do is just to add this action into the contextual menu, under the group name Refactor. We can do that in Whole by defining actions.

Voila! We have now the magic power of recognizing patterns in any piece of Java code. As easy as that.

3) Doing the opposite: expand the concepts

So far we have discussed how to recognize patterns in code and map them to higher level concepts.

However we can also do the opposite: we can expand higher level concepts into the corresponding code. In the whole platform we can do this with this code:

Let’s focus exclusively on the property definition. Each instance is expanded by:

  1. Adding a field declaration
  2. Adding a getter
  3. Adding a setter
  4. Adding a parameter to the first constructor
  5. Adding an assignment in the first constructor. The assignment take the value of the parameter added and assign it to the corresponding field

Now, the way we recognize the pattern and the way we reverse it is not 100% matching in this example, bear with us over this little discrepancy. We just wanted to show you two different ways to look at properties in Java.

What this approach could be used for

We see this approach being useful for three goals:

  • Running queries to answer specific questions on code
  • Understanding applications
  • Transforming or re-engineering applications

Queries

We could define patterns for all sort of things. For example, patterns could be defined to recognize views defined in our code. We could imagine to run queries to identify patterns. Those queries could be used by project managers or other stakeholders involved in the project to examine the progress of the project itself.

They could also be used by developers to navigate into the code and familiarize with complex codebases. Do we need to identify all observers or all singletons in this legacy codebase? Just run a query!

Understanding applications

The fact is that programming languages tend to be very low level and very detailed. The amount of boiler plate code varies between languages, sure, but it is always there.

Now, one of the problem is that the amount of code could hide things and make them difficult to notice. Imagine reading ten Java beans in a row. Among them there is one that implements the equals method slightly differently from what you expect, like for some reason it ignores one field or compare one field using identity instead of equality. This is a detail that has a meaning but that you would very probably miss as you look at code.

Why is that?

This happens because after looking at a large amount of code and expecting to see certain patterns you become blind to those patterns. You stop reading them without even noticing it.

By recognizing patterns automatically (and precisely) we can represent higher level concepts and easily spot things that do not fit in common patterns, like slight differences in equals methods.

We have seen how to recognize bean properties, but we can go further and recognize whole beans. Consider this example.

This shows the relevant information ony. All the redundant information is gone. This makes things obvious.

We can recognize incresingly complex patterns by proceeding incrementally. In this way exceptions pop up and do not remain unnoticed.

Transforming applications

When we write code we translate some higher level ideas into code, following an approach that depends on the technology we have chosen. Over time the best technology could change even if the idea stay the same. We still want the same views but maybe the way to define them is changed in the new version of our web framework. Maybe we still want to use singletons but we decided is better to use public static methods with lazy initialization to provide the instance, instead of using a public static field.

By identifying higher level concepts we could decide to translate them differently, generating different code. This process is called re-engineering and we can perform it automatically to some extent. It seems a good idea to me and it is another advantage of using patterns to identifying higher level concepts.

Summary

Code has an incredible value for organizations because it captures a lot of knowledge in a form that is executable. As we evolve our applications and we cover more corner cases we improve our knowledge and our code. After years of developing a code base it becomes often invaluable for the organization owning it. However that value is like frozen: there is not much we can do with it. It is even difficult to understand exactly how much information there is in the code.

We think that one approach to extract knowledge from the code is proceeding bottom up: recognizing the small abstractions and composing over them, step by step, until we recognize larger structures and patterns and can represent easily the big picture hidden in our application.

Using the  Whole Platform is invaluable for these experiments.

This article has been written following a visit of Riccardo Solmi, the author of the Whole Platform. I would like to thank him for building this great product, for sharing ideas and writing the code used in this article. The code used in this article is available on GitHub.

Translate Javascript to C#

Problems and Strategies to Port from Javascript to C#

Let’s say you need to automatically port some code from one language to another, how are going to do it? Is it even possible? Maybe you have already seen a conversion between similar languages, such as Java to C#. That sounds much simpler in comparison.

In this article we are going to discuss some strategies to translate Javascript to a very different language, such as C#. We will discuss the issues with that and plan some possible solutions. We will not arrive to writing code: that would be far too complicate for an introduction to the topic. Let’s avoid putting together something terribly hacky just for the sake of typing some code.

Having said that, we are going to see all the problems you may find in converting one real Javascript project: fuzzysearch, a tiny but very successful library to calculate the difference between two strings, in the context of spelling correction.

When it’s worth the effort

First of all you should ask yourself if the conversion it’s worth the effort. Even if you were able to successfully obtain some runnable C# you have to consider that the style and the architecture will probably be “unnatural”. As consequence the project could be harder to maintain than if you write it from scratch in C#.

This is a common problem even in carefully planned conversion, as the one who originated Lucene.net, that started as a conversion from Java to C#. Furthermore, you will not be able to use it without manual work for every specific project, because even the standard libraries are just different. Look at the example: while you could capitalize the length of haystack.length, you cannot just capitalize charCodeAt, you will have to map different functions in the source and destination language.

On the other hand all languages have area of specialization which may interest to you, such as Natural Language Processing in Python. And if you accept the fact that you will have to do some manual work, and you are very interested in one project, creating an automatic conversion will give you a huge head start. Though if you are interested in having a generic tool you may want to concentrate on small libraries, such as the Javascript one in the example.

Parse with ANTLR

The first step is parsing, and for that, you should just use ANTLR. There are already many grammars available which may not necessarily be up-to-date, but are much better than starting from scratch and they will give you an idea of the scale of the project. You should use visitors, instead of listeners, because they allow you to control the flow more easily. You should parse the different elements in custom classes, that can manage the small problems that arises. Once you have done this generating C# should be easier.

The small differences

There are things that you could just skip, such as the first and last lines, they most probably don’t apply to your C# project. But you must pay attention to the small differences: the var keyword has a different meaning in Javascript and C#. By coincidence it would work most of the time, and would be quite useful to avoid the problem of the lack of strict typing in Javascript. But it’s not magic, you are just hoping that the compiler will figure it out. And sometimes it’s not a one to one conversion. For instance you can’t use in C# in the way it’s used in the initialization of the for cycle.

The continue before outer should be transformed in goto,  but when it is alone it works just as in C#. A difference that could be fixed quite brutally is the strict equality comparison “===/!==”, that could be replaced with “==/!=” in most of cases, since it’s related to problems due to the dynamic typing of Javascript. In general you can do a pre-parse check and transform the original source code to avoid some problems or even comment out some things that cannot be easily managed.

I present you thy enemy: dynamic typing

The real problem is that Javascript uses dynamic typing while C# use strict typing. In Javascript anything could be anything, which lead to certain issues, such as the aforementioned strict equality operator, but it’s very easy to use. In C# you need to know the type of your variables, because there are checks to be made. And this information is simply not available in the Javascript source code. You might think that you could just use the var keyword, but you can’t. The compiler must be able to determine the real type at compile time, something that will not always be possible. For example you cannot use it in declaring function arguments.

You can use the dynamic keyword, which makes the type be determined at execution time. Still this doesn’t fixes all the problem, such as initialization. You may check the source code for literal initialization or, in theory, even execute the original Javascript in C# and find a way to determine the correct type. But that would be quite convoluted. You might get lucky, and in small project, such our example, you will, but not always.

There are also problems that can be more easily to manage than you imagined. For instance, assigning a function to a variable it’s not something that you usually do as explicitly in C# as you do in Javascript. But it’s easy using the type delegate and constructs such as Func. Of course you still have to deal with determining the correct types of the arguments, if any is present, but it doesn’t add any other difficulties per se.

Not everything is an object and other issues

In Javascript “string” is a string, but not an object, while in C# everything is an object, there are no exceptions. This is a relevant issue, but it’s less problematic than dynamic typing. For instance to convert our example we just have to wrap around the function a custom class, which is not really hard. One obvious problem is that there are different libraries in different languages. Some will not be available in the destination language. On the other hand some part of the project might not be needed in the destination language, because there are already better alternatives. Of course you still have to actually change all the related code or wrap the real library in the destination language around a custom class that mimic the original one.

Conclusion

There are indeed major difficulties even for small project to be able to transform from language to another, especially when they are so different like Javascript and C#. But let’s image that you are interested in something very specific, such a very successful library and its plugins. You want to port the main library and to give a simpler way for the developers of the plugins to port their work. There are probably many similarities in the code, and so you can do most of the work to manage typical problems and can provide guidance for the remaining ones.

Converting code between languages so different in nature it is not easy, that is sure, but you can apply some mixed automatic/manual approach by converting a large amount of code automatically and fix the corner cases manually. If you can also translates the tests maybe you can later refactor the code, once it is in C#, and over time improve the quality.

Code Generation with Roslyn: a Skeleton Class from UML

Code Generation with Roslyn: a Skeleton Class from UML

Get the source code for this article on GitHub

We have already seen some examples of transformation and analysis of C# code with Roslyn. Now we are going to see how to create a more complex example of code generation with Roslyn and parsing with Sprache. We are going to create a skeleton class from a PlantUML file. In short, we are doing the inverse of what we have done. The first step is to parse, of course.

As you can see, there are four entities in this example: PlantUML start and end tags and class, variable and function declarations.

Parsing all the things

We are going to parse the file line by line instead of doing it in one big swoop, this is in part because of the limitations of Sprache, but also because it’s easier to correctly parse one thing at a time instead of trying to get it right all in one go.

With CharExcept we are parsing all characters except for the one(s) indicated, which is an handy but imprecise way to collect all the text for an identifier. The roughness of this process is obvious, because we are forced to exclude all the characters that comes after an identifier. If you look at the file .plantuml, at the beginning of the article, you see that there is a space after the field names, a ‘}’ after the modifier static, a ‘:’ after the argument, to divide identifier and its type, and finally the closing parenthesis, after the type. You might say that we should simply have checked for “Letters”, which would work in this specific case, but would exclude legal C# name for identifiers.

The Modifier parser is quite uninteresting, except for the lines 6 and 11 where we are seeing the same problem just mentioned to identify the correct name. The last case is referring to something that doesn’t happen in this example, but could happen in others UML diagrams: override modifiers. The real deal is in the lines 18 and 22, where we are seeing the Ref parser, which is used, as the documentation says, to: “Refer to another parser indirectly. This allows circular compile-time dependency between parsers”. DelimitedBy is use to select many of the same items delimited by the specified rule, and finally Optional refers to a rule that isn’t necessary to parse correctly, but it might appear. Since the rule is optional, the value could be undefined and it must be accessed using the method shown on the line 22. The rule Method is slightly more complicated, but it uses the same methods. In case you are wondering, methods without a return type are constructors.

Parsing line by line

We can see our parser at work on the main method, where we try to parse every line with every parser and, if successful, we add the value to a custom type, that we are going to see later. We need a custom type because code generation requires to have all the elements in their place, we can’t do it line by line, at least we can’t if we want to use the formatter of Roslyn. We could just take the information and print them ourselves, which is good enough for small project, but complicated for larger one. Also, we would miss all the nice automatic options for formatting. On line 13 we are skipping a cycle, if we found a method, because method could also be parsed, improperly, as fields, so to avoid the risk we jump over.

Code Generation

If you remember the first lessons about Roslyn it’s quite verbose, because it’s very powerful. You have also to remember that we can’t modify nodes, even the ones we create ourselves and are not, say, parsed from a file. Once you get around to use SyntaxFactory for everything, it’s all quite obvious, you have just to find the correct methods. The using directive are simply the ones usually inserted by default by Visual Studio.

Generation of methods

Let’s start by saying that Declarations and DeclarationType are fields in our custom class, that is not shown, but you can look at it in the source code. Then we proceed to generate the method of our skeleton C# class. MethodDeclaration allow us to choose the name and the return type of the method itself; mods refer to the modifiers, which obviously could be more than one, and so they are in a list. Then we create the parameters, which in our case need only a name and a type.

We choose to throw an exception, since we obviously cannot determine the body of the methods just with the UML diagram. So we create a throw statement and a new object of the type NotImplementedException. This also allows us to add a meaningful body to the method. You should add a body in any case, if you use the formatter, because otherwise it will not create a correct method: there won’t be a body or the curly braces.

Generation of fields

The case “field”  is easier that the “method” one and the only real new thing is on line 12, where we use a method to parse the type from a string filled by our parser.

The end of the Generate method is where we add the class created by the for cycle, and use Formatter. Notice that cu is the CompilationUnitSyntax that we created at the beginning of this method.

Limitations of this example

The unit tests are not shown because they don’t contain anything worth noting, although I have to say that Sprache is really easy to test, which is a great thing. If you run the program you would find that the generated code is correct, but it’s still missing something. It lack some of the necessary using directives, because we can’t detect them starting just from the UML diagram. In a real life scenario, with many files and classes and without the original source code, you might identify the assemblies beforehand and then you could use reflection to find their namespace(s). Also, we obviously don’t implement many things that PlantUML has, such as the relationship between classes, so keep that in mind.

Conclusions

Code Generation with Roslyn is not hard, but it requires to know exactly what are you doing. It’s better to have an idea of the code you are generating beforehand, or you will have to take in account every possible case, which would make every little step hard to accomplish. I think it works best for specific scenarios and short pieces of code, for which it could become very useful. In such cases, you could create tools that are useful and productive for your project, or yourself, in a very short period of time and benefit from them, as long as you don’t change tools or work habit. For instance, if you are a professor, you could create an automatic code generator to translate your pseudo-code of a short algorithm in real C#. If you think about it, this complexity is a good thing, otherwise, if anybody could generate whole programs from scratch, us programmers will lose our jobs.

You might think that using Sprache for such a project might have been a bad idea, but it’s actually a good tool for parsing single lines. And while there are limitations, this approach make much easier to make something working in little time, instead of waiting to create a complete grammar for a “real” parser. For cases in which code generation is most useful, specific scenarios and such, this is actually the best approach, in my opinion, since it allows you to easily pick and choose which part to use and just skip the rest.

Code Climate: a Service for Static Analysis

Code Climate: A service for static analysis

GitHub has been a revolution for developers. You could consider SourceForge a predecessor, in the sense that it also let people share code. But GitHub it’s not simply a place where you can download programs, it’s mainly a platform for developers. One thing that GitHub has brought is integrations, of which there are many. The most famous is probably Travis and, of course, there are many other services for continuous integration. There are also other integrations, such as the ones with messagging apps, which are useful, but less obvious. Though today we talk about a code-related one, but less known: Code Climate, a service for static analysis.

Static analysis as a service

CodeClimate integration in GitHub

We have mentioned static analysis when we talked about Roslyn, but this service deliver it. Although it doesn’t cover C#, which should probably be a crime. Following the lead of GitHub, it’s also free for open-source, but it costs for businesses. This is a successful model that is also followed by Travis and AppVeyor, which have become synonymous with CI on Linux and Windows, and, among other things, Walkmod, another tool that can modify your code to your liking.

You can hook the service to your pull requests, but there is also a local version, that use a docker image of the CodeClimate CLI. Which is great if you don’t want to wait for pushes to check your code. You can easily integrate the JSON output can be with other tools you use. On the other hand, if you find a use for such a service you don’t really want to manage it yourself. And you really want to check the code of everybody, and not just yours. After all, they created it for GitHub, where developers works together.

It seems to hate Javascript developers

Code Climate explanation

An example of a problem found by the Code Climate analysis

The service itself seems to relay on libraries developed by other people. There is nothing wrong per se in that, of course, there is tremendous value in offering a service. And leveraging the works of everybody allows them to check for everything from bugs, security risk, excessive complexity, style, linter, etc. and basically everything you can image, they can also check Chef cookbooks and bash scripts. People that have a not-developed-here attitude can’t do that. Though the drawback is that there is a lack of uniformity in analysis and judgement between different languages. If you look at their Open Source Section, it seems that every Javascript project is judged harshly: Bootstrap, jQuery, node, etc. Of course I haven’t checked everything, but the only one that seems to be good is D3. While if you look at Ruby instead, and at Jekyll or Rails

It may be that Javascript developers are all terrible, but that seems unlikely, more probably there are differences in the tools that focus on Javascript and other kinds of languages. There might be legitimate concerns, maybe because Javascript is used in a different way. Rules exists to help you create good software, but if the best programmer ignored them, are actually good rules? Are rules that can be followed without killing productivity ? While this is a great way to humble everybody, the risk is that many warnings will soon become ignored warnings, after all if everything sucks, why bother ? We all know the story of The Boy Who Cried Wolf. So you have to careful and take your time to configure it in a way that works for your project.

Conclusion

If you are searching for a tool to verify the quality of your code you might want to use CodeClimate, especially if you are working on open source software with many other people. Although if you use Javascript be aware that it might tell you that everything you do is wrong.

Extracting JavaDoc documentation from source files using JavaParser

 

A lot of people are using JavaParser for the most different goals. One of these is extracting documentation. In this short post we will see how you can print all the JavaDoc comments associated to classes or interfaces.

Code is available on GitHub: https://github.com/ftomassetti/javadoc-extractor

Getting all the Javadoc comments for classes

We are reusing DirExplorer, a supporting class presented in the introduction to JavaParser. This class permits to process a directory, recursively, parsing all the Java files contained there.

We can start by iterating over all the classes and find the associated Javadoc comments.

As you can see getting the JavaDoc comments is fairly easy. It produces this result:

Getting all the Javadoc comments and find the documented elements

In other cases we may want to start collecting all the Javadoc comments and then finding the element which is commented. We can also do that easily with Javaparser:

Here most of the code is about providing a description for the commented node (method describe).

Conclusions

Manipulate the AST and finding the Javadoc comments is quite easy. However one missing feature is the possibility to extract the information contained in the Javadoc in a structured form. For example, you may want to get only the part of the Javadoc associated to a certain parameter or to the return value. Javaparser currently does not have this feature, but I am working on it and it should be merged in the next 1-2 weeks. If you want to follow the development take a look at issue 433.

Thanks for reading and happy parsing!

Implementing Lexical Preservation for JavaParser

lexical_preservation_javaparser

Many users of JavaParser are asking to implement lexical preservation, i.e., the ability of parsing a Java source file, modifying the AST and get back the modified Java source code keeping the original layout.

Currently this is not supported by JavaParser: you can parse all Java 8 code with JavaParser, get an AST but then the code obtained is pretty-printed, i.e., you lose the original formatting and sometimes comments can be moved around.

Why is this complex and how could this be implemented? I tried experimenting with a solution which does not require to change the JavaParser APIs or the internal implementation. It is just based on using the observer mechanism and a few reflection-based tricks.

Let’s take a look.

How would I use that?

As a user you basically need to setup a LexicalPreservingPrinter for an AST and then change the AST as you want. When you are done you just ask the LexicalPreservingPrinter to produce the code:

What the setup method does?

Two things:

  1. associate the original code to the starting elements of the AST
  2. attach an observer to react to changes

Let’s see how this work.

Connect the parsed nodes to the original code

Let’s consider the scenario in which we start by parsing an existing Java file. We get back an AST in which is node has a Range which indicates its position in the original source code. For example, if we parse this:

We know that the class declaration will start at line 3, column 1 and end at line 6, column 1 (inclusive). So if we have the original code we can use the range to get the corresponding text for each single AST node.

This is easy. This part is implemented in the registerText method of the LexicalPreservingPrinter:

putPlaceholders find the part of text corresponding to children and create ChildNodeTextElement for those. In practice at the end for each node we will have a list of strings (StringNodeTextElement) and placeholders to indicate the position of children in the text (ChildNodeTextElement)

For example for the class class A { int a;} we would have a template of three elements:

  1. StringNodeTextElement(“class A {“)
  2. ChildNodeTextElement(field a)
  3. StringNodeTextElement(“}”)

Now, every time a change is performed on the AST we need to decide how the original text will change.

Removing nodes

The simplest case is when a node is removed. Conceptually when a node is removed we will need to find the text of the parent and remove the portion corresponding to the child.

Consider this case:

If I want to remove the field f I need to find the parent of f and update its text. That would mean changing the text of the class in this case. And if we change the text of the class we should also change the text of its parent (the CompilationUnit, representing the whole file).

Now, we use placeholders and template exactly to avoid having to propagate changes up in the hierarchy.  a parent does not store the portion of text corresponding to the child but is uses placeholders instead. For example for the class we will store something that conceptually looks like this:

So removing a child will just mean removing an element from the list of the parent which will look like this:

In other words when we remove an element we remove the corresponding ChildNodeTextElement from the text associated to the parent.

At this point we may want to merge the two consecutive strings and update the spacing to remove the empty line, but you get the basic idea.

Now, not all cases are that simple. What if we want to remove a parameter? Take this method:

The corresponding list of element will be:

If we want to remove the first parameter we would get:

Which would not be valid because of the extra comma. In this case we should know that the element is part of comma-separated list, we are removing an element from a list with more than one element so we need to remove one comma.

Now, these kind of changes depends to the role of a certain node: i.e., where that node is used. For example where a node contained in a list is removed the method concreteListChange of our observer is called:

Now, to understand what the modified NodeList represents we use reflection, in the method findNodeListName:

If the modified NodeList is the same one we get when calling getParameters on the parent of the list then we know that this is the NodeList containing the parameters. We then have to specify rules for each possible role: in other words we have to specify that when deleting a node from a list of parameters we have to remove the preceeding comma. When removing a node from a list of methods instead there is no comma or other separator to consider.

Note that while removing the comma would be enough to get something which can be parsed correctly it would not be enough to produce an acceptable result because we are not considering adjustments to whitespace. We could have newlines, spaces or tabs added before or after the comma in order to obtain a readable layout. When removing an element and the comma the layout would change and adjust whitespace accordingly would be not necessarily trivial.

Adding nodes

When adding nodes we have mostly the same problems we have seen when removing nodes: we could need to add separators instead of removing them. We have also another problem: we need to figure out where to insert an element.

We can have two different cases:

  • it is single element (like the name of a class or the return type of a method)
  • it is an element part of a list (for example a method parameter)

In the first case we could have:

We need to specify that when adding a name to a Class Definition we need to find the class keyword and put the name after it.

What if we want to insert an element in a list?

If the list is empty or if we want to insert the first element of a list we need to use some delimiter. For example in the case of a method definition, when adding a parameter we should add it after the left parenthesis. If instead we want to add an element in a position different from the first one we need to find the preceeding element in the list, insert a delimiter (if necessary) and then place the new element.

Also in this case we would need to adapt whitespace.

Changing nodes

Most changes to the AST would be performed by adding or removing nodes. In some cases however we would need to change single properties of existing nodes. This is the case when we add or remove modifiers, which are not nodes per se. For these cases we would need specific support for each property of each node. Some more work for us.

Associate comments to nodes

Some time ago I started working on comment attribution: i.e., finding to which node a comment is referred. Why is this necessary? Because when we remove a Node we should remove also the corresponding comments and when we move a Node around the associated comments should be moved with it. And again we usually put some whitespace around the comments. Also that need to be handled.

Conclusions

Lexical preservation is a very important feature: we need to implement it in JavaParser and we need to get it right. However it is far from being trivial to implement and there is not a clear, easy to implement solution. There are different aspects to consider and heuristics to address the problem. For this reason we will need to collect a lot of feedback and being ready to a lot of testing and incremental work to polish our solution.

And you, what do you think about lexical preservation? Do you need it? Any advice on how to implement it?

Observers for AST nodes in JavaParser

javaparser-observer-ast

We are getting closer to the first Release Candidate for JavaParser 3.0. One of the last features we added was support for observing changes to all nodes of the Abstract Syntax Tree. While I wrote the code for this feature I received precious feedback from Danny van Bruggen (a.k.a. Matozoid) and Cruz Maximilien. So I use “we” to refer to the JavaParser team.

What observers on AST nodes could be used for?

I think this is a very important feature for the ecosystem of JavaParser because it makes easier to integrate with JavaParser by reacting to the changes made on the AST. Possible changes that can be observed are setting a new name for a class or add a new field. Different tools could react to those changes in different ways. For example:

  • an editor could update its list of symbols, which could be used for things like auto-completion
  • some frameworks could regenerate source code to reflect the changes
  • validation could be performed to verify if the new change lead to an invalid AST
  • libraries like JavaSymbolSolver could recalculate the types for expressions

These are just a few ideas that come to mind but I think that most scenarios in which JavaParser is used could benefit from the possibility to react to changes.

The AstObserver

The JavaParser 3.0 AST is based on Nodes and NodeLists. A Node, like a TypeDeclaration for instance, can have different groups of children. When these groups can contain more than one node we use NodeLists. For example a TypeDeclarations can have multiple members (fields, methods, inner classes). So each TypeDeclaration has a NodeList to contain fields, one to contain methods, etc. Other children, like the name of a TypeDeclaration, are instead directly contain in a node.

We introduced a new interface named AstObserver. An AstObserver receive changes on the Nodes and NodeLists.

What to observe

Now we have an AstObserver and we need to decide which changes it should received. We thought of three possible scenarios:

  1. Observing just one node, for example a ClassDeclaration. The observer would receive notifications for changes on that node (e.g., if the class change name) but not for any of its descendants. For example if a field of the class change name the observer would not be notified
  2. For a node and all its descendants at the moment of registration of the observer. In this case if I register an observer for the ClassDeclaration I would be notified for changes to the class and all its fields and methods. If a new field is added and later modified I would not receive notifications for those changes
  3. For a node and all its descendants, both the ones existing at the moment of registration of the observer and the ones added later.

So a Node has now this method:

To distinguish these three cases we simply use an enum (ObserverRegistrationMode). Later you can see how we implemented the PropagatingAstObserver.

Implementing support for observers

If JavaParser was based on some meta-modeling framework like EMF this would be extremely simple to do. Given this is not the case I needed to add a notification call in all the setters of the AST classes (there are around 90 of those).

So when a setter is invoke on a certain node it notifies all the observers. Simple. Take for example setName in TypeDeclaration<T>:

Given we do not have a proper metamodel we have no definitions for properties. Therefore we added a list of properties in an enum, named ObservableProperty. In this way an Observer can check which property was changed and decide how to react.

Internal hierarchy of observers

For performance reasons each node has its own list of observers. When we want to observe all descendants of a node we simply add the same observer to all nodes and nodelists in that subtree.

However this is not enough, because in some cases you may want to observe also all nodes which are added to the subtree after you have placed your observers. We do that by using a PropagatingAstObserver. It is an AstObserver that when see a new node been attached to a node it is observing start to observe the new node as well. Simple, eh?

Observers in action

Let’s see how this works in practice:

Conclusions

I am quite excited about this new feature because I think it enables more cool stuff to be done with JavaParser. I think our work as committers is to enable other people to do things we are not foreseeing right now. We should just act as enablers and then get out of the way.

I am really curious to see what people will build. By the way, do you know any project using JavaParser that you want to make known to us? Leave a comment or open an issue on GitHub, we are looking forward to hearing from you!

Interview to Erik Dietrich on Static Analysis and a data driven approach to refactoring

eirk_dietrich_static_analysis

Erik Dietrich is a well known Software Architect with a long experience in consulting. His blog (DaedTech) is a source of thoughtful insights on software development. In particular I was very interested by his usage of static analysis and code metrics in his consulting work.

I am myself a relatively young consultant and I thought it would be very valuable for me to hear his opinion on these topics and in general about consulting in the software industry.

So I wrote to him and he was kind enough to answer some questions.

Introduction

Can you tell us about the typical projects you work on?

It’s hard to summarize typical projects over the course of my entire career.  It’s been an interesting mix.  Over the last couple of years, though, it’s coalesced into what I think of as pure consulting.  I come in to help managers or executives assess their situations and make strategic decisions.

Can you tell us about a project which results exceeded your expectations?

There have been a couple of projects of late that I felt pretty good about.  Both of them involved large clients (a Fortune 10 company and a state government) to whom I furnished some strategic recommendations.  In both cases, I checked back from time to time and discovered that they had taken my recommendations, executed them, and expanded on them to great success.  What I found particularly flattering wasn’t just that they realized success with recommendations (that’s just what you’d expect from a good consultant), but how foundational they found the seeds I had planted there to be.  They had taken things that I had written up and made them core parts of what they were doing and their philosophy.

What make your work difficult? What are the conditions which reduce the possibility of a success?

When it comes to purely consultative engagements, the hardest thing is probably gathering and basing recommendations on good data.

If I’m writing code, it’s a question of working with compilers and test suites and looking at fairly objective definitions of success.  But when it comes to strategic advice like “should we evolve this codebase or start over” or “how could we make our development organization more efficient,” you get into pretty subjective territory.  So the hardest part is finding ways to measure and improve that aren’t just opinion and hand-waving.  They’re out there, but you really have to dig for them.

Human factor

When promoting changes to a software development team how important is the human factor? Do you face much resistance from developers?

It’s the most important thing.  In a consultative capacity, it’s not as though I can elbow someone out of the way, roll up my sleeves, and deliver the code myself.  The recommendations can only come to fruition if the team buys in and executes them.

As for facing resistance, that happens at times.  But, honestly, not as much as you might think.  Almost invariably, I’m called in when things aren’t great, and people aren’t enjoying themselves.  I try to get buy-in by helping people understand how the recommendations that I’m making will improve their own day to day lives.  For example, you can reduce friction with a recommendation to say, have a unit test suite, by explaining that it will lead to fewer late night bug hunts and frustrating rework sessions.

How do you win support from people? Developers employed by the company can get defensive when someone from outside comes to suggest changes to their way of work. In your recent post What to do when Your Colleague Creates Spaghetti Code you suggest using data to support your considerations. How well does it work for you? Do you always find data to support your proposals?

Using data, in and of itself, tends to be a way more to settle debates than to get support.  Don’t get me wrong — it’s extremely important.  But you need to balance that with persuasion and the articulation of the value proposition that I mentioned in the previous answer.  Use the data to build an unassailable case for your position, and then use persuasion and value proposition to make the solution as palatable as possible for all involved.

Static analysis

Is there any typical low hanging fruits for organizations which start to adopt static analysis? Anything that can be achieved with limited experience and effort?

I would say the easiest, quickest thing to do is get an analysis tool and set it up to run against the team’s codebase.  Those tool vendors know their stuff, and you’ll get some good recommendations out of the box.

Do you work with different language communities (e.g., JVM, .NET, Ruby, Python)? My impression is that on .NET there are better tools for static analysis, and maybe more maturity in that community. Is that your feeling also?

With static analysis, these days I work pretty exclusively in the .NET and Java communities, though I may expand that situationally, if the need arises.  As for comparing the static analysis offerings across ecosystems, I’ve actually never considered that.  I don’t have enough data for comparison to comment 🙂

In the Java community there are a few tools for static analysis but I think their default configuration reports way too many irrelevant warnings and this contribute to the poor reputation of static analysis in the Java world. What do you think about that?

I don’t know that this is unique to Java or to any particular tool.  Just about every static analysis tool I’ve ever seen hits you with far more than you probably want right out of the gate.  I suspect the idea is that it’s better to showcase the full feature offering and let users turn warnings off than to give the impression that they don’t do as much analysis.  

A common problem of open source software is the lack of documentation. In your post Reviewing Strangers’ Code on Github you talk about the challenge of confronting with unfamiliar GitHub projects. Do you think static analysis tools could be used to get an overview of the codebase and familiarize faster with the code?

Oh, no doubt about it.  I run custom analysis on new codebases just to get a feel for them and gather basic statistics.  What’s the length of the average method?  How systematically do they run afoul of so-called best practices?  How do the dependencies cluster?  I could go on and on.  I feel like I’d be looking at a codebase blind without the ability to analyze it this way and compare it against my experience with all of the previous ones I’ve examined.

The role of developers

Your next book, Developer Hegemony, is about a new organization of labor and the role developers could occupy in it. Can you tell something about it? And do you think this new organization could be advantageous also for other creative jobs?

In the book, I talk extensively about the problems with and vestigial nature of the corporation in the modern world of knowledge workers.  Corporations are organized like giant pyramids, in which the people at the bottom do the delivery work and are managed by layer upon layer of people above them.  To make matters worse for software people, a lot of these organizations view their labor as an overhead cost.

I see us moving toward a world where software developers move away from working for big, product companies that make things other than software.  I see us moving out from having our salaries depressed under the weight of all of those layers of corporate overhead.  A decent, if imperfect, analog, might be to doctors and lawyers.  And, yes, I think this absolutely can apply to creative fields as well.

To add on to your post How to Get that First Programming Job what is something that the average programmer is not aware of and could improve their work ?

One of the best pieces of advice I think I can offer is the following, though it’s not exactly technical.

If you’re working and you find yourself thinking/muttering, “there’s GOT to be a better way to do this,” stop!  You are almost certainly right, and there is a better way.  Go find it.  And don’t automate something yourself without checking for its existence.

One of the most common mistakes I see in relatively inexperienced programmers is to take the newfound power of automation and apply it everywhere, indiscriminately.  They solve already-solved problems without stopping to think about it (reinventing wheels) or they brute force their way through things (e.g. copy and paste programming).  So, long story short, be sensitive to solutions that may already exist or ways to solve problems that may not always involve writing code.

You are clearly an accomplished professional. Could you share one advice for advancing the careers of someone is the software industry?

One thing that drives me a little nuts about our industry is something I described in this post.  We line our resumes with alphabet soups of programming languages, frameworks, and protocols that we know, but with little, if any mention of what sorts of business or consumer problems we know how to solve.  It’s “I have 8 years of C++” instead of “I can help your team cut down on the kinds of memory leaks that lead to system crashes.”

It will improve your career and focus your work if you start to think less in terms of years and tools and more in terms of what value you can provide and for whom.

Conclusions

I am very thankful to Erik for sharing his opinions. I think there are many interesting takeaways. Two ideas in particular made me think:

  • The changing role of developers: maybe some will still prefer the security of corporate jobs but there are undoubtedly many opportunities for those we are willing to go independent.
  • We as professionals have to raise the bar. One way to do that is start taking more seriously our job by basing it on data, when that makes sense. Data cannot yet provide all the answers, but it can help and it should not be ignored.

Do you agree?