Posts

Code Generation with Roslyn: a Skeleton Class from UML

Code Generation with Roslyn: a Skeleton Class from UML

Get the source code for this article on GitHub

We have already seen some examples of transformation and analysis of C# code with Roslyn. Now we are going to see how to create a more complex example of code generation with Roslyn and parsing with Sprache. We are going to create a skeleton class from a PlantUML file. In short, we are doing the inverse of what we have done. The first step is to parse, of course.

As you can see, there are four entities in this example: PlantUML start and end tags and class, variable and function declarations.

Parsing all the things

We are going to parse the file line by line instead of doing it in one big swoop, this is in part because of the limitations of Sprache, but also because it’s easier to correctly parse one thing at a time instead of trying to get it right all in one go.

With CharExcept we are parsing all characters except for the one(s) indicated, which is an handy but imprecise way to collect all the text for an identifier. The roughness of this process is obvious, because we are forced to exclude all the characters that comes after an identifier. If you look at the file .plantuml, at the beginning of the article, you see that there is a space after the field names, a ‘}’ after the modifier static, a ‘:’ after the argument, to divide identifier and its type, and finally the closing parenthesis, after the type. You might say that we should simply have checked for “Letters”, which would work in this specific case, but would exclude legal C# name for identifiers.

The Modifier parser is quite uninteresting, except for the lines 6 and 11 where we are seeing the same problem just mentioned to identify the correct name. The last case is referring to something that doesn’t happen in this example, but could happen in others UML diagrams: override modifiers. The real deal is in the lines 18 and 22, where we are seeing the Ref parser, which is used, as the documentation says, to: “Refer to another parser indirectly. This allows circular compile-time dependency between parsers”. DelimitedBy is use to select many of the same items delimited by the specified rule, and finally Optional refers to a rule that isn’t necessary to parse correctly, but it might appear. Since the rule is optional, the value could be undefined and it must be accessed using the method shown on the line 22. The rule Method is slightly more complicated, but it uses the same methods. In case you are wondering, methods without a return type are constructors.

Parsing line by line

We can see our parser at work on the main method, where we try to parse every line with every parser and, if successful, we add the value to a custom type, that we are going to see later. We need a custom type because code generation requires to have all the elements in their place, we can’t do it line by line, at least we can’t if we want to use the formatter of Roslyn. We could just take the information and print them ourselves, which is good enough for small project, but complicated for larger one. Also, we would miss all the nice automatic options for formatting. On line 13 we are skipping a cycle, if we found a method, because method could also be parsed, improperly, as fields, so to avoid the risk we jump over.

Code Generation

If you remember the first lessons about Roslyn it’s quite verbose, because it’s very powerful. You have also to remember that we can’t modify nodes, even the ones we create ourselves and are not, say, parsed from a file. Once you get around to use SyntaxFactory for everything, it’s all quite obvious, you have just to find the correct methods. The using directive are simply the ones usually inserted by default by Visual Studio.

Generation of methods

Let’s start by saying that Declarations and DeclarationType are fields in our custom class, that is not shown, but you can look at it in the source code. Then we proceed to generate the method of our skeleton C# class. MethodDeclaration allow us to choose the name and the return type of the method itself; mods refer to the modifiers, which obviously could be more than one, and so they are in a list. Then we create the parameters, which in our case need only a name and a type.

We choose to throw an exception, since we obviously cannot determine the body of the methods just with the UML diagram. So we create a throw statement and a new object of the type NotImplementedException. This also allows us to add a meaningful body to the method. You should add a body in any case, if you use the formatter, because otherwise it will not create a correct method: there won’t be a body or the curly braces.

Generation of fields

The case “field”  is easier that the “method” one and the only real new thing is on line 12, where we use a method to parse the type from a string filled by our parser.

The end of the Generate method is where we add the class created by the for cycle, and use Formatter. Notice that cu is the CompilationUnitSyntax that we created at the beginning of this method.

Limitations of this example

The unit tests are not shown because they don’t contain anything worth noting, although I have to say that Sprache is really easy to test, which is a great thing. If you run the program you would find that the generated code is correct, but it’s still missing something. It lack some of the necessary using directives, because we can’t detect them starting just from the UML diagram. In a real life scenario, with many files and classes and without the original source code, you might identify the assemblies beforehand and then you could use reflection to find their namespace(s). Also, we obviously don’t implement many things that PlantUML has, such as the relationship between classes, so keep that in mind.

Conclusions

Code Generation with Roslyn is not hard, but it requires to know exactly what are you doing. It’s better to have an idea of the code you are generating beforehand, or you will have to take in account every possible case, which would make every little step hard to accomplish. I think it works best for specific scenarios and short pieces of code, for which it could become very useful. In such cases, you could create tools that are useful and productive for your project, or yourself, in a very short period of time and benefit from them, as long as you don’t change tools or work habit. For instance, if you are a professor, you could create an automatic code generator to translate your pseudo-code of a short algorithm in real C#. If you think about it, this complexity is a good thing, otherwise, if anybody could generate whole programs from scratch, us programmers will lose our jobs.

You might think that using Sprache for such a project might have been a bad idea, but it’s actually a good tool for parsing single lines. And while there are limitations, this approach make much easier to make something working in little time, instead of waiting to create a complete grammar for a “real” parser. For cases in which code generation is most useful, specific scenarios and such, this is actually the best approach, in my opinion, since it allows you to easily pick and choose which part to use and just skip the rest.

Generate diagrams from C# source code using Roslyn

Representation of the world inspired by Matrix

The code for this post is on Github

Beyond the source code

Last week we have seen how to use Roslyn to rewrite source code to your liking. That’s all well and good, but it’s not the only thing you can do when you have a compiler open and ready to do your bidding. Another possibility is to leverage the knowledge that the compiler has, to support other tools that you use as a programmer, or that are needed by co-workers to simplify their job.

There is two great advantages to use the source code to support everything else:

  1. the source code become the truth, from which everything follow
  2. you can integrate the support for these tools into the processes of continuous integration that you already use

You may say that the point number 1 is already true in any case. But, even for open source software, how many are going to wade through hundreds of files to understand how to use the damn thing ? The reality is that if there is no documentation, it doesn’t exist for most people. Time is too much valuable to lose it behind other people’s code. And this doesn’t even count people that don’t understand code, but they need to know the feature of the software.

Roslyn doesn’t help just programmers

No, it’s true, Roslyn would not write documentation on its own, but it can be used to make it easier and even manage other structured information. In particular today we are talking about UML diagrams. The traditional way is to create them is by hand, which is prone to make them obsolete, or to use programs that reverse engineer the code itself, which is costly and not easily adaptable. Roslyn, instead, allows you to easily create diagrams, at least some kind of diagrams such as class diagrams. Another advantage is that by understanding the source code programmatically you can hide or shows information that are not needed by the reader. For instance, you can hide private properties and methods that the user of the library doesn’t need to know.

The plan

In short the idea is to create text files that are compatible with PlantUML for every class of our source code and then to use PlantUML to create the actual diagrams. In real life it would be trivial to then  create the diagrams programmatically, thanks to the command line and upload the images wherever you want. To generate class diagrams by leveraging the compiler is so easy because the compiler need to understand the source code and so every information is readily available to us. In fact, I didn’t even need to write much code since there is already a small library that does it: https://github.com/pierre3/PlantUmlClassDiagramGenerator1. Ehi, we are programmers, we are lazy, we are smart enough to leverage existing resources.

We just need to understand how it works. It’s less than 300 lines of code, including comments, so we can delve right in.

Generating the diagram

See, I wasn’t kidding, it’s easy. All the information is readily available from the parser of Roslyn, we just need to take it. GetMembersModifierText (not shown) is simply a switch to associate every modifier keyword to its respesctive plantuml symbol, like SyntaxKind.PublicKeyword equals “+”.  Of course you need to learn the terminology, such as SyntaxKind or the names of the several *Syntax(s), but that isn’t really hard. The only thing slightly harder than a simple “copy value and write a string” is relative to properties, which are what the developers of .NET call “syntactic sugar”, that is to say a shortcut for programmers, that the compiler transform in real functions. Since they are not a standard feature of many languages you have to translate them for UML.

The main method

I don’t show the whole main method because it’s you typical console app: very simple. Since ClassDiagramGenerator is nothing more than a CSharpSyntaxWalker, we just need to gather the text, parse it, and give the order to visit the tree with our walker. The only things to notice are the starting and closing plantuml notation lines that we add to our generated files. Now you can use plantuml to create the diagrams.

Conclusion

Class diagram of ClassDiagramGenerator

Class Diagram generated by PlantUML

Using the source code as a source of intelligence about the code itself is not exactly a free lunch, but it’s quite there. You can write code and then automatically have it translated in a form that co-workers can understand, be them other programmers or something else. And you can integrate this information into the practices and tools that you already use, it’s a win-win. It’s true that in real life there is probably more setup, but the advantages are clear. The information is already there, now Roslyn make it easy accessible, why not use it ?


[1] I just added a few lines to include the relation between base and derived classes [^]

Getting started with Roslyn: transforming C# code

Getting started with Roslyn

Getting started with Roslyn on C#

The code for this post is on GitHub: getting-started-roslyn

Under the hood

Making a programming language actually useful is not simply about designing it well but it is also about providing supporting tools around the language: compilers, obviously, but also editors, build systems, etc.

There are few languages that give you tools to play under the hood. I am thinking about the Language Server Protocol for example. It permits to reuse parts of a compiler to get errors or the position of a definition. Roslyn is another example. Microsoft defined the idea behind it as “compiler as a service”, or more recently, a “platform”. Ok, what the hell does it mean?

Introduction to Roslyn

Using Roslyn you can access the inner workings of the compiler and use all its knowledge to create tools to boost your productivity or simplify your life. For instance, you could finally force everybody to respect the coding style of your project or extend the functionality of the IDE. A common example is to check the correctness of your Regex, while you are writing it, eliminating the need to run the program to check it.

You have it on Windows, Linux and Mac and works on .NET Core.

What we are going to do

In this post we are going to make sure that every int variable is initialized, and if it is already initialized, we make sure it is initialized to the value 42. It’s a simple example, but it will touch the three main areas of interest:

  1. syntax analysis
  2. semantic analysis
  3. syntax transformation

Believe it or not it will be even easy to understand!

Setup

We will create this example on Linux and using Visual Studio Code as an editor, but of course you could use whatever editor you want. Just make sure you install a recent version of .NET Core. Once you have done this, create a new project and open the file project.json. We have two things to do: add the dependencies needed for Roslyn and use a workaround to correct a bug; the fix is simply to add the value “portable-net45+win8+wp8+wpa81” to imports. After our edits we can restore the packages to check that everything works (ie. the bug is fixed).

The Main method

Let’s take a look at our Program.cs.  We skip CreateTestCompilation, for now, the only thing to notice is that if you wanted just to look at the SyntaxTree you wouldn’t need to compile anything, you could just build it with something as simple as CSharpSyntaxTree.ParseText(“Text to parse”).

We are looping through the source trees, the source files, and get the Semantic Model for everyone of them. This is needed to check the meaning of the code we are seeing.

In our example we have to be sure to initialize only integer variables and not, say, a string. Next, we are giving the semantic model to our InitializerRewriter and then we visit every node of the tree. InitializerRewriter is a kind of walker of the tree that can be used to modify the tree. More precisely, you can’t modify the original tree, but you can create a new one that is identical save for the nodes you have changed. In the end, we check if we have modified the original source and if that’s true we create a new source file. In real life you would rewrite the original one, but to ease tinkering we are creating a new one.

Programmatic compilation

I.e., where we show how you can give orders to your compiler.

CreateTestCompilation is fairly easy to understand: we need to compile the source files programmatically, and so we have to parse the text, gather the references to the assemblies needed for our program, and then give the order to compile.

Let’s initialize everything to 42

Because you know, why not?

InitializerRewriter is an implementation of the abstract class CSharpSyntaxRewriter that is used when you want to modify the tree, while CSharpSyntaxWalker is chosen when you just want to just walk through it. VisitVariableDeclaration is one of many functions that you can overwrite, specifically the one that is invoked whenever the walker hit a VariableDeclarationSyntax node. Of course you can also overwrite the generic Visit to get access to all nodes. SyntaxTrivia is all the things that are useful to humans and not the compiler, such as whitespace or comments.

The first thing to notice is the first condition of the first if, it checks whether the type of the node that we are visiting is a int. Since we are looking at the Symbol of the model the condition will be true even if the declaration is in the form “var a = 0”, that is to say we are not merely checking the syntax, but the semantic value. If the second condition is true, that is to say there isn’t an initializer, we create one and we set the value to 42. The second if checks whether there is an int variable that is initialized, but it isn’t initialized to 42. In that case we change the initialization to 42, again, technically we create a new one.

Conclusion

The practical steps to create an initializer are three:

  1. you create a new value, in our case a “42” with a leading space
  2. create a new assignment with that value
  3. use the assignment to replace the original initializer

We can’t create the expression directly, we have to use the factory. These steps are intuitive, if you have experience in compilers: first you create a value then an expression. But if you don’t have experience in compilers it may seem superfluous: why you can’t just assign the initializer to 42 ?

If you want to access the power of the compiler you have to understand how it thinks, how it have to manage every line of code youwrite. For a compiler there always many possibilities to consider and you have to help him narrow them down. For instance you may want to assign not a simple value, but another variable. If you understand this, three lines aren’t too much to ask to access such power.

You have also to remember that you can’t modify anything in the original tree. We create a new VariableDeclarationSyntax node with new variables, with the help of the WithVariables method.

You can now go back to Program.cs and add a simple variable declaration such as int one, two; or string three and see the new source files in the new_src folder. If you run the program, you will notice that it also changes var i = 0 in var i = 42, proving that it checks the results of the compilation and not merely the syntax and that compilation may not always do what you expect it to do.

Enjoy playing with Roslyn!

After many posts from Federico Tomassetti, this one is brought to you by Gabriele Tomassetti. Because programming is a family business.