Posts

Translate Javascript to C#

Problems and Strategies to Port from Javascript to C#

Let’s say you need to automatically port some code from one language to another, how are going to do it? Is it even possible? Maybe you have already seen a conversion between similar languages, such as Java to C#. That sounds much simpler in comparison.

In this article we are going to discuss some strategies to translate Javascript to a very different language, such as C#. We will discuss the issues with that and plan some possible solutions. We will not arrive to writing code: that would be far too complicate for an introduction to the topic. Let’s avoid putting together something terribly hacky just for the sake of typing some code.

Having said that, we are going to see all the problems you may find in converting one real Javascript project: fuzzysearch, a tiny but very successful library to calculate the difference between two strings, in the context of spelling correction.

When it’s worth the effort

First of all you should ask yourself if the conversion it’s worth the effort. Even if you were able to successfully obtain some runnable C# you have to consider that the style and the architecture will probably be “unnatural”. As consequence the project could be harder to maintain than if you write it from scratch in C#.

This is a common problem even in carefully planned conversion, as the one who originated Lucene.net, that started as a conversion from Java to C#. Furthermore, you will not be able to use it without manual work for every specific project, because even the standard libraries are just different. Look at the example: while you could capitalize the length of haystack.length, you cannot just capitalize charCodeAt, you will have to map different functions in the source and destination language.

On the other hand all languages have area of specialization which may interest to you, such as Natural Language Processing in Python. And if you accept the fact that you will have to do some manual work, and you are very interested in one project, creating an automatic conversion will give you a huge head start. Though if you are interested in having a generic tool you may want to concentrate on small libraries, such as the Javascript one in the example.

Parse with ANTLR

The first step is parsing, and for that, you should just use ANTLR. There are already many grammars available which may not necessarily be up-to-date, but are much better than starting from scratch and they will give you an idea of the scale of the project. You should use visitors, instead of listeners, because they allow you to control the flow more easily. You should parse the different elements in custom classes, that can manage the small problems that arises. Once you have done this generating C# should be easier.

The small differences

There are things that you could just skip, such as the first and last lines, they most probably don’t apply to your C# project. But you must pay attention to the small differences: the var keyword has a different meaning in Javascript and C#. By coincidence it would work most of the time, and would be quite useful to avoid the problem of the lack of strict typing in Javascript. But it’s not magic, you are just hoping that the compiler will figure it out. And sometimes it’s not a one to one conversion. For instance you can’t use in C# in the way it’s used in the initialization of the for cycle.

The continue before outer should be transformed in goto,  but when it is alone it works just as in C#. A difference that could be fixed quite brutally is the strict equality comparison “===/!==”, that could be replaced with “==/!=” in most of cases, since it’s related to problems due to the dynamic typing of Javascript. In general you can do a pre-parse check and transform the original source code to avoid some problems or even comment out some things that cannot be easily managed.

I present you thy enemy: dynamic typing

The real problem is that Javascript uses dynamic typing while C# use strict typing. In Javascript anything could be anything, which lead to certain issues, such as the aforementioned strict equality operator, but it’s very easy to use. In C# you need to know the type of your variables, because there are checks to be made. And this information is simply not available in the Javascript source code. You might think that you could just use the var keyword, but you can’t. The compiler must be able to determine the real type at compile time, something that will not always be possible. For example you cannot use it in declaring function arguments.

You can use the dynamic keyword, which makes the type be determined at execution time. Still this doesn’t fixes all the problem, such as initialization. You may check the source code for literal initialization or, in theory, even execute the original Javascript in C# and find a way to determine the correct type. But that would be quite convoluted. You might get lucky, and in small project, such our example, you will, but not always.

There are also problems that can be more easily to manage than you imagined. For instance, assigning a function to a variable it’s not something that you usually do as explicitly in C# as you do in Javascript. But it’s easy using the type delegate and constructs such as Func. Of course you still have to deal with determining the correct types of the arguments, if any is present, but it doesn’t add any other difficulties per se.

Not everything is an object and other issues

In Javascript “string” is a string, but not an object, while in C# everything is an object, there are no exceptions. This is a relevant issue, but it’s less problematic than dynamic typing. For instance to convert our example we just have to wrap around the function a custom class, which is not really hard. One obvious problem is that there are different libraries in different languages. Some will not be available in the destination language. On the other hand some part of the project might not be needed in the destination language, because there are already better alternatives. Of course you still have to actually change all the related code or wrap the real library in the destination language around a custom class that mimic the original one.

Conclusion

There are indeed major difficulties even for small project to be able to transform from language to another, especially when they are so different like Javascript and C#. But let’s image that you are interested in something very specific, such a very successful library and its plugins. You want to port the main library and to give a simpler way for the developers of the plugins to port their work. There are probably many similarities in the code, and so you can do most of the work to manage typical problems and can provide guidance for the remaining ones.

Converting code between languages so different in nature it is not easy, that is sure, but you can apply some mixed automatic/manual approach by converting a large amount of code automatically and fix the corner cases manually. If you can also translates the tests maybe you can later refactor the code, once it is in C#, and over time improve the quality.

Code Generation with Roslyn: a Skeleton Class from UML

Code Generation with Roslyn: a Skeleton Class from UML

Get the source code for this article on GitHub

We have already seen some examples of transformation and analysis of C# code with Roslyn. Now we are going to see how to create a more complex example of code generation with Roslyn and parsing with Sprache. We are going to create a skeleton class from a PlantUML file. In short, we are doing the inverse of what we have done. The first step is to parse, of course.

As you can see, there are four entities in this example: PlantUML start and end tags and class, variable and function declarations.

Parsing all the things

We are going to parse the file line by line instead of doing it in one big swoop, this is in part because of the limitations of Sprache, but also because it’s easier to correctly parse one thing at a time instead of trying to get it right all in one go.

With CharExcept we are parsing all characters except for the one(s) indicated, which is an handy but imprecise way to collect all the text for an identifier. The roughness of this process is obvious, because we are forced to exclude all the characters that comes after an identifier. If you look at the file .plantuml, at the beginning of the article, you see that there is a space after the field names, a ‘}’ after the modifier static, a ‘:’ after the argument, to divide identifier and its type, and finally the closing parenthesis, after the type. You might say that we should simply have checked for “Letters”, which would work in this specific case, but would exclude legal C# name for identifiers.

The Modifier parser is quite uninteresting, except for the lines 6 and 11 where we are seeing the same problem just mentioned to identify the correct name. The last case is referring to something that doesn’t happen in this example, but could happen in others UML diagrams: override modifiers. The real deal is in the lines 18 and 22, where we are seeing the Ref parser, which is used, as the documentation says, to: “Refer to another parser indirectly. This allows circular compile-time dependency between parsers”. DelimitedBy is use to select many of the same items delimited by the specified rule, and finally Optional refers to a rule that isn’t necessary to parse correctly, but it might appear. Since the rule is optional, the value could be undefined and it must be accessed using the method shown on the line 22. The rule Method is slightly more complicated, but it uses the same methods. In case you are wondering, methods without a return type are constructors.

Parsing line by line

We can see our parser at work on the main method, where we try to parse every line with every parser and, if successful, we add the value to a custom type, that we are going to see later. We need a custom type because code generation requires to have all the elements in their place, we can’t do it line by line, at least we can’t if we want to use the formatter of Roslyn. We could just take the information and print them ourselves, which is good enough for small project, but complicated for larger one. Also, we would miss all the nice automatic options for formatting. On line 13 we are skipping a cycle, if we found a method, because method could also be parsed, improperly, as fields, so to avoid the risk we jump over.

Code Generation

If you remember the first lessons about Roslyn it’s quite verbose, because it’s very powerful. You have also to remember that we can’t modify nodes, even the ones we create ourselves and are not, say, parsed from a file. Once you get around to use SyntaxFactory for everything, it’s all quite obvious, you have just to find the correct methods. The using directive are simply the ones usually inserted by default by Visual Studio.

Generation of methods

Let’s start by saying that Declarations and DeclarationType are fields in our custom class, that is not shown, but you can look at it in the source code. Then we proceed to generate the method of our skeleton C# class. MethodDeclaration allow us to choose the name and the return type of the method itself; mods refer to the modifiers, which obviously could be more than one, and so they are in a list. Then we create the parameters, which in our case need only a name and a type.

We choose to throw an exception, since we obviously cannot determine the body of the methods just with the UML diagram. So we create a throw statement and a new object of the type NotImplementedException. This also allows us to add a meaningful body to the method. You should add a body in any case, if you use the formatter, because otherwise it will not create a correct method: there won’t be a body or the curly braces.

Generation of fields

The case “field”  is easier that the “method” one and the only real new thing is on line 12, where we use a method to parse the type from a string filled by our parser.

The end of the Generate method is where we add the class created by the for cycle, and use Formatter. Notice that cu is the CompilationUnitSyntax that we created at the beginning of this method.

Limitations of this example

The unit tests are not shown because they don’t contain anything worth noting, although I have to say that Sprache is really easy to test, which is a great thing. If you run the program you would find that the generated code is correct, but it’s still missing something. It lack some of the necessary using directives, because we can’t detect them starting just from the UML diagram. In a real life scenario, with many files and classes and without the original source code, you might identify the assemblies beforehand and then you could use reflection to find their namespace(s). Also, we obviously don’t implement many things that PlantUML has, such as the relationship between classes, so keep that in mind.

Conclusions

Code Generation with Roslyn is not hard, but it requires to know exactly what are you doing. It’s better to have an idea of the code you are generating beforehand, or you will have to take in account every possible case, which would make every little step hard to accomplish. I think it works best for specific scenarios and short pieces of code, for which it could become very useful. In such cases, you could create tools that are useful and productive for your project, or yourself, in a very short period of time and benefit from them, as long as you don’t change tools or work habit. For instance, if you are a professor, you could create an automatic code generator to translate your pseudo-code of a short algorithm in real C#. If you think about it, this complexity is a good thing, otherwise, if anybody could generate whole programs from scratch, us programmers will lose our jobs.

You might think that using Sprache for such a project might have been a bad idea, but it’s actually a good tool for parsing single lines. And while there are limitations, this approach make much easier to make something working in little time, instead of waiting to create a complete grammar for a “real” parser. For cases in which code generation is most useful, specific scenarios and such, this is actually the best approach, in my opinion, since it allows you to easily pick and choose which part to use and just skip the rest.

Create a simple parser in C# with Sprache

Create a simple parser in C# with Sprache

You can find the code for this article on github

Everybody loves ANTLR, but sometimes it may be overkill. On the other hand, a regular expression just doesn’t cut it or it may be too complicated to maintain. What a developer can do in such cases ? He uses Sprache. As its creators say:

Sprache is a simple, lightweight library for constructing parsers directly in C# code.

It doesn’t compete with “industrial strength” language workbenches – it fits somewhere in between regular expressions and a full-featured toolset like ANTLR.

It is a simple but effective tool, whose main limitation is being character-based. In other words, it works on characters and not on tokens. The advantage is that you can work directly with code and you don’t have to use external tools to generate the parser.

The guessing game

You can see the project website if you want to see specific real uses, let’s just say that its even credited by ReSharper and it was created more than six years ago, so it’s stable and quite good. It’s ideal to manage things like error messages created by other tools that you have to deal with, to read logs, to parse queries like the ones you would uses for a simple search library or to read simple formats like Json. In this article we will create a parser for a simple guessing game, we will use .NET Core and xUnit for the unit tests, so it will work also on Linux and Mac.

The objective of the game is to guess a number, and to do that you can ask if the number is greater than a number, less than a number or between two numbers. When you are ready to guess you simply ask if it’s equal to a certain number.

Setup the project

We will use VSCode, instead of Visual Studio, but in the github project you would find two projects, one for each: this because there are still some compatibility quirks relative to project.json and the different .NET Core tools versions used by Visual Studio or the standalone command line version. To clarify, the project.json generated by the .NET Core standalone command line will work also with Visual Studio, but not viceversa (this might be changed when you will read this). Also, with two projects you can easily see how Visual Studio integrates xUnit tests. The C# code itself is the same.

Create the file global.json in the directory of your project, in our case SpracheGame, then create another SpracheGame folder inside src and a SpracheGame.Tests folder inside test. Inside the nested SpracheGame folder you can create a new .NET core program with the usual:

While you are nside the SpracheGame.Tests folder you can create a xUnit test project with:

You can see the final structure here.

SpracheGame folder structure

Change both project.json, adding sprache as a dependency to the main project:

…and add the main project as a dependency for the xUnit test project.

If you are using Visual Studio you may need to add a runtimes section to both of your project.json:

See the .NET documentation for .NET Core Runtime IDentifier (RID) catalog if you need to know other platform IDs.

Create GameParser

Let’s start by creating a class called GameParser and by recognizing numbers and commands.

On line 3 there is the code to parse a number: we start with Sprache.Parse followed by a digit, of which there must be at least one, then we convert from IEnumerable<char> to string, with Text(), and finally we discard whitespace with Token(). So first we choose the type of character we need, in this case Digit, then we set a quantity modifier and trasform the result in something more manageable. Notice that we return Parser<string> and not an int.

On the lines 5-6 we order to the parser to find a character ‘<‘  followed by  one ‘>’, using  Then(). We return an enum instead of a simple string. We can easily check for the presence of different options with the Or(), but it’s important to remember that, just as for ANTLR, the order matters. We have to put the more specific case first, otherwise it would match the generic one instead of reaching the correct case.

Now we have to combine this two simple parser in one Play, and thanks to the LINQ-like syntax the task is very simple. Most commands require only a number, but there is one that requires two, because we have to check if the number to guess is between two given numbers. It also has a different structure, first there is a number, then the symbol, and finally the second number. This is a more natural syntax for the user than using a ‘<>’ symbol followed by two numbers. As you can see, the code is quite simple, we gather the elements with from .. in .. and then we create a new object with select.

It’s time for Play

The only interesting things in the Play class are on the lines 27-51, the Evaluate function, where the “magic” happens, and I use the term magic extremely loosely. The number to guess is provided to the function, then it’s properly checked with the command and the numbers of the specific play that we are evaluating.

Unit Tests are easy

There are basically no disadvantages in using xUnit for our unit tests: it’s compatible with many platforms, it’s still integrated with the Visual Studio Test Explorer and it also have a special feature: theory. Theory is a special kind of test that allow you to supply multiple inputs with one test. Lines 3-6 shows exactly how you can do it. In our case we are testing that our parser can parse numbers with many digits.

The following test is a typical one, we are checking that the symbol ‘>’ is correctly parsed as a Command.Greater. On Line 27 we are making sure that an Exception is raised if we encounter an incorrect Play. Sprache allows also to use TryParse, instead of Parse, if you don’t want to throw an exception. As you can see the simplicity of tool make very easy to test it.

Let’s put everything together

The main function doesn’t contain anything shocking, on the lines 27-28 we parse the input and execute the proper command, then, on 31, we check whether we guessed the correct number and if so we prepare to exit the cycle. Notice that we provide a way to exit the game even without guessing the number correctly, but we check for ‘q’ before trying to parse, because it would be an illegal command for GameParser.

Conclusions

This blog talks much about Language Engineering, which is a fascinating topic, but it is not always used in the everyday life of the average developer. Sprache, instead, is one tool that any developer could find a use for. When a RegEx wasn’t good enough you probably have simply redesigned your application, making your life more complicated. Now you don’t need to, when you meet the mortal enemy of regular expressions, that is to say nested expression, you can just use Sprache, right in your code.

Getting started with ANTLR in C#

The code for this article is available on github.

Readers of this website will know that ANTLR is a great tool to quickly create parsers and help you in working with a known language or create your DSL. While the tool itself is written in Java it can also be used to generate parsers in several other languages, for instance Python, C# or Javascript (with more languages supported by the newly released 4.6 version).

If you want to use C# you can integrate ANTLR in your favorite IDE, as long as that IDE is any recent edition of Visual Studio. The runtime itself works also on Mono, and can be used as a standalone and you can look at the issues for the official C# target for ANTLR 4 to see if you can make it work with other setups, but the easiest way is to use Visual Studio and the provided extension to integrate the generation of the grammar into the your C# project.

Setup

The first step is to install ANTLR Language Support extension for Visual Studio, you just have to search for it in for Visual Studio going to ToolsExtensions and Updates. This will allow to easily integrate ANTLR into your workflow by generating automatically the parser and, optionally, listener and visitor starting from your grammar. Now you can add a new ANTLR 4 Combined Grammar or an ANTLR 4 Lexer/Parser in the same way you add any other new item. Then, for each one of your projects, you must add the Nuget package for Antlr4. If you want to manage options and, for instance disable the visitor/listener generation, you can see the official github project.

Create the Grammar

For our simple project we are going to create grammar that parses two lines of text that represents a chat between two people. This could be the basis for a chat program or for a game in which whoever says the shortest word get beaten up with a thesaurus. This is not relevant for the grammar itself, because it handles only the recognition of the various elements of the program. What you choose to do with these elements is managed through the normal code. Add a new ANTLR 4 Combined Grammar with the name Speak. You will see that there is already some text in the new file; delete all and replace it with the following text.

While you may create separate lexer and parser grammar, for a simple project you will want to use a combined grammar and put the parser before the lexer. That’s because as soon as antlr recognize a token in the lexer part, it stop searching. So it’s also important to put the more specific tokens first and then the generic ones, like WORD or ID later. In this example, if we had inverted SAYS and WORD, SAYS would have been hidden by WORD. Another thing to notice is that you can’t use fragments outside of lexer rules.

Having said that, the lexer part is pretty straightforward: we identify a SAYS, that could be written uppercase or lowercase, a WORD, that could be composed of any letter uppercase or lowercase and a NEWLINE. Any text that is WHITESPACE, space and tab, is simply ignored. While this is clearly a simple case, lexer rules will hardly be more complicated than this. Usually the worst thing that could happen is to have to use semantic predicates. These are essentially statement that evaluates to true or false, and in the case they are false they disable the following rule. For instance, you may want to use a ‘/’ as the beginning of a comment, only if it is the first character of a line, otherwise it should be considered an arithmetic operator.

The parser is usually where things gets more complicated, although that’s not the case this time. Every document given to a speak grammar must contain a chat, that in turn is equal to two line rules followed by a End Of File marker. The line must contain a name, the SAYS keyword and a word. Name and word are identical rules, but they have different names because they correspond to different concepts, and they could easily change in a real program.

Visiting the tree

Just like we have seen for Roslyn, ANTLR will automatically create a tree and base visitor (and/or listener). We can create our own visitor class and change what we need. Let’s see an example.

The first line shows how to create a class that inherit from the SpeakBaseVisitor class, that is automatically generated by ANTLR. If you need it, you could restrict the type, for instance for a calculator grammar you could use something like int or double. SpeakLine (not shown) is a custom class that contains two properties: Person and Text. The line 5 shows how to override the function to visit the specific type of node that you want, you just need to use the appropriate type for the context, that contains the information provided by the parser generated by ANTLR. At line 13 we return the SpeakLine object that we just created, this is unusual and it’s useful for the unit testing that we will create later. Usually you would want to return base.VisitLine(context) so that the visitor could continue its journey across the tree.

This code simply populate a list of SpeakLine that hold the name of the person and the word they have spoken. The Lines properties will be used by the main program.

Putting it all together

As you can see there is nothing particularly complicated. The lines 15-18 shows how to create the lexer and then create the tree. The subsequent lines show how to launch the visitor that you have created: you have to get the context for whichever starting rule you use, in our case chat, and the order to visit the tree from that node.

The program itself simply output the information contained in the tree. It would be trivial to modify the grammar program to allow infinite lines to be added, both the Visitor and the main Program would not need to be changed.

Unit testing

Testing is useful in all cases, but it is absolutely crucial when you are creating a grammar, to check that everything is working correctly. If you are creating a grammar for an existing language you probably want to check many working source file, but in any case you want to start with unit testing the single rules. Luckily since the creation of the Community edition of Visual Studio there is a free version of Visual Studio that including an unit testing framework. All you have to do is to create a new Test Project, add all the necessary nuget packages and add a reference to the project assembly you need to test.

There is nothing unexpected in this tests. One observation is that we can create a test to check the single line visitor or we can test the matching of the rule itself. You obviously should do both. You may wonder how the last test works, since we are trying to match a rule that doesn’t match, but we still get the correct type of context as a return value and some correct matching values. This happens because antlr is quite robust and there is only checking one rule. There are no alternatives and since it starts the correct way it is considered a match, although a partial one.

Conclusions

Integrating an ANTLR grammar in a C# project is quite easy with the provided Visual Studio extensions and nuget packages, making it the best way to quickly create parser for your DSL. No more piles of fragile RegEx(s), but don’t forget the tests.