Sometimes, we need to parse some code for all sorts of purposes: building an interpreter, generating something out of it, or perhaps building an editor for that language. When this is the case, we may just start considering our options: should we build a parser ourselves, license a commercial parser or just use an open-source one? In this article, we explore these options and share some guidelines to support you in your decision.
What kind of applications can benefit from a parser?
A parser can be used for many different goals.
Here we list the most common examples:
- You may want to transpile your code to some other language to reuse some tools supporting that target language. For example, you may want to transpile code in a certain language to Java or C to reuse existing Java or C compilers. To do that, a parser is needed
- You may want to generate something out of some code. For example, you may want to generate sequence diagrams from some code, or documentation. In this case you will need to parse the code to recognize the structure and extract the relevant information (e.g., the comments)
- You may want to build an interpreter or a compiler for an existing language because there is not one for your platform of interest. For example, someone at some point decided that running Javascript outside the browser and on their desktop computer was a good idea and they had to build a parser for Javascript
- You may want to design a new language, and therefore you need to provide a set of tools for your language such as an interpreter or a compiler, an editor and maybe some less widely used tools like a simulator. All of those tools would require a parser
What role does a parser play in these applications
In all of the scenarios that we have seen a parser is a first component in a larger application. All of the applications we have listed are Language Engineering applications and we define their architecture using the concept of Language Engineering Pipelines.
In these Language Engineering Pipelines we combine different components together, where each one (but the first one) is consuming the output of the previous one and each one (but the last one) is producing something useful to the next stage. The first component does not take its input from the previous component but from the external world, and the last component does not provide its output to the next component but to the external world.
In all of these Language Engineering Pipelines the first component is always a parser that takes some code and produces an Abstract Syntax Tree (AST). The AST is a data-structure which contains all the information extracted from the code, in a form that simplifies implementing the rest of the Language Engineering Application.
This means that, while the parser per-se is not an application, it is a very useful component that enables us to build all sorts of interesting applications for a given language.
Important characteristics in a parser
If you are considering adopting a parser (possibly after having built it) we think you should consider these points:
- Implementation language: this is the language in which the parser itself is built, not the language recognized by the parser. You could for example have a parser written in Java (implementation language) that can recognize RPG code (parsed language). Why is this important? This is important because a parser is more often than not used as a library that you use inside your larger application. For example, if you want to generate syntax diagrams for RPG code, you can use a parser that can recognize RPG and then use its output as an input for your logic that actually prints the diagram. If the implementation language of the parser is Java, you will be able to write such logic in Java (or in a JVM compatible language). So you may want to be sure to choose an implementation language that your team is comfortable with.
- AST APIs: the parser produces an AST, which is then consumed by the other components of your application. Now, the better the API exposed by the AST, the easier it is to write the following components. Some ASTs provide advanced API to find the elements you are interested in, filtering them, transforming them or generating other ASTs from them. Others instead offer a more basic API relying on a visitor or a listener. If you are not familiar with Language Engineering applications this is a point you may tend to underestimate, but it can make a difference and significantly impact the complexity of writing and maintaining the other components in your Language Engineering application.
- Completeness/Correctness: can your parser parse all the valid files you need to process? And can it do that correctly, e.g., without producing errors that are not there or recognizing constructs incorrectly? While having a complete and correct parser is always desirable, this may be more or less important depending on the type of application you are building. A compiler that you intend to use to process tens of thousands of files may need to be more complete and correct than a parser you want to use to generate a few syntax diagrams.
- Maintainability: in this case we mean, in how much time an identified problem can be solved? Is this something that could take weeks and a prayer, or can you expect problems to be solved in a matter of hours or days? Depending on the situation this may be very relevant or not at all. If you use the parser inside a compiler that is vital for you, you may need to be sure to get the problem fixed quickly, while if you are doing a migration planned to take a very long time you may be more relaxed
- Time to adopt: how urgent is it to be able to start using the parser? Can you afford waiting a few months or do you need the parser for something you would like to put in production in a couple of months?
- License: does the license under which the parser is obtained work for your goals?
Building your own parser: what does it mean?
We have seen what a parser can be used for, and we have seen which characteristics are important in a parser, so let’s see what it means to build your own parser with respect to those characteristics.
Note that most of the following considerations apply also if you have someone else build a parser specifically for you (so that you end up owning the resulting codebase).
- Implementation language: if you build your own parser you can implement it in the language you prefer, provided there are parser generators for that language. While it is true that you could also build your own parser without relying on a parser generator, that requires way more effort and we would not advise doing that. We suggest using a parser generator and ANTLR in particular. ANTLR is a tool that, given a grammar, generates a parser written in any of these implementation languages: C++, C#, Dart, Java, JavaScript, PHP, Python3, Swift, TypeScript.
- AST APIs: if you build your parser using ANTLR exclusively you will get a parser providing barebone APIs, however you can add a level on top of it to get more powerful APIs. To do that you can use one of our open-source libraries, which are collectively called StarLasu. At this time they are:
- Kolasu, written in Kotlin. You can find a tutorial for it here: Building advanced parsers using Kolasu
- Tylasu, written in Typescript
- Pylasu, written in Python
- Sharplasu, written in C#
- Completeness/Correctness: here it is up to you to develop the parser up to the required standards for your use case. Besides the effort you can spend on it, you can be limited by the availability of examples, specifications, or the experience in testing parsers.
- Maintainability: also in this case you are a master of your own destiny. Which may or may not be a good thing. It may be a good thing if your team has experience in building parsers, because in that case, if you can reserve capacity for supporting the parser, every problem should be fixed in a timely manner. If you do not have the skills or you cannot protect some time for maintaining the parser, then issues can take an unpredictable time to be solved. And you may meet problems you just do not know how to solve.
- Time to adopt: building a parser from scratch can take a developer 4 to 8 months, for most typical languages. Of course the complexity of the language, the completeness of the parser, and the quality of the APIs exposed are all important factors, but this is a ballpark figure for your reflections. These values are for developers who already know how to build a parser, so you may need to add a few months for your team to get up to speed with the parsing technologies needed, if they have no previous experience.
- License: no problem at all here. If the code is yours, you can do whatever you want with it.
Using an open-source parser
Let’s see what happens when you adopt an open-source parser.
- Implementation language: here you need a bit of luck. The fact is that you are looking for a parser able to recognize a certain language and implemented in a language that your team is familiar with. For example, you may need to process Java code inside a Javascript application. In your quest for a parser you may find JavaParser, a parser that recognizes Java, and which is written in Java. Bad luck: you cannot use it in your parser. Now, if you can work with a parser with barebone APIs, you can then just pick an ANTLR grammar and from that generate the parser in one of the ten implementation languages supported by ANTLR, so that could help increase your chances.
- AST APIs: most open-source parsers we encountered provide basic APIs, as they are mostly based on ANTLR and do not have a proper AST on top. There are notable exceptions, like JavaParser, which offers advanced APIs. There are also other advanced parsers for very widespread languages, like XML or JSON. Here you have way better chances if your language is very, very popular.
- Completeness/Correctness: your mileage can vary, as there are parsers which have been around for many years, and have been used and refined a lot. If you see that the project has at least 1,000 commits, chances are that some time and care have been invested in the parser. Consider that for example JavaParser at this time has more than 9,000 commits (1,663 from myself :D). If your parser has less than 100 commits, you can consider it as a stub that may be a good starting point to build your own parser, but do not expect it to be something you can use “as-is”
- Maintainability: here our advice is to look at the average age of issues. If there are hundreds of issues that have been around for years, you may have two factors causing it: 1) the team of contributors may be just not big enough to handle all the requests or 2) the parser is actually hard to maintain. In the first case, you can have your own team contribute to the parser. You may need to invest some time in familiarizing with the codebase but that is an option. In the second case you have instead a problem that is harder to crack.
- Time to adopt: good news, you can start using the parser right away.
- License: you need to check if the parser is released under a license that works for you. Chances are that MIT, BSD, or Apache License v2 could work, while GPL would probably not work if you plan to use the parser in a commercial project.
Licensing a parser, how does that work?
The third option is to license a parser that someone else has built. Let’s see what it means.
- Implementation language: like when adopting an open-source parser, you need to find a parser that processes that language you are interested in and also is implemented in a language you are comfortable with. Finding the right parser that satisfies these two requirements can require some luck, as there are not so many commercial parsers available out there.
- AST APIs: a proper commercial parser should have advanced APIs. It should have been built by professionals that do this all day long. Also typically these vendors also provide consulting services around their parsers, and therefore have the interest of having parsers that can be used successfully in Language Engineering projects. A key factor for that is having advanced AST APIs.
- Completeness/Correctness: if the parser has been around for a while it should be reasonably complete and correct. And you should ask the vendor to demonstrate that to you.
- Maintainability: if there is a vendor behind the parser, it is reasonable to expect that some form of support is provided.
- Time to adopt: as in the case of an open-source parser, it is just a matter of starting using the parser. If your purchasing department does not get in the way, obviously.
- License: here you may want to read the fine prints. Typically commercial parsers are provided to companies that want to use them for commercial purposes, so licenses should permit this kind of usage, but here your lawyers can earn their salary and double check that.
What is the best option in my situation?
Deciding which ways to go depends on your situation, but a few points you could reflect on is:
- Have your team time available to build or maintain the parser? If not, a licensed parser, with an adequate support contract can be the way to go
- How soon do you need to get the parser? If you have time pressure for getting the project out of the gate, then building your own parser is something you cannot afford
- Is your project going to be open-source? If so, using a licensed parser may not be feasible
- Is there an existing parser, open-source or licensed, that can parse the language you are interested in and it is implemented in your language of choice? If not, then you can only build this parser yourself
Summary
Deciding how to source the parser for your Language Engineering project is a complex decision, especially if it is the first time that you look into such problems. With this article we listed a few aspects we suggest to consider, based on our experience. While some are more obvious (like the license), others are sometimes overlooked (like the AST APIs or the implementation language). Hopefully this guide can help you in making up your mind. And if you are not sure, we can provide consulting options to help make a decision.
Also, if you decide to build your own parser, we suggest taking a look at our video course on building parsers with ANTLR. You could also apply the principle of the Chisel method for building parsers, a method we developed at Strumenta. It would help in ensuring maintainability and correctness for your parser.
All the best for your decision and your project!