Converting from JavaCC to ANTLR

Introduction

Converting from JavaCC to ANTLR: JavaCC was the first popular parser generator (or compiler-compiler, hence the name) for the Java platform. It originated at Sun Microsystems and was later open-sourced.

In this tutorial, we’ll show how to convert a parser from JavaCC to ANTLR. Indeed, there are legacy parsers that are now dated; those would benefit from using ANTLR instead, which is more actively maintained and has other advantages, as we’ll see.

We’ll show a semi-automatic translation path, using a software tool that can quickly migrate the most common/boring parts of a grammar leaving only the more interesting bits out.

Why Converting to ANTLR Is a Good Idea

Let’s set this straight first: converting or migrating some technology, library, language, etc. just for the sake of it, or to be “more modern”, is generally not beneficial. In fact, widely-used tools such as the aforementioned JavaParser and JSqlParser are based on JavaCC to this day.

Still, there are several reasons to migrate – some are generic, some are specific to JavaCC and ANTLR. So, let’s look at various aspects in which ANTLR is indeed “better” than JavaCC.

Like ANTLR and other similar tools, JavaCC takes a formal description of a language as input and then outputs a parser; in this case specifically, a piece of Java source code that implements the parser. The generated parser is capable of recognizing the language described by the grammar. As part of the process, it builds a structured representation of the source code you invoke it on – a parse tree. Application code can then further process the tree.

JavaCC is Java Only

JavaCC only generates parsers in Java. Instead, ANTLR can generate parsers in several programming languages. In principle, we could have a single grammar from which we generate a parser in Java, another in C++, another in JavaScript, and so on.

Of course, that’s only relevant if we’re writing a library or component that we want to consume in-process from several languages or platforms. That may not appear to be an everyday requirement – unless your job is writing and selling parsers, that is. That’s part of our job, as you can see at Parser Bench, where we showcase some of the multi-platform parsers that we’ve built. However, that’s not the case in most software companies.

Still, nowadays it’s quite common to have some logic replicated on the backend (e.g. coded in Java or C#) and the frontend (JavaScript). Think of a domain-specific language that is compiled on the backend, but that users may edit on the frontend with a web code editor. Or even in a desktop application made with Electron, such as Visual Studio Code. The editor may use a parser to implement better language support, such as code completion or semantic checking.

If we generate the client-side parser from the same grammar used in the backend (or parts of it), we’ll ensure that the language is consistent between the editor and the compiler, and we’ll reduce effort and errors resulting from maintaining two unrelated parsers.

Now, not all supporting ANTLR runtimes have the same level of quality and performance; also, an ANTLR grammar can contain language-specific elements (predicates and actions) that prevent it to be reused as-is to generate a parser in another language. And, anyway, we could implement the client-server scenario above by delegating all the “language intelligence” to the server. Or, in the case of VSCode, to a separate “server” process communicating with the Language Server Protocol.

Still, that doesn’t come without drawbacks and added complexity. So, being able to reuse the entire grammar, or parts of it, in multiple environments, is an appealing feature of ANTLR, that JavaCC lacks.

ANTLR Has a More Active Community

Community in open-source is also a matter of personal preference, since an important part of fitting in has to do with human relationships and affinity of values. Still, if we look at easily measurable parameters, ANTLR has more active committers, more commits, and a bigger ecosystem of supporting tools, libraries, and runtimes.

Also, ANTLR has a single reference implementation. Another performance-oriented implementation exists which is a superset in terms of syntax and features.

In contrast, JavaCC’s development has been stale for many years. Nowadays, JavaCC has both a “legacy” implementation, which receives some limited maintenance, and a newer fork/implementation called JavaCC 21, which is more actively developed. However, at the time of writing JavaCC 21 doesn’t seem to have gained a lot of traction. That’s proably at least in part due to its quite litigious author (we’ve been bitten ourselves, so beware should you inquire about JavaCC 21). So, choosing the “right” version of JavaCC today would require some study. That said, since we’re talking about conversion, the used JavaCC version is a given, not something that we can choose.

Anyway, both tools are used in popular open-source libraries. For example, JavaCC in JavaParser and JSqlParser, ANTLR in Hibernate 5+ and Groovy3, just to name a few.

ANTLR Is Simpler

The grammar language used by ANTLR is definitely more readable. Compare:

literal: INTEGER_LITERAL  |  LONG_LITERAL  |  FLOATING_POINT_LITERAL  |  CHARACTER_LITERAL  |  STRING_LITERAL  |  booleanLiteral |  nullLiteral;

With the equivalent JavaCC:

Expression Literal():
{
  Expression ret;
}
{
 (
    <INTEGER_LITERAL> {
        ret = new IntegerLiteralExpr(tokenRange(), token.image);
    }
  |
    <LONG_LITERAL> {
        ret = new LongLiteralExpr(tokenRange(), token.image);
    }
  |
    <FLOATING_POINT_LITERAL> {
        ret = new DoubleLiteralExpr(tokenRange(), token.image);
    }
  |
    <CHARACTER_LITERAL> {
     ret = new CharLiteralExpr(tokenRange(), token.image.substring(1, token.image.length()-1));
    }
  |
    <STRING_LITERAL> {
     ret = new StringLiteralExpr(tokenRange(), token.image.substring(1, token.image.length()-1));
    }
  |
    ret = BooleanLiteral()
  |
    ret = NullLiteral()
 )
 { return ret; }
}

Also, ANTLR allows us to split a grammar over multiple files. That way, we can further reduce complexity by grouping related rules together while isolating them from unrelated rules.

To be fair, JavaCC 21 comes with much leaner syntax, which is close to ANTLR’s, while still supporting the legacy syntax. So, it offers a more gradual migration path. Also, it supports grammar includes, to split the same grammar across multiple files. However, it doesn’t appear to be used much in the wild yet.

Also, JavaCC generates self-contained parsers that need no external dependencies, while ANTLR requires a runtime support library. This could be a valid concern in the past. However, it’s unlikely to be significant today. Nowadays, the average application includes dozens of library dependencies already and we have powerful tools to manage them.

ANTLR Is More Powerful

Technically speaking, ANTLR4 employs an adaptive LL(*) parsing algorithm, while JavaCC generates recursive-descent LL(1)/LL(k) parsers. So, JavaCC is roughly in the same league as ANTLR3. Instead, ANTLR4 is strictly more powerful in that it can operate with unbounded, adaptive lookahead.

In practice, this means that JavaCC recognizes fewer languages than ANTLR4. Also, it requires certain manual interventions that ANTLR4 supports automatically, such as choosing the lookahead depth and refactoring left-recursive rules. The latter applies in practice to expressions in most languages, where the naive and intuitive description looks like the following:

expr: expr (PLUS | MINUS) expr | /* other cases omitted */ | variable | NUMBER;

That’s a left-recursive rule, and neither ANTLR3 nor JavaCC support that; we’d have to rewrite the rule in a more convoluted way.

To be fair, these points mostly matter if we’re designing a new parser. When we’re dealing with a conversion, the JavaCC grammar already exists and works within the constraints of the parser generator. Since we’re talking about conversion, ANTLR4 being more powerful doesn’t seem to matter much.

However, with ANTLR4 we could then refactor the grammar to be more in line with ANTLR’s idioms and possibilities. We may also gain the possibility to evolve our language to support a new construct, that would have been hard to parse with JavaCC.

On another note, ANLTR4 generates a parse tree out of the box, with no special instructions. Instead, in JavaCC we must use the JJTree preprocessor – another tool to learn with its syntax and idiosyncrasies. On the other hand, with JJTree one may construct an abstract syntax tree directly from the parser – as much as it’s feasible and convenient; it was the solution used in ANTLR3 but it was abandoned in ANTLR4. With ANTLR, we have to transform the parse tree into an AST in a separate step. Fortunately, libraries such as Kolasu greatly help in that regard. Also, JavaCC 21 does not require JJTree (but it supports the same syntax as JJTree in the core).

Automatic Conversion

We can go quite far with automatic conversion from JavaCC to ANTLR. Indeed, our own Federico Tomassetti wrote a proof of concept implementation. The context was to evaluate the migration of JavaParser from JavaCC to ANTLR. Thus, Federico’s tool was tested primarily against the Java grammar used in JavaParser at the time.

Let’s give a brief look at how the tool works. The basic principle is:

(Parsing) Read the text of the JavaCC grammar and parse it into an in-memory representation (abstract syntax tree)
(Code generation) Walk the AST to generate ANTLR4 code from it.

This is the minimal process to implement source-to-source transformation, or transpiling. In the real world, a transpiler performs one or more tree-to-tree transformations (also called model-to-model transforms) before finally generating code. That breaks down the generation process into more manageable steps. It is, for example, the strategy that JetBrains MPS extensively employs for code generation.

Loading a JavaCC Grammar

We can use JavaCC itself to parse a JavaCC grammar. Of course, JavaCC (like ANTLR) needs that capability to do its job. However, ANTLR for example is separated into two components. We have the tool, which reads a grammar and generates a parser, and the runtime, which supports the execution of the parser. So, an application built with ANTLR usually can’t parse the ANTLR grammar language. To enable that, we need to include the ANTLR tool as a dependency. Of course, that’s only possible on the JVM, since the tool part of ANTLR is written in Java. JavaCC, instead, comes in a single monolithic library comprising the tool and the runtime.

So, loading a JavaCC grammar amounts to the following code (in this case, Kotlin, but it would look similar in Java):

fun loadJavaCCGrammar(javaCCGrammarFile: File) : JavaCCGrammar{
   val javaccParser = JavaCCParser(FileInputStream(javaCCGrammarFile))
   Options.init()
   javaccParser.javacc_input()
   return JavaCCGrammar(JavaCCGlobals.rexprlist, JavaCCGlobals.bnfproductions)
}

Note that JavaCC uses antiquated practices such as saving mutable data in static fields. Today we know that to be a poor design choice in most cases. This is just an example of JavaCC’s aging codebase. It dates back to the earliest versions of Java (before generics, collections, and most of the stuff we take for granted today).

Code Generation

Superficially, there’s not much to say about code generation, either. In fact, most JavaCC concepts map to ANTLR concepts with only differences in syntax.

One evident difference is that, in ANTLR, parser rules are distinguished from lexer rules using the case of the first character in their name: parserRule vs LexerRule. By convention, lexer rules are typically written in all caps, as in “SELECT”. However, we can find grammars where only the first letter is capitalized, as in “Identifier”.

So, we just have to ensure that we properly capitalize the rule names in the JavaCC grammar. We’ve used the built-in String.capitalize method for lexer rules and the following Kotlin function for parser rules:

private fun String.uncapitalize(): String {
   return if (this.isNotEmpty() && this[0].isUpperCase()) {
       this[0].toLowerCase() + this.substring(1)
   } else {
       this
   }
}

Then, we have to generate the code for the rules themselves. Here, we straightforwardly map each JavaCC concept to its corresponding ANTLR syntax.

For example, we have the concept of “choice” among several alternatives, which JavaCC represents with the class Choice extends Expansion (Expansion being the shared superclass of the elements that can go into a rule’s body, or expansion).

We convert a Choice into ANTLR grammar code simply by combining the choices with the pipe operator, “|”:

"(" + this.choices.joinToString(separator = " | ") { (it as Expansion).process(lexerDefinitions, namesToUncapitalize) } + ")"

Similarly, we translate ZeroOrMore nodes into application of the “*” operator:

"(${this.expansion.process(lexerDefinitions, namesToUncapitalize)})*"

We can see how, in both examples, we recursively apply the generation process – called expansion in this particular codebase.

Similarly, we handle all the other concepts such as OneOrMore (+), ZeroOrOne (?), etc.

Case Sensitivity

A minor point to pay attention to is the case sensitivity of lexer rules.

JavaCC, like ANTLR3, is case sensitive by default but we can instruct it to ignore case when lexing. ANTLR4 doesn’t have that capability but it supports a well-known method for dealing with case-insensitive tokens.

We may leverage ANTLR’s import statement to confine case insensitive fragments into a file that we only import when needed. Then, it’s easy to automatically translate literal strings, for example, “foo” into F O O. It’s less straightforward, but still possible, to translate character classes such as [a-f] into their case-insensitive counterparts.

The Devil in the Details

Upon further examination, we find that fully automated translation is a much harder task than it seems. That’s because of some semantic differences between JavaCC and ANTLR. One such example is the pattern for handling multi-line comments, such as, in Java:

/* this is a
multi-line comment */

Here’s a simplified extract of the Java grammar used in JavaParser:

MORE :
{
 <ENTER_MULTILINE_COMMENT: "/*"> : IN_MULTI_LINE_COMMENT
}

<IN_MULTI_LINE_COMMENT>
SPECIAL_TOKEN :
{
 <MULTI_LINE_COMMENT: "*/" > : DEFAULT
}

<IN_MULTI_LINE_COMMENT>
MORE :
{
 <COMMENT_CONTENT: ~[] >
}

We can see how, in JavaCC, we can use the “more” action to start building a token, change lexer mode, and continue lexing. Then, when the comment ends, we mark the whole token, accumulated so far, as “special”. In JavaCC, the parser ignores special” tokens. The concept of lexer modes is applicable to ANTLR as well; you may want to read The ANTLR Mega Tutorial for that and other advanced concepts.

In ANTLR, we cannot use that pattern (MORE + SPECIAL_TOKEN), because we can’t combine “more” with actions that skip the token or change the channel to make the token invisible to the parser. Therefore, we cannot translate the rules above as they are, one by one. We ought to recognize the semantics of the rules – what the author wanted to accomplish – and rewrite that in ANTLR style:

MULTILINE_COMMENT_START: '/*' -> pushMode(multiLineComment), channel(HIDDEN);
mode multiLineComment;
MULTILINE_COMMENT_END: '*/' -> channel(HIDDEN), popMode;
MULTILINE_COMMENT_CONTENT: . -> channel(HIDDEN);

Notice how, both in ANTLR and in JavaCC, the end of the comment comes before the content, otherwise, the “catch-all” comment content rule would always match before the end-comment rule.

Then, we can rewrite the above rules in a simpler way, which is also closer to the JavaCC semantics. In fact, in ANTLR we don’t need to use lexer modes if we use a non-greedy operator:

MULTILINE_COMMENT: ('/*' .*? '*/') -> channel(HIDDEN);

Here, the content of the comment consumes any character, but only up to the next star-slash sequence.

That said, algorithmically recognizing the above JavaCC pattern for comments is not simple, and indeed the proof-of-concept converter doesn’t try to do it. Instead, it uses a heuristic tailored to JavaParser’s “java.jj” grammar – if a rule contains “comment” in its name, then the generated lexer grammar will skip it. However, this doesn’t work for all JavaCC grammars, it’s just an accident of how the Java grammar is designed.

A more robust approach could be to bail out entirely and output a comment or warning urging the developer to translate those rules by hand.

In conclusion, fully automated translation is a hard problem, but we can go quite far with partially automated translation followed by later human intervention. As a reference, the java.jj grammar is ~2800 lines long, while the problematic comments part is just 30 lines, or slightly more than the 1% of the file. So, it makes sense to automatically translate 99% of the grammar!

The AST

We’ve repeated this in various other articles: a parser, by itself, is useless. It’s only when we consume the output of the parser that we extract value from it. So, the integration of the parser with the application or library using it is a crucial point to discuss.

The output of a parser is a tree. We call it either “parse tree” or “abstract syntax tree” (AST) according to a fuzzy measure of how close the tree is to the structure of the grammar versus the abstract concepts that define the language.

A parser built with JavaCC may use the JJTree tool to build a tree according to some rules, or it may include actions that imperatively build the tree, written in Java by a developer. In either case, users of the legacy parser will have built their code against a certain API, which includes the classes that make up the nodes and leaves of the tree, and the methods to traverse them.

Instead, ANTLR (version 4) doesn’t give leeway to the developer to control how the tree is built. It builds a parse tree with a 1:1 correspondence to the rules of the grammar. Generally, we advise not to use the parse tree directly; instead, we suggest transforming it into an AST before further processing, for reasons that we won’t discuss here for brevity.

In the case of a conversion from JavaCC, we need to decide on a strategy for the AST:

Should we transform the parse tree produced by ANTLR into the tree that the existing code expects?
Or, should we write a new AST and break the API, forcing consumers to adapt to the new version?

There’s not a universally superior answer. In general, transforming the parse tree into an existing tree structure and API is not rocket science, and not markedly different from a transformation into a new AST. It may involve multiple passes and transformations, but we can find plenty of literature and examples to draw inspiration from.

However, maybe the conversion of the grammar is part of a greater modernization effort aimed at reducing technical debt. In that context, it may pay off to rewrite the AST with more modern practices and tools (such as Kolasu). Of course, such a decision depends on the project and the goals of the migration.

Note that we may still use some methods from Kolasu with the legacy tree, for traversal and other purposes.

Conclusions

JavaCC and ANTLR use very different syntax but the underlying concepts map quite nicely one to one between the two, with some exceptions that are nontrivial to treat algorithmically. So, we can translate from JavaCC to ANTLR with a mostly automated process where some manual intervention by the human developer may be necessary.

You can find the proof-of-concept translator on GitHub and hack on it until it suits your needs. Please remember that Federico didn’t develop it for this article, it’s part of an earlier effort to migrate JavaParser to ANTLR. However, we’ve checked that it still builds at the time of writing on recent JDK versions.

The migration of parsers to ANTLR from older or less used parser generators (including previous ANTLR versions) is one of the services Strumenta provides to its customers, as part of our Legacy Modernization offerings.

If you want to understand how to use ANTLR you can read our article The ANTLR Mega Tutorial.