ANTLR is a popular parser generator or “compiler-compiler”, developed by prof. Terence Parr and several contributors. It’s been around since 1992, as an evolution of PCCTS. It’s gone through multiple major versions. The latest incarnation of ANTLR is version branch 4.x, the first release of which is from 2013.

Here at Strumenta, we use ANTLR4 whenever possible, to develop parsers and build editors, interpreters, transpilers, and everything else that we do with programming languages and DSLs. Typically, we start our projects from scratch. However, sometimes we’re tasked with performing a migration from an older version of ANTLR.

So, in this tutorial, we’ll give some guidelines on migrating from an ANTLR2 grammar to an ANTLR4 grammar. Typically, migration will require weeks to be completed. It’s a mostly manual process, which is tricky, and without sticking to some good principles, it can have a less than satisfactory outcome.

The ideal target audience for this piece is a developer with a working knowledge of some version of ANTLR, or at least, of a parser generator.

Why Migrating from ANTLR2?

Sometimes, we don’t have the luxury to start from scratch. We may have to work on legacy code, to fix it, to evolve it, to adapt it to new scenarios. This includes grammars written for previous versions of ANTLR.

There’s a sound engineering principle that goes like this: if it ain’t broke, don’t fix it. There’s no point in migrating code that works and does what it’s supposed to do, just to use the shiny new version of our favorite tool.

However, when the old code is an unmaintainable mess and we’re tasked with adding significant new features, migrating to the latest version of ANTLR might be the right solution for the customer, in terms of development costs and maintainability. This is the case because of several features introduced with ANTLR v4.

Also, ANTLR2 supports only a few target platforms: Java, C#, and C++. If we need to port our grammar to a different target that ANTLR2 doesn’t support, save from rewriting it, our only option is to migrate the grammar to a newer ANTLR version that can target our chosen platform.

ANTLR4 Features and Differences

Finally, ANTLR4 has many features that ANTLR2 lacks, that we may enjoy after a migration. Let’s look at some of them.

ANTLR4 accepts left-recursive grammars, so it’s easier to write certain kinds of rules more intuitively, for example:

expr: expr '+' expr | expr '*' expr | NUMBER;

Here, the operator precedence results from the ordering of the alternatives in the production.

In ANTLR2, we would have had to split the rule:

expr: addend ('+' addend)?;
addend: factor ('*' factor)?;
factor: NUMBER;

ANTLR4 parsers employ the adaptive LL(*) algorithm; in practice, this means that it’s not necessary to fix a value for k in LL(k) in advance, k being the upper bound to the parser’s lookahead – roughly, how many tokens the parser will examine in order to decide which path to take. ANTLR4 has no such upper bound, rather, it adjusts it dynamically; also, being adaptive, the parser warms up at runtime with each use, reaching a better peak performance than its predecessors that used static analysis.

ANTLR4 no longer builds an abstract syntax tree (AST), thus it has no syntax for tree construction and manipulation in the grammar. Instead, the output of an ANTLR4 parser is a parse tree, which is closer to the concrete syntax. Additionally, we can instruct ANTLR4 to generate a visitor or a listener, with which we can traverse the parse tree and build whatever data structure we may need for further processing.

It might look like a feature has been lost, but it’s actually a change in paradigm. ANTLR4 grammars are simpler and the AST construction instructions are moved into code in a general-purpose programming language. The downside is that such code has to be written, but we’ll offer ideas and patterns for that, too.

What about ANTLR3?

In this tutorial, we’ll focus on migrating ANTLR2 grammars and abstract syntax trees to ANTLR4; we won’t touch on ANTLR3. This is both because there’s more to say about migration from 2 to 4 and because if we’re migrating now, it makes no sense to migrate to ANTLR3, which hasn’t seen any new development in many years.

You can still benefit from this tutorial if you need to migrate to or from ANTLR3:

  • ANTLR3 syntax is closer to 4 than to 2;
  • Migrating the construction of the AST from 3 to 4 is similar in principle, although the tree construction syntax is different from ANTLR2;
  • In general, ANTLR3 semantics are closer to 2 than to 4.

For migrating from ANTLR2 to ANTLR3, you’ll also want to read this guide by Terence Parr himself.

What do I need?

For starters, we’ll need an ANTLR2 grammar. Throughout the tutorial, we’ll be including snippets from actual grammar code, but a concrete example at hand is invaluable.

We should also ensure that we’re able to build a parser from the grammar. This might seem unnecessary or obvious, but by looking at the code that ANTLR2 generates, we can reconstruct the same behavior with ANTLR4, particularly with regard to AST building and manipulation, if our grammar makes use of it.

Then, we’ll need a working ANTLR4 installation. For that, please refer to the Setup section of our ANTLR Mega Tutorial.

The Process

Let’s now look at the process that we’ll follow to implement our migration. We’ll perform the following steps:

  1. Rewrite the grammar, one rule at a time, removing all AST construction logic if present.
  2. Generate the parser and a visitor.
  3. If the consuming code needs an AST, write a visitor that builds an AST from a parse tree.
  4. If the ANTLR2 grammar included semantic actions, write a visitor that runs the actions, either from the parse tree or from the abstract syntax tree.

Let’s now examine the above points one by one in more detail.

Rewriting Grammar Rules

ANTLR2 grammar files have a .g extension, while ANTLR4 uses .g4. As a first step, we’ll create a .g4 file for each grammar and proceed to migrate each rule. We can follow the same order as the original grammar, but it’s probably better to go bottom-up from the lexer rules so as to have a runnable and testable parser at every step of the process, even if it will only recognize a progressively larger subset of the language.

In general, the formats of ANTLR2 grammars and ANTLR4 grammars are not terribly different, as the syntax is based on similar variants of an EBNF notation. Following the guide for migrating from 2 to 3 will take you quite far already, as ANTLR3 is closer to 4. Let’s see some highlights.

Grammar Declaration and Setup

In ANTLR4, we no longer declare a class that extends a Lexer or Parser, like so: class SQLParser extends Parser;. Instead, we declare a grammar like so: grammar SQLParser;. ANTLR4 will differentiate between parser rules and lexer rules by looking at their case: someRule is a parser rule, while SOME_TOKEN is a lexer rule. It’s also possible (and sometimes necessary) to split a grammar into multiple files.

Also, the declaration of additional code to include in the generated lexer/parser changes, from header { … }to @header { … }for the file header, and from { … }to @members { … }for class members. For a more thorough treatment, look into the ANTLR4 documentation. If our grammar has no header or member sections, we can ignore this part.

Skipping tokens

In lexer rules, to skip tokens – what in ANTLR2 we may have written as $setType(Token.SKIP); – we’ll now write as -> channel(HIDDEN)or -> skipat the end of the rule, for example:

WS: (' ' | '\t') -> channel(HIDDEN);

Note the mandatory parentheses in case of rules with several alternatives.

We’ve touched on the difference between the skip action and the channel action in another article.

Similarly, to change the type of a token, we’ll use -> type(SOME_TYPE)at the end of a rule rather than { $setType(SOME_TYPE); }.

Options

Many options are no longer necessary and/or no longer exist in ANTLR4; both parser-level options and rule-level options. Some examples:

  • caseSensitive = falseand caseSensitiveLiterals = false are no longer supported; instead, we’ll define case-insensitive tokens as explained in depth in the ANTLR documentation.
  • k = <some number>is unnecessary as ANTLR4 is LL(*).
  • caseSensitiveLiterals = falseis no longer supported, obviously.
  • greedy = falseis replaced by non-greedy operators.

Note that specifying an option which is not supported is not an error, it will be silently ignored.

Other Minor Changes

A few other changes from 2 to 4 include:

  • Syntactic predicates (=>) are no longer needed as ANTLR4 is LL(*). They were used to work around LL(k) limitations in ANTLR2, and we should just eliminate them.
  • To label an element of a grammar production we’ll use the equal sign (=) instead of the colon (:), for example: expr: expr operator=(PLUS|MINUS) expr;
  • The range operator (..) has been replaced by regular expression-like character classes. So, we’ll rewrite 'a'..'z'as [a-z].
  • Literal tokens are surrounded by single quotes (‘) and not double quotes (“).
  • We’ll replace protected lexer rules with fragment rules.

These are not the only differences, but they’re the ones that we’re most likely to encounter in our migration. For the grammar syntax that we haven’t covered in this tutorial, you can refer to the ANTLR2 meta-language specification and to the ANTLR4 documentation on GitHub.

Semantic Predicates and Actions

Semantic predicates are tests that guard a parsing decision. We’ve talked about them in our ANTLR Mega Tutorial. For example, let’s consider the following rule:

ID: [a-zA-Z]+ {isValidIdentifier()}?;

It will only match identifiers that satisfy the predicate, isValidIdentifier, that we’ll have written.

Semantic actions are code snippets associated with a rule, or portion of a rule, that ANTLR will embed in the generated parser or lexer. For example, the lexer grammar for ANTLR4’s own meta-language contains the rule:

ACTION: '}' {
    popMode();
    if ( _modeStack.size()>0 ) more(); // keep scarfing until outermost }
};

After matching a closing curly brace, ANTLR will execute the code in the action – in this case, popping out of the mode stack until there are no morer closing braces.

As we can see, both actions and predicates are written in the target language (e.g., Java), as opposed to the EBNF-like ANTLR grammar specification meta-language. So, they depend on the structure of the generated code: which class the parser extends, for example. Thus, when migrating a grammar, most probably we’ll have to rewrite them, and that’s a completely manual process, that we cannot detail here.

In general, we’ve noticed that legacy grammars, especially if written by people without much expertise in language implementation techniques, tend to be heavy in embedded code (actions and predicates). This makes them hard to evolve – and hard to migrate as well. Thus, when migration is necessary, it’s a good idea to contextually move code as much as possible out of the grammar, into a listener or visitor. This will make the grammar easier to maintain after the migration.

This is particularly difficult when a semantic action alters the behavior of the parser (say, by consuming tokens in a loop under certain conditions). In a migration, such code will need to be rewritten, while maintaining the same logic as much as possible. It’s preferable to move such logic into the grammar if it’s possible because the grammar is declarative and has a higher level of abstraction than source code in a general-purpose programming language. However, if it’s not possible to express the same logic in the ANTLR meta-language, we can also simplify the grammar to only recognize a lump of text (e.g., delimited by some known tokens), that we’ll process further in a later stage. So, when assessing a migration and estimating the effort that will be needed, we have to pay special attention to actions that interact with the parser or the lexer themselves, as those will probably require the most of the manual work.

A Note About Error Handling

In ANTLR2, we have the option defaultErrorHandler = falseto disable default error handling, typically to install our own error handlers programmatically or by specifying them in the grammar.

ANTLR4 doesn’t have that option as part of the grammar definition language; instead, we can register and deregister listeners and error handlers on the generated lexer and parser. We still have the option to add a catch clause to our rules. We’ve talked about error handling in other articles, for example, the ANTLR Mega Tutorial.

AST Building Instructions in the Grammar

Unless we’ve turned off the buildAST option, the output of an ANTLR2 parser is an abstract syntax tree. ANTLR2 has a default strategy to build the AST for you without any specific instruction. However, a grammar author can instruct it to build an AST with a different shape. For that purpose, ANTLR2 includes syntax for describing these kinds of tree manipulations. This syntax includes:

  • The AST root postfix operator, ^
  • The AST exclude postfix operator, !
  • Semantic actions manipulating the AST with an ad-hoc language. We can recognized those by the use of the sharp character (#) and square brackets ([]). For example: { #decl = #([DECL,"decl"], #decl); }

All the information about tree construction operators is available in ANTLR2’s documentation. We’ll give a quick overview in a later section.

When migrating a grammar, we’ll completely remove these instructions, as ANTLR4 doesn’t support them (and doesn’t even build an AST by itself). Instead, we’ll build the AST with a visitor (or listener). We’ll return to the topic of AST construction soon.

Generating Lexer, Parser, and Visitor

Once we’ve migrated some rules, we’ll want to generate a parser and test it. Here, we won’t go into detail about this part, as it’s documented elsewhere, for example in our ANTLR Mega Tutorial.

The output of the generation process will be:

  • A lexer class. This produces a stream of tokens from a stream of characters.
  • A parser class. This produces a parse tree from a stream of tokens.
  • A listener and/or a visitor, according to the options we’ve fed to ANTLR. These we can extend to attach custom logic to a parse tree, for example, to further transform it into an AST. In this tutorial, we’ve chosen to show a visitor that builds an AST.

Depending on the target language, we may have additional tools at our disposal. For example, with Java lexers and parsers, we can use TestRig/grun, a tool for testing and visualizing the grammar. Or, if our target is TypeScript and we’re writing an editor, we can use antlr4-c3 for code completion (this might be the topic for another article), and so on.

Building an Abstract Syntax Tree

A parser is only useful as the first stage in a process; for example, interpretation, syntax highlighting, source-to-source translation, code completion, compilation, etc.

Usually, subsequent phases benefit from using an abstract syntax tree that removes some of the complexity of the grammar and it’s closer to the domain of the language. And, if we’re migrating from ANTLR2, most probably we already have generated code in place that builds the AST, and hand-written code that consumes it.

Differences Between AST and Parse Tree

Often, it’s hard to explain the difference between a parse tree and an AST, as they’re both tree structures that represent elements of a language. So the difference really lies in how we intend to use the tree, rather than in its contents.

A parse tree closely mirrors the grammar. For example, for a production like:

ifExpr: IF expr THEN expr (ELSE expr)?;

The parse tree will have a node for each element:

ifExpr
  IF
  expr
    ...
  THEN
  expr
    ...
  ELSE
  expr
    ...

If we were to write an AST for the same if construct, we might, for example, omit unnecessary tokens, and only store the logical information that is important for analyzing or transforming the code:

ifExpr
  condExpr
    ...
  thenExpr
    ...
  elseExpr
    ...

But it’s not just that. In the AST, we could also merge different notations for the same concept (syntax sugar) into a single tree representation. For example, when parsing JavaScript, we may want to represent both foo.bar and foo['bar'] as a PropertyAccess(foo, 'bar') node in the AST – even though their parse trees will be different.

Also, with an AST we can eliminate some artifacts of the grammar, e.g., that we’ve split a rule in two for clarity. We can refactor the grammar without impact on the code that consumes the AST because the AST doesn’t have to reflect the structure of the grammar like a parse tree does.

Of course, we could just use the parse tree as our AST, without further transformations. However, it’s good programming practice to structure code into multiple layers of abstraction, getting progressively closer to the program domain (and the end-user).

Building an AST in ANTLR4

As we’ve previously mentioned, ANTLR4 no longer builds an AST for you directly from the grammar. We’ll have to write code to do that. We have two options:

  1. Produce the very same AST that the ANTLR2 parser produced: same classes, same structure.
  2. Write an AST from scratch.

Both options have pros and cons and it depends on our project which one can have the best outcome.

Option 1. has the lowest impact on the existing code, however, the AST might have accumulated cruft and technical debt we’ll still be carrying after the migration. Also, depending on the ANTLR target, there may be technical obstacles in mixing ANTLR4 parser runtime code with ANTLR2 AST code. We’ve verified that it’s possible to do so with the Java runtime and the C++ runtime because ANTLR2 and ANTLR4 use different packages/namespaces.

On the other hand, option 2. requires changing the consuming code, which might even not be under our control. That may pose integration issues. However, it also allows us to use more modern technology and to improve the design of our AST, eliminating all technical debt as part of the migration. Also, depending on the complexity of the code consuming the AST, this option could be more straightforward, as it doesn’t rely on mixing different versions of the ANTLR runtime.

Migrating an AST: a Case Study in C++

Recently we’ve been tasked with the migration of several grammars from ANTLR2 to ANTLR4. The generated parsers are integrated into a larger C++ software solution. The parsers all produce customized ASTs.

In the process, we’ve learned some patterns for migrating AST construction logic, and we’ll now share them.

ANTLR2 builds the AST while it parses the input. In ANTLR4, instead, we’ll build the AST from the parse tree resulting from the parser, using a visitor (or listener; there’s no substantial difference between the two approaches).

Regular AST Construction

When given no specific instructions, ANTLR2 will build an AST for each rule as follows:

  • Start with an empty AST, that we’ll call T.
  • For each matched element of the rule:
    • If it’s a lexer rule (token), create a terminal node and add it to T.
    • If it’s a parser rule, recursively build an AST for it and add it to T.

From the above, it follows that an ANTLR2 AST by default is made only of nodes corresponding to the matched tokens (lexer rules), and not parser rules. However, as we’ll see, it’s possible to manipulate the construction of the tree to introduce synthetic nodes.

When we write, “add it to T”, we mean: if the AST is empty, the added node becomes the root; otherwise, it becomes the last child of the root. For this operation, the C++ ANTLR2 runtime uses a type called ASTPair, which holds a reference to a root node and a reference to the last of its children.

In the C++ runtime, the AST is made of objects that use a reference counting scheme for managing memory and ownership. Without going into much detail, there’s a built-in, fixed hierarchy of AST classes and a parameterized ASTRefCount<T> reference class to wrap them.

We’re not supposed to instantiate these directly; instead, ANTLR2 uses an ASTFactory class, and we’ll use that as well in our code.

Similarly, for tokens, the ANTLR2 C++ runtime defines a hierarchy culminating in the class, CommonToken, and wraps them into reference-counted RefToken instances.

Users can also define new AST node subtypes, by extending the built-in AST classes and providing an AST factory that returns the custom node types. Then, they can set the default node type for the grammar:

options {
  ASTLabelType = "name of the AST class to use";
}

As well as registering a specific node type for a certain token type:

tokens {
  SOME_TOKEN<AST=SomeClass>;
}

Note that, for the C++ runtime, the ASTLabelType is a reference-counting type, such as RefMyAST in:

typedef ANTLR_USE_NAMESPACE(antlr)ASTRefCount<MyAST> RefMyAST;

The AST class in the tokens section, instead, is a class in the AST hierarchy, such as:

class ANTLR_API SOME_TOKEN_Node : public MyAST

ANTLR2 will then use the corresponding RefSOME_TOKEN_Node type, that we’ll have defined like so:

class ANTLR_API SOME_TOKEN_Node : public MyAST

Looking at the code that ANTLR2 generates will solve any remaining doubts.

AST Building Code

So, simplifying it a bit, the code that ANTLR2 generates for normal AST construction follows a pattern similar to the following:

antlr::ASTPair currentAST;
RefAST returnAST = antlr::nullAST;

//For parser rules
returnAST = match_some_rule();

//or, alternatively, for tokens. LT(1) is “the next token”.
match(SOME_TOKEN);
returnAST = astFactory.create(LT(1));

//Add the node to the AST
astFactory.addASTChild(currentAST, returnAST);

//Match other rules or tokens, according to the grammar
//...

//Then, finally:
return currentAST.root

The actual generated code is more complicated, but this gives the idea.

Handling Tree Manipulation Instructions in the Grammar

As we’ve seen, ANTLR2 grammars can include instructions to control the structure and shape of the AST. We’ll now look into how those translate to code so that we’ll be able to replicate them in our listener or visitor.

The easiest operator is the AST ignore operator, !. This marks a token or subtree for exclusion from the AST. For example:functionCall: IDENTIFIER OPEN_PAREN! (argument (COMMA! argument)*)? CLOSE_PAREN!;

The AST won’t contain any nodes for the open and close paren and for the commas that separate the arguments. In the AST building code, ANTLR2 will simply omit the call to addASTChild for those nodes. The tree for the expression f(a, 1, "s")could look like the following:

IDENTIFIER(f)
    IDENTIFIER(a)
    NUMBER(1)
    STRING(s)

Then, let’s examine the AST root operator, ^. This applies to the preceding rule or token and makes it the root of the AST. Compare:

sum: NUMBER PLUS NUMBER;
#1 + 2 parses into the AST: (NUMBER(1) PLUS NUMBER(2))

with:

sum: NUMBER PLUS^ NUMBER;
#1 + 2 parses into the AST: (PLUS NUMBER(1) NUMBER(2))

In the generated parser, this translates to instructions similar to the following:

plusAST = astFactory.create(LT(1));
astFactory.makeASTRoot(currentAST, plusAST);

instead of the usual addASTChild call.

Finally, the most complex instructions are those that take full control of node creation, like so:

someRule: A B C { #someRule = #([NODE_LABEL, "someText"], #someRule); }

This will create an AST with a synthetic root node labeled NODE_LABEL with text “someText”. The above translates to C++ code similar to the following:

someRule_AST = RefAST(currentAST.root);
someRule_AST = RefAST(astFactory.make((new antlr::ASTArray(2))->add(astFactory.create(NODE_LABEL, "someText"))->add(someRule_AST)));

In general, we’ll want to consult the generated code when we find instructions of this kind, as they’re too generic for us to identify a pattern.

A Sketch of a Visitor in C++

We’ll now sketch an ANTLR4 visitor to produce the AST, using ANTLR2 classes. However, we’ll omit some of the details.

In particular, the ANTLR4 runtime uses an opaque, dynamically typed Any type, and converting back and forth is tedious, so we’ll not show that in the code.

The visitor class generated by ANTLR4 takes care of processing each node, by visiting its children. When we want to add our own processing code, we’ll override the method for the corresponding grammar rule, like so:

antlrcpp::Any ASTBuilderVisitor::visitSomeGrammarRule(MyParser::someGrammarRuleContext *ctx) {
    doMyStuff(ctx);
    return visitChildren(ctx);
}

visitChidren is a built-in method that traverses the child nodes of some parse tree node and recursively invokes the visitor. Since visitor methods may return some value – in our case, a portion of an AST – ANTLR4 has some built-in logic, that we can override, to combine the return values for each child node into an aggregate return value for the parent node.

However, this built-in logic does not fit well with the ANTLR2 tree building code, so we’ll just replace the built-in visitChildren like so:

antlrcpp::Any ASTBuilderVisitor::visitChildren(antlrcpp::tree::ParseTree *node) {
    antlr::ASTPair ast;
    return visitChildren(ast, node);
}

antlrcpp::Any ASTBuilderVisitor::visitChildren(antlr::ASTPair &ast, const antlrcpp::tree::ParseTree *node) {
    //Iterate through children, and then
    return ast.root;
}

Regular AST Construction

Now, for iterating through the children in the default case (when no instructions are specified in the grammar) we could write a loop like the following:

size_t n = node->children.size();
for (size_t i = 0; i < n; i++) {
    auto child = node->children[i];
    RefAST childAST = visit(child);
    astFactory->addASTChild(ast, childAST);
}

Remember that we’re omitting all the dance to convert back and forth to ANTLR4’s Any class. The actual code is more complex, but this is the core idea.

Note the reference to astFactory in the code. We’ve chosen to declare the factory as a field of the visitor:

protected:
    ASTFactory* astFactory;
    explicit ASTBuilderVisitor(ASTFactory *astFactory) : astFactory(astFactory) {}

So, we’ll provide a new ASTFactory() when we create it. But other choices are possible, of course.

Ignored Nodes

Let’s now see how we might handle ignored nodes. Our loop blindly adds all child nodes to the AST, but in the ANTLR2 grammar, some nodes could be excluded using the bang (!) operator.

So, we’ll keep a set of ignored nodes:

protected:
    set<tree::ParseTree*> ignore;
    //other fields...

Then, in the loop, we’ll filter ignored nodes:

auto ignored = ignore; //Copy the set, so that we don't interfere with recursive calls to visitChildren
ignore.clear();
for (size_t i = 0; i < n; i++) {
    auto child = node->children[i];
    RefAST childAST = visit(child);
    bool addToAST = ignored.empty() || (ignored.find(child) == ignored.end());
    if (addToAST) {
        astFactory->addASTChild(ast, childAST);
    }
}

As we can see, we clear the set, because the visitor calls visitChildren recursively. Also, we visit the child no matter what, even if we’ll throw the resulting AST away. This is consistent with what ANTLR2 does and allows us to perform side-effects when visiting the tree, independently of AST construction. That said, if such side-effects are numerous and unrelated with the AST, we should consider writing a separate visitor that performs them, either directly from the parse tree or in a subsequent phase, by further processing the AST.

Finally, we’ll provide a method to ignore a given parse tree node:

void ASTBuilderVisitor::astIgnore(tree::ParseTree *node) {
    if (node != nullptr) {
        ignore.insert(node);
    }
}

Root Nodes

Similarly, we’ll want to be able to set some nodes as the root of a given subtree – what the ANTLR2 grammar accomplished with the ^ operator.

So, we’ll add a new field to the visitor:

tree::ParseTree* root = nullptr;

And we’ll update the loop to handle that as well:

auto ignored = ignore;
auto theRoot = root; //Again, we copy then null this for recursive calls
ignore.clear();
root = nullptr;
for (size_t i = 0; i < n; i++) {
    auto child = node->children[i];
    RefAST childAST = visit(child);
    if (theRoot == child) {
        astFactory->makeASTRoot(ast, childAST);
    } else {
        bool addToAST = ignored.empty() || (ignored.find(child) == ignored.end());
        if (addToAST) {
            astFactory->addASTChild(ast, childAST);
        }
    }
}

As before, to abstract the node-visiting code from our implementation details, we’ll provide a utility method to mark some node as the root:

void ASTBuilderVisitor::astRoot(tree::ParseTree *node) {
    root = node;
}

Custom Tree Manipulation Code

With the above, we’ve covered the two operators to manipulate the AST by annotating grammar elements.

For other kinds of tree manipulations, written in semantic actions, there isn’t a general solution. We can look at the generated parser and reproduce the logic in our visitor, using the building blocks that we’ve shown so far.

Writing an AST From Scratch

We may decide instead to avoid using the ANTLR2 classes altogether and to rewrite the AST from scratch using another technology.

We don’t have much to say here: the implementation of the tree, and the mapping to it from the ANTLR4 parse tree, is for its developers to choose. However, we can suggest that you evaluate our open-source library, Kolasu, if you’re writing your AST in Java or Kotlin.

Conclusions

In this tutorial, we’ve given some guidelines and examples that can help us perform a successful migration from ANTLR2 to ANTLR4.

As we’ve seen, migration is a complex process, but we can build some utilities that help us in common cases. We’ll still have to resort to manual translation of code (Java, C++, C#, etc.) for the less common cases.

Read more:

If you want to understand how to use ANTLR you can read our article The ANTLR Mega Tutorial.