In this article we are going to talk about a little known, but quite useful feature of ANTLR: pattern matching. It is a feature that allows you to find nodes that follow a specified pattern. This makes it easier to do stuff like collecting all static methods inside a class declaration.

What is ANTLR Pattern Matching

ANTLR offers visitors and listeners to execute some action when you encounter a node of a certain type, for example a node of type VariableDeclaration. This works great when you need to perform the same operation every time. However, sometimes you need to find patterns in the tree rather than individual nodes.

For example, you need to do something whenever you find a method inside a class declaration. You might also want to do something for some concrete case, for instance whenever you find all assignments to 0. This is where pattern matching shines. You just ask to find all instances matching a specific pattern, and you get them.

You could also use pattern matching in place of visitors or listeners, for example to find all assignments. Instead of executing an action whenever you match a node of type Assignment, you ask to match all assignment patterns. You may opt for this solution because it is easier or quicker than building a whole listener or visitor.

Using Patterns

The concept is very easy, so let’s see an example. We are going to match some patterns for Java code. The grammar we are using is mostly the Java9 grammars from the grammars repository. The little change that we made is to create a rule interfaces, that parses a file of interface declarations.

The example file looks like the following.

interface Formula {
    // body
}

public interface CallableProcessingInterceptor {
    // body
}

@FunctionalInterface
public interface RouterFunction<T extends ServerResponse> {
    // body
}

This is the code to use pattern matching.

Java9Parser.InterfacesContext itsc = javaParser.interfaces();

for (ParseTree t: itsc.normalInterfaceDeclaration())
{
    ParseTreePattern p = javaParser.compileParseTreePattern("interface <identifier> <interfaceBody>", Java9Parser.RULE_normalInterfaceDeclaration);
    ParseTreeMatch m = p.match(t);
    if (m.succeeded())
    {
        System.out.println(m.get("identifier").getText());
    }
}

In the first line we parse the source file using the aforementioned rule interfaces. We then get all the interfaces matched and try to match each interface to our pattern:

interface <identifier> <interfaceBody>

The elements between angle brackets are handled like tokens or rules of the grammar. So <identifier> will match the subrule identifier in the rule normalInterfaceDeclaration or any of its subrules. This means that, in this example, you could try to match elements inside the rule interfaceBody. Instead, the element interface is considered a literal because it is not enclosed in any delimiter. 

For our example file, this code would print:

formula

So, it would match just the first interface.

Understanding Pattern Matching

The method compileParseTreePattern tries to match the pattern to the specific referenced rule, in this case the rule is normalInterfaceDeclaration. It is important to understand well what that means.

It means that the pattern must identify a valid match of the rule. This means that the elements of the pattern must be present in the referenced rule or any subrule. So, we can use a pattern like this one:

interface <identifier> { <interfaceMemberDeclaration>  <interfaceMemberDeclaration> }

That is because these are the rules normalInterfaceDeclaration and interfaceBody in the grammar:

normalInterfaceDeclaration : interfaceModifier* 'interface' identifier typeParameters? extendsInterfaces? interfaceBody ;
interfaceBody
: '{' interfaceMemberDeclaration* '}'
;

Since interfaceMemberDeclaraton is a subrule of interfaceBody, we can use it in the pattern matching.

This is also a valid pattern:

interface <identifier> { }

However, this is not:

interface <identifier>

That is because this does not identify a valid match of the rule normalInterfaceDeclaration. We cannot use a partial pattern match, we need a valid match. In this example, normalInterfaceDeclaration requires an interfaceBody, so we need one in our pattern. 

This is an important limitation of pattern matching. In practical terms, this means that we will often use the pattern as seen in this example. First, we parse the whole input, then we apply our pattern matching to specific parts of the tree for which we can define a manageable pattern. We are rarely going to try to match a pattern for a whole file, at least when parsing programming languages, because it would be too complicated to define a valid pattern.

We can Match More Than Just Rules

In our patterns, we can also use literals instead of rule references. For example, this is a valid pattern:

interface Formula { }

This pattern will match all empty interfaces with the exact identifier Formula. This identifies a valid match for the rule normalInterfaceDeclaration, so this is an acceptable pattern. It is also an easy way to match things like expressions or all assignments to some number, like 0. It could be useful to do things like changing the default initialization value in a file.

Finally, you can also change the delimiters used to identify rules and the escape character used to escape the delimiters. The default escape character is the backslash \.

ParseTreePatternMatcher pm = new ParseTreePatternMatcher(javaLexer, javaParser);
pm.setDelimiters("(", ")", "/");
ParseTreePattern ptp = pm.compile("interface (identifier)<(typeParameterList)> (interfaceBody)", Java9Parser.RULE_normalInterfaceDeclaration);
ParseTreeMatch ptm = ptp.match(t);

In this example we changed the start delimiter to (, the end one to ) and the escape character to the / symbol. This makes it easier to parse generic arguments, that are delimited by angle brackets, like this one:

public interface RouterFunction<T extends ServerResponse> {}

Handling Matches

Once we get a pattern, we can find a match and extract the parts of the match that we care about using the name of the rule.

ParseTreeMatch m = p.match(t);
if (m.succeeded())
{
    System.out.println(m.get("identifier").getText());
}

In this example we get the node identified by the rule identifier. We can also define labels in our pattern matching, defining the name of the label followed by a colon.

interface <label:identifier> <interfaceBody>

In this pattern, the name of the label is label.

ParseTreeMatch m = p.match(t);
if (m.succeeded())
{
    System.out.println(m.get("label").getText());
}

We can then recover the node identified by the label just as if it were a rule name. In fact, this will also recover any rule name with the same name, so this will match any rule, token or label named label

The match method of ParseTreeMatch returns a matching parse tree, if any. You can also use a matches method, that returns a boolean, in case it finds a match.

Another useful method is findAll, which returns all matches. This method accepts as first argument a ParseTree and as second a pattern. However, this pattern is not a tree pattern, it is an XPath string.

Finding Nodes with XPath

XPath is a simple language originally used to query an XML document. In our case it is used to identify nodes or subtrees of a parse tree. It is simple, but quite powerful and effective to identify specific sections of a tree. If you do not have familiarity with XPath, you can think about something like the glob patterns used in .gitignore files to identify files and directories to exclude from a git repository.

XPath works well in conjunction with parse tree patterns because of the limitation of parse tree pattern we mentioned earlier. It is difficult to define valid patterns for a whole input. You can first identify some subtree using XPath and then apply the parse tree pattern to this manageable subtree in order to find what you need.

For example, you can use XPath to find the first method of a class and then apply a pattern to that method.

These are the basic expressions that you can use in ANTLR XPath.

ExpressionExplanationExample
namenodes with the corresponding token or ruleenumBody -> matches all nodes with the rule enumBody
‘text’tokens with the corresponding text‘var’ -> matches all tokens with the text “var”
/direct children of the node must obey the next pattern /compilationUnit/ordinaryCompilation/typeDeclaration/interfaceDeclaration -> matches all interfaceDeclaration that are children of /compilationUnit/ordinaryCompilation/typeDeclaration
//any descendants of the node must obey the next pattern//literal -> matches any literal node, anywhere in the tree
!negates the next pattern//!* -> matches nothing, anywhere in the tree

Let’s see some examples.

Using XPath

You can use XPath standalone, using the following general approach.

List<String> nodes = new ArrayList<String>();
for (ParseTree t : XPath.findAll(tree, xpath, parser) ) {
    if ( t instanceof RuleContext) {
        RuleContext r = (RuleContext)t;
        nodes.add(r.getText());   
    }     
    else {
        TerminalNode token = (TerminalNode)t;
        nodes.add(token.getText());
    }     
}

You can use the method findAll of the class XPath to find all nodes that follow the XPath identified by the second argument.

For example, you can use the following XPath to match all identifiers inside normal interface declaration, i.e., the name of the interface.

String xpath = "/compilationUnit/ordinaryCompilation/typeDeclaration/interfaceDeclaration/normalInterfaceDeclaration/identifier";

As we mentioned, you can also use XPath in conjunction with tree patterns.

ParseTree tree = javaParser.compilationUnit();
String xpath = "//literal";
String treePattern = "5";
ParseTreePattern p = javaParser.compileParseTreePattern(treePattern, Java9Parser.RULE_literal);
List<ParseTreeMatch> matches = p.findAll(tree, xpath);
for(ParseTreeMatch m: matches) {
    System.out.println(m.getTree().getText());
}

In this particular example we first find all nodes literal using XPath, then we find all literals with the text 5 using tree pattern. This combination works well because you can use the querying power of XPath to find the nodes in the tree you care about based on the structure of the tree. Then you can check the content of the nodes for the pattern you care about. So, XPath to identify a subtree and then pattern match to find the nodes you care about.

Conclusion

ANTLR offers powerful pattern matching capabilities that you can use to query and find specific nodes inside trees. This is not a new feature, it has been present since ANTLR 4.2, so it is stable and well-tested. 

The only issue with ANTLR pattern matching is that it has limited applicability. For one, you often prefer to move from the parse tree to an AST tailored for your application and use case. This is what we usually do, in fact we created our Starlasu libraries to simplify creating and manipulating ASTs. The parse tree is designed for good parsing: performance and manageability. However, this rarely makes it easy to work with in the rest of your code. So, when you transform the parse tree in the AST, you lose this capability.

The second reason because its has limited application in the context of parsing. Most often you need a parser to handle the whole input of the parser. For instance, to create a transpiler or to read a data format. This makes the ability of selecting specific nodes not applicable. You need to systematically collect all the content.

ANTLR pattern matching is useful for limited cases, when you need to work on the concrete syntax and to make small changes or take small bits of information. For example, it has its uses if you are doing some static analysis, making systematic changes to your code or some editor-related work.

 
https://strumenta-s-r-l.kit.com/ab55ba037c/index.js