Parsing any language in Java in 5 minutes using ANTLR: for example Python

I like processing code for several purposes, like static analysis or automated refactoring. The interesting part to me is to reason on the models you build from the Abstract Syntax Tree (AST). To get there you need a way to get the AST from your source files. This can be done easily using ANTLR and the collection of complete grammars available here: https://github.com/antlr/grammars-v4

antlr-logo

Thank you folks for all the grammars!

We are just going to take the one for Python 3, which should work fine also for Python 2. If we will need to do minor adjustment we can easily do that starting from this base.

Getting the grammar

First things first: let’s get the grammar.

Just visit https://github.com/antlr/grammars-v4 and take the grammar you need. Most grammars have a very permissive license.

There are tens of grammars for languages such as R, Scala, Python, Swift, PHP and many others. There is also one for Java but for Java you prefer to use JavaParser, am I right?

Just copy the grammar into your new project, under src/main/antlr

Setting up the project using Gradle

Now we are going to setup a build script with Gradle.

We will use the ANTLR4 plugin from melix, because I find it more flexible of the one described in the official documentation.

We will generate the code in a specific package (me.tomassetti.pythonast.parser) and therefore in a directory derived from that package (build/generated-src/me/tomassetti/pythonast/parser).

I also added a fatJar task. That tasks produce a JAR containing all the dependencies. I use it to import the parser into Jetbrains MPS more easily.

To generate the parser from the grammar you can just run gradle antlr4.

You can then have to explain to your IDE that it should consider the code under build/generated-src.

How to invoke the parser

Now let’s see how we can invoke the parser.

Our ParserFacade has just one public method named parse. It gets a File and it returns an AST.  It could hardly be simpler than that.

Let’s look at some ASTs

Let’s take a simple file:

And now get the AST. We can print it using this code:

If we parse the simple example and print it with AstPrinter we get a super complex AST. The first lines look like:

For the way the parser it is build there are a lot of annidated rules. That makes sense while parsing but it produces a very polluted AST. I think there are two different ASTS: as a parsing AST which is easy to produce, and a logic AST that it is easy to reason about. Luckily we can transform the first one in the latter without too much effort.

One simple way is to list all the rules that are just wrappers and skip them, taking their only child instead. We could have to refine this but as a first approximation let’s just skip the nodes which have just one children which is another parser rule (no terminals).

In this way we go from 164 nodes to 28. The resulting logic AST is:

In this tree everything should we mapped to a concept we understand, with no artificial nodes in the way, nodes just created for parsing reasons.

Conclusions

Writing parsers is no where we can produce most value. We can easily reusing existing grammars, generate parsers and build our smart applications using those parsers.

There are several parser-generators out there and most of them are good enough for most goals you can have. Among them I tend to use ANTLR more than others: it is mature, it is supported, it is fast. The ASTs it produces can be navigated both using hereogeneous APIs (we have single classes genered for each kind of node) and homogeneous APIs (we can ask to each node which rule it represents and the list of its children).

Another great benefit of ANTLR is the presence of grammars ready to be used. Building grammars require experience and some work. Especially for complex GPL like Java or Python. It also requires very extensive testing. We are still finding minor issues with the Java 8 grammars behind JavaParser even if we have parsed literally hundreds of thousands of files using it. This is a very good reason to not write your own grammar if you can avoid that.

By the way, all the code is available on github: python-ast

Question: are you interested in parsing? Do you prefer other parser generators to ANTLR?

 

7 Comments

  1. Would you please share with us the generated files that you got after running gradle antlr4.
    I’ve followed the same steps while working on an ObjC Grammar and the parser that I got has no method called file_input()

    PS: in the github repo I only find your own classes.

  2. Hi, the files are not included in the repository because they are generated. You shoul re-generate them locally by running “gradle compileJava” in the root of the project.

    The name of the method is the name of the root rule. You should look for the rule at the top of your ObjC Grammar and invoke the method with the corresponding name.

  3. Pingback: Python reflection: how to list modules and inspect functions | Federico Tomassetti – Consultant Software Engineer

  4. Pingback: ANTLR and Jetbrains MPS: Parsing files and display the AST usign the tree notation | Federico Tomassetti – Consultant Software Engineer

  5. Hi Federico, thanks for such a great tutorial.
    However, I am facing the same problem as Sarra. Would you please try to help me to find correct rule which should be called for lang. C?
    https://github.com/antlr/grammars-v4/tree/master/c
    I tried to call ‘primaryExpression()’ as you suggested, but no success.

    Thanks in advance
    Jan

  6. Hi Jan, the rule you are looking for is “compilationUnit”

  7. Thank you, Federico!

Leave a Reply