Written by Federico Tomassetti
in ANTLR, Parsing

    I am a Language Engineer: I use several tools to define and process languages.

    Among other tools I use ANTLR: it is simple, it is flexible, I can build things around it.

    However I find myself rebuilding similar tools around ANTLR for different projects. I see two problems with that:

    • ANTLR is a very good building block but with ANTLR alone not much can be done: the value lies in the processing we can do on the AST and I do not see an ecosystem of libraries around ANTLR
    • ANTLR does not produce a metamodel of the grammar: without it becomes very difficult to build generic tools around ANTLR

    Let me explain that:

    • For people with experience with EMF: we basically need an Ecore-equivalent for each grammar.
    • For the others: read next paragraph

    Why we need a metamodel

    Suppose I want to build a generic library to produce an XML file or a JSON document from an AST produced by ANTLR. How could I do that?

    Well, given a ParseRuleContext I can take the rule index and find the name. I have generated the parser for the Python grammar to have some examples, so let’s see how to do that with an actual class:

    Python3Parser.Single_inputContext astRoot = pythonParse(...my code...);
    String ruleName = Python3Parser.ruleNames[astRoot.getRuleIndex()];
    

    Let’s look at the class Single_inputContext:

    public static class Single_inputContext extends ParserRuleContext {
        public TerminalNode NEWLINE() { return getToken(Python3Parser.NEWLINE, 0); }
        public Simple_stmtContext simple_stmt() {
            return getRuleContext(Simple_stmtContext.class,0);
        }
        public Compound_stmtContext compound_stmt() {
            return getRuleContext(Compound_stmtContext.class,0);
        }
        public Single_inputContext(ParserRuleContext parent, int invokingState) {
            super(parent, invokingState);
        }
        @Override public int getRuleIndex() { return RULE_single_input; }
        @Override
        public void enterRule(ParseTreeListener listener) {
            if ( listener instanceof Python3Listener ) ((Python3Listener)listener).enterSingle_input(this);
        }
        @Override
        public void exitRule(ParseTreeListener listener) {
            if ( listener instanceof Python3Listener ) ((Python3Listener)listener).exitSingle_input(this);
        }
    }

    In this case I would like to:

    I should obtain something like this:

    <Single_input NEWLINES="...">
       <Simple_stmt>...</Simple_stmt>
       <Compund_stmt>...</Compunt_stmt>
    </root>

    Good. It is very easy for me to look at the class and recognize these elements, however how can I do that automatically?

    Reflection, obviously, you will think.

    Yes. That would work. However what if when we have multiple elements? Take this class:

    public static class File_inputContext extends ParserRuleContext {
        public TerminalNode EOF() { return getToken(Python3Parser.EOF, 0); }
        public List<TerminalNode> NEWLINE() { return getTokens(Python3Parser.NEWLINE); }
        public TerminalNode NEWLINE(int i) {
            return getToken(Python3Parser.NEWLINE, i);
        }
        public List<StmtContext> stmt() {
            return getRuleContexts(StmtContext.class);
        }
        public StmtContext stmt(int i) {
            return getRuleContext(StmtContext.class,i);
        }
        public File_inputContext(ParserRuleContext parent, int invokingState) {
            super(parent, invokingState);
        }
        @Override public int getRuleIndex() { return RULE_file_input; }
        @Override
        public void enterRule(ParseTreeListener listener) {
            if ( listener instanceof Python3Listener ) ((Python3Listener)listener).enterFile_input(this);
        }
        @Override
        public void exitRule(ParseTreeListener listener) {
            if ( listener instanceof Python3Listener ) ((Python3Listener)listener).exitFile_input(this);
        }
    }
    Class clazz = Python3Parser.File_inputContext.class;
    Method method = clazz.getMethod("stmt");
    Type listType = method.getGenericReturnType();
    if (listType instanceof ParameterizedType) {
        Type elementType = ((ParameterizedType) listType).getActualTypeArguments()[0];
        System.out.println("ELEMENT TYPE "+elementType);
    }
    ELEMENT TYPE class me.tomassetti.antlrplus.python.Python3Parser$StmtContext

    To define metamodels I would not try to come up anything fancy. I would use the classical schema which is at the base of EMF and it is similar to what it is available in MPS.

    I would add a sort of container named Package or Metamodel. The Package would list several Entities. We could also mark one of those entity as the root Entity.

    Each Entity would have:

    • a name
    • an optional parent Entity (from which it inherits properties and relations)
    • a list of properties
    • a list of relations

    Each Property would have:

    • a name
    • a type chosen among the primitive type. In practice I expect to use just String and Integers. Possibly enums in the future
    • a multiplicity (1 or many)

    Each Relation would have:

    • a name
    • the kind: containment or reference. Now, the AST knows only about containments, however later we could implement symbol resolution and model transformations and at that stage we will need references
    • a target type: another Entity
    • a multiplicity (1 or many)

    Next steps

    I would start building a metamodel and later building generic tools taking advantage of the metamodel.

    There are other things that typically need:

    • transformations: the AST which I generally get from ANTLR is determined by how I am force to express the grammar to obtain something parsable. Sometimes I have also to do some refactoring to improve performance. I want to transform the AST after parsing to obtain closer to the logical structure of the language.
    • unmarshalling: from the AST I want to produce the test back
    • symbol resolution: this could be absolutely not trivial, as I have found out building a symbol solver for Java

    Yes, I know that some of you are thinking: just use Xtext. While I like EMF (Xtext is built on top of it), it has a steep learning curve and I have seen many people confused by it. I also do not like how OSGi plays with the non-OSGi world. Finally Xtext is coming with a lot of dependencies.

    Do not get my wrong: I think Xtext is an amazing solution in a lot of contexts. However there are clients who prefer a leaner approach. For the cases in which it makes sense we need an alternative. I think it can be built on top of ANTLR, but there is work to do.

    By the way years ago I built something similar for .NET and I called it NetModelingFramework.

    The ANTLR Mega Tutorial as a PDF

    Get the Mega Tutorial delivered to your email and read it when you want on the device you want

    Powered by ConvertKit
     
    Creating a Programming Language

    Learn to Create Programming Languages

    Subscribe to our newsletter to get the FREE email course that teaches you how to create a programming language