I am a Language Engineer: I use several tools to define and process languages.

Among other tools I use ANTLR: it is simple, it is flexible, I can build things around it.

However I find myself rebuilding similar tools around ANTLR for different projects. I see two problems with that:

  • ANTLR is a very good building block but with ANTLR alone not much can be done: the value lies in the processing we can do on the AST and I do not see an ecosystem of libraries around ANTLR
  • ANTLR does not produce a metamodel of the grammar: without it becomes very difficult to build generic tools around ANTLR

Let me explain that:

  • For people with experience with EMF: we basically need an Ecore-equivalent for each grammar.
  • For the others: read next paragraph

Why we need a metamodel

Suppose I want to build a generic library to produce an XML file or a JSON document from an AST produced by ANTLR. How could I do that?

Well, given a ParseRuleContext I can take the rule index and find the name. I have generated the parser for the Python grammar to have some examples, so let’s see how to do that with an actual class:

Python3Parser.Single_inputContext astRoot = pythonParse(...my code...);
String ruleName = Python3Parser.ruleNames[astRoot.getRuleIndex()];

Let’s look at the class Single_inputContext:

public static class Single_inputContext extends ParserRuleContext {
    public TerminalNode NEWLINE() { return getToken(Python3Parser.NEWLINE, 0); }
    public Simple_stmtContext simple_stmt() {
        return getRuleContext(Simple_stmtContext.class,0);
    }
    public Compound_stmtContext compound_stmt() {
        return getRuleContext(Compound_stmtContext.class,0);
    }
    public Single_inputContext(ParserRuleContext parent, int invokingState) {
        super(parent, invokingState);
    }
    @Override public int getRuleIndex() { return RULE_single_input; }
    @Override
    public void enterRule(ParseTreeListener listener) {
        if ( listener instanceof Python3Listener ) ((Python3Listener)listener).enterSingle_input(this);
    }
    @Override
    public void exitRule(ParseTreeListener listener) {
        if ( listener instanceof Python3Listener ) ((Python3Listener)listener).exitSingle_input(this);
    }
}

In this case I would like to:

I should obtain something like this:

<Single_input NEWLINES="...">
   <Simple_stmt>...</Simple_stmt>
   <Compund_stmt>...</Compunt_stmt>
</root>

Good. It is very easy for me to look at the class and recognize these elements, however how can I do that automatically?

Reflection, obviously, you will think.

Yes. That would work. However what if when we have multiple elements? Take this class:

public static class File_inputContext extends ParserRuleContext {
    public TerminalNode EOF() { return getToken(Python3Parser.EOF, 0); }
    public List<TerminalNode> NEWLINE() { return getTokens(Python3Parser.NEWLINE); }
    public TerminalNode NEWLINE(int i) {
        return getToken(Python3Parser.NEWLINE, i);
    }
    public List<StmtContext> stmt() {
        return getRuleContexts(StmtContext.class);
    }
    public StmtContext stmt(int i) {
        return getRuleContext(StmtContext.class,i);
    }
    public File_inputContext(ParserRuleContext parent, int invokingState) {
        super(parent, invokingState);
    }
    @Override public int getRuleIndex() { return RULE_file_input; }
    @Override
    public void enterRule(ParseTreeListener listener) {
        if ( listener instanceof Python3Listener ) ((Python3Listener)listener).enterFile_input(this);
    }
    @Override
    public void exitRule(ParseTreeListener listener) {
        if ( listener instanceof Python3Listener ) ((Python3Listener)listener).exitFile_input(this);
    }
}
Class clazz = Python3Parser.File_inputContext.class;
Method method = clazz.getMethod("stmt");
Type listType = method.getGenericReturnType();
if (listType instanceof ParameterizedType) {
    Type elementType = ((ParameterizedType) listType).getActualTypeArguments()[0];
    System.out.println("ELEMENT TYPE "+elementType);
}
ELEMENT TYPE class me.tomassetti.antlrplus.python.Python3Parser$StmtContext

To define metamodels I would not try to come up anything fancy. I would use the classical schema which is at the base of EMF and it is similar to what it is available in MPS.

I would add a sort of container named Package or Metamodel. The Package would list several Entities. We could also mark one of those entity as the root Entity.

Each Entity would have:

  • a name
  • an optional parent Entity (from which it inherits properties and relations)
  • a list of properties
  • a list of relations

Each Property would have:

  • a name
  • a type chosen among the primitive type. In practice I expect to use just String and Integers. Possibly enums in the future
  • a multiplicity (1 or many)

Each Relation would have:

  • a name
  • the kind: containment or reference. Now, the AST knows only about containments, however later we could implement symbol resolution and model transformations and at that stage we will need references
  • a target type: another Entity
  • a multiplicity (1 or many)

Next steps

I would start building a metamodel and later building generic tools taking advantage of the metamodel.

There are other things that typically need:

  • transformations: the AST which I generally get from ANTLR is determined by how I am force to express the grammar to obtain something parsable. Sometimes I have also to do some refactoring to improve performance. I want to transform the AST after parsing to obtain closer to the logical structure of the language.
  • unmarshalling: from the AST I want to produce the test back
  • symbol resolution: this could be absolutely not trivial, as I have found out building a symbol solver for Java

Yes, I know that some of you are thinking: just use Xtext. While I like EMF (Xtext is built on top of it), it has a steep learning curve and I have seen many people confused by it. I also do not like how OSGi plays with the non-OSGi world. Finally Xtext is coming with a lot of dependencies.

Do not get my wrong: I think Xtext is an amazing solution in a lot of contexts. However there are clients who prefer a leaner approach. For the cases in which it makes sense we need an alternative. I think it can be built on top of ANTLR, but there is work to do.

By the way years ago I built something similar for .NET and I called it NetModelingFramework.