I am a Language Engineer: I use several tools to define and process languages.
Among other tools I use ANTLR: it is simple, it is flexible, I can build things around it.
However I find myself rebuilding similar tools around ANTLR for different projects. I see two problems with that:
- ANTLR is a very good building block but with ANTLR alone not much can be done: the value lies in the processing we can do on the AST and I do not see an ecosystem of libraries around ANTLR
- ANTLR does not produce a metamodel of the grammar: without it becomes very difficult to build generic tools around ANTLR
Let me explain that:
- For people with experience with EMF: we basically need an Ecore-equivalent for each grammar.
- For the others: read next paragraph
Why we need a metamodel
Suppose I want to build a generic library to produce an XML file or a JSON document from an AST produced by ANTLR. How could I do that?
Well, given a ParseRuleContext I can take the rule index and find the name. I have generated the parser for the Python grammar to have some examples, so let’s see how to do that with an actual class:
Python3Parser.Single_inputContext astRoot = pythonParse(...my code...); String ruleName = Python3Parser.ruleNames[astRoot.getRuleIndex()];
Let’s look at the class Single_inputContext:
public static class Single_inputContext extends ParserRuleContext { public TerminalNode NEWLINE() { return getToken(Python3Parser.NEWLINE, 0); } public Simple_stmtContext simple_stmt() { return getRuleContext(Simple_stmtContext.class,0); } public Compound_stmtContext compound_stmt() { return getRuleContext(Compound_stmtContext.class,0); } public Single_inputContext(ParserRuleContext parent, int invokingState) { super(parent, invokingState); } @Override public int getRuleIndex() { return RULE_single_input; } @Override public void enterRule(ParseTreeListener listener) { if ( listener instanceof Python3Listener ) ((Python3Listener)listener).enterSingle_input(this); } @Override public void exitRule(ParseTreeListener listener) { if ( listener instanceof Python3Listener ) ((Python3Listener)listener).exitSingle_input(this); } }
In this case I would like to:
I should obtain something like this:
<Single_input NEWLINES="..."> <Simple_stmt>...</Simple_stmt> <Compund_stmt>...</Compunt_stmt> </root>
Good. It is very easy for me to look at the class and recognize these elements, however how can I do that automatically?
Reflection, obviously, you will think.
Yes. That would work. However what if when we have multiple elements? Take this class:
public static class File_inputContext extends ParserRuleContext { public TerminalNode EOF() { return getToken(Python3Parser.EOF, 0); } public List<TerminalNode> NEWLINE() { return getTokens(Python3Parser.NEWLINE); } public TerminalNode NEWLINE(int i) { return getToken(Python3Parser.NEWLINE, i); } public List<StmtContext> stmt() { return getRuleContexts(StmtContext.class); } public StmtContext stmt(int i) { return getRuleContext(StmtContext.class,i); } public File_inputContext(ParserRuleContext parent, int invokingState) { super(parent, invokingState); } @Override public int getRuleIndex() { return RULE_file_input; } @Override public void enterRule(ParseTreeListener listener) { if ( listener instanceof Python3Listener ) ((Python3Listener)listener).enterFile_input(this); } @Override public void exitRule(ParseTreeListener listener) { if ( listener instanceof Python3Listener ) ((Python3Listener)listener).exitFile_input(this); } }
Class clazz = Python3Parser.File_inputContext.class; Method method = clazz.getMethod("stmt"); Type listType = method.getGenericReturnType(); if (listType instanceof ParameterizedType) { Type elementType = ((ParameterizedType) listType).getActualTypeArguments()[0]; System.out.println("ELEMENT TYPE "+elementType); }
ELEMENT TYPE class me.tomassetti.antlrplus.python.Python3Parser$StmtContext
How the metamodel should like?
To define metamodels I would not try to come up anything fancy. I would use the classical schema which is at the base of EMF and it is similar to what it is available in MPS.
I would add a sort of container named Package or Metamodel. The Package would list several Entities. We could also mark one of those entity as the root Entity.
Each Entity would have:
- a name
- an optional parent Entity (from which it inherits properties and relations)
- a list of properties
- a list of relations
Each Property would have:
- a name
- a type chosen among the primitive type. In practice I expect to use just String and Integers. Possibly enums in the future
- a multiplicity (1 or many)
Each Relation would have:
- a name
- the kind: containment or reference. Now, the AST knows only about containments, however later we could implement symbol resolution and model transformations and at that stage we will need references
- a target type: another Entity
- a multiplicity (1 or many)
Next steps
I would start building a metamodel and later building generic tools taking advantage of the metamodel.
There are other things that typically need:
- transformations: the AST which I generally get from ANTLR is determined by how I am force to express the grammar to obtain something parsable. Sometimes I have also to do some refactoring to improve performance. I want to transform the AST after parsing to obtain closer to the logical structure of the language.
- unmarshalling: from the AST I want to produce the test back
- symbol resolution: this could be absolutely not trivial, as I have found out building a symbol solver for Java
Yes, I know that some of you are thinking: just use Xtext. While I like EMF (Xtext is built on top of it), it has a steep learning curve and I have seen many people confused by it. I also do not like how OSGi plays with the non-OSGi world. Finally Xtext is coming with a lot of dependencies.
Do not get my wrong: I think Xtext is an amazing solution in a lot of contexts. However there are clients who prefer a leaner approach. For the cases in which it makes sense we need an alternative. I think it can be built on top of ANTLR, but there is work to do.
By the way years ago I built something similar for .NET and I called it NetModelingFramework.