Introduction
In this article, we will learn how to implement parsers in Python using Pylasu and ANTLR. In order, we will:
- Create an ANTLR grammar for a simple ‘kinda-functional’ programming language called Slang and generate a parser from it;
- Define an Abstract Syntax Tree (AST) using Pylasu and learn how to build these from ANTLR parse trees;
- Integrate our parser with a Command-Line Interface (CLI) application, allowing users to parse Slang code from both strings and files and visualize a JSON representation of the corresponding AST;
In order to keep the article from being too verbose, some code details might be omitted. Check out the complete project source code on GitHub. Feel free to fork the project, play with it and share your improvements, comments or ideas with us!
Parsers and Abstract Syntax Trees
Maybe as part of our latest attempt to create a shiny-looking innovative programming language or data format, or maybe to dust off an existing one – teaching computers to understand languages is a fun drive!
In this, the absolute first step consists in transforming text into something that can be easily manipulated by computers. At first glance, venturing into the dark world of regular expressions might seem like the right path. After all, we are talking about text, aren’t we?
Unfortunately, regular expressions might not provide enough expressive power to process the syntax of language constructs. Most of the existing ones, indeed, require Context-Free Grammars (CFGs) – thanks Prof. Chomsky. Also, manually implemented regular expressions might easily become cryptic and hard to maintain in this context.
In practice, CFGs are used to describe the syntax of most computer languages in terms of rules and dedicated software components are implemented to recognize syntactic constructs in conformance with such grammars. Put simply, we use Parsers.
The typical output of a parser consists in a tree-like data structure called parse tree. In this, the root and the intermediate nodes represent non-terminal rule matches, while leave nodes correspond to terminals. For example, here is a possible parse tree representation (right-hand side) for a simple series of Javascript statements (left-hand side).
Numerous parsing algorithms exist in the literature, providing different trade-offs between expressive power and computational complexity.
Parsing algorithms can be implemented manually, but a definitely more convenient approach consists in relying on parser generators. These take a grammar specification and produce a conforming parser implementation.
Among parser generators, we love ANTLR at Strumenta – we use it in all sorts of parsing projects on a daily basis and it always worked well for us!
The tree-like nature of parse trees entails ease of navigation and manipulation of the input. However, at some point, the purely syntactic nodes included in it might become irrelevant. For example, parentheses can be useful to understand the order of the operations in text and determine how the parse tree is organized. Once the parse tree is created these do not provide useful information anymore and can be removed. This is different from what happens with identifiers or literals, which remain valuable as they provide information through their content.
For this reason, additional transformation steps are implemented into parsers to extract what’s called Abstract Syntax Tree (AST). Going back to our Javascript example, this is a possible abstract syntax tree representation where purely syntactic keywords and punctuation has been removed.
The Slang Language
Slang is a toy programming language showcasing some typical language constructs found in both imperative and functional languages. As such, feel free to interpret each construct with your preferred semantics, the following discussion will rather only focus on the syntactic aspects of the language.
Example – Fibonacci using Slang
The function computing the Nth number of the Fibonacci series seems to be a quite popular example to give a taste of the syntax of a programming language. We like the Fibonacci series, so here is a possible outstanding implementation using Slang
function fibonacci(n) { if (n <= 1) return n; else return fibonacci(n - 1) + fibonacci(n - 2); }
In the following paragraphs, we introduce the various language constructs for Slang. For each construct, we provide the corresponding ANTLR parser grammar rule. The complete ANTLR parser and lexer grammar specifications can be found in SlangParser.g4 and SlangLexer.g4 files on GitHub, respectively.
Workspaces
Each file contains a single workspace definition in Slang. This acts as a container construct for zero or more function definitions and standalone statements.
workspace: functions+=function* statements+=statement* ;
Functions
Function definitions consist of a name, zero or more parameters and zero or more statements. Parameters are identified by their name and do not include type constraints or specifications.
function: FUNCTION name=NAME LPAREN (parameters+=NAME (parameters+=NAME)*)? RPAREN LBRACE statements+=statement* RBRACE ;
Statements
Slang provides support for the following kinds of statement:
- Return Statement – can be used to return back some value specified through an expression from a function;
- Print Statement – can be used to display the representation of a value somewhere, e.g. standard output;
- Conditional (aka If-Then-Else) Statement – can be used to execute different logic depending on some condition;
- Binding Statement – can be used to associate names to values;
- Expression Statement – can be used to wrap expressions as statements;
Each statement, except for conditionals having multiple statements in their body, ends with a semicolon. Curly brackets are used to wrap multiple statements in conditionals and can be omitted otherwise.
statement : RETURN value=expression COLON #returnStatement | PRINT argument=expression COLON #printStatement | IF LPAREN condition=expression RPAREN (LBRACE positive+=statement* RBRACE |positive+=statement) (ELSE (LBRACE negative_branch+=statement* RBRACE |negative_branch+=statement))? #conditionalStatement | name=NAME BND value=expression COLON #bindingStatement | expression COLON #expressionStatement ;
Expressions
Slang provides support for the following kinds of expressions:
- Grouping Expression – can be used to override the ordinary operator evaluation order using parentheses;
- Unary Operation Expression – can be used to apply both logical and arithmetic negation, as well as arithmetic identity;
- Binary Operation Expression – can be used to perform arithmetic calculations and comparisons;
- Invocation Expression – can be used to invoke functions with expression arguments;
- Reference Expression – can be used to reference the value of parameters and variables;
- Literal Expression – can be used to define constant integer values;
expression : LPAREN expression RPAREN #groupingExpression | operator=(NOT|ADD|SUB) operand=expression #unaryOperationExpression | left=expression operator=(MUL|DIV) right=expression #binaryOperationExpression | left=expression operator=(ADD|SUB) right=expression #binaryOperationExpression | left=expression operator=(LT|GT|LTQ|GTQ) right=expression #binaryOperationExpression | left=expression operator=(EQ|NQ) right=expression #binaryOperationExpression | target=NAME LPAREN (arguments+=expression (COMMA arguments+=expression)*)? RPAREN #invocationExpression | target=NAME #referenceExpression | value=NUMBER #literalExpression ;
Generating Parsers with ANTLR
In order to generate a parser from a grammar specification using ANTLR, there are first a couple of tools that need to be installed. Generally, ANTLR can be used by downloading and running its JAR – hence requiring Java to be installed on our machine. However, there is a simpler solution for us – antlr4-tools.
Using this, the only requirement is reduced to Python, which typically comes installed on all operating systems. The tool will create two executables to generate and invoke parsers – respectively antlr4 and antlr4-parse. The first time any of these commands will be invoked, the installation of both Java and ANTLR will be silently handled.
Checkout the antlr4-tools GitHub repository for detailed installation instructions.
Let’s get back to our example! We are using the PDM package and dependency manager, hence the following command is executed to install antlr4-tools:
pdm add -dG antlr antlr4-tools
Given our SlangParser.g4 and SlangLexer.g4 grammar files, we can then generate an ANTLRv4.9.3 parser targeting Python3 with the following command:
antlr4 -v 4.9.3 -Dlanguage=Python3 -no-visitor -no-listener SlangLexer.g4 SlangParser.g4
The process will generate our parser in SlangParser.py, while the SlangLexer.py file will contain the lexer. The -no-visitor and -no-listener options specify that we are not interested in generating ANTLR4 base visitors and listeners in this case.
Check out our ANTLR Mega Tutorial for further details and start your path towards becoming an ANTLR ninja!
In order to rapidly test out the generated parser, we can use the antlr4-parse command as follows:
# parse and print out the tokens antlr4-parse SlangParser.g4 SlangLexer.g4 workspace <file_path> -tokens # parse and print out the parse tree antlr4-parse SlangParser.g4 SlangLexer.g4 workspace <file_path> -tree
Otherwise, we could also install the ANTLR Python runtime and programmatically invoke the parser as follows:
# Install antlr4-python-runtime with: pdm add antlr4-python-runtime from antlr4 import CommonTokenStream, InputStream from slang.parser.antlr.SlangLexer import SlangLexer from slang.parser.antlr.SlangParser import SlangLexer # the code that we want to parse code = "...slang code..." # tokenize the input using the lexer lexer = SlangLexer(InputStream(code)) # or FileStream(absolute_path) for files # parse the token stream using the parser parser = SlangParser(CommonTokenStream(lexer)) # retrieve the parse tree root parse_tree = parser.workspace() # print out a formatted representation of the parse tree print (parse_tree.toStringTree())
Printing out the parse tree for our fibonacci function will produce the following output – don’t worry, we will make it more readable soon:
(workspace:1 (function:1 function fibonacci ( n ) { (statement:3 if ( (expression:5 (expression:8 n) <= (expression:9 1)) ) (statement:1 return (expression:8 n) ;) else (statement:1 return (expression:4 (expression:7 fibonacci ( (expression:4 (expression:8 n) - (expression:9 1)) )) + (expression:7 fibonacci ( (expression:4 (expression:8 n) - (expression:9 2)) ))) ;)) }))
Defining ASTs with Pylasu
Until now, we defined an ANTLR grammar specification for our language, generated a parser from it and saw how to print out the parse tree for a given file or string. Let us now introduce an abstract syntax tree using Pylasu.
Pylasu is an open-source library supporting the StarLasu methodology in Python. Other equivalent libraries exist to enable support in Kotlin (and the JVM), Typescript and C#.
Abstract syntax tree nodes can be defined through classes extending the Node base class in Pylasu. Nodes come with various built-in features supporting the implementation of recurring tasks when implementing language processing applications:
- Position – each node keeps track of its corresponding position in the source text in terms of line and column number;
- Traversal API – various functions can be used to navigate nodes, their children and properties;
- Transformation API – dedicated classes and functions can be used to transform nodes into other nodes or text;
- Origin and Destination – when involved into a series of transformations, each node keeps track of its origin and destination, that is the nodes where it originates from and those produced from it;
In addition to these core features, StarLasu libraries provide support for advanced concepts, such as symbol resolution, type computation, serialization, testing. Check out the documentation for further details!
Going back to our language, the following snippet shows how the node representing function definitions can be defined using a Python dataclass:
from dataclasses import dataclass, field from typing import List @dataclass class Function(Node): name: str = field(default_factory=str) parameters: List[str] = field(default_factory=list) statements: List['Statement'] = field(default_factory=list)
That’s it, really! Extending the Node base class will integrate our node specification into the Pylasu infrastructure. Therefore, we will be able to navigate through its children, access its corresponding position in the source code (if coming from a textual source) and so on.
Node classes can also contain no properties or extend each other. For example, we decided to define a dedicated AST node for each unary operation. In the parse tree, instead, these are all represented using a single UnaryOperationExpression node. The following snippet illustrates how unary operation expression nodes can be modeled for our purpose:
from dataclasses import dataclass, field from pylasu.model import Node from typing import Optional @dataclass class Expression(Node): '''AST node representing an expression''' pass @dataclass class UnaryOperation(Expression): '''AST node representing a unary operation expression''' operand: Optional[Expression] = field(default=None) @dataclass class Not(UnaryOperation): '''AST node representing logical negation operations''' pass @dataclass class Minus(UnaryOperation): '''AST node representing arithmetic negation operations''' pass @dataclass class Plus(UnaryOperation): '''AST node representing arithmetic identity operation''' pass
The rest of the AST node definitions is omitted for brevity, you can find it inside the nodes.py module.
From ANTLR Parse Trees to Pylasu ASTs
We are now ready to transform our Parse Tree into an AST.
In Pylasu, transformations can be implemented by creating and configuring ASTTransformer instances. In the specific case of a transformation mapping ANTLR Parse Trees to ASTs, the dedicated ParseTreeToASTTransformer class can be used. A transformer instance can be created as follows:
from pylasu.mapping.parse_tree_to_ast_transformer import ( ParseTreeToASTTransformer ) transformer=ParseTreeToASTTransformer(allow_generic_node=False, issues=[])
The idea behind transformers consists in supporting a mostly declarative specification of transformation rules using NodeFactories, while taking care of all details concerning the position of nodes – that is, the line and column number at which these appear in text.
There are various ways a NodeFactory can be defined, the simplest one consists in specifying the source and target node types, along with their children – used to guide the transformation. For example, mapping a WorkspaceContext parse tree node into our Workspace AST node, the following factory can be registered:
from pylasu.transformation import PropertyRef from slang.ast.nodes import Workspace from slang.parser.antlr.SlangParser import SlangParser as _ # workspace (constructor-node-factory-with-child) ( transformer.register_node_factory(_.WorkspaceContext, Workspace) .with_child(PropertyRef('functions'), PropertyRef('functions')) .with_child(PropertyRef('statements'), PropertyRef('statements')) )
In short, we are instructing the transformer to create a Workspace instance whenever a WorkspaceContext is encountered in the parse tree, and populate the functions and statements properties by mapping each element with the appropriate factory – the rules mapping FunctionContext with our Function node, for example. PropertyRef is a Pylasu class used to represent references to node properties.
Sometimes, however, we might need a bit more freedom. Passing a reference to the class constructor – e.g. Workspace – requires this being a parameter-less one. Other classes could have mandatory constructor parameters, therefore, we also provide support for more complex kinds of factories – any Callable producing the right AST node instance.
Let’s use lambdas, for example, to define our node factory mapping BindingContext parse tree nodes into Binding AST nodes, as follows:
from pylasu.transformation import PropertyRef from slang.ast.nodes import Binding from slang.parser.antlr.SlangParser import SlangParser as _ # statement - binding transformer.register_node_factory( _.BindingStatementContext, lambda source: Binding(name=source.name.text) # handle 'name' property ).with_child(PropertyRef('value'), PropertyRef('value'))
As you can see, we still use children references to map the value property, but the name property is manually handled. In particular, we are setting it using the text property of ANTLR terminal nodes – the NAME token in this case.
Defining lambdas for complex logic is cumbersome, often not possible and generally discouraged in Python. In these cases, we can use user-defined function references – defs.
As mentioned in the previous section, while unary operation expressions are represented using a single kind of parse tree node, we decided to define dedicated nodes in our AST. Therefore, we need some slightly more sophisticated logic to map UnaryOperationExpressionContext parse tree nodes into UnaryOperation AST node subclasses, as follows:
from pylasu.parsing.parse_tree import to_position from pylasu.validation import Issue, IssueSeverity, IssueType from pylasu.transformation import PropertyRef from slang.ast.nodes import Minus, Not, Plus from slang.parser.antlr.SlangParser import SlangParser as _ # node factory from user-defined function def unary_operation_node_factory(source: _.UnaryOperationExpressionContext): # common builder function for unary operations def build_unary_operation(unary_operation_constructor: ...): unary_operation = unary_operation_constructor() # manually invoke the transformer to transform the 'expression' property unary_operation.expression = transformer.transform(source.operand) return unary_operation # branch depending on the operator if source.NOT(): # create a 'Not' AST node instance return build_unary_operation(Not) elif source.ADD(): # create an 'Add' AST node instance return build_unary_operation(Plus) elif source.MINUS(): # create a 'Minus' AST node instance return build_unary_operation(Minus) else: # oops... issues.append( Issue( IssueType.SYNTACTIC, f"Unsupported unary operation: {source.operator.text}", IssueSeverity.ERROR, to_position(source) ) ) return None # register the node factory transformer.register_node_factory( _.UnaryOperationExpressionContext, unary_operation_node_factory )
See the oops there? This example also illustrates how issues can be added during the transformation. In this case, we are creating a syntactic error issue in case an unexpected unary operator is found. This should not happen given that the ANTLR parser would not recognize other unary operators…but better be safe than surprised.
All the other mappings fall into one of these categories and are omitted in this article for brevity. Please check the complete source code in the transformations.py module on GitHub.
‘Packaging’ our Parser
As mentioned at the beginning of this article, our objective consists in implementing a parser and integrating this into a CLI allowing users to parse strings and files.
Before doing so, it is convenient to wrap our parser capabilities into a well-defined interface. More specifically, we expose two functions from a main.py module, namely parse_string and parse_file:
from antlr4 import CommonTokenStream, FileStream, InputStream from pylasu.validation import Result def parse_string(code: str) -> Result: return parse_input_stream(InputStream(code)) def parse_file(filename: str) -> Result: return parse_input_stream(FileStream(filename))
Both functions wrap their input into an ANTLR input stream and delegate the actual work to another function named parse_input_stream. The result consists in a Result instance, which is a utility class provided by Pylasu to represent parsing results. More specifically, it keeps a list of possible issues encountered during the parsing process and a reference to the root of the AST.
The issues list can be populated with errors thrown by the internal ANTLR parser or user-defined ones emerged during subsequent phases, e.g. as seen with our unary operation node factory in the previous section.
What is happening inside the parse_input_stream function? Well, we put all the previous steps together and build a Result instance out of it, as follows:
from antlr4 import InputStream from pylasu.validation import Result, Issue def parse_input_stream(input_stream: InputStream) -> Result: # keep track of all issues issues: List[Issue] = [] # 1. build the ANTLR Parse Tree parse_tree = _build_slang_parse_tree(input_stream, issues) # 2. transform the ANTLR Parse Tree into the Pylasu AST abstract_syntax_tree = _build_slang_abstract_syntax_tree(parse_tree, issues) # 3. build a Result instance and return it back return _build_pylasu_result(abstract_syntax_tree, issues)
The complete source code is omitted for brevity, please checkout the main.py module on GitHub for further details.
A Pretty-Cool Command-Line Interface
We are finally ready to implement our PC-CLI!
For this, we are going to cheat a bit and delegate the ‘make-it-cool’ task to Typer, a Python package supporting the creation of command-line interfaces with built-in auto-completion, argument validation and so on.
Sort of a version of Click on steroids – psss…that is actually what they are using inside! Check out their documentation for a complete overview of the supported features and feel free to make our CLI even cooler.
We can install the package as follows (if you are using our project setup):
pdm add "typer[all]"
Once completed the installation, setting up a series of commands is pretty straightforward. In our case, we want to structure our interface so that users can run the following commands:
slang parse string "<slang_code>" # parse slang code from a string slang parse file <file_path> # parse slang code from a file
In order to do so, we define the following decorated functions:
from pathlib import Path from typer import Argument, Typer from typing_extensions import Annotated from slang.parser import parse_file # initialize a Typer application app = Typer() # add the command group 'parse' parser_app = Typer() app.add_typer(parser_app, "parse", "Parse slang code from strings or files") # add command 'string' to the 'parse' group @parser_app.command("string", "Parse slang code from a string") @track_progress("Parsing") def from_string(code: Annotated[str, Argument(help="...")]): return build_report(parse_string(code)) # add command 'file' to the 'parse' group @parser_app.command("file", "Parse slang code from a file.") @track_progress("Parsing") def from_file(path: Annotated[Path, Argument(help="...")]): return build_report(parse_file(str(path.absolute())))
The track_progress decorator function keeps the user entertained while the parsing process is running, i.e. it shows a progress bar. Instead, the build_report function just serializes our AST into JSON. Their complete specification is omitted for brevity, please check out their source code on GitHub.
There are various ways to expose scripts from a Python package, you can find the complete configuration inside our pyproject.toml file. In our case, we just added a script in the tool.pdm.scripts configuration to be able to execute commands like pdm run slang, as follows:
[tool.pdm.scripts] slang = { call = "slang.cli:app" }
In order install the package and be able to run it using python -m slang, we can just add a __main__.py file into our package, create an instance of our command-line interface and change its default name, as follows:
from .cli import app app(prog_name='slang')
One last thing, update the pyproject.toml file with a project.script configuration as follows to be able to install the package and use it as an ordinary prompt command:
[project.scripts] slang = "slang.cli:app"
Now, we are finally able to play with our CLI. Here are some execution examples, feel free to run these by yourself and have fun!
Asking for help
Help…again
Parsing strings
The full output is omitted here, run it to see the entire AST.
…and parsing files
In this case, we parsed our fibonacci function and successfully got back the corresponding AST with no issues, yay!
Conclusions
In this article, we implemented a parser for the Slang toy programming language in Python using Pylasu and ANTLR.
First, we defined the ANTLR grammars and saw how antlr-tools can be used to trigger the generation process. Then, we defined our AST using Pylasu and implemented a model-to-model transformation to create instances of it from ANTLR parse trees. Finally, we integrated our parser into a CLI providing users the possibility to parse Slang code from both strings and files.
Pss…check out the project repository to see how the whole ANTLR generation process can be integrated into the build workflow using Tox. This might be a topic for a future article, let us know what do you think about it.
You did it, you reached the end of this article and we hope this was a fun and interesting drive! Get in touch with us on social media, through the project repository – or carrier pigeons if you like – for comments, feedback, everything really…and have a nice day!