Building advanced parsers using Sharplasu

Introduction

In this article, we will show you how to use Sharplasu for building advanced parsers for the Python language. We choose Python 3 as a language to be parsed for the developers to see a real-world example they are familiar with – involving imports, function definitions, and parameters –, and for them to understand the structure produced by the parser. However, the same approach can be applied to any of the languages that ANTLR can be used to build parsers for.

Sharplasu is a library that enables developers to create advanced parsers that can be used for many applications. These parsers offer APIs to define and produce an Abstract Syntax Tree (AST), making it really easy to build many different kinds of applications.
Sharplasu is only one of the Starlasu libraries, each of them covering a different language – in our case, C#.

Sharplasu uses ANTLR for the first stage of the parsing process, which will produce a Syntax Tree of the source code given as input. Then, developers using Sharplasu have access to some nice APIs to define the nodes of an AST and to navigate through it and work with it.

By reading this article, you will learn how to use Sharplasu to parse Python 3 files, and how to extract and work with the structure of the parsed sourced code to, e.g., analyse its content.

Set up a project with Sharplasu

The source code of the project presented in this article is available at the following GitHub repository: https://github.com/Strumenta/sharplasu-tutorial

The repository has a few tags marking different stages of this article.

First of all, we want to create a .NET solution composed by two projects: a library and a console project. The library will be the parser itself, while we will use the console project to write a few, simple unit tests that will help us explore the APIs that Sharplasu offers.

This point corresponds to the Git tag: project-creation

Includes the project structure as it is when newly created

Generating the first-stage parser with ANTLR

For this project we use an ANTLR grammar available at the following GitHub repository: https://github.com/antlr/grammars-v4/tree/master/python/python3/Python3

In this case, the grammar is composed of a Lexer and a Parser. Additionally, there are a couple of C# source files that take care of some particularities of the Python language that would be difficult to represent in an ANTLR grammar. We do not need to read and understand them, we can just use them as black boxes.

You can download and use the grammar files as they are; while for the .cs files, you will need to add the Strumenta.Python3Parser namespace, which we are using in our project.

At this point, your IDE should give you some errors, e.g. in src/Python3LexerBase.cs

For instance, Python3Parser.DEDENT is not recognised. This issue will be fixed as soon as we invoke the ANTLR tool to generate the first-stage parser, as we wil see in the following paragraphs.

Now we need the ANTLR complete tool to generate the parser C# classes Sharplasu will be using. You can download the ANTLR4 complete tool from here: https://www.antlr.org/download/antlr-4.12.0-complete.jar

You can then create a script that invokes ANTLR to generate the first-stage parser source code. For instance:

#!/bin/bash

ANTLR_JAR=antlr-4.12.0-complete.jar

java -jar "tools/$ANTLR_JAR" \
 -Dlanguage=CSharp \
 -package "Strumenta.Python3Parser" \
 -o "parser/src/generated" \
 -Xexact-output-dir \
 -no-listener \
 -no-visitor \
 "parser/src/antlr/Python3Lexer.g4" \
 "parser/src/antlr/Python3Parser.g4"

Once you run the ANTLR tool, the source code for the first-stage parser should be generated. Make sure the sources are in a folder that the compiler detects. At this point, the errors in src/Python3LexerBase.cs and src/Python3ParserBase.cs should disappear, and you should be able to build the solution successfully.

Now we have the first-stage parser.

This point corresponds to the Git tag: antlr-configuration

Contains the properly configured .g4, .cs and .sh files

Setting up our Sharplasu Python parser implementation

At this stage, we have the ANTLR tool generating the first-stage parser in place. We can now begin using Sharplasu APIs to invoke the first-stage parsing and return the parse tree.

However, in order to compile, we need to define a few types needed by our Sharplasu parser. Let’s start with the AST nodes.

We can define our first AST node that represents the structure of a whole Python file. We can call it CompilationUnit. A Python file is composed of a list of statements. For now, let’s represent these statements as strings, but we will define AST nodes for them as well later.

Create a file: parser/src/Ast.cs

using Strumenta.Sharplasu.Model;

namespace Strumenta.Python3Parser;

public class CompilationUnit : Node
{
   public List<string> Statements { get; set; }
}

Now we need to define a class implementing a Sharplasu parser using the AST node that we just defined. We can create a Python3SharplasuParser class extending the SharplasuParser type, specifying the types of, respectively: the root AST node, the first-stage parser and the root parse tree node. The class should look as follows:

using Antlr4.Runtime;
using Strumenta.Sharplasu.Parsing;
using Strumenta.Sharplasu.Validation;

namespace Strumenta.Python3Parser;

public class Python3SharplasuParser : SharpLasuParser<CompilationUnit, Python3Parser, Python3Parser.File_inputContext>
{
   public override Lexer InstantiateLexer(ICharStream charStream)

   {
       return new Python3Lexer(charStream);
   }

   public override Python3Parser InstantiateParser(ITokenStream tokenStream)
   {
       return new Python3Parser(tokenStream);
   }

   protected override CompilationUnit ParseTreeToAst(Python3Parser.File_inputContext parseTreeRoot, bool considerPosition = true, List<Issue> issues = null)
   {
       throw new NotImplementedException();
   }

   protected override Python3Parser.File_inputContext InvokeRootRule(Python3Parser parser)
   {
       return parser.file_input();
   }
}

The IstantiateLexer and IstantiateParser methods are using ANTLR generated first-stage parser. The ParseTreeToAst method is, as the name suggests, where Sharplasu converts the parse tree to the AST we define. Notice that this method will return the root node of the AST, which we defined to be CompilationUnit.

Taking a look at the parse tree

In order to test our Python parser, we need of course some Python source code. For this purpose, we can download the following Python file: https://raw.githubusercontent.com/Mindwerks/worldengine/master/worldengine/plates.py

And place it in test/data/plates.py

In order for our application to be able to read this Python file when executing the test project, add the following configuration to test.csproj:

<ItemGroup>
 <None Include="data/**" CopyToOutputDirectory="PreserveNewest" />
</ItemGroup>

Let’s finally see how to invoke the Python parser and get our AST. Create test/ParserTest.cs and copy the following code:

using Microsoft.VisualStudio.TestTools.UnitTesting;

namespace Strumenta.Python3Parser.Test;

[TestClass]
public class ParserTest
{
   private string GetExamplePythonFileContent()
   {
       return File.ReadAllText(@"data/plates.py");
   }

   [TestMethod]
   public void TestFirstStageParsing()
   {
       var parser = new Python3SharplasuParser();
       var result = parser.ParseFirstStage(GetExamplePythonFileContent());
       Assert.IsTrue(result.Correct);
   }
}

Invoking the first-stage parsing is as simple as calling the ParseFirstStage method over a Python source code string. We will get a FirstStageParsingResult instance containing our AST and other information, like errors (if any). If the parsing produces no errors, there is a Correct flag that will be set to true. Try running the unit test and verify that the parsing completes correctly by looking at this Correct flag.

This point corresponds to the Git tag: parser-setup

Initial version

We can start by revising the dummy implementation we made for the CompilationUnit AST node. Let’s keep it easy and add pieces step by step. First of all, let’s see how to link AST nodes – that is, how to define what children a node type can have. We redefine the statement’s implementation as a list of nodes of Statement type, meaning that a compilation unit node can have several statement nodes as its children. Therefore we need to define this Statement type.

Being familiar with programming languages, we know we can have many different types of statements, therefore we can model our Statement AST node as an abstract class, which other specific implementations will extend.

For now let’s keep it simple and let’s just define a dummy statement implementation that is intended to capture the content of any statement as a string. This is intended just to test our parser step by step.

using Strumenta.Sharplasu.Model;

namespace Strumenta.Python3Parser;

public class CompilationUnit : Node
{
   public List<Statement> Statements { get; set; }
}

public abstract class Statement : Node {}

public class DummyStatement : Statement
{
   public string Content { get; set; }
}

Now that we defined our first AST nodes, we need them to be produced by the parse tree nodes generated by ANTLR. Our approach is to write an extension method for each type of parse tree node. In this tutorial, we will only cover a few nodes, as they are found in the downloaded example Python file. More specifically, we will only support the import statements, the function definitions and their parameter declarations. We will not cover the body of the functions and the instructions contained inside.

First of all, let’s create a new file parser/src/ParseTreeToAST.cs

using Strumenta.Python3Parser;
using Strumenta.Sharplasu.Validation;
using static Strumenta.Python3Parser.Python3Parser;

namespace ExtensionMethods;

public static class ParseTreeToASTExtensions
{

   public static CompilationUnit ToAst(this File_inputContext context, List<Issue> issues)
   {
       return new CompilationUnit
       {
           Statements = context.stmt().Select(s => s.ToAst(issues)).ToList()
       };
   }

   public static Statement ToAst(this StmtContext context, List<Issue> issues)
   {
       return new DummyStatement
       {
           Content = context.GetText()
       };
   }
}

The two ToAst methods are extension methods for: File_inputContext, which is the node produced by our starting point in the ANTLR grammar; and StmtContext, which is the type of node that represents a generic statement.

What we are defining here are two parse tree-AST conversions:

file_input -> CompilationUnit
stmt -> Statement

Since in the ANTLR grammar, file_input can contain multiple stmt nodes, we modeled CompilationUnit accordingly to contain a list of Statement nodes. To produce these Statement nodes from a file_input node, we map each statement parse tree node to their respective AST nodes by calling their ToAst extension methods. The extension method for StmtContext (the ANTLR type representing a stmt node) will for now just return a DummyStatement containing whatever text ANTLR captured in that subtree.

So, file_input is returning a CompilationUnit; this CompilationUnit contains one Statement node for each statement that file_input contains. Now we need to implement the calls for this file_input -> CompilationUnit conversion. We do it by overriding the Sharplasu method: ParseTreeToAst. Let’s modify parser/src/Python3SharplasuParser.cs:

using ExtensionMethods;

[...]

public class Python3SharplasuParser : SharpLasuParser<CompilationUnit, Python3Parser, Python3Parser.File_inputContext>
{
[...]
   protected override CompilationUnit ParseTreeToAst(Python3Parser.File_inputContext parseTreeRoot, bool considerPosition = true, List<Issue> issues = null)
   {
       var cu = parseTreeRoot.ToAst(issues);
       cu.AssignParents();

       return cu;
   }
[...]
}

Since we decided to use extension methods to perform the parse tree-AST transformation, all we need to do is invoke the ToAst method on the root of the parse tree.

Then we call the Sharplasu method AssignParents, which will take care of assigning each AST node to its parent.

At this point, we can add a unit test in Parser.Test.cs and verify everything is working as expected:

[TestMethod]
public void TestParsing()
{
   var parser = new Python3SharplasuParser();
   var result = parser.GetTreeForText(GetExamplePythonFileContent());
   Assert.IsTrue(result.Correct);

   // We expect 8 statements (the top level statements in the file)
   Assert.AreEqual(8, result.Root.Statements.Count);

   // We expect the Compilation Unit to have 8 children, exactly 1 for each Statement
   Assert.AreEqual(8, result.Root.Children.Count);
}

To familiarize more with the result, it is possible to print each statement’s content, which will be a string very similar to the one in the Python file:

// Print the statements
result.Root.Statements
   .ForEach(statement => Console.WriteLine(((DummyStatement) statement).Content));

Which will print something like:

This point corresponds to the Git tag: testing-parsetree-ast

Expanding the AST

In our Python file, the top level statements are either import statements or function definitions. Let’s try to find their corresponding rules in the ANTLR grammar.

For the imports we can refer to the following rules:

import_stmt: import_name | import_from;

import_name: 'import' dotted_as_names;

// note below: the ('.' | '...') is necessary because '...' is tokenized as ELLIPSIS
import_from: ('from' (('.' | '...')* dotted_name | ('.' | '...')+)
             'import' ('*' | '(' import_as_names ')' | import_as_names));

import_as_name: name ('as' name)?;

dotted_as_name: dotted_name ('as' name)?;

import_as_names: import_as_name (',' import_as_name)* ','?;

dotted_as_names: dotted_as_name (',' dotted_as_name)*;

dotted_name: name ('.' name)*;

For the function definitions:

funcdef: 'def' name parameters ('->' test)? ':' block;

In both cases we are looking at statements:

stmt: simple_stmts | compound_stmt;

simple_stmts: simple_stmt (';' simple_stmt)* ';'? NEWLINE;

simple_stmt: (expr_stmt | del_stmt | pass_stmt | flow_stmt |
            import_stmt | global_stmt | nonlocal_stmt | assert_stmt);
[...]
compound_stmt: if_stmt | while_stmt | for_stmt | try_stmt | with_stmt | funcdef | classdef | decorated | async_stmt | match_stmt;

We can try and expand our AST nodes according to what we just found.

Let’s begin with the import statements. In our Python example file, we have two cases:

import platec

and

from worldengine.model.world import World, Size, GenerationParameters

For simplicity, we can decide to model both of them as a single Import AST node with an optional From field. Let’s add to Ast.cs:

public class ImportStatement : Statement
{
   public string? From { get; set; }
   public string Name { get; set; }
}

As usual, our approach is to model our AST step by step, so for now we do not care about modelling From and Name as children nodes; capturing them as strings will suffice for now.

Let’s not forget to define the conversion methods ToAst responsible to convert the import nodes of the parse tree to the respective AST nodes. Let’s modify the ToAst method for StmtContext in ParseTreeToAST.cs as follows:

public static List<Statement> ToAst(this StmtContext context, List<Issue> issues)
{
   if (context.simple_stmts() != null && context.simple_stmts().simple_stmt() != null)
   {
       return context.simple_stmts().simple_stmt().Select(s => s.ToAst(issues)).ToList();
   }
   else if (context.compound_stmt() != null)
   {
       // TODO
       return new List<Statement>();
   }

   issues.Add(new Issue(
       IssueType.SYNTACTIC,
       $"Unexpected node: {context}",
       null,
       IssueSeverity.Error));

   return new List<Statement>();
}

Since stmt can have two cases (simple_stmts and compund_stmt), we need to check which one is not null in order to understand which path the parser took when creating the parse tree. For now, we only implemented the simple_stmts() part, which is the one that leads to our import statements. As for the compound_stmt() part, we can for now just return an empty list of statements, to allow our parser to run without errors.

Then, we are finally using that issues argument we have been carrying along in all our ToAst method calls. This is just an example to show how we can signal an error that occurred during the parsing – in this case, a parse tree node we didn’t expect or forgot to implement – and still not break the execution of the parsing, by returning an empty list of statements instead. By doing this, the user of our parser will receive a Result object containing the Issue we just created and it will be able to handle it.

Now that we modified the return type of StmtContext.ToAst making it a list, we need to adjust the relative call in File_inputContext.ToAst accordingly.:

return new CompilationUnit
{
   Statements = context.stmt().SelectMany(s => s.ToAst(issues)).ToList()
};

Then we need to implement the ToAst method for Simple_stmtContext:

public static Statement ToAst(this Simple_stmtContext context, List<Issue> issues)
{
   if (context.import_stmt() != null)
   {
       return context.import_stmt().ToAst(issues);
   }

   throw new NotImplementedException($"{context}");
}

As usual, we check if the rule we are interested in is not null – in this case, import_stmt – and proceed to call the ToAst method for the said rule, a method that we also need to implement.

public static ImportStatement ToAst(this Import_stmtContext context, List<Issue> issues)
  {
      if (context.import_name() != null)
      {
          return new ImportStatement
          {
              Name = context.import_name().dotted_as_names().GetText()
          };
      }
      else if (context.import_from() != null)
      {
          return new ImportStatement
          {
              From = context.import_from().dotted_name().GetText(),
              Name = context.import_from().import_as_names().GetText()
          };
      }

      throw new NotImplementedException($"{context}");
  }

At this level of the grammar, once we have an Import_stmtContext node, we are interested in the import_name and the import_from rules, the only possible ones.

By looking at the grammar, we know the import_name rule is chosen when we have an import statement without the from part, while the import_from rule is chosen when the import statement does have a from clause. As usual, we can tell these two cases apart by null-checking the children of Import_stmtContext. In each case, we can create an instance of our ImportStatement AST node and set the value of the Name property accordingly, while only in the import_from case we need to set the value of the From property.

In both the last two ToAst methods, we decided to throw an exception. This serves as a guard in case the parse tree contains a node we either didn’t expect or forgot to implement the conversion for. This can be useful when implementing an AST with an incremental approach.

The complete file now looks like this:

using Antlr4.Runtime;
using Strumenta.Python3Parser;
using Strumenta.Sharplasu.Validation;
using static Strumenta.Python3Parser.Python3Parser;

namespace ExtensionMethods;

public static class ParseTreeToASTExtensions
{
   public static CompilationUnit ToAst(this File_inputContext context, List<Issue> issues)
   {
       return new CompilationUnit
       {
           Statements = context.stmt().SelectMany(s => s.ToAst(issues)).ToList()
       };
   }

   public static List<Statement> ToAst(this StmtContext context, List<Issue> issues)
   {
       if (context.simple_stmts() != null && context.simple_stmts().simple_stmt() != null)
       {
           return context.simple_stmts().simple_stmt().Select(s => s.ToAst(issues)).ToList();
       }
       else if (context.compound_stmt() != null)
       {
           // TODO
           return new List<Statement>();
       }

       issues.Add(new Issue(
           IssueType.SYNTACTIC,
           $"Unexpected node: {context}",
           null,
           IssueSeverity.Error));

       return new List<Statement>();
   }

   public static Statement ToAst(this Simple_stmtContext context, List<Issue> issues)
   {
       if (context.import_stmt() != null)
       {
           return context.import_stmt().ToAst(issues);
       }

       throw new NotImplementedException($"{context}");
   }

   public static ImportStatement ToAst(this Import_stmtContext context, List<Issue> issues)
   {
       if (context.import_name() != null)
       {
           return new ImportStatement
           {
               Name = context.import_name().dotted_as_names().GetText()
           };
       }
       else if (context.import_from() != null)
       {
           return new ImportStatement
           {
               From = context.import_from().dotted_name().GetText(),
               Name = context.import_from().import_as_names().GetText()
           };
       }

       throw new NotImplementedException($"{context}");
   }
}

So far, we defined some parse tree – AST node conversions as follows:

Parse tree node	AST node
file_input	CompilationUnit
stmt	List<Statement>
simple_stmt	Statement
import_stmt	ImportStatement

At this point, the project is buildable, but the unit test TestParsing will fail, saying it only found five statements instead of the expected 8. This is because we have not implemented the function definition nodes yet (which are the three remaining statements). Let’s do it now.

We saw before that funcdef is defined as a compound_stmt, therefore, it makes sense to model our FunctionDefinition AST node as a Statement. As for its properties, it must certainly have a name, a list of parameters and a body. The name is simply a string, each parameter can be another AST node, and the body can be a list of statements. Therefore:

public class FunctionDefinition : Statement
{
   public string Name { get; set; }
   public List<ParameterDeclaration> Parameters { get; set; }
   public List<Statement> Body { get; set; }
}

public class ParameterDeclaration : Statement
{
   public string Name { get; set; }
   public Expression DefaultValue { get; set; }
}

As we see in our Python file, each parameter can have a default value. We can define it as a generic Expression node, and then extend it with more specialised nodes, one for each type:

public abstract class Expression : Node {}

public abstract class LiteralExpression : Expression
{
   public string Value { get; set; }
}

public class NumberLiteral : LiteralExpression {}
public class BooleanLiteral : LiteralExpression {}
public class StringLiteral : LiteralExpression {}

We identify that an expression can be a literal expression and start by defining three different types: numbers, booleans and strings. One approach of modeling these could be considering they all have a string as a value – which is straightforward for string literals, but could be confusing for numbers and booleans: for these two, we still want to use a string in order to capture the actual text in the source code and being able to perform checks and analysis over it, if needed.

Now we need to define the parse tree-AST conversion. We start by editing the ToAst method for the StmtContext:

else if (context.compound_stmt() != null)
{
   return new List<Statement> { context.compound_stmt().ToAst(issues) };
}

And implementing the remaining ToAst methods for the other nodes we are interested in:

public static Statement ToAst(this Compound_stmtContext context, List issues)
{
   if (context.funcdef() != null)
   {
       return context.funcdef().ToAst(issues);
   }
   // TODO
   throw new NotImplementedException($"{context}");
}

Above, we follow our usual approach: null-check the child node deriving from the grammar rule we are interested in, then calling the respective ToAst method; or raising an exception in the other cases so that the process breaks and we can implement the missing parts as they are found by the parsing,

For creating an instance of our FunctionDefinition AST node, we can ignore the body for this tutorial. The name is trivial. The parameters part needs a bit more attention.

public static FunctionDefinition ToAst(this FuncdefContext context, List<Issue> issues)
{
   return new FunctionDefinition
   {
       Name = context.name().GetText(),
       Parameters = context.parameters().ToAst(issues),
       // TODO: Body = context.block().ToAst(issues)
   };
}

public static List<ParameterDeclaration> ToAst(this ParametersContext context, List<Issue> issues)
{
   return context.typedargslist() == null ? new List<ParameterDeclaration>() : context.typedargslist().ToAst(issues);
}

For the parameters, as for the way the grammar is designed, the parse tree node ANTLR gives us will have two arrays: tpdef and test. We need then to figure out the tpdef-test nodes associations. Our approach could be to sort these two lists of nodes and figure out their associations based on each node position.

public static List<ParameterDeclaration> ToAst(this TypedargslistContext context, List<Issue> issues)
{
   var parameters = context.tfpdef().OrderBy(t => t.Position().Start).ToList();
   var defaultValues = context.test().OrderBy(v => v.Position().Start).ToList();
   var defaultValuesByParameter = new Dictionary<TfpdefContext, TestContext>();

   foreach (var defaultValue in defaultValues)
   {
       var parameter = parameters.Last(p =>
           p.Position().End < defaultValue.Position().Start);

       if (parameter != null)
           defaultValuesByParameter.Add(parameter, defaultValue);
   }

   var parametersWithDefaultValues = parameters
       .Select(p => new KeyValuePair<TfpdefContext, TestContext>(
           p,
           defaultValuesByParameter.ContainsKey(p) ? defaultValuesByParameter[p] : null));

   return parametersWithDefaultValues
       .Select(pair => new ParameterDeclaration
       {
           Name = pair.Key.name().GetText(),
           DefaultValue = new StringLiteral { Value = pair.Value?.GetText() }
           // TODO: DefaultValue = pair.Value.ToAst(issues)
       })
       .ToList();
}

Since this is getting a bit complicated, let’s leave the implementation of the DefaultValue node conversion for a second moment; for now, let’s just create a dummy StringLiteral with a string value and check if what we built so far works well.

Now the unit test should be OK once again, and by debugging we should be able to have a look at the AST:

Our AST correctly has a CompilationUnit as a root node, which contains 5 ImportStatement nodes and 3 FunctionDefinition ones. By inspecting the FunctionDefinition nodes, we can also see their Parameters nodes, and for each of them the ParameterDeclaration nodes; here we can have, where present in the code, a DefaultValue currently implemented as a dummy StringLiteral node. Our AST is beginning to have a structure!

Focusing on the following extract of Python code:

import platec
[...]
def world_gen([...], verbose=get_verbose()):
   [...]

We can see the following AST:

This point corresponds to the Git tag: ast-extension-1

We can now continue our implementation from where we left it: let’s return to the ToAst method for TypedargslistContext.

The function parameter declarations can have a default value. This value may or may not be present; therefore we first edit the declaration of parametersWithDefaultValues to make TestContext nullable:

var parametersWithDefaultValues = parameters
   .Select(p => new KeyValuePair<TfpdefContext, TestContext?>(
       p,
       defaultValuesByParameter.ContainsKey(p) ? defaultValuesByParameter[p] : null));

Then we initialise the DefaultValue with a proper AST node (replacing the dummy LiteralString we used before), when the Value (that is, a TestContext) of our Key-Value pair is present:

return parametersWithDefaultValues
   .Select(pair => new ParameterDeclaration
   {
       Name = pair.Key.name().GetText(),
       DefaultValue = pair.Value?.ToAst(issues) ?? null
   })
   .ToList();

The grammar rules for tests and for all the AND, OR, etc expressions are defined in an unintuitive way, basically because they are meant to handle the operators precedence. However, let’s focus on the parts we care about:

test: or_test ('if' or_test 'else' test)? | lambdef;

[...]

or_test: and_test ('or' and_test)*;

and_test: not_test ('and' not_test)*;

not_test: 'not' not_test | comparison;

comparison: expr (comp_op expr)*;

[...]

expr: xor_expr ('|' xor_expr)*;

xor_expr: and_expr ('^' and_expr)*;

and_expr: shift_expr ('&' shift_expr)*;

shift_expr: arith_expr (('<<'|'>>') arith_expr)*;

arith_expr: term (('+'|'-') term)*;

term: factor (('*'|'@'|'/'|'%'|'//') factor)*;

factor: ('+'|'-'|'~') factor | power;

power: atom_expr ('**' factor)?;

atom_expr: AWAIT? atom trailer*;

atom: '(' (yield_expr|testlist_comp)? ')'
  | '[' testlist_comp? ']'
  | '{' dictorsetmaker? '}'
  | name | NUMBER | STRING+ | '...' | 'None' | 'True' | 'False' ;

Implementing the conversion will be a bit tedious and repetitive, as we need to go through all the precedence levels until the rule for the actual “expression value”. However, we do not need to consider the binary expressions or other cases for this tutorial.

public static Expression ToAst(this TestContext context, List<Issue> issues)
{
   if (context.test() == null && context.or_test().Length == 1)
       return context.or_test()[0].ToAst(issues);
   // TODO
   throw new NotImplementedException($"{context}");
}

public static Expression ToAst(this Or_testContext context, List<Issue> issues)
{
   if (context.and_test().Length == 1)
       return context.and_test()[0].ToAst(issues);
   // TODO
   throw new NotImplementedException($"{context}");
}

public static Expression ToAst(this And_testContext context, List<Issue> issues)
{
   if (context.not_test().Length == 1)
       return context.not_test()[0].ToAst(issues);
   // TODO
   throw new NotImplementedException($"{context}");
}

public static Expression ToAst(this Not_testContext context, List<Issue> issues)
{
   if (context.comparison() != null)
       return context.comparison().ToAst(issues);
   // TODO
   throw new NotImplementedException($"{context}");
}

public static Expression ToAst(this ComparisonContext context, List<Issue> issues)
{
   if (context.expr().Length == 1)
       return context.expr()[0].ToAst(issues);
   // TODO
   throw new NotImplementedException($"{context}");
}

public static Expression ToAst(this ExprContext context, List<Issue> issues)
{
   if (context.xor_expr().Length == 1)
       return context.xor_expr()[0].ToAst(issues);
   // TODO
   throw new NotImplementedException($"{context}");
}

public static Expression ToAst(this Xor_exprContext context, List<Issue> issues)
{
   if (context.and_expr().Length == 1)
       return context.and_expr()[0].ToAst(issues);
   // TODO
   throw new NotImplementedException($"{context}");
}

public static Expression ToAst(this And_exprContext context, List<Issue> issues)
{
   if (context.shift_expr().Length == 1)
       return context.shift_expr()[0].ToAst(issues);
   // TODO
   throw new NotImplementedException($"{context}");
}

public static Expression ToAst(this Shift_exprContext context, List<Issue> issues)
{
   if (context.arith_expr().Length == 1)
       return context.arith_expr()[0].ToAst(issues);
   // TODO
   throw new NotImplementedException($"{context}");
}

public static Expression ToAst(this Arith_exprContext context, List<Issue> issues)
{
   if (context.term().Length == 1 &&
       context.term()[0].factor().Length == 1 &&
       context.term()[0].factor()[0].power() != null)
       return context.term()[0].factor()[0].power().atom_expr().atom().ToAst(issues);
   // TODO
   throw new NotImplementedException($"{context}");
}

Finally, we arrived at the part of the grammar defining the actual literals we are interested in. We can, for now implement the three literals we implemented the AST nodes for:

public static Expression ToAst(this AtomContext context, List<Issue> issues)
{
   if (context.NUMBER() != null)
       return new NumberLiteral { Value = context.NUMBER().GetText() };

   if (context.STRING() != null && context.STRING().Length > 0)
       return new StringLiteral { Value = string.Join("", context.STRING().Select(s => s.GetText())) };

   if (context.TRUE() != null || context.FALSE() != null)
       return new BooleanLiteral { Value = context.TRUE() != null ? context.TRUE().GetText() : context.FALSE().GetText() };

   // TODO
   throw new NotImplementedException($"{context}");
}

Once again, let’s try to run the unit test to see if everything works so far… whoops! We get a NotImplemented exception!

System.NotImplementedException: [1221 1213 1210 1200 1192 1184 1176 1168 1160 1133 1130 1120 1112 1083 326 313 304 689 492 260]

   at ExtensionMethods.ParseTreeToASTExtensions.ToAst(AtomContext context, List`1 issues) in [...]/ParseTreeToAST.cs:line 238

   at ExtensionMethods.ParseTreeToASTExtensions.ToAst(Arith_exprContext context, List`1 issues) in [...]/ParseTreeToAST.cs:line 222

In AtomContext.ToAst() we only implemented cases for context.NUMBER, context.STRING, context.TRUE and context.FALSE – respectively, number literals, string literals and boolean literals. However, we can see by inspecting the context variable in the debugger, ANTLR built a parse tree with another node: we are in the atom -> name case. Let’s see what it is about:

atom: '(' (yield_expr|testlist_comp)? ')'
  | '[' testlist_comp? ']'
  | '{' dictorsetmaker? '}'
  | name | NUMBER | STRING+ | '...' | 'None' | 'True' | 'False' ;

name : NAME | '_' | 'match' ;

As we mentioned before, the task of building an AST can be performed incrementally. This is exactly what happened here: we tried parsing a source file and found out a case we have not covered yet, so we need to model it to our AST and implement the parse tree-AST conversion accordingly.

In Ast.cs let’s add the following:

public class ReferenceExpression : Expression
{
   public string Reference { get; set; }
}

Basically, what we want to do is support a reference to a variable. We need, for sure, a string which is the name that references the variable: we implement that with the Reference property.

In ParseTreeToAST.cs, we add this case to our AtomContext.ToAST method:

if (context.name() != null)
   return new ReferenceExpression { Reference = context.name().GetText() };

Once again, by running a new parse, we will see yet another NotImplementedException. With the help of the debugger we can now catch the missing case being:

atom -> '[' testlist_comp? ']'

[...]

testlist_comp: (test|star_expr) ( comp_for | (',' (test|star_expr))* ','? );

That is, an array value; we can find it in our Python source; it’s the temps parameter here:

def _plates_simulation(name, width, height, seed, temps=
                      [.874, .765, .594, .439, .366, .124], humids=
                      [.941, .778, .507, .236, 0.073, .014, .002], gamma_curve=1.25,
                      curve_offset=.2, num_plates=10, ocean_level=1.0,
                      step=Step.full(), verbose=get_verbose()):

At this point, it should be clear how to proceed; firstly, we add an AST node for the array literal:

public class ArrayLiteral : LiteralExpression
{
   public List<Expression> Elements { get; set; }
}

Secondly, the parse tree-AST conversion:

if (context.OPEN_BRACK() != null)
   return new ArrayLiteral
   {
       Value = context.GetText(),
       Elements = context.testlist_comp().test()
           .Select(t => t.ToAst(issues)).ToList()
   };

As usual, we start by considering only the easiest part of the grammar, the case where testlist_comp is just a list of test. Notice that we already implemented the ToAst method for TestContext, so we just need to call it over each test node of this subtree.

This time our unit test will run successfully, meaning both the ANTLR parsing and our parse tree-AST conversion went smoothly!

Let’s try to add an additional check in our unit test, for instance, we may want to test if there is a function definition for a function named _plates_simulation:

// We expect a function definition for _plates_simulation
var platesSimulationFuncDef = result.Root
   .SearchByType<FunctionDefinition>()
   .FirstOrDefault(funcDef => funcDef.Name == "_plates_simulation");

Assert.IsNotNull(platesSimulationFuncDef);

The above SearchByType method requires ExtensionMethods:

using ExtensionMethods;

Additional checks we could do:

// We expect _plates_simulation to contain a definition of an array called temps
var temps = platesSimulationFuncDef
   .Parameters
   .FirstOrDefault(p => p.Name == "temps");
Assert.IsNotNull(temps);

// We also test the expected value of the temps parameter
Assert.IsTrue(temps.DefaultValue is ArrayLiteral);
Assert.IsTrue(((ArrayLiteral) temps.DefaultValue).Elements.SequenceEqual(
   new List<Expression>()
   {
       new NumberLiteral { Value = ".874" },
       new NumberLiteral { Value = ".765" },
       new NumberLiteral { Value = ".594" },
       new NumberLiteral { Value = ".439" },
       new NumberLiteral { Value = ".366" },
       new NumberLiteral { Value = ".124" }
   }));

In order for the SequenceEqual method to work as expected, we have to implement the Equals and GetHashCode method for the LiteralExpression type:

public abstract class LiteralExpression : Expression
{
   public string Value { get; set; }

   public override bool Equals(object? obj)
   {
       if (obj == null || GetType() != obj.GetType())
       {
           return false;
       }
       var o = (NumberLiteral) obj;
       return o.Value == Value;
   }

   public override int GetHashCode()
   {
       unchecked
       {
           int hash = 17;
           hash = hash * 23 + Value.GetHashCode();
           return hash;
       }
   }
}

Now the unit test should complete successfully.

This point corresponds to the Git tag: ast-extension-2

Testing the AST

Sharplasu offers methods to test the equality of two ASTs, which can be used for unit testing.

Now, if we wish to test our AST structure to just check the presence of the _plates_simulation function definition and its temps parameter, we can initialize an istance of the expected AST and test it against the one returned by the parsing by calling the Sharplasu method AssertASTsAreEqual. However, we do not need to build the whole AST but just the parts we are interested in, and ignore the subtrees we don’t care about by using a special Sharplasu collection, IgnoreChildren.

First of all, let’s add to ParserTest.cs:

using Strumenta.Sharplasu.Testing;
using static Strumenta.Sharplasu.Testing.Asserts;

Then we can add these lines to our unit test:

// Test the function declaration: _plates_simulation
var expectedPlatesSimulation = new FunctionDefinition
{
   Name = "_plates_simulation",
   Parameters = new IgnoreChildren<ParameterDeclaration>()
};

AssertASTsAreEqual(expectedPlatesSimulation, platesSimulationFuncDef);

var expectedTempsArgument = new ParameterDeclaration
{
   Name = "temps",
   DefaultValue = new ArrayLiteral
   {
       Value = "[.874,.765,.594,.439,.366,.124]",
       Elements = new IgnoreChildren<Expression>()
   }
};

AssertASTsAreEqual(expectedTempsArgument, temps);

When AssertASTsAreEqual finds an instance of IgnoreChildren in the expected AST, it does not test them against the corresponding ones in the actual AST. It just ignores them and continues with whatever is left to be checked.

We could also want to test the AST starting from the root note, maybe like this:

[TestMethod]
public void TestAST()
{
   var expectedAST = new CompilationUnit
   {
       Statements = new List<Statement>
       {
           new ImportStatement { Name = "platec" },
           new ImportStatement { Name = "time" },
           new ImportStatement { Name = "numpy" },
           new ImportStatement
           {
               Name = "Step,add_noise_to_elevation,center_land,generate_world,get_verbose,initialize_ocean_and_thresholds,place_oceans_at_map_borders",
               From = "worldengine.generation"
           },
           new ImportStatement
           {
               Name = "World,Size,GenerationParameters",
               From = "worldengine.model.world"
           },
           new FunctionDefinition
           {
               Name = "generate_plates_simulation",
               Parameters = new IgnoreChildren<ParameterDeclaration>()
           },
           new FunctionDefinition
           {
               Name = "_plates_simulation",
               Parameters = new IgnoreChildren<ParameterDeclaration>()
           },
           new FunctionDefinition
           {
               Name = "world_gen",
               Parameters = new IgnoreChildren<ParameterDeclaration>()
           }
       }
   };

   var parser = new Python3SharplasuParser();
   var result = parser.GetTreeForText(GetExamplePythonFileContent());
   AssertASTsAreEqual(expectedAST, result.Root);
}

You may want to take a moment to compare the expected AST we just wrote with the actual Python file.

This point corresponds to the Git tag: ast-testing

Traversing the AST

We already saw, previously in this tutorial, how to look for children of a certain type starting from a node, and then filter the result. We can see this again, for example, if we wanted to find out what function parameters are named height, we could write the following code, where we expect to find exactly 3 parameter declarations named “height”:

[TestMethod]
   public void TestTraversing()
   {
       var parser = new Python3SharplasuParser();
       var result = parser.GetTreeForText(GetExamplePythonFileContent());

       // Search all function parameters named "height"
       var heightParameters = result.Root
           .SearchByType<ParameterDeclaration>()
           .Where(p => p.Name == "height")
           .ToList();

       Assert.AreEqual(3, heightParameters.Count);
   }

Another thing we commonly may want to do is to search the closest ancestor of a given type. For instance, let’s see the function definitions for each of the “height” parameters we just found:

// We can search the closest ancestor of a given type.
// For each of the "height" parameters we just found,
// we expect the closest FunctionDefinition to be respectively:
// generate_plates_simulation, _plates_simulation and world_gen
var functionDefinitionNames = heightParameters
   .Select(param => param.FindAncestorOfType<FunctionDefinition>())
   .Select(funcDef => funcDef.Name)
   .ToList();

Assert.IsTrue(
   new [] { "generate_plates_simulation", "_plates_simulation", "world_gen" }
       .SequenceEqual(functionDefinitionNames));

We expect to find that each “height” parameter declaration we previously found has a function definition ancestor, respectively: generate_plates_simulation, _plates_simulation and world_gen.

Sharplasu offers a few other methods that we can use for traversing the tree:

WalkChildren, which returns all the direct children of a node
WalkDescendants, which look at the whole subtree of a node (the node itself is excluded)
WalkLeavesFirst, which returns the nodes starting from the leaves and going up

This point corresponds to the Git tag: ast-traversing

Conclusions

In this tutorial, we introduced how to build a Python parser using Sharplasu. Even though we just scratched the surface of this topic, it should already be clear by now how a library like Sharplasu can help build applications that need to parse files and use their structural information.

Everything we presented in this tutorial is available in our GitHub repository for you to try: https://github.com/Strumenta/sharplasu-tutorial

Sharplasu is the youngest of the StarLasu libraries. Its development is still ongoing, but it’s catching up with its older siblings, allowing C# developers to produce great tools with a simple yet reliable approach.

To understand how to use ANTLR, you can read The ANTLR Mega Tutorial

To discover more about parsing in C#, you can read Parsing in C#: Tools and Libraries

For more about our outsource libraries and the Starlasu Methodology, you can read here Building advanced parsers in Kolasu

Building advanced parsers using Sharplasu

Introduction

Set up a project with Sharplasu

Generating the first-stage parser with ANTLR

Setting up our Sharplasu Python parser implementation

Taking a look at the parse tree

Initial version

Expanding the AST

Testing the AST

Traversing the AST

Conclusions

Read more:

Categories

More on Parsing

How to Use the EGL Parser

Incremental Parsing Using Tree-sitter