Parsing is typically where we begin and invest much of our enthusiasm. However, completing the parser is just the beginning. Sadly, I must inform you that additional steps are necessary. One step that enhances the value of the Abstract Syntax Trees (ASTs) obtained from parsing is semantic enrichment. In this article, we’ll explore what semantic enrichment is, what it enables us to do, and we are going to share some code.

As always, the code for this article is available on GitHub. You can find it at https://github.com/Strumenta/an-entity-language-sharplasu. In this article we just talk about symbol resolution, so if you want details about the parsing and creation of the AST you will have to look in the repository.

It All Starts With Parsing, but the Assembly Line Continue

When we parse, we recognize the structure of the code, by analyzing its syntax.

But what does it mean to parse? It means to:

  1. To check the code is syntactically correct
  2. To build the Abstract Syntax Tree (AST), we make nodes representing the structures we recognize. For example, we could process the code dog.fly() and return a node representing a method call. 

All is good, except for the fact that dogs cannot fly.

The code is syntactically correct and semantically incorrect. 

When we process code we start by verifying that it is syntactically correct. 

If it is, we move to the semantic analysis. When the code is semantically correct, we can enrich the AST with semantic information.

In essence, we are interested in two things:

  1. Connecting references to the corresponding declarations. We call this symbol resolution
  2. Calculating the types of elements that can be typed. We call this type calculation

The Link Between Symbol Resolution and Type Calculation

I am an engineer by trade, and I really like the step-by-step approach to problem-solving: you solve one part of the problem and only then move to the next one. So you may wonder why I am conflating two apparently different problems, like symbol resolution and type calculation. As for most things I do, it is because there is no better alternative.

The two mechanisms are interconnected: one depends on the other. For example, let’s say that I have this code:

class TimeMachine {

   Time move(TimeIncrement) 

   Point move(Direction)

}

class Point {

   Point add(Direction)

}

class Time {

   Time add(TimeIncrement)

}

myTimeMachine.move(foo).add(bar)

Let’s say my goal is to figure out the type of the overall expression myTimeMachine.move(foo).add(bar).

To answer this question I need to figure out which add method we are referring to. If it is the one declared in Point, then the overall result will be of type Point. If instead we are referring to the add method declared in Time, then the overall result will be of type Time

Ok, but how do I know which add method I am calling? Well, it depends on the type of myTimeMachine.move(foo):

  • If that expression has type Point and bar has type Direction, then we are calling the add method in Point
  • If instead that expression has type Time, and the expression bar has type TimeIncrement, then we are calling the add method in Time.

This means, I need to figure out the type of myTimeMachine.move(foo). To do so, I need first to figure out if I am calling the first or the second overloaded variant of TimeMachine.move. And that depends on the type of foo.

So, you see, I cannot extricate the two problems: they affect each other results, and therefore, in principle, we treat them together. In practice, for very simple languages we can get away by treating them separately. Typically, you need to treat the problems in a combined way if there are composite types or cascading functions/method calls. 

If you want to read about symbol resolution for a language like Java, you can look at How to Build a Symbol Solver for Java, in Clojure or Resolve method calls in Java code using the JavaSymbolSolver.

A Prerequisite for Any Non-Trivial Operation

Semantic Enrichment is a prerequisite for most programmatic handling of code.

Perhaps you may implement a linter or a code formatter without semantic enrichment, but for the most typical language engineering applications you need semantic enrichment:

  • Interpretation: To execute code, we need to connect function invocation to their definition
  • Migrations: To migrate code in any nontrivial way, we want to take into account the type of the different elements. For example, if we were translating the operation a + b, depending on the target language and the type of the operands, we may want to translate it as a + b or perhaps a.concat(b) or even a + b.toDouble().
  • Static analysis and refactoring: Automated code modifications, such as renaming variables or moving functions, depend on knowing which references are linked to which declarations.
  • Editors: Autocompletion or error-checking depends on semantic enrichment. But also go-to-definition or find-usages. In essence the difference between a basic editor and an advanced one is in their support for semantic enrichment for the language of interest.

The StarLasu Approach and Semantic Enrichment

When it comes to parsing, at Strumenta we apply the principles of the Chisel Method. They are quite established at this point, after years of refining. For Semantic Enrichment, things are not as crystallized as we evolve the approach at each new project, finding new solutions to new challenges. That said, we are finding patterns that work and incorporating them in our core libraries and in Sharplasu in particular. 

At this stage, Sharplasu has a module called SymbolResolution which has a reasonably good approach to symbol resolution. Type calculation is instead still implemented ad-hoc for each project, at this time. So we call the type calculation logic from symbol resolution (and viceversa). It is just that we have standard APIs for symbol resolution and not for type calculation.

Let’s See an Example of Semantic Enrichment

In our example we will work with a simple language that permits us to define entities. These entities have fields, called features with types. They can be initialized with expressions.

This is an example:

module example

import standard

type address

class base {
	name string
	description string
}

class aged : base {
	age integer
}

class person : aged {
	speed integer = 2
}

class athlete : person {
	speed integer = person.speed * 2
}

class car : aged {
	kilometers integer = age * 10000	
}

This language, while simple, contains some of the elements that we can find in most languages:

  • We can import modules
  • We can define new types
  • We have built-in types such as string and integer
  • We can reference features without specifying the context (therefore using the current object as context) or specifying it

Symbol Resolution

Let’s take a look at a portion of our SymbolResolver.

Scope ModuleLevelTypes(Node ctx)
{
    var scope = new Scope();
    var module = ctx.FindAncestorOfType<Module>();
    if (module != null)
    {
        // let's define types
        module.Types.ForEach(type => scope.Define(type));
        foreach (var import in module.Imports)
        {
            SymbolResolver.ResolveSymbols(import);
            if (import.Module.Referred?.Entities != null)
            {
                foreach (var entity in import.Module.Referred.Entities)
                {
                    scope.Define(entity);
                }
            }
            if (import.Module.Referred?.Types != null)
            {
                foreach (var type in import.Module.Referred.Types)
                {
                    scope.Define(type);
                }
            }
        }
    }

    return scope;
}

[..]

public ExampleSemantics(IModuleFinder moduleFinder)
{
    ModuleFinder = moduleFinder;

    SymbolResolver = new DeclarativeLocalSymbolResolver(Issues);
    [..]
    SymbolResolver.ScopeFor(typeof(FeatureDecl).GetProperty("Type"), (FeatureDecl feature) =>
    {
        var scope = ModuleLevelTypes(feature);                
        return scope;
    });
    [..]
    SymbolResolver.ScopeFor(typeof(Import).GetProperty("Module"), (Import import) =>
    {
        var scope = new Scope();
        if(moduleFinder.FindModule(import.Module.Name) != null)
        {
           scope.Define(moduleFinder.FindModule(import.Module.Name));
        }                    
        return scope;
    });

    TypeCalculator = new EntityTypeCalculator(SymbolResolver);                
}

Here we want to see the basics of symbol resolution and how importing symbols from other elements works. We first see that symbol resolution depends on moduleFinder. The moduleFinder is the thing containing the list of available code for a project. This means files of the project and any library available for the project. You can see on the repository that, for this project, is just a Dictionary that keeps tracks of the name and the corresponding Module object. A Module object is the root of the AST. Given the previous example file, there will a Module with name "example" representing that. The important part is that is the object that tells you if and where modules outside the current one are located.

You can see that to solve imports, as in:

import standard

To solve a symbol you need a scope. You can think of scope as a container of available definitions. In Sharplasu, a Scope can have a parent Scope, so you can properly nest them. For example, there could be global scope to solve imports and a class scope to solve features.

So, we create a scope, and then we ask the moduleFinder if there is a module with that name, and, if so, we define that module. Defining a symbol means telling our SymbolResolver object that there is a definition of the argument in the current scope. The SymbolResolver object is based on a Sharplasu class, so you can use that class for all your projects. Later, when we will ask our symbol resolver to resolve the symbols, it will look in the scope of that reference and check if there is valid definition for a reference with that name.

How Importing Modules Affects Types

You will notice that defining module is necessary to solve the type of features.

class base {
	name string
	description string
}

So, string after name and description are types and name string is a definition of a feature.

To solve the references to the types of features you call the ModuleLevelTypes method. This method will:

  • look for definition of types in the current module
  • loop through imported module, make sure that all references in imported modules are solved
  • then it will define the types in each imported module

Solving imports is therefore crucial to solve types. Particularly as in this example, as in many real languages, types are often the ones from the standard library.

class athlete : person {
	speed integer = person.speed * 2
}
class car : aged {
	kilometers integer = age * 10000	
}

Solving ReferenceExpression

In our language a ReferenceExpression, like person.speed or age can only have:

  • an optional parent/context element that is a class (like person)
  • a target element that references a feature (like speed or age)

No nesting or multiple levels are allowed.

So, solving either a reference to the context or target element is similar.

SymbolResolver.ScopeFor(typeof(ReferenceExpression).GetProperty("Context"), (ReferenceExpression reference) =>
{
    var scope = new Scope();
    var classParent = reference.FindAncestorOfType<ClassDecl>();
    if (classParent != null)
        scope = ClassHierarchyEntities(reference.FindAncestorOfType<ClassDecl>());
    else
        Issues.Add(Issue.Semantic("The class containing this expression has no superclasses. The Context cannot be solved.", reference.Position));
    return scope;
});
SymbolResolver.ScopeFor(typeof(ReferenceExpression).GetProperty("Target"), (ReferenceExpression reference) =>
{
    var scope = new Scope();
    if (reference.Context == null)
    {
        var classParent = reference.FindAncestorOfType<ClassDecl>();
        if (classParent != null)
            scope.Parent = ClassLevelTypes(classParent);
    }
    else if (reference.Context.Resolved)                
    {
        reference.Context.Referred.Features.ForEach(it => scope.Define(it));
    }
    return scope;
});

One difference is just that if the current expression contains a Context element, but the class containing the expression has no superclass, we have an issue, because the reference cannot be solved. The other one is that for solving Target, we must first solve Context, to make sure we only consider the features in the Context element.

Otherwise, we look for the proper elements in the parent class. Let’s just look at how to solve the hierarchy of classes.

Scope ClassHierarchyEntities(ClassDecl ctx)
{
    var scope = new Scope();
    var superclass = ctx.Superclass;            
    if (superclass != null && superclass.Resolved)
    {
        // let's define the superclass
        scope.Define(superclass.Referred);                
        scope.Parent = ClassHierarchyEntities(superclass.Referred);
    }

    return scope;
}

If the reference to the superclass of the current class has been solved, we define the current superclass. Then we rise up through the hierarchy of classes to define them all.

class base {
	name string
	description string
}

class aged : base {
	age integer
}

class person : aged {
	speed integer = 2
}

class athlete : person {
	speed integer = person.speed * 2
}

Basically, considering the previous example, to solve the reference to person, in person.speed we define person, because is the superclass of athlete, the class containing the expression, then aged and base.

Symbol Resolution Patterns

We can then see a few rules:

  • The way in which we resolve imports is by delegating to the moduleFinder object. This is the case because we need to use some logic to find other files and parse them on demand, possibly managing loops (what if a file import itself, directly or indirectly?)
  • When looking for a superclass we use a global scope, for all classes in the module
  • For solving references to feature we do not specify any element declared at that level. We just look for features declared in the classes containing them. So to do this, we define a parent scope
  • The case of ModuleLevelTypes is interesting because there we can see a combination of elements:
    • We get all the types declared in the module
    • We get all the types declared in the imported modules
    • We could also get all the built-in entities, but we choose to force the user to import a standard module to get them instead

These few rules cover many of the most common patterns we see in languages we work with, either Domain Specific Languages we design or legacy languages for which we build parsers.

Type Calculation

Let’s take a look at our simple Type Calculator. Notice that we managed to separate our type calculation from symbol resolution. Our type calculation needs symbol resolution, but not vice versa. We are going to see this simpler case first, then what happens in the standard case.

public override IType CalculateType(Node node)
{
    switch (node)
    {
        case OperatorExpression opExpr:
            var leftType = GetType(opExpr.Left);
            var rightType = GetType(opExpr.Right);
            if (leftType == null || rightType == null)
                return null;
            switch (opExpr.Operator)
            {
                case Operator.Addition:
                    if (leftType == EntityStandardLibrary.StringType && rightType == EntityStandardLibrary.StringType)
                        return EntityStandardLibrary.StringType;
                    else if (leftType == EntityStandardLibrary.StringType && rightType == EntityStandardLibrary.IntegerType)
                        return EntityStandardLibrary.StringType;
                    else if (leftType == EntityStandardLibrary.IntegerType && rightType == EntityStandardLibrary.IntegerType)
                        return EntityStandardLibrary.IntegerType;
                    else
                        throw new NotImplementedException($"Unsupported operand types for addition: {leftType}, {rightType}");
               [..]
            }
        case ReferenceExpression refExpr:
            if(refExpr.Context == null)
                return GetTypeOfReference<ReferenceExpression, FeatureDecl>(refExpr, typeof(ReferenceExpression).GetProperty("Target"));
            else
            {
                SymbolResolver.ResolveNode(refExpr);
                return GetTypeOfReference<ReferenceExpression, FeatureDecl>(refExpr, typeof(ReferenceExpression).GetProperty("Target"));
            }
       [..]
        case StringLiteralExpression _:
            return EntityStandardLibrary.StringType;
        case BooleanLiteralExpression _:
            return EntityStandardLibrary.BooleanType;
        case IntegerLiteralExpression _:
            return EntityStandardLibrary.IntegerType;                
        default:
            throw new NotImplementedException($"Type calculation not implemented for node type {node.GetType()}");
    }
}

It shows common patterns for type calculation:

  • On the bottom you can see that we solve types for literals: we assign a standard type for each literal
  • We solve types for binary operations by finding types for the two individual elements (left and right) of the expression and then defining rules for the combination. For each an addition of a string and an integer is considered a concatenation, so the resulting type is an integer. This will vary very much by language and how you choose to handle type conversion between compatible types
  • To solve the type of reference expressions, we need to solve the reference and then get its type

Calculating the Type of References

To solve the type of references we get a look at GetTypeOfReference method. This method has type argument the class that can hold the reference and the information about the property that will hold the reference in the first argument. Notice that in this case we could avoid using the first type argument, since we ReferenceExpression is the only kind of expression containing a reference. However, this code shows that it is easy to generalize this method.

private IType GetTypeOfReference<T, S>(T refHolder, PropertyInfo refAccessor)
    where T : Node
    where S : Node, Named            
{
    ReferenceByName<S> refValue = refAccessor.GetValue(refHolder) as ReferenceByName<S>;
    if (refValue != null && !refValue.Resolved)
    {
        SymbolResolver?.ResolveProperty(refAccessor, refHolder);                             
    }
    else if (refValue != null && refValue.Resolved != false)
        return GetType(refValue.Referred as Node);
    
    return null;
}

The method uses a bit of reflection. In essence, it checks if the providing property corresponds to a reference that has been solved. If it is not, then it triggers the symbol resolution, supposing we have a SymbolResolver. Then we get the type referred to. GetType is simply a way to access a Dictionary matching nodes with types.

These are the common patterns to calculate a type. One that we are missing is having a special type for void or unit, that represents no type.

Again, it may sound quite boring, but these are the kind of patterns we routinely see. Of course, things can get more exciting if we throw generics and type inference in the mix, but for this time, let’s keep things simple.

Our types are all classes that inherits from an interface IType. This is not an interface part of Sharplasu, we created it for this example, but it is so simple that you can look it up on your own in the repository.

When Type Calculation and Symbol Resolution Intertwine

We avoided making type calculation and symbol resolution be dependent on each other for a few reasons. Our references had only two levels and we knew that the first one was always a superclass of the current one. Imagine we change that.

class address {
	note base
	city string
	street string
	number integer	
}

class person : aged {
	location address
	speed integer = 2
}

class athlete : person {			
	deliveryNote string = location.note.description
	luckyNumber integer = location.number + 3
}

Now, the features can have as a type a class, other than scalar types. Our references now have a Context property, which is an Expression. So, the node ReferenceExpression for location.note.description will have this structure. At the first level we have a ReferenceExpression with Target description and Context another ReferenceExpression with Target note and so on.

This means that now we cannot determine statically what will be the actual type of a Context object: it could be a scalar type or a class. So, to solve a reference in Target we need to dynamically define the type of Context, which will depend on what the reference in Context resolves to. So, how do we accomplish this? For starters, we need to make a small change in the ModuleLevelTypes method and make sure that we define all entities at the module level.

module.Entities.ForEach(type => scope.Define(type));

Apart from that all we need to change is how we solve symbols for the Target property.

SymbolResolver.ScopeFor(typeof(ReferenceExpression).GetProperty("Target"), (ReferenceExpression reference) =>
{
    var scope = new Scope();

    var classParent = reference.FindAncestorOfType<ClassDecl>();
    if (classParent != null)
        scope.Parent = ClassLevelTypes(classParent);                

    if (reference.Context != null)
    {
        SymbolResolver.ResolveNode(reference.Context);

        var type = TypeCalculator.GetType(reference.Context) as ClassDecl;

        if (type != null)
        {
            type.Features.ForEach(it => scope.Define(it));
        }
    }

    return scope;
});

The pattern is simple:

  • We ensure we have resolved the Context node, so we know which feature Context resolves to
  • This allows us to get the type, i.e. the class the features has
  • We can now define the features of the class

And voilà, we can now solve the current reference.

Using Semantic Enrichment

It is easy to glue together symbol resolution and type calculation.

public List<Issue> SemanticEnrichment(Node node)
{
    SymbolResolver.ResolveSymbols(node);
    node.WalkDescendants<Expression>().ToList().ForEach(expression => {
        TypeCalculator.SetTypeIfNeeded(expression);
    });
    return Issues;
}

So, we can take any module and trigger symbol resolution. We will then get an AST containing references that have been resolved. Also, the types will be stored in a cache in TypeCalculator.

public abstract class TypeCalculator
{
    public virtual IType GetType(Node node)
    {
        return SetTypeIfNeeded(node);
    }

    public IType StrictlyGetType(Node node)
    {
        var type = SetTypeIfNeeded(node);
        if (type == null)
            throw new InvalidOperationException($"Cannot get type for node {node}");
        return type;
    }

    public abstract IType CalculateType(Node node);

    public virtual IType SetTypeIfNeeded(Node node)
    {
        if (node.GetTypeSemantics() == null)
        {
            var calculatedType = CalculateType(node);
            node.SetTypeSemantics(calculatedType);
        }
        return node.GetTypeSemantics();
    }
}

This means that after invoking the semantic enrichment, we can look each of our nodes of class Expression and get their type using mynode.GetTypeSemantics(). Easy, right?

A Simple Test

What is life without tests? Here at Strumenta we do not want to imagine such a sorry existence, so our repository has a few tests. Let’s see a simple one:

        [TestMethod]
        public void TestTypeCalculation()
        {
            EntitySharplasuParser parser = new EntitySharplasuParser();
            string code = @"module example

import standard

class base {
	name string
	description string
}

class aged : base {
	age integer
}

class person : aged {
	location address
	speed integer = 2
}

class address {
	note base
	city string
	street string
	number integer	
}

class athlete : person {			
	deliveryNote string = location.note.description
	luckyNumber integer = location.number + 3
}

class car : aged {
	kilometers integer = age * 10000	
}";
            var result = parser.Parse(code);

            SimpleModuleFinder moduleFinder = new SimpleModuleFinder();
            ExampleSemantics semantics = new ExampleSemantics(moduleFinder);
            List<Issue> issues = semantics.SemanticEnrichment(result.Root);
            Assert.AreEqual(0, issues.Count);
            result.Root.AssertAllExpressionsHaveTypes();
            Assert.AreEqual("string",
                result.Root.Entities[4].Features[0].Value.GetTypeSemantics().Name
            );
        }

We test that all expressions have a type, then we check one specific expression. The highlighted expression should have type string, since location is of class address which has the note feature of class base, which in turn has a feature description of type string.

Conclusions

While parsing organizes code into a syntactic structure, Semantic Enrichment uncovers its meaning by resolving symbols and determining types. Without this critical step, advanced operations such as code generation, interpretation, and refactoring would be impossible. Semantic Enrichment is not trivial to implement. 

With this article, we wanted to share some of the principles behind it. And through the support built-in in Sharplasu, we want to provide a way to simplify the implementation of advanced Language Engineering solutions. For us, it has been working pretty well, and we hope it will work similarly well for you. 

Have fun with your Language Engineering project!