Introduction: the ANTLR4 Code Completion Core

When we think about an editor for code, we think about syntax highlighting: graphically marking different elements of programming constructs, so that we can better interpret the structure of the code at a glance. We’ve written a few articles about that already, for example in Kotlin/Java and in TypeScript/JavaScript.

However, a good editor is not limited to syntax highlighting. The IDEs and editors that we use as programmers have many more features; for example, they can mark the source code with reports of programming errors and warnings, and they can suggest meaningful completions of what we’re typing (what Microsoft calls IntelliSense).

In this tutorial, we’ll learn how to use the code completion engine, antlr4-c3: the ANTLR4 Code Completion Core. As the name suggests, antlr4-c3 targets ANTLR4 grammars. So, a little working knowledge of ANLTR4 is necessary to fully understand this tutorial. ANTLR4 novices may want to read our ANTLR Mega Tutorial first.

We won’t see how to integrate code completion in an actual editor, however. We’ll keep the focus on the completion engine. As a company, we’ve got experience with the Language Server Protocol and Visual Studio Code, among others, and that may become the topic for another article. For another take on the same subject, that also shows integration into an editor, you may read our article, Building autocompletion for an editor based on ANTLR, which doesn’t use antlr4-c3.

The Journey Ahead

Code completion isn’t an easy topic, but with all the information we’ve packed here, you’ll be able to rapidly get it working for you.

We’ll start with an introduction about how antlr4-c3 works and how to set it up. We’ll code a simple example to see the basics of the engine in action, based on a grammar for the Kotlin language by JetBrains. The initial example will only know how to suggest keywords, such as var and try.

Then, we’ll gradually improve our example by:

  • suggesting identifiers using a symbol table;
  • respecting the scoping of the identifiers;
  • completing partial matches;
  • integrating a fuzzy search library;
  • improving accuracy;
  • giving performance tips.

In the end, we’ll have a component that we could use in a real-world application with some further tuning.

Setup

For this tutorial, we’ll be using antlr4-c3 with TypeScript, specifically the antlr4ts backend, although antlr4-c3 also comes in a Java version and a C# version.

We’ll assume you already have an ANTLR4 grammar and parser. For the sake of the example, we’ll be using the sample Kotlin grammar in the project antlr4-grammars, but the techniques that we present here are not specific to a particular grammar. The only prerequisite: the grammar must be suitable for generating a parser with the TypeScript backend, so it must either be free of semantic predicates and actions, or those must be written against the TypeScript backend.

We’ll also assume you already have a TypeScript project in place, with an existing package.json file. If that’s not the case, you can follow a tutorial to set it up.

Having said that, we’ll now add these dependencies to our package.json file (in the “dependencies” section):

"antlr4-c3": "^1.1.12",
"antlr4ts": "^0.5.0-alpha.3",

And this development-time dependency (in the “dev-dependencies” section):

"antlr4ts-cli": "^0.5.0-alpha.3",

antlr4-c3 is the code completion engine itself, while antlr4ts is the ANTLR4 TypeScript backend and antlr4ts-cli are its accompanying command-line tools.

Depending on how the project is set up, we can use a package manager such as Npm or Yarn to install those dependencies instead of modifying the file by hand. For example, issuing the command: npm install --save antlr4-c3

We could also automate the generation of the parser from the grammar, in the “scripts” section, for example:

"antlr4ts": "antlr4ts my-language.g4",
"build": "npm run antlr4ts && tsc -b",

Then, every time we build the project, our build tool will generate the lexer, parser, and accompanying classes before going on to transpile the TypeScript code to JavaScript.

How It Works

antlr4-c3 works by computing the set of applicable tokens at a given point in a parse tree. For example, let’s consider the following Kotlin statement:

fun test() {
    try {
        doSomething()
    }
}

If we place the caret after the first closing brace, we can expect the engine to suggest ‘catch’ and ‘finally’ tokens, because according to the grammar those can go in that spot in order to form a valid program.

To compute the set of candidate tokens, antlr4-c3 uses the same data structures and algorithms that the parsers generated by ANTLR use internally to do their job; basically, it simulates the parser’s state machine. Thus, we can rest assured that the code completion engine is always aligned with the language grammar.

We should note that antlr4-c3 works in terms of tokens; we don’t get back strings that we can readily show to the user. Translating those tokens into meaningful strings will be the topic of some of the following sections.

Basic Usage: Suggesting Keywords

Let’s now look at the basic usage of the code completion engine. We’ll see how to set up the engine and how to ask it for tokens.

As a first step, we create the parser and parse the desired code:

let input = CharStreams.fromString(code);
let lexer = new KotlinLexer(input);
let parser = new KotlinParser(new CommonTokenStream(lexer));
parser.kotlinFile();

Note that the parse doesn’t need to be successful; we can also work with incomplete or malformed code (which is often the case in an editor). Fortunately, ANTLR4 is capable of recovering from parsing errors, and if we use the default error strategy, it will build a parse tree in every case.

A tree built from malformed input will contain error nodes, but can still use it for our purposes – even though we may have to take into account the fact that some nodes are missing the information that we’d normally expect.

Anyway, the engine itself doesn’t use the parse tree to compute its suggestions. In principle, running the parser is only necessary to fill the token stream. However, we’ll actually make use of the parse tree to fine-tune the engine.

Once we’ve filled the token stream, we may ask the engine for a list of completion candidates:

let index = /* we’ll show how to compute this later */;
let core = new CodeCompletionCore(parser);
let candidates = core.collectCandidates(index);

In the following sections, we’ll describe the inputs and the outputs of the completion engine, then we’ll gradually improve our usage of the library to provide more appropriate suggestions.

Determining the Position of the Caret

The number that we pass to the completion engine, which we called index in the preceding examples, represents the position of a token in the stream that we’ve provided to the parser. To a first approximation, we could say that it points to the token under the caret. However, we’ll see over the course of the tutorial that things are a bit more nuanced.

It’s very important to calculate this index so as to find the most appropriate suggestions. To get started, however, we just need to compute a position that is good enough to give some results that aren’t completely off each and every time.

We’ll represent the position of the caret in the document as a couple of coordinates: line and column. Given how ANTLR parsers work, it would be simpler to use a single number, the position of the character in the input stream; however, using line and column allows for easier integration with the Language Server Protocol and probably editors in general.

So, our first attempt might be to write a function such as:

export type CaretPosition = { line: number, column: number };
export function computeTokenIndex(parseTree: ParseTree, caretPosition: CaretPosition): number {
    if(parseTree instanceof TerminalNode) {
        return computeTokenIndexOfTerminalNode(parseTree, caretPosition);
    } else {
        return computeTokenIndexOfChildNode(parseTree, caretPosition);
    }
}

Two additional functions will perform the actual work, one for terminal nodes and one for intermediate nodes:

function computeTokenIndexOfTerminalNode(parseTree: TerminalNode, caretPosition: CaretPosition) {
    let start = parseTree.symbol.charPositionInLine;
    let stop = parseTree.symbol.charPositionInLine + parseTree.text.length;
    if (parseTree.symbol.line == caretPosition.line && start <= caretPosition.column && stop >= caretPosition.column) {
        return parseTree.symbol.tokenIndex;
    } else {
        return undefined;
    }
}

function computeTokenIndexOfChildNode(parseTree: ParseTree, caretPosition: CaretPosition) {
    for (let i = 0; i < parseTree.childCount; i++) {
        let index = computeTokenIndex(parseTree.getChild(i), caretPosition);
        if (index !== undefined) {
            return index;
        }
    }
    return undefined;
}

Note that we’ve written the code above under the assumption that tokens in our language – at least, tokens that we may want to complete – do not span multiple lines. This is reasonable, because in typical languages the tokens that can span multiple lines only include comments and perhaps strings, and we don’t expect completion to work inside comments and strings.

Even when dealing with string interpolation or documentation comments with references to program elements, that we may want to complete, we would handle those by tokenizing and parsing comments and strings, so they wouldn’t be represented as a single token spanning multiple lines.

Should we encounter any unusual language where identifiers can span multiple lines, we’d have to amend the above code to handle such tokens.

Computing Code Suggestions

Now that we know how to compute a meaningful token index, let’s return to our initial example, which included the line:

let candidates = core.collectCandidates(index);

The collectCandidates method returns a list of potential completions at the given position (token index), in the form of an object of type CandidatesCollection. This includes two types of results, tokens, and rules:

class CandidatesCollection {
    tokens: Map<number, TokenList>;
    rules: Map<number, RuleList>;
}

Let’s concentrate on tokens for now. Each key in the map is a token type, as defined in the parser’s vocabulary. We can reconstruct a user-friendly name for the token type from its definition in the lexer grammar:

let tokenName = parser.vocabulary.getSymbolicName(tokenType).toLowerCase();

In many cases, we can offer that string directly as a suggestion:

let completions = []; 
candidates.tokens.forEach((_, k) => completions.push(parser.vocabulary.getSymbolicName(k).toLowerCase()));
return completions;

However, this only works for tokens whose rule name matches the parsed text (uppercased), as is common practice for keywords. For example:

PACKAGE: 'package' ;

In this case, for the token type PACKAGE, we’ll suggest the string ‘package’. However, not all keywords respect this pattern. We could then handle the few exceptions to the rule when necessary, simply by checking for each of them. For example, the Kotlin lexer defines the following rule:

NOT_IN: '!in' (WS | NL)+;

We may want to handle it like this:

if(k == KotlinParser.NOT_IN) {
    completions.push("!in");
} else {
    completions.push(parser.vocabulary.getSymbolicName(k).toLowerCase());
}

We can already see that how we write the grammar has an impact on code completion. This will be more evident as we go on with the tutorial.

Beyond Keywords; Excluding Tokens

In any case, the naive approach that we’ve just seen falls short as soon as the engine suggests a token which is not a keyword, say, LPAREN or Identifier [*].

In the case of an LPAREN token, or SEMICOLON, etc., probably we’ll want not to suggest anything at all since these are just syntactic markers.

When the candidate is an Identifier token, instead, we don’t want to suggest the literal string ‘identifier’; rather, we want to suggest a symbol that can meaningfully go where the user is typing. This could be, for example, the name of a variable in the current scope, or the name of a class in the current namespace. Typically, most of the value of code completion lies in suggesting identifiers.

Now, excluding tokens from candidate suggestions is simply a matter of configuring the engine:

core.ignoredTokens = new Set([ KotlinParser.LPAREN, KotlinParser.SEMICOLON ]);

Then, collectCandidates will return neither LPAREN nor SEMICOLON.

To suggest identifiers, however, we need to do more work. This will be the topic of the following section.

[*] note: the Kotlin grammar that we’ve used spells the name of identifier tokens like we did here, Identifier with a capital I followed by lowercase letters.

Suggesting Identifiers

We can determine which identifiers can go under the caret if we gather two pieces of information:

  • what kinds of identifiers we can insert at that point;
  • which identifiers of those kinds are visible in the active scope.

Let’s see how we might compute each set.

Which Kinds of Identifiers Can Go Here?

In general, at any given location in the source code, only certain kinds of identifiers will be valid. For example, when writing an expression such as 1 + …, we could insert the name of a variable, a function, or a method, but not the name of a type or namespace [*].

How can we compute which kinds of identifiers are valid? If we’re integrating an existing interpreter, compiler, or transpiler, we probably already have facilities to process and analyze the source code after parsing it, and we may be able to leverage those for our purpose. Remember that such facilities will have to be robust against the parse trees containing error nodes that result from malformed input – which is not a typical requirement in a translator.

If, however, we cannot readily use the information we already have or know how to compute, or if we only have a parser (as in our example), we can still use the grammar to our advantage, combined with antlr4-c3.

In fact, recall that candidates returned by the engine comprise both tokens and rules. So far, we’ve only considered tokens; it’s time to give rules some love.

[*] Technically, we can, if we want to refer to a static member of some class, but let’s keep this simple.

Rules Rule!

If we don’t tell it anything, antlr4-c3 won’t populate the rules map, ever. But, we can instruct the engine to return rules as well, by telling it which parser rules we’re interested in. Let’s start from simpleIdentifier:

core.preferredRules = new Set([ KotlinParser.RULE_simpleIdentifier ]);

simpleIdentifier is a rule in the Kotlin parser grammar. It wraps the Identifier lexer rule, allowing one to use also a few “soft keywords” as identifiers. Soft keywords are words, such as ‘abstract’ or ‘annotation’, that are only keywords when used in certain statements. Outside of those, we can use them as names, e.g., of variables.

Now, when we receive completion candidates, we can check whether we may suggest an identifier:

if(candidates.rules.has(KotlinParser.RULE_simpleIdentifier)) {
    completions.push(...suggestIdentifiers());
}

Let’s not forget that we still don’t want to suggest the literal keyword “identifier”:

candidates.tokens.forEach((_, k) => {
    if(k == KotlinParser.Identifier) {
        //Skip, we’ve already handled it above
    } else if(...) { ... } else {
        completions.push(parser.vocabulary.getSymbolicName(k).toLowerCase());
    }
});

Note also that we cannot avoid the above if statement by telling the engine to ignore Identifier tokens. If we did that, it would stop returning the simpleIdentifier rule as well, because of how antlr4-c3 works.

Refactoring the Grammar to Convey More Information

Now, simpleIdentifier tells us that we can insert a generic identifier, but we were trying to be more specific than that; indeed, many grammars don’t even have such a rule, and use an IDENTIFIER token directly.

However, the mechanism that we’ve just seen allows us to use the grammar to instruct the completion engine so that it’ll give more specific suggestions. The trick is to invent a new parser rule – say, variableRead – that delegates to simpleIdentifier:

variableRead: simpleIdentifier;

Then, we’ll use that rule whenever we can be sure syntactically that the identifier at that point can only refer to a variable. In order to do that, we may have to refactor the grammar. Since every grammar is different, if and how such a refactoring is possible will depend on how the grammar is structured.

For example, the Kotlin grammar that we’ve used lumps together variables and function/method calls with this rule:

postfixUnaryExpression
   : (atomicExpression | callableReference) postfixUnaryOperation*
   ;


In fact, atomicExpression includes simpleIdentifier among its alternatives, while postfixUnaryOperation includes function application (callSuffix):

postfixUnaryOperation
   : INCR | DECR | EXCL EXCL
   | callSuffix
   ...

So, we can restructure the grammar to accommodate our use case, in three steps:

  • Remove the function call syntax, callSuffix, from postfixUnaryOperation;
  • Remove the simpleIdentifier alternative in atomicExpression;
  • Disambiguate between variable reads and other identifiers in postfixUnaryExpression.

In the end, we might refactor postfixUnaryExpression like this:

postfixUnaryExpression
   : (variableRead | atomicExpression) postfixUnaryOperation*
   | (simpleIdentifier | callableReference | atomicExpression) (postfixUnaryOperation | callSuffix)*
   ;


If we perform this kind of refactoring, we have to check each use of the rules that we’ve modified. Fortunately, in this case, the only uses of atomicExpression and postfixUnaryOperation are in postfixUnaryExpression, so the impact is limited to the rules we’ve just modified. But watch out for mutually recursive rules! Of course, having a comprehensive test suite for the grammar helps considerably.

At last, we can now get the code completion engine to tell us exactly when we might insert a variable:

core.preferredRules = new Set([ …, KotlinParser.RULE_variableRead, ... ]);

Since the variableRead rule is only activated when an expression involves reading the contents of a variable, the engine will return this rule only when we can be sure that we can indeed suggest a variable.

If refactoring is not possible, or the language doesn’t make it easy to distinguish syntactically where a variable name might go, we can still traverse the parse tree to determine if a variable can really appear at the given point. However, that kind of work is out of the scope of this tutorial.

The Symbol Table

With all that we’ve done so far, we still have a problem: how do we know which variables are visible in the scope that’s active under the caret?

Compilers and other translators typically employ a data structure called a symbol table to hold information about symbols (identifiers), such as name, scope, type, and other characteristics. We could build and use such a structure to keep track of which variables are visible where, so as to give appropriate suggestions.

Luckily for us, antlr4-c3 provides a symbol table implementation out of the box, which is pretty generic and adaptable to many languages. In particular, it already has classes to represent several kinds of symbols (variables and functions, classes and namespaces, etc.) and a class hierarchy that we can extend. We can also use it to model nested scopes and associate each symbol with a node in the parse tree.

Building the Symbol Table With a Visitor

We can populate such a symbol table with a parse tree visitor. Here, we won’t be showing a detailed implementation, but we’ll sketch a very simple visitor that only handles basic scoping, just to provide a track that you could follow.

Our visitor can make use of the classes and interfaces that antlr4 provides and generates:

export class SymbolTableVisitor extends AbstractParseTreeVisitor<SymbolTable> implements KotlinParserVisitor<SymbolTable>

The visitor returns a symbol table: we’ll give this object a lifetime which is equal or longer than that of the visitor, and that will make the visitor not reusable:

protected defaultResult(): SymbolTable {
    return this.symbolTable;
}

That’s only one of the possible design choices; we could have created a new symbol table for each run of the visitor, instead.

We’ll initialize the symbol table in the constructor, and we’ll also use a field to keep track of the current lexical scope:

constructor(
    protected readonly symbolTable: SymbolTable = new SymbolTable("", {}),
    protected scope = symbolTable.addNewSymbolOfType(ScopedSymbol, undefined)) {
    super();
}

We’ll talk about scoping in the next section.

Now comes the important part: we have to record variable declarations whenever we find them. We do this by providing a visitVariableDeclaration callback:

visitVariableDeclaration = (ctx: VariableDeclarationContext) => {
   this.symbolTable.addNewSymbolOfType(VariableSymbol, this.scope, ctx.simpleIdentifier().text);
   return this.visitChildren(ctx);
};

Et voilà, we can now use this simple symbol table to suggest variable names:

if(candidates.rules.has(KotlinParser.RULE_variableRead)) {
    let symbolTable = new SymbolTableVisitor().visit(parseTree);
    completions.push(...suggestVariables(symbolTable));
}

We might define suggestVariables naively as follows:

function suggestVariables(symbolTable: SymbolTable) {
    return symbolTable.getNestedSymbolsOfType(VariableSymbol).map(s => s.name);
}

Handling the Scope of Identifiers

However, the above example doesn’t handle scoping at all. We suggest all variable names in the program at any point a variable name could go, regardless of the language’s scoping rules.

As with the symbol table, we may already have code to compute the scoping of identifiers if we have an interpreter or compiler and if we can reuse parts of it. But, if we don’t, this section will help us build it.

So, let’s now see how we might handle scoping. Here, we’ll only introduce a new scope for each function declaration, to demonstrate the idea:

visitFunctionDeclaration = (ctx: FunctionDeclarationContext) => {
    return this.withScope(ctx, RoutineSymbol, [ctx.identifier().text], () => this.visitChildren(ctx));
};

Symbols in antlr4-c3’s symbol table act as scopes as well. In fact, they form a hierarchy, so it’s easy to create a chain of scopes from most specific to least specific and to keep track of them as we traverse the tree:

protected withScope<T>(tree: ParseTree, type: new (...args: any[]) => ScopedSymbol, args: any[], action: () => T): T {
    const scope = this.symbolTable.addNewSymbolOfType(type, this.scope, ...args);
    scope.context = tree;
    this.scope = scope;
    try {
        return action();
    } finally {
        this.scope = scope.parent as ScopedSymbol;
    }
}

We’ve copied the complex signature of the type parameter from the addNewSymbolOfType method, which we invoke to add the new symbol (scope) to the current symbol (scope).

Note that we also track the scope’s context. In our case, this is the portion of the parse tree where the scope is in effect: the function definition. We’ll use it later, to provide scoped suggestions.

Putting It All Together

Now that we track the scope of identifiers, we can use it to provide appropriate suggestions. We’ll rewrite our suggestVariables method to make use of the scope that we’ve computed at the previous step:

function suggestVariables(symbolTable: SymbolTable, context: ParseTree) {
    const scope = getScope(context, symbolTable);
    let symbols: Symbol[];
    if(scope instanceof ScopedSymbol) { //Local scope
        symbols = getAllSymbolsOfType(scope, VariableSymbol);
    } else { //Global scope
        symbols = symbolTable.getSymbolsOfType(VariableSymbol);
    }
    return symbols.map(s => s.name);
}

We’ll show the implementation of getAllSymbolsOfType soon, but first, notice that now suggestVariables needs the context in which to search for symbols. This context is, basically, the portion of that parse tree that lies under the caret, from which we can determine the applicable scopes.

The best possible place to compute the context is where we already compute the index of the token under the caret. So, we’ll change the computeTokenIndex… methods to computeTokenPosition… and we’ll make them return both the index and the context:

export type TokenPosition = { index: number, context: ParseTree };
export function computeTokenPosition(parseTree: ParseTree, caretPosition: CaretPosition): TokenPosition {
    //...
}

And in computeTokenPositionOfTerminalNode we’ll now return a TokenPosition:

return { index: parseTree.symbol.tokenIndex, context: parseTree };

Now that we’re all set, we can look at how to determine the applicable scope and all the variable names that are visible there.

First, to determine the scope, we leverage antlr4-c3’s symbol table support for scopes, starting from our context and going up the parse tree until we find a scope that matches:

function getScope(context: ParseTree, symbolTable: SymbolTable) {
    if(!context) {
        return undefined;
    }
    const scope = symbolTable.symbolWithContext(context);
    if(scope) {
        return scope;
    } else {
        return getScope(context.parent, symbolTable);
    }
}

Then, we’ll compute the set of visible variables by starting from the closest scope and going up:

function getAllSymbolsOfType<T extends Symbol>(scope: ScopedSymbol, type: new (...args: any[]) => T): T[] {
    let symbols = scope.getSymbolsOfType(type);
    let parent = scope.parent;
    while(parent && !(parent instanceof ScopedSymbol)) {
        parent = parent.parent;
    }
    if(parent) {
        symbols.push(...getAllSymbolsOfType(parent as ScopedSymbol, type));
    }
    return symbols;
}

Notes About Scoping

Our scoping implementation is only meant as a starting point. We haven’t handled several real-world concerns, including:

Symbols from other files: in our language, we might refer to names from a library that we imported, or from a superclass, etc. Here, we’re dealing with one file at a time.

Indexing: we’ve modeled scopes that form a linear nesting hierarchy. However, in a project with many source files with cross-references, scopes could easily branch into a big tree, and linearly searching it could become too inefficient. Real-world IDEs build indexes for our source code, libraries, platform code, etc.

Shadowing: many languages allow a variable to be redeclared in a narrower scope, thus shadowing the variable with the same name in some enclosing scope. In such cases, our code would return the same name twice.

Sorting: we haven’t explicitly sorted the results in any way – they’re implicitly sorted from most specific to least specific. That could be OK, but we may want to present them alphabetically, for example.

How to deal with these and other improvements is left to the reader.

Towards a Real-World Implementation

We’ve gotten quite far already. In this section, we’ll look at some ways to improve what we’ve got, to smooth the user experience and advance towards a practical, real-world solution.

Partial Matches

Often, when we ask for code completion, we’ve already typed a part of a keyword or identifier and we expect the editor to complete it for us.

For example, let’s consider the following Kotlin statement:

fun test() {
    try {
        doSomething()
    } ca|
}

Here, we’ve represented the caret with a vertical bar. Having already typed the letters ‘ca‘, we expect the engine to suggest the ‘catch’ keyword and not other keywords that do not begin with ‘ca‘, such as ‘finally‘.

However, the code that we’ve written so far does not take into account the characters that the user may have already typed. Let’s now see how we can use that information to improve our suggestions.

First of all, along with the token position under the caret, we’ll want to return the text to match:

export type TokenPosition = { index: number, context: ParseTree, text: string };

And in computeTokenPositionOfTerminalNode:

return {
    index: parseTree.symbol.tokenIndex,
    context: parseTree,
    text: parseTree.text.substring(0, caretPosition.column - start)
};

We can see that the text to match is the portion of the token up to the position of the caret. Or, we could have chosen to always use the full text of the token. The best strategy depends on the expectations of our users and the conventions of the editors that we’ll target. Anyway, most often the caret is at the end of the token when we ask for code completion, so the difference is not terribly important.

Filtering Completion Candidates

Now, let’s see how to make use of the new information when we compute code suggestions. We’ll want to execute a partial match only when the token under caret is not one of the types that we’ve excluded, as we don’t want to match against a closed parenthesis or a semicolon:

const isIgnoredToken =
    position.context instanceof TerminalNode &&
    ignored.indexOf(position.context.symbol.type) >= 0;
const textToMatch = isIgnoredToken ? '' : position.text;

With that in mind, let’s filter the keywords that we suggest:

candidates.tokens.forEach((_, k) => {
    let candidate;
    if(k == KotlinParser.Identifier) {
        //Skip, we’ve already handled it above
    } /* else if(...) { //Tokens with names that don’t match with a keyword
        candidate = ...;
    }*/ else {
        candidate = parser.vocabulary.getSymbolicName(k).toLowerCase();
    }
    if(candidate) {
        maybeSuggest(candidate, textToMatch, completions);
    }
});

We’ll define the maybeSuggest function like the following:

function maybeSuggest(completion: string, text: string, completions: any[]) {
    if (tokenMatches(completion, text)) {
        completions.push(completion);
    }
}

In turn, it invokes the tokenMatches function:

function tokenMatches(completion: string, text: string) {
    return text.trim().length == 0 ||
           completion.toLowerCase().startsWith(text.toLowerCase());
}


We can see that we only filter those tokens that have actual text in them. If a token has no text or only whitespace, we consider it matching.

Similarly, when suggesting variables, we’ll only execute a partial match if the caret is on a variableRead expression:

let variable = position.context;
while(!(variable instanceof VariableReadContext) && variable.parent) {
    variable = variable.parent;
}
const textToMatch = variable ? position.text : '';
return symbols.map(s => s.name).filter(n => tokenMatches(n, textToMatch));

And that’s it! We can now complete partial identifiers and keywords.

Introducing Fuzzy Matching

Many modern editors go one step further when implementing code completion, and employ fuzzy search algorithms to match identifiers.

For example, in a Java IDE, we could type the characters “.noa” after some variable name, and the editor might suggest the notifyAll() method call, which is available for all Java objects. notifyAll does not start with “noa”, but a fuzzy search algorithm is able to match them nevertheless because the two strings have some intersection.

We can easily achieve the above result by integrating a library called “fuzzysort”:

"dependencies": {
    "@types/node": "^13.9.8",
    "antlr4-c3": "^1.1.12",
    "antlr4ts": "^0.5.0-alpha.3",
    "fuzzysort": "^1.1.4",
    ...
}

We import it like this in TypeScript:

import * as fuzzysort from 'fuzzysort';

And we use it simply by invoking the method go:

fuzzysort.go(text, targets)

go returns a list of matching results sorted by relevance. Thus, we could simply rewrite the tokenMatches method like this:

export function tokenMatches(completion: string, text: string) {
return text.trim().length == 0 ||
       fuzzysort.go(text, [completion]).length > 0;
}

and leave everything else as it is. I.e., we’ll be still returning suggestions in the order that the engine computes them.

Sorting by Relevance

However, to use fuzzysort to its full potential and return results sorted by relevance, we need to change how we filter suggestion candidates: from matching each token separately, to collecting and matching them in a single invocation.

That is, we’ll turn the matcher method into a filter method:

export function filterTokens(text: string, candidates: string[]) {
    if(text.trim().length == 0) {
        return candidates;
    } else {
        return fuzzysort.go(text, candidates).map(r => r.target);
    }
}

Notice the .map(r => r.target); at the end: fuzzysort returns a list of results that includes a score and other information, but we’re only interested in the target text.

Now, we’ll change how we compute suggestions. Let’s do it for variables first, by modifying the return expression in the suggestVariables function:

return filterTokens(position, symbols.map(s => s.name));

Second, let’s also change how we suggest keywords. Recall that we won’t be filtering them one by one; instead, we’ll accumulate them and then filter them all together:

let tokens = [];
candidates.tokens.forEach((_, k) => {
    if(k == KotlinParser.Identifier) {
        //Skip, we’ve already handled it above
    }/* else if(...) { //Tokens with names that don’t match with a keyword
        token.push(...);
    }*/ else {
        tokens.push(parser.vocabulary.getSymbolicName(k).toLowerCase());
    }
});
completions.push(...filterTokens(position, tokens));

Now we have fuzzy completion for both identifiers and keywords!

Increasing Accuracy

With all the improvements we’ve implemented, we’re approaching a respectable level of quality. However, our code completion component still has a weak spot: in fact, it can’t reliably determine a caret position that is useful to compute suggestions.

For example, consider this piece of code, where the caret is represented with a vertical bar:

val v = 1
fun test() {
    print(|)
}

If we placed the caret between the parentheses of the print call, as in the above code, and asked for code suggestions, we would expect the variable, v, to be proposed as a candidate for completion. However, that does not happen.

In general, poor results may be caused by different effects. In our experience, the most common are:

  1. The lexer grammar skips some tokens, therefore our token stream has holes in it (positions in the character stream that correspond to no token)
  2. The computation of the caret position requires some manual tuning to produce optimal suggestions, because of the way antlr4-c3 works.

Let’s address both of the above.

Don’t Skip Whitespace!

In our case, point 1. is somewhat mitigated for Kotlin. Because some whitespace tokens such as newlines are significant in the language, the lexer doesn’t skip them. However, as in many other ANTLR grammars, the lexer still skips some characters, such as spaces and tabs.

This can be problematic when determining the correct position of the caret; for example, let’s consider the following statement:

val x = 1

The resulting parse tree won’t contain any whitespace tokens. Furthermore, since those are skipped, there will be no trace of them even in the underlying token stream. Therefore, if we were to place the caret between, say, the equal sign and the literal 1, our code would either find that the caret is on one of these two tokens or nowhere at all. That, of course, will not produce the meaningful suggestions that we desire.

So, as a first step, we should ensure that our grammar skips no tokens, and directs unwanted tokens to another channel. This expedient is suggested in the antlr4-c3 documentation.

For example, we might change the Kotlin lexer grammar to spell:

WS
   : [\u0020\u0009\u000C]
     -> channel(HIDDEN)
   ;

where channel(HIDDEN) is used instead of the skip action. Of course, we can also use another channel instead of HIDDEN, which is built into ANTLR.

If we follow that route, we should note that we’ve implemented getScope by going up the parse tree, but whitespace tokens won’t belong to any parse tree; rather, they’ll be TerminalNode instances that are disconnected from the tree. So, to fully benefit from this change, we’ll have to adapt getScope (and/or computeTokenPosition) to return the closest parse tree that contains the token, in case it has no parent. This is left as an exercise for the reader since it impacts several of the functions that we’ve written.

Adding Heuristics

Returning to the print() example, we can see that, regardless of whitespace, where the caret falls has implications for the suggestions that are produced. Is the caret at the end of the open parenthesis, or at the beginning of the closing parenthesis? This is important as antlr4-c3 suggests tokens that may go after the caret, but not before.

We could further improve accuracy by playing with heuristics, such as, “if the caret is at the beginning of a closed parenthesis token, then compute suggestions as if it was on the token before” (i.e. return the index of the preceding token when computing the position).

We’ve also seen that using placeholders can help. Here, for example, we’ve introduced a placeholder rule to aid the engine in suggesting function arguments:

valueArguments
   : LPAREN (suggestArgument valueArgument (COMMA suggestArgument valueArgument)*)? RPAREN
   ;

suggestArgument: /* This is a placeholder for code completion */;

suggestArgument is a placeholder rule; it’s empty, so it always matches, and it remains in the parse tree, giving a hint to the engine.

Of course, we’ve changed the variable suggestion code accordingly:

if(candidates.rules.has(KotlinParser.RULE_variableRead) ||
   candidates.rules.has(KotlinParser.RULE_suggestArgument)) {
    let symbolTable = new SymbolTableVisitor().visit(parseTree);
    completions.push(...suggestVariables(symbolTable, position));
}

In general, this is not an exact science. You’ll need to perform some experimentation to identify the solution that provides the best accuracy. We’ve built our experience with many different languages, and while we’ve developed our little toolbox of tricks to improve code completion, we can say that there’s no single one-size-fits-all solution.

Improving Performance

The performance of code completion is generally good but not predictable. As we said, antlr4-c3 works by simulating the parser’s state machine and computing all the reachable states from the selected token. Thus, it’s possible to incur in the exponential growth of the state space to explore. In such a case, computing suggestions even for simple expressions can take an unacceptably long time.

To reduce the engine’s search space, we can pass a second parameter to collectCandidates:

core.collectCandidates(tokenIndex, context);

The context parameter refers to a portion of the parse tree; the search will be constrained to only examine candidate nodes that fit inside it. The context could be the current block, or function, or class; it’s a matter of balancing performance with accuracy and number of results.

Despite a reasonable context parameter, some expressions may still take too long to be completed. Then, with a little hack we can at least bail out of completion altogether, to avoid rendering the editor unresponsive:

let processRuleMethod = Object.getPrototypeOf(core).processRule;
const start = new Date().getTime();
const MAX_DURATION_MS = 1000;
Object.getPrototypeOf(core).processRule = (...args) => {
    let now = new Date().getTime();
    if (now - start > MAX_DURATION_MS) {
        throw new TimeoutException();
    }
    return processRuleMethod.call(core, ...args);
}

However, that’s rarely necessary in our experience.

Conclusions

In this tutorial, we’ve implemented a code completion engine for a small subset of Kotlin. We started simple and gradually introduced the techniques and tricks that make the difference between a toy example and a real-world component.

For further questions and inquiries, contact us at Strumenta: this is our daily bread.

The full code accompanying this tutorial is available on GitHub.

Read more:

If you wan to know how to use ANTLR, you can read our article about The ANTLR Mega Tutorial.

The ANTLR Mega Tutorial as a PDF

Get the Mega Tutorial delivered to your email and read it when you want on the device you want

Powered by ConvertKit