Writing an Editor for PlantUML

Written by

Alessio Stalla

Introduction

In this tutorial, we’ll write an editor for a subset of PlantUML, as an extension to Visual Studio Code. However, we won’t start from scratch like most tutorials do (including our own on implementing high-quality code completion in VSCode with ANTLR and the Language Server Protocol). Instead, we’ll use this example to show how we typically work, with the hope of giving an idea of what language engineering is in practice.

PlantUML is a tool to author UML diagrams using plain text. With it, we can describe diagrams of several kinds (including class diagrams, sequence diagrams, activity diagrams, and others) using a syntax that resembles pseudo-code, or what one would write in words and symbols on a piece of paper.

To edit PlantUML files, we can use any text editor. However, as we’ve often said in other articles, using an editor that understands the language being written has several advantages. This “language intelligence” allows for features such as syntax highlighting, real-time error reporting, and code completion. The code becomes easier to read, write, and maintain. New users can learn and become proficient more quickly.

PlantUML is also interesting because as a language it’s not as formally strict as most programming languages and DSLs; it doesn’t have a formal grammar and it’s meant to be embedded in any plain text document. We’ll see how this and other design decisions in the language can impact the effort required to write a good editor for it.

This tutorial is structured as follows:

Laying the Groundwork, where we do preliminary research;
Setting Up, where we prepare our environment for the actual development;
Evolving the Grammar, where we show how to improve on an existing PlantUML ANTLR grammar;
Improving the Extension, where we integrate a proper parser into an existing extension for Visual Studio Code;
Wrapping Up and Further Improvements, where we draw our conclusions and provide some ideas for further development.

All the code accompanying this tutorial is on GitHub.

A Visual Studio Code editor panel with some sample PlantUML code, with a code completion menu showing that the editor understands class names from the code. — Final result at the end of the tutorial: an editor with syntax highlighting, error reporting and code completion

Laying the Groundwork

In the following sections, we’ll look at what options we have and plan a course of action. Then, we’ll be ready for the actual coding.

Choosing an Editor Platform

So, we want to write an editor for PlantUML. Where do we start? An editor is a complex application. One doesn’t usually start from a white canvas. We have many existing editors to choose from that allow some kind of plugin or extension mechanism and take care of all the complicated rendering and editing stuff.

Which editor platform to choose largely depends on two factors:

the needs and customs of our users;
our learned skills and the tools at our disposal.

If we and our users are hardcore Linux hackers, then vim is a fine choice. If we and/or our users have a preference for Emacs, and a solid Lisp background, then Emacs is the right choice. Et cetera.

In general, Visual Studio Code is a popular editor platform these days. Its modern UI is appealing for users on major desktop operating systems, and developers who are familiar with JavaScript/TypeScript can quickly pick it up. It’s also pretty mainstream, so it’s got a thriving community that’s developing lots of extensions.

Plus, it leaves the door open to porting the editor to the web without a complete rewrite. This is because VSCode is written in JavaScript using Web technologies, and you can either reuse parts of it in web applications – most notably, the Monaco editor – or even deploy full-blown extensions into a web IDE thanks to Eclipse Theia.
So, absent any other considerations, VSCode looks like a fine default choice. Actually, here at Strumenta we have plenty of experience with VSCode; we’ve developed internal libraries and open-source tools that take care of many common tasks, and written several articles about it. It’s our go-to platform for writing desktop editors. However, that won’t be true for everyone, so let’s not take it into consideration.

Doing Some Research

So, we have a hint that VSCode could be our platform of choice. Let’s do some more research. After all, someone else might already have faced our same problem.

Searching for “PlantUML editor” on popular search engines returns a bunch of editors, for the desktop and the web, that all lack the language intelligence features we’re after – such as syntax highlighting and code completion. Also, they don’t look like they’re built on an editor platform, so we have no idea how easy it would be to extend them – we’d need to study them, and that would quite possibly require more work than actually extending the editor!

Scrolling a bit further down the search results, however, we find:

https://marketplace.visualstudio.com/items?itemName=jebbs.plantuml
PlantUML – Visual Studio Marketplace
Apr 6, 2021 — Generate URLs. Multi-Page Diagram support. From local or server. Image map (cmapx) support. Editing Supports. Format PlantUML code.

This looks promising. It’s an open-source extension on the VSCode marketplace. The code is hosted on GitHub. It’s got advanced features such as Auto-update (the diagram image automatically updates to reflect the text), it’s got syntax highlighting, and some form of code completion.

We can easily try it since it’s on the marketplace.

Indeed, we can confirm that the extension has the features that are described on its marketplace page. Notice that code completion doesn’t seem to be particularly aware of the context – it suggests more or less always the same words.

However, to better evaluate what we have, we need to dig into the code.

Evaluating the Extension’s Source Code

Our next step is thus to clone the extension from GitHub and study the source:

git clone https://github.com/qjebbs/vscode-plantuml

Actually, for writing this article, we’ve forked the extension on GitHub and cloned the forked repository, where we’ll also host our work.

The structure of the extension is not trivial; we can find several directories under src. Roughly, the features of the extension come from the providers found in src/providers, that in turn use other services.

Without going too deep into the details, we can note several facts:

The extension doesn’t include a grammar.
Code completion is indeed not context-dependent, since the tool doesn’t really understand the structure of PlantUML diagrams; it takes the list of keywords from PlantUML itself, invoked from the command line, and assumes that every word which is not a keyword is a variable, regardless of its position.
It can only report two kinds of warnings: when you don’t name your diagram, and when you use the same diagram name twice.

So, to take the extension to the next level, we’ll want to integrate a PlantUML parser.

A PlantUML Grammar

To provide the best language intelligence efficiently, we leverage ANTLR and the libraries and tools built upon it, when possible. We could also use other tools, or even write a parser by hand; however, that would sacrifice accuracy and ease of maintenance, and/or require significantly more time and effort. So, the next step in our research is to look for an existing ANTLR grammar and parser for PlantUML.

The first stop is the official website. However, PlantUML itself doesn’t have an official grammar. The language is apparently implemented with regular expressions and ad-hoc code.

The list of open-source ANTLR4 grammars on GitHub doesn’t mention PlantUML, either. So, we’ll have to search for a suitable project.

It turns out that several grammars exist, but the most popular ones are written in PEG.js or other notations. Eventually, we found an open-source (BSD license), partial PlantUML ANTLR grammar here: https://github.com/jgoppert/pumlg/blob/master/src/pumlg/parser/Pumlg.g4

By looking at it, we can see that it only supports Class diagrams, but it’s a start.

Issues With the Grammar

The quality of a parser is the combination of several factors:

How many valid code listings it can parse
How the resulting tree fits with the intended purposes of the parser (validating code in an editor, providing code completion, translating the language, …)
The quality of error messages and error recovery
Performance
And possibly others.

When using a parser generator such as ANTLR, some of those are properties of the generator itself, while others depend on how we write the grammar. In particular, ANTLR guarantees good error handling and generally acceptable performance without tweaking the grammar.

Instead, other characteristics depend on how we write the grammar. Here, we’re mostly interested in the following aspects:

We want nice automatic code completion;
We want the parser to be somewhat lenient.

The last point merits further discussion. The grammar that we’ve found, as is typical for formal languages, doesn’t accept much “noise” in the code; it assumes a strict format.

However, we have several reasons to desire a more lenient parser. Code in an editor is often malformed, and on top of that, PlantUML is designed so that we can include it as a section in some text documentation file (e.g. written in Markdown, or even just plain text). Also, our grammar only recognizes a subset of PlantUML, but we want to be able to extract as much information as possible from the input. For example, we’ll want to recognize class declarations and macro declarations, even if they’re in the middle of valid code that we can’t parse (i.e., distinguish from malformed code).

So, we’ll have to accommodate for more “noise” in the grammar. This will create some tension with its accuracy, because we’ll increase the possibility of wrong parses – detecting a structure that the writer didn’t really intend to express in the code. We’ll also risk degrading the quality of code completion: if we allow for all sorts of tokens to go more or less everywhere, then the code completion engine will always suggest tokens that are not really meaningful in the specific context. We’ll return on this when we discuss code completion.

Planning Before Coding

So many words spent, and not a line of code yet? Here, we’re trying to show the processes that lead to writing the code one way and not another. So, we still have to lay out our plan.

We have an editor that lacks some features, or doesn’t implement them adequately. We have a partial grammar and we’ve identified some of its limitations. So, our plan is, unsurprisingly, to integrate the grammar into the editor, with the goal of providing better error reporting and code completion.
To do that, we’ll modify the grammar to better deal with malformed input and with parts of PlantUML that we don’t support yet. Then, we’ll integrate a code completion engine (we’ve talked extensively about that in other articles).

Setting Up

So, let’s get into the code. We’ve already forked/cloned the extension repository: we now need to prepare it for actual development.

The first step is to install the dependencies:

npm install

Then, we can open the extension in Visual Studio Code. There, from the run/debug panel, we can either launch a new instance of VSCode with the extension installed to try it out, or we can run the tests – again, in a fresh instance of VSCode.

The run/debug menu in VSCode, showing the two options “Run Extension” and “Launch Tests”. — The run/debug menu in VSCode

Let’s launch the extension and open some PlantUML files. From the package.json descriptor, we learn that the editor is used for files with the following extensions: .wsd, .pu, .puml, .plantuml, .iuml.

However, if we locate and open such a file, or create a new one, we’ll notice that syntax highlighting is missing. In fact, in order to get syntax highlighting in the editor, we have to generate an XML descriptor from the YAML syntax description that comes with the project. There’s a script to do that:

npm run build-syntax

You can study what the script does if you want, to learn more about how syntax highlighting works in Visual Studio Code. The entry point is the scripts section in the package.json file.

Also, note that to activate some features we need PlantUML itself which doesn’t come bundled with the extension. So, we can download plantuml.jar from the official website and place it under the project root, next to the package.json file.

We can now try running the tests if we so wish. In theory, we can do it from the command line with:

npm run test

However, we weren’t able to get that working while writing this tutorial. Fortunately, tests appear to run fine if launched from the run/debug menu in the VSCode GUI.

Extending the Extension

Now that we have the extension project ready for development, we can add the bits that are missing. We’ll start with the grammar.

First, we can download it from https://github.com/jgoppert/pumlg/blob/master/src/pumlg/parser/Pumlg.g4 into the folder src/grammar that we’ll have created.

Next, we ought to generate a parser from the grammar. Usually, one does it with the ANTLR Java executable. However, since we’ll be using the antlr4ts (TypeScript) runtime as we’ve done in other tutorials, we’ll have to use the antlr4ts-cli project instead.

So, let’s add antlr4ts-cli to the devDependencies section of our package.json (dependencies that are only available during development, and not in the delivered extension):

"antlr4ts-cli": "^0.5.0-alpha.4",

Then, let’s add the antlr4ts runtime itself to the dependecies section, since it contains the classes that are necessary at runtime to implement the parser:

"antlr4ts": "^0.5.0-alpha.4",

We can now add a new script to build the parser, in the scripts section:

"build-parser": "antlr4ts -o src/parser -Xexact-output-dir -visitor src/grammar/*.g4",

Running it with npm run build-parser should create some TypeScript files under src/parser. We may have to manually create the directory the first time. Also, it’s a good idea to exclude it from version control, since they’re generated files.

Writing Our First Test

Our extension now contains a parser, but nothing is making use of it. We can write a test that exercises the parser on some simple input. Let’s create a file called parser.test.ts under the test directory, with the following content:

import { expect } from "chai";
import { suite, test } from "mocha"
import {CharStreams, CommonTokenStream} from "antlr4ts";
import {PumlgLexer} from "../src/parser/PumlgLexer";
import {PumlgParser} from "../src/parser/PumlgParser";

suite("Parser Tests", () => {
   test("Empty diagram", () => {
       const cs = CharStreams.fromString("@startuml\n@enduml");
       const ts = new CommonTokenStream(new PumlgLexer(cs));
       let parser = new PumlgParser(ts);
       parser.uml();
       expect(parser.numberOfSyntaxErrors).to.equal(0);
       expect(parser.inputStream.index == parser.inputStream.size);
   });
});

We test that 1) the parser does not produce any syntax errors, and 2) the parser consumes the entire input. This is a common gotcha – that there are no errors doesn’t necessarily mean that the entire input is valid, it depends on how the grammar is written. So in general it’s always good practice to check that the input stream was indeed consumed in full. We’ll say more about this later.

We can run just our test from the command line with:

mocha -r ts-node/register test/parser.test.ts

This test will also be run as part of the entire test suite when we launch it from VSCode.

Also, note that in the GitHub repository we’ve updated all the test libraries. You may have to do the same to actually run the test successfully.

Evolving the Grammar

As we’ve said, the grammar that we’ve found has several shortcomings and we’ll now proceed to improve it. However, there’s not a single dimension of improvement that we can apply, and actually we can construct opposite goals (for example, better performance vs better maintainability).

Here, we want to use the grammar for code completion, so our improvements will go in that direction. We may make the grammar actually worse for other tasks (e.g. compilation of a diagram into an image).

Code Completion and Discoverability

However, we’ve been still a bit too vague when talking about code completion. We can actually refer to two slightly different goals with that term.

One target is speeding up the work of the experienced coder. Such a person already knows what to type, most of the time; they just need to do it faster, because spelling all the characters in vendingMachine.makeMyCoffeAsUsual() is cumbersome and interrupts the flow. So, they will start typing a few characters – v, e, n for example for “vending”, or m-m-c for “make-my-coffee” – and the editor will suggest the rest, completing a partially typed identifier or keyword.

A different goal is making the language more discoverable by inexperienced users. These people will often not know what to type in advance – they won’t remember the syntax and won’t know the standard library. So, the editor would help them by suggesting potential keywords and identifiers without a partial match. They would use a keyboard shortcut, or trigger completion after one of a few selected special characters (such as the dot in most object-oriented languages in the C-like family).

These two objectives overlap and we use the same technologies and techniques to reach them, but sometimes they can be at odds with each other – altering the grammar to improve completion of partial identifiers may worsen the quality of suggestions when no partial word has been typed, or vice versa. We should keep that in mind to decide what to prioritize in case a conflict arises.

Making the Grammar More Lenient

Having said that, we’ll now start improving on our grammar. In particular, we want to make it more lenient, i.e. tolerant of malformed input. Actually, as we’ve said, “malformed input” is possibly valid code that uses parts of PlantUML that we don’t support yet.

So, let’s identify some potential problems in our grammar. The main entry point is the following rule (edited for readability):

uml: '@startuml' (NEWLINE | class_diagram) '@enduml';

This has a few issues:

A UML diagram can be embedded in the surrounding text, which is not parsed, it can be anything. Our grammar doesn’t handle that.
It only handles class diagrams – the parse fails with other diagrams.
It doesn’t end with EOF. Therefore, it allows for arbitrary text to go after an instance of “uml” has been matched. This is not a problem in our particular case, since we want to allow for arbitrary leading and trailing text anyway; however, it’s worth mentioning because it’s a common pitfall.

Actually, if we study the class_diagram rule and its subrules, we’ll see that this grammar doesn’t even support the whole class diagram language. On the PlantUML website, the first example of a class diagram is the following:

@startuml
abstract        abstract
abstract class  "abstract class"
annotation      annotation
circle          circle
()              circle_short_form
class           class
diamond         diamond
<>              diamond_short_form
entity          entity
enum            enum
interface       interface
@enduml

However, nowhere in our grammar we see support for the diamond or circle elements, for example.

Clearly, we want to recognize as many diagram elements as possible, and provide code completion while the user is writing them, even when we don’t recognize some other parts of the diagram.

So, we can change our grammar’s first few rules into:

umlFile: (text=.*? embeddedUml)* text=.*? EOF;
uml: embeddedUml EOF;
embeddedUml: STARTUML ident? NEWLINE diagram? ENDUML;

diagram: class_diagram;

See how we’ve added the EOF token and we’ve provided two entry points – one for an entire file with multiple UML sections, each preceded and followed by any combination of tokens, and one for a single, extracted UML section. We’ll see which one works best for our editor.

We’ve also added an extra diagram rule in case we’ll want to add new types of diagrams in the future. For now, it simply continues into a class_diagram.

Then, we’ll also modify the class_diagram rule to allow for some extra “noise”, i.e., text that we can’t parse properly, but we want to ignore (instead of producing spurious parse errors and possibly causing proper code not to be understood as well):

class_diagram:
   (class_diagram_noise_line*
    (class_declaration | connection | enum_declaration | hide_declaration)
    NEWLINE
    class_diagram_noise_line*)+
   ;
class_diagram_noise_line: (~(CLASS | ENUM | HIDE | CONNECTOR | NEWLINE) .*?)? NEWLINE;

Here, every line that does not start with one of the known class diagram keywords is considered “noise” and added to the parse tree as it is. The consuming code will be free to ignore it and extract the information it needs from the valid parts (declarations and connections).

Making the Grammar Better Suited for Code Completion

As it is, the grammar has other significant issues with respect to code completion.

First, it skips whitespace:

WS : [ ]+ -> skip ; // toss out whitespace

That’s a problem for the code completion engine we’ll be using, as we’ve shown in another tutorial. Fortunately, the fix is easy:

WS: [ ]+ -> channel(HIDDEN);

Interestingly, the grammar does correctly push comments to another channel rather than skipping them. This is specified immediately above the WS rule that we’ve just modified. We can improve on comment handling, too, but let’s not get distracted now.

Another important issue is the absence of a fallback token. What does this mean? Our present lexer does not assign a type to all characters; it will report a lexer error if it encounters, e.g., an accented character such as “à”, and (with the default error strategy) it will skip the token entirely. This is problematic for two reasons:

In general, the input of the code completion engine is a stream of tokens, and if a token is missing, we may get less accurate results.
Specifically in our grammar, we’ve introduced “arbitrary text” sections, but we’ve only done that in the parser – the lexer will still report an error every time it encounters a character that we’ve not mapped.

Again, this is pretty easy to fix by adding the following rule at the end of the lexer grammar:

ANYTHING_ELSE: .;

The ANYTHING_ELSE rule will match any character (that was not matched by previous rules). Since we’ve not used ANYTHING_ELSE in any of the parser rules, it will generally cause a parsing error down the line.

The only exceptions are the “text” and “noise” rules in the parser, where we’ve included a dot (.) that matches any possible token coming from the lexer – including ANYTHING_ELSE. So, in unparsed text lines, we allow for any sequence of characters, while in code that we know how to parse, unknown characters will cause a parse error.

Side Note: Handling Comments

Now that we’re at it, we can improve the handling of comments that we’ve left out earlier. In fact, we can note that newlines (represented with the NEWLINE token) are sometimes significant in the language – for example, to separate the attributes of a class. As it is, the grammar correctly parses the following class declaration:

class Foo {
    bar
    baz
}

Indeed, we can verify that the following expression evaluates to true:

parser.class_declaration().attribute().length == 2

However, let’s look at what happens if we add a comment:

class Foo {
    bar //Yes, I said bar
    baz
}

Now the count of attributes is 1! The parser mistakenly fuses together the “bar” and “baz” attributes into a single “barbaz” attribute. This is because it “swallows” the newline character as part of the comment. We can fix that by treating line comments as if they were newlines:

LINE_COMMENT: ('/' '/' .*? '\n') -> type(NEWLINE);
BLOCK_COMMENT: ('/*' .*? '*/') -> channel(HIDDEN);

But wait, there’s more. The grammar actually gets comment characters wrong; it uses Java-style comments, while PlantUML uses the quote character. And even PlantUML itself contradicts its own documentation – it actually parses an attribute followed by a comment starting on the same line as if the text of the comment were part of the attribute! Check this out:

@startuml
'This comment is correctly skipped
class Foo {
   bar 'This ought to be a comment, but
   'This is skipped as well
   baz /' PlantUML really
has a bug here '/ quux ' and it's parsing everything as a single attribute!
}
@enduml

An image of a UML class element where PlantUML fails to ignore comments, inserting them as part of attribute names.

This shows the importance of having an actual grammar rather than relying on regular expressions, as well as of having a good test suite for your parser.

Naming Tokens

Our grammar uses character strings in parser rules for keywords and punctuation, like so:

uml: '@startuml' (NEWLINE | class_diagram) '@enduml';

This has a few issues:

In general, the precedence among tokens is not clear.
For keywords, we can reuse the token name as a suggestion for code completion if the token name matches the keyword (e.g. FOR: ‘for’;).
Generated token names have no connection to their meaning; it’s much more readable to exclude the token COMMA from code completion than, say, token T__0.

So, before going forward, we’ll replace all the character string literals in the parser rules with meaningful token names.

Distinguishing Among Identifier Types

The last improvement that we can readily apply to the grammar to get better results from code completion is a way of distinguishing among identifiers. We’ve talked about it in another tutorial, so for a more thorough explanation, please refer to that one.

Basically, the code completion engine is able to tell us which rules can start at the position we give it as input. However, if we have overly generic rules such as “identifier” (or ident in the present PlantUML grammar), we don’t have enough information to suggest only the appropriate names – e.g., variable names but not function names.

For example, in a PlantUML class diagram we can represent connections between classes. The left side of a connection has the following syntax:

connection_left: ident (DQUOTE attrib=ident MULTIPLICITY? DQUOTE)?;

The right side is similar to the left:

connection_right: (DQUOTE attrib=ident MULTIPLICITY? DQUOTE)? class_name;

However, we know that we don’t want every possible identifier there, only class names. We can therefore refactor the grammar to convey that kind of information without modifying the language that it recognizes:

connection_left: class_name (DQUOTE attrib=ident MULTIPLICITY? DQUOTE)?;
connection_right: (DQUOTE attrib=ident MULTIPLICITY? DQUOTE)? class_name;

class_name: ident;

In general, we’d do similar transformations over the entire grammar, however, for the sake of the example, we’ll only handle class names.

Improving the Extension

We’re finally ready to integrate the parser into the extension, to reap the benefits of our changes. Recall that we use the build-parser script to regenerate the parser from the grammar:

npm run build-parser

Reporting Errors

Now let’s hook the parser into the error reporting system, so that parse errors will be underlined in red in the editor. This is boilerplate code, that doesn’t usually change much from a project to another – we need to add an error listener to both the parser and the lexer, that report any error as a diagnostic message for VSCode:

class ReportingLexerErrorListener implements ANTLRErrorListener<number> {
   syntaxError? = <T extends number>(recognizer: Recognizer<T, any>, offendingSymbol: T | undefined, line: number, charPositionInLine: number, msg: string, e: RecognitionException | undefined) => {
       let range = null;
       if(e) {
           let token = e.getOffendingToken();
           range = rangeOfToken(token, d);
       }
       let diagnostic: vscode.Diagnostic = {
           severity: vscode.DiagnosticSeverity.Error,
           range: range,
           message: msg,
           source: 'PlantUML syntax checker'
       };
       diagnostics.push(diagnostic);
   };
}
lexer.addErrorListener(new ReportingLexerErrorListener());

class ReportingParserErrorListener implements ANTLRErrorListener<Token> {
   syntaxError? = <T extends Token>(recognizer: Recognizer<T, any>, offendingSymbol: T | undefined, line: number, charPositionInLine: number, msg: string, e: RecognitionException | undefined) => {
       let range;
       if(e) {
           let token = e.getOffendingToken();
           range = rangeOfToken(token, d);
       } else {
           range = rangeOfToken(offendingSymbol, d);
       }
       let diagnostic: vscode.Diagnostic = {
           severity: vscode.DiagnosticSeverity.Error,
           range: range,
           message: msg,
           source: 'PlantUML syntax checker'
       };
       diagnostics.push(diagnostic);
   };
}
parser.addErrorListener(new ReportingParserErrorListener());

We’re missing three key pieces of information here:

where to put the above code;
how to compute the range of the offending token (the portion of text that will be underlined in the editor);
how to build and invoke the parser.

Doing a search for “diagnostic” we can easily see that the extension defines a Diagnoser class in the file src/providers/diagnoser.ts. We can put our code in its diagnose method.

Typically, we would parse an entire document and report the errors we find, but since PlantUML code is embedded in the surrounding text, the diagnose method works by extracting just the individual diagrams and processing them separately. We can thus run the parser on each diagram:

let diagrams = diagramsOf(document);
diagrams.map(d => {
    //Code omitted...
    const text = document.getText(new vscode.Range(d.start, d.end));
    const input = CharStreams.fromString(text);
    const lexer = new PumlgLexer(input);
    const tokenStream = new CommonTokenStream(lexer);
    const parser = new PumlgParser(tokenStream);
    //Then, we register the error listeners as we’ve shown above (omitted)
}

It helps now that we have two entry points in the grammar – umlFile for an entire file, and uml for a single UML diagram extracted from a file. In this case, we’ll use the latter (after having registered the error listeners):

parser.uml();

The last piece we’re missing is how to compute the range to underline. Since we’re parsing individual diagrams, and not the whole documents, the locations reported by the parser will be relative to the diagram, and we’ll have to offset them by the position of the diagram in the document:

function rangeOfToken(token: Token, d: Diagram) {
   if (token) {
       return {
           start: new vscode.Position(token.line, token.charPositionInLine).translate(d.start.line, d.start.character),
           end: new vscode.Position(token.line, token.text.length).translate(d.start.line, d.start.character),
       }
   }
}

Fortunately, as we can see, the diagramsOf function built into the extension already computes such an offset.

Note that we’ve corrected the above function because it didn’t work well with malformed input. Suppose a document already contains a diagram and the user adds a new one above it. The user types this:

@startuml
while down below there’s already this well-formed diagram:
@startuml
class Foo
@enduml

The original diagramsOf function would return two overlapping diagrams, which would report multiple parse errors over the same lines/tokens. That would be confusing for the user, so we’ve changed it so as to never return overlapping diagrams – in the example above, it will match the first @startuml with the @enduml at the end. This is not yet ideal because we’re working with regular expressions – using the actual parser for this would be better as we would properly recognize well-formed diagrams. We’ll leave that as an exercise to the reader.

Improving Code Completion

Similarly, we can use our parser to improve the extension’s code completion capabilities. We won’t go too much into the details here as we’ve already covered the topic in other tutorials.

Code completion is already implemented in file src/providers/completion.ts, we can see that a few completion sources are used:

let diagram = diagramAt(document, position);
return Promise.all([
    MacroCompletionItems(diagram, position, token),
    LanguageCompletionItems(),
    VariableCompletionItems(diagram, position, token),
]).then(results => [].concat(...results));

We can leave macros and variables alone and replace LanguageCompletionItems – which suggests a fixed list of keywords – with a more intelligent function:

return Promise.all([
   MacroCompletionItems(diagram, position, token),
   this.computeSuggestions(diagram, position),
   VariableCompletionItems(diagram, position, token),
]).then(results => [].concat(...results));

Computing Suggestions

We won’t be showing the entire computeSuggestions function; it’s in the GitHub repository and it follows the same approach we’ve used in other tutorials on code completion.

Let’s just see a couple of highlights. We use a visitor to traverse the parse tree and record the class names:

export class ExtractNamesVisitor extends AbstractParseTreeVisitor<void> implements PumlgVisitor<void>{
   protected defaultResult(): void {}

   classNames: Set<string> = new Set<string>();

   visitClass_declaration = (ctx: Class_declarationContext) => {
       this.classNames.add(ctx.ident().text);
   };
}

This way, when the engine tells us that we may suggest a class_name – the special rule we’ve introduced earlier to mark identifiers that we know to refer to class names – we can return the names we’ve recorded:

core.preferredRules = new Set<number>([PumlgParser.RULE_class_name]);
let candidates = core.collectCandidates(tkPos.index, parseTree);
const suggestions: CompletionItem[] = [];
if (candidates.rules.size > 0) {
   const visitor = new ExtractNamesVisitor();
   if (candidates.rules.has(PumlgParser.RULE_class_name)) {
       visitor.visit(parseTree);
       for (const name of visitor.classNames) {
           suggestions.push({label: name, kind: CompletionItemKind.Class, sortText: `00_${name}`});
       }
   }
}

Note the use of the sortText property to sort the results, in this case, by kind (represented with two digits) and then by name. By default, VSCode sorts by name only.

As usual we ignore punctuation tokens and, for keywords, we use the token name; the following is a simplified version of the relevant code:

const symbolicName = parser.vocabulary.getSymbolicName(k)?.toLowerCase();
if (symbolicName) {
    suggestions.push({label: symbolicName, kind: CompletionItemKind.Keyword, sortText: `02_${symbolicName}`});
}

As an exception, we explicitly list the available connectors:

if (k == PumlgParser.CONNECTOR) {
   const connectors = [
       '--', '..', '-->', '<--', '--*', '*--', '--o', 'o--', '<|--', '--|>', '..|>', '<|..',
       '*-->', '<--*', 'o-->', '<--o', '.', '->', '<-', '-*', '*-', '-o', 'o-', '<|-', '-|>', '.|>', '<|.',
       '*->', '<-*', 'o->', '<-o'];
   connectors.forEach((c, i) => {
       suggestions.push({label: c, kind: CompletionItemKind.Keyword, sortText: `01_${i}`});
   })
   return;
}

Here we’ve chosen to show classes first, then connectors, then keywords, and finally everything else, but you’re free to choose another strategy.

Also, note how we can say which kind of word we’re suggesting; VSCode will represent these with different icons in the code completion menu.

Several kinds of code completion elements, taken from another extension by Strumenta.

Wrapping Up and Further Improvements

In this tutorial, we’ve shown how we typically work to implement an editor, that is, by finding existing building blocks when they’re available, and combining and extending them to add value for the users.

We’ve intentionally skipped over many details that we’ve covered better in other articles, in order to keep the size of the tutorial manageable. Also, we’ve left out some further improvements and considerations that readers may want to evaluate on their own, such as:

performance, particularly by caching of parsing results to avoid repeatedly parsing the same text;
tuning code completion, in particular with regards to triggering characters and word boundaries. By default, VSCode does not play nice with words starting with the @ sign, for example;
replacing the uses of regular expressions with invocations of the parser whenever appropriate;
other useful features such as “outline” (or document symbols, in the Language Server Protocol).

As usual, all the code is on GitHub. The tag “article-v1.0” matches the code at the time the article was written (bugs included). We may update it from time to time, especially if we use it as the basis for another tutorial.

Joining forces to modernize legacy software

14 May 2026

Case Study | Vertec | IT Industry | SQL Parser to translate code to English

5 January 2026