Introduction
In this tutorial, we’ll give an introduction to Peggy. As usual in our articles, all the code is on GitHub.
Peggy is the successor of peg.js, an open-source library for writing parsers in JavaScript. It’s based on parsing expression grammars or PEGs, hence the name. PEGs are a powerful parsing formalism with a gentler learning curve than many alternatives, so Peggy is particularly well suited for developers with little experience with parsers. We’ll expand on PEGs throughout the tutorial.
Also, Peg.js/Peggy is used in many JavaScript projects, including Kibana. While it’s not the only parser generator for JavaScript, it’s one of the most popular ones, and at the time of writing it’s the first Google search result for “JavaScript parser generator”.
Our readers may know that our go-to choice for writing parsers is ANTLR4. Many of the concepts that apply to ANTLR also apply to other parser generators, including Peggy. Over the course of the article, we’ll rehash some of those concepts to present Peggy in action. However, if you’re new to parsers, you may want to start with one of our ANTLR tutorials. In fact, we’ll review our reasons to prefer ANTLR, but we’ll also list a few points in favour of Peggy.
Parsers and JavaScript?
Usually, we associate parsers with compilers, interpreters, and static analysis tools. These are software applications that typically run on a developer’s machine as a command-line tool, or perhaps an IDE plugin. Sometimes, we run these kinds of tools on a server, e.g., a continuous build server.
Because of that, it may be hard to see the point of writing a parser in JavaScript, a language that we primarily associate with web development. However, we can identify several reasons to do so.
First of all, JavaScript is no longer a browser-only language, thanks to node.js and other similar technologies. We can use JavaScript to write single-user applications or multi-user web services.
Second, parsing is ubiquitous in computing, it’s not limited to the implementation of programming languages. From configuration languages to markup or templating languages, it’s hard to find a software application that doesn’t use at least some kind of parsing technique, however limited and informal.
Finally, even in the browser, it’s getting more common for web applications to feature advanced UI components. These may include a powerful editor for writing some form of code, not necessarily program code, as we’ve seen. These editors need a parser to go beyond syntax coloring and to offer features like error reporting and code completion.
Using Peggy
A Sample Grammar
In Peggy’s documentation, and in the repository accompanying this article, we can find the following grammar that implements a simple expression calculator:
start = additive additive = left:multiplicative "+" right:additive { return left + right; } / multiplicative multiplicative = left:primary "*" right:multiplicative { return left * right; } / primary primary = integer / "(" additive:additive ")" { return additive; } integer "integer" = digits:[0-9]+ { return parseInt(digits.join(""), 10); }
Now, in almost every programming language or DSL, we’ll have arithmetic expressions, and the grammar above is a minimal implementation of those. In a real-world language, the grammar for expressions would be much more complex and cover many different cases, for example other operators and function calls, but the concepts are the same.
That said, we can observe several interesting facts in our grammar. We have five rules or productions. The first rule is always the entry point into the parser – the starting rule that it’ll try to match. Then, each rule may call into other rules to do the matching.
Eventually, every rule will match some patterns, also known as terminals. In this sample grammar, patterns are just literal strings and the 0-9 character range, but Peggy supports more patterns, similarly to ANTLR lexer rules and simple regular expressions.
Also, typically we can combine patterns and rules to form a new rule. If a rule or pattern follows another, Peggy will try to match both in sequence. Or, if we want the parser to choose between two or more alternatives, we use the choice operator, the slash character “/”. We can see that for example to match the rule “primary” the parser must either match the rule “integer” or a sequence of an open parenthesis followed by the “additive” rule and a closing parenthesis.
We should note that, in PEGs, the choice operator is ordered, and there is no backtracking: once one of the possible choices matches, the others are discarded, even if by moving forward in the input the match then fails and perhaps trying a different choice would have actually matched the input correctly.
Semantic Actions
We can also observe that each rule except the first has an associated action. Indeed, this grammar implements a simple expression interpreter, not just a parser. Each action here returns a numeric value, which is the result of evaluating the corresponding arithmetic expression.
In a real-world grammar, we would likely build a parse tree using the actions, then an AST, and finally process the AST to interpret it, or transpile it to another target, etc. That’s the process that we follow for all the parsers we write, and we’ve developed open-source support libraries for that. In some cases, with Peggy, we may construct parts of the AST directly in the grammar, rather than building a parse tree first.
Finally, we can see how the rules are closely integrated with the JavaScript actions. We can give a name to matched subrules (left and right, for example), and those names are available as bound variables in the scope of the semantic action.
The value of each variable is the result of matching the rule, i.e., the return value of its associated action. In this case, it’s a number, but it could be a parse tree node for example, or another kind of object, depending on the application.
Generating a Parser from a Grammar
As usual for JavaScript tools, we can use Peggy (the parser generator, as opposed to the generated parser) either as a command-line tool or as a library. Usage from the command line is more common, so we’ll start showing that. In any case, we’ll have to install Peggy first, for example, with npm:
npm install peggy
Then, using Peggy as a command-line tool, we can generate a parser from our grammar. Suppose we’ve saved the grammar that we’ve encountered earlier in the file “simple_math.pegjs”. We would invoke Peggy like this:
peggy simple_math.pegjs
The result is a simple_math.js file implementing the parser. We can then use it in our node.js application:
$ node Welcome to Node.js v16.13.0. Type ".help" for more information. > sm = require("./simple_math.js") { SyntaxError: [Function: peg$SyntaxError] { buildMessage: [Function (anonymous)] }, parse: [Function: peg$parse] } > sm.parse("1+10") 11
This is what happens when the input is malformed:
> sm.parse("1+abc") Uncaught: [peg$SyntaxError: Expected "(" or integer but "a" found. ] { expected: [ { type: 'other', description: 'integer' }, { type: 'literal', text: '(', ignoreCase: false }, { type: 'other', description: 'integer' }, { type: 'literal', text: '(', ignoreCase: false }, { type: 'other', description: 'integer' }, { type: 'literal', text: '(', ignoreCase: false }, { type: 'other', description: 'integer' }, { type: 'literal', text: '(', ignoreCase: false } ], found: 'a', location: { source: undefined, start: { offset: 2, line: 1, column: 3 }, end: { offset: 3, line: 1, column: 4 } } }
The parser aborts with an exception, so we have no possibility to recover from the error and we don’t know about any subsequent error. Here, for example, the “bc” characters after “a” are also an error, but Peggy doesn’t report them. We’ll expand on that in a later section.
That said, the information that Peggy reports is quite rich, as we also get to know what the stream should have contained for the parsing to continue, in the “expected” array. So, if we were integrating this parser into an editor, for example, we could mark the start of the erroneous code and also suggest possible solutions to the user.
The peggy command accepts numerous flags that are detailed in the documentation. These include the --format
flag to control the module format we want Peggy to output, if we want to load the parser into a browser rather than node.js.
Generating the Parser at Runtime
Another option is to call into the Peggy API at runtime, to generate the parser on the fly. While it’s better to do it ahead of time if possible, perhaps for a simple enough application that doesn’t have a build system it could make sense to generate the parser at runtime.
Also, generating a parser at runtime opens up the possibility of integrating Peggy into other tools. For example, a build tool or plugin could achieve finer-grained integration by calling Peggy directly instead of going through the command line.
The documentation explains this extensively, so we’ll just show the gist of it. First, we import Peggy, for example, on node.js:
const peggy = require("peggy");
Or, for the browser (either directly, or using a bundler such as Webpack):
import * as peggy from "peggy";
Then, we can generate a parser from a grammar string:
const parser = peggy.generate( "start = (a / b)+\n" + "a = 'a'\n" + "b = 'b'");
Here, parser is a parser object that we can readily use. However, we can also tell Peggy to generate the source code of the parser, e.g. if we’re writing a build script or tool:
const parserSource = peggy.generate("start = ('a' / 'b')+", { output: "source" });
Source Maps
With this method, can even generate a source map for our parser. This is an encoded object that allows browsers and other tools to reconstruct the source code from compiled JavaScript, be it minified or transpiled (e.g. from TypeScript).
Typically, we generate source maps to aid developers in debugging. Since compiled or minified code is pretty hard to read, browsers can use a source map to show the original source as the developer wrote it, and map each location in the compiled version to a corresponding location in the source code. This makes it much easier to follow along with what the code is actually doing.
To generate a source map, alongside the code, we have to specify output: "source-and-map"
as well as a grammarSource string describing the origin of the source:
const parser = peggy.generate( "start = (a / b)+\n" + "a = 'a'\n" + "b = 'b'", { output: "source-and-map", grammarSource: "<generated>" });
grammarSource could be a file name, or any string indicating where the code comes from. The resulting parser is a SourceNode object, from which we can extract the code:
const code = parser.toString();
But we can also extract a source map:
const sourceMap = parser.toStringWithSourceMap().map.toString();
This will be the string representation of a JSON object that browsers and other tools can interpret. It looks like this:
'{"version":3,"sources":["<generated>"],"names":["start","a","b"],"mappings":";;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;WAAAA,cAAK;;;;;;;;;;;;;;;;;;;;;;;WACLC,UAAC;;;;;;;;;;;;;;WACDC,UAAC"}'
The interesting part is the strange-looking mappings field. This is encoded in a format known as VLQ that we can decode using a library or an online tool. In the example above, the source map contains the following intervals:
Compiled line, column => source line, column From 0,0 => 337,11 to 0,5 => 337,25 From 1,0 => 360,11 to 1,1 => 360,21 From 2,0 => 374,11 to 2,1 => 374,21
Indeed, the source from 0,0 to 0,5 is “start” while the compiled code from 337,11 to 337,25 is the declaration of the function “peg$parsestart”:
> parser.toStringWithSourceMap().code.toString().split('\n')[337] ' function peg$parsestart() {'
Now all this is unlikely to be relevant when we use Peggy in a software application, but it’s interesting to look under the hood a bit to better understand how these tools work.
When to Use Peggy
Here we’ll do mainly a comparison with ANTLR as that’s the parser generator that we usually employ. However many of these considerations apply to other parsers as well.
Formalism: PEG vs LL(*)
ANTLR implements a parsing algorithm called LL(*). It uses advanced techniques to avoid having a fixed lookahead, that is, a maximum number of characters or tokens that the parser can check beyond its current position in the input stream.
In practical terms, this means ANTLR parsers have two qualities out of the box:
- reasonable performance
- good error reporting and recovery.
Peggy, as we’ve said, uses the PEG formalism (parsing expression grammars). As such, it generates a recursive-descent parser, which is similar to what a human programmer would intuitively write. Unless they’re an expert in parsing algorithms, maybe.
Which formalism is “more powerful” or more efficient is a matter of debate and we don’t want to go there. Some languages are better suited for PEG, others for LL(*). Developer experience is also very important, and the “worse” formalism can still guarantee better productivity if the developers know it better than the alternative.
For sure, we can say that PEG supports left-recursive rules better. These are grammar rules such as:
expr: expr ‘+’ expr | NUMBER;
That is, rules that are defined in terms of themselves, with no prefix to break the recursion. In the example, an expression can be the sum of two expressions or an individual number. That way, recursively, we model all possible sum expressions.
ANTLR4 specifically supports that kind of rule, in contrast with previous versions of ANTLR that required the developer to rewrite them in a more convoluted form. However, ANTLR4 can only work with direct left-recursion, and it still requires a manual rewrite for mutually left-recursive rules.
This may look like highly technical talk for specialists, but in day-to-day work with ANTLR, it can happen that the intuitive way of writing some set of rules is, indeed, mutually left-recursive. Then, the inexperienced developer will have to learn about the issue and find a solution, which makes the learning curve steeper. With PEG parsers such as Peggy, that doesn’t happen.
One drawback is that the advanced prediction algorithm of ANTLR is absent in PEG parsers. In fact, as we’ve seen, PEG grammars have a deterministic way to resolve ambiguities, the ordered choice operator. Among multiple alternatives, Peggy will always select the first rule that matches, without any prediction or backtracking.
Error Handling
As we’ve seen earlier, a parser may have to deal with malformed input: code that doesn’t respect the rules of the grammar. This could be the result of a mistake by the user, for example.
However, malformed code is also a frequent occurrence in an editor. While the user is writing correct code, the editor will reparse it every few keystrokes in order to understand it and give content assistance. So, it will often try to parse code that is only partially written, and therefore incorrect. For example, imagine typing console.
: the code is incomplete, but by running the parser, the editor may reconstruct enough information to suggest that “log” is a method in the console object.
When it encounters malformed code, a parser can do two things:
- Report the error so that the calling code can notify the user, and perhaps halt the operation (e.g. a compiler will refuse to compile malformed code);
- Make an attempt at recovering from the error to continue parsing. This is especially useful in an editor or in an interpreter, where we don’t want an error at the beginning of the source file to render the whole file useless. But also in a compiler, where even if the compilation process is aborted in case of errors, it’s useful if the parser is able to continue after an error and report multiple errors in one pass.
Recursive-descent parsers such as those that Peggy generates are not very good at reporting errors. Indeed, as we’ve seen earlier, Peggy throws an exception as soon as it encounters the first error, therefore aborting the parsing altogether.
ANTLR works differently: encountering an error doesn’t stop an ANTLR parser; it will still build a parse tree, although it will contain some error nodes. And, it will report all the errors that it finds to one or more listeners.
Then, the developer can decide what to do with the errors. They may choose to present all errors to the user – for example, marking code in red in an editor – or filter them and, for example, only print to the console the first n errors or those most likely to be relevant.
Furthermore, ANTLR comes with built-in strategies for error recovery that work in the most common cases, which is not possible in Peggy. ANTLR for example can usually detect and correct a missing semicolon or closing parenthesis and continue parsing. And if the built-in strategies don’t work out, we can write a custom one.
Another point in favor of ANTLR is the amount of prediction information that it computes to generate the parser. This allows applications that go beyond parsing, such as simulators for debugging the parser, or semi-automatic code completion engines.
ANTLR and Java
Developers who are familiar with Web technologies, but not with ANTLR, may vaguely recall ANTLR as a tool that is specific to Java. While, indeed, ANTLR has roots in the JVM ecosystem, we can use it from a long list of languages.
In fact, ANTLR is a parser generator. It reads a grammar file and outputs the source code of a parser. While the generator itself (the “ANTLR tool”) is indeed written in Java, we can tell it to target its output (the source code of the actual parser) to another language. The list of languages that ANTLR supports includes JavaScript, Python, C++, C# and many others.
A JavaScript developer may still want to avoid ANTLR for these reasons:
- ANTLR requires to step outside the typical JavaScript build setup. The developer would need to install the JVM on their development machine (and any other build environment) and download the ANTLR tool’s JAR file, which is unavailable as an NPM package.
- If the project is in TypeScript, since ANTLR doesn’t have a native TypeScript target, the developer would have to generate a parser in JavaScript and lose TypeScript’s checks and IDE support.
Actually, there is an ANTLR implementation that addresses both these points: antlr4ts. It’s in TypeScript, available from the NPM registry, and generates a parser in TypeScript.
Still, Peggy is a native JavaScript library and, as such, doesn’t suffer from any inherited Java-ism in its design. Also, besides the parser generator itself, ANTLR comes with other tools for debugging and testing, and it has various IDE plugins; these are almost all JVM-based.
Parse Trees vs Semantic Actions
The output of an ANTLR parser is a parse tree. This is a tree with a node for each matched grammar rule and a leaf (terminal) for each matched token.
From the parse tree, which is tied to the grammar, we usually distill an abstract syntax tree (AST). That represents the conceptual structure of the language, independently of any artifacts of the grammar. ANTLR doesn’t cover this part; we’ve built our own libraries to aid us, including one for TypeScript and JavaScript.
Finally, we can process the AST to compute the necessary information; for example, to resolve references to names, compute types, or to transpile the code for another platform.
This process seems convoluted, and it is overkill for simple languages and transformations. In those cases, one may adopt the simpler approach of navigating the parse tree directly. However, that quickly becomes inadequate as the languages increase in complexity and the transformations and computations we need to execute on the code become more sophisticated.
ANTLR also allows decorating the grammar with code, in the form of semantic predicates and semantic actions. We use predicates to alter the control flow of the parser. A semantic action instead adds side-effects to a parser rule: when the parser successfully matches the rule, it also executes the code of the corresponding action. However, for clarity, reuse, and performance, we discourage using semantic predicates and actions.
Instead, Peggy relies heavily on semantic actions. In fact, Peggy doesn’t build any kind of tree by itself. Without actions embedded in the grammar, the output of a Peggy parser is just either success or failure (i.e., an exception is thrown or not). If we want to build a tree or process the structures matched by the parser in any other way, we have to use semantic actions.
Because of that, together with the fact that Peggy only targets JavaScript while ANTLR is multi-target, Peggy offers tighter integration with embedded JavaScript code. For example, as we’ve seen, grammar labels are immediately available as JavaScript variables in semantic actions.
Lexer and Parser
Finally, we should mention that Peggy is a so-called “scannerless” parser.
Many parsers, including those generated with ANTLR, actually work in two phases: first, they break the textual source code into tokens (lexing), then they run the actual parser on the resulting stream of tokens. That’s because those parsers are basically large state machines; tokenizing the input reduces the number of states and transitions.
In fact, if a lexer first breaks the input into tokens, then the parser won’t have to track each character in the input, as multiple characters will be lumped together in a single token. Each token or character is a state with incoming and outgoing transactions in the parser’s state machine, and reducing the number of such states saves memory and execution time.
Peggy does not follow that approach, therefore a Peggy parser does the work of a lexer and a parser all in one. This makes it arguably easier to pick it up and allows a quicker turnaround when developing the grammar, but it has other drawbacks beyond performance.
In fact, having a separate lexer means that we can filter out some “noise” so that the parser doesn’t have to deal with it. Typical examples of such “noise” are spaces, newlines, and comments. These are characters or strings that can appear basically everywhere.
If the grammar has to consider those, then rule definitions become much longer and harder to read, because between any two meaningful tokens we have to account for the possibility of having “noise”. Instead, with a lexer, we can silently consume these “noise” tokens without forwarding them to the parser.
So, with Peggy, it’s advisable to pre-process the input to remove these kinds of unwanted characters and strings from the source code.
Conclusions
In this article, we’ve reviewed Peggy, a JavaScript-native parser generator based on the PEG formalism. While at Strumenta we prefer to use ANTLR, Peggy’s simplicity and gentle learning curve make it a valid alternative for JavaScript developers with little experience in parsers.
The code in this article is hosted on GitHub.