Posts

EBNF: How to describe the grammar of a language

EBNF: The extended Backus-Naur form

EBNF: The extended Backus-Naur form

The EBNF is the most commonly used formalism to describe the structure of languages.

In this article we are going to see:

Does it sound like a plan?

Parsing: Tools and Libraries

Parsing_-_tools_and_libraries_-_cover

Receive the guide to your inbox to read it on all your devices when you have time. Learn about parsing in Java, Python, C#, and JavaScript

We won't send you spam. Unsubscribe at any time. Powered by ConvertKit

What is the EBNF?

The EBNF is a way to specify a formal language grammar. It can be considered a metalanguage because it is a language to describe other languages.

A formal language is a language with a precise structure, like programming languages, data languages, or Domain Specific Languages (DSLs). Java, XML, and CSS are all examples of formal languages.

A grammar can be used to define two opposite things:

  • how to recognize the different portions in a piece of code written in the formal language
  • the possible ways to build a valid piece of code in the formal language

For example, a simple grammar could tell us that a document in our language is composed by a list of declarations, and declarations are defined by the sequence of a keyword Xyz and a a name.

Based on this we could:

  • recognize in the code sequences of the keyword Xyz and a name as instances of these declarations we have considered
  • we could generate valid documents in our language by writing a list of declarations, each time inserting a keyword Xyz and a name

While there are two possible usages for a grammar, we are typically interested only in the first one: recognizing if a piece of code is valid for a given language and identifying the different structures typical of the language (like functions, methods, classes, etc.).

What EBNF means?

Ok, but what EBNF stands for? EBNF stands for Extended Backus-Naur Form. It will not surprise you to read that it is an extended version of the Backus-Naur form (BNF).

There is at least another format derived from BNF which is called ABNF, for Augment Backus-Naur Form. ABNF main purpose is to describe bidirectional communications protocols. EBNF is the most used one.

While there is a standard for EBNF it is common to see different extensions or slightly different syntaxes to be used. In the rest of the article we will add more comments when looking a specific parts of EBNF.,

Examples of EBNF grammars

We are going to see some examples of grammars taken from a list available on github. Later we could refer to them while explaining the rules.

Note also that for each language you could have different equivalent grammars. So for the same languages you could find grammars which are not exactly the same and yet they are correct anyway.

An EBNF grammar for HTML

An EBNF grammar for TinyC

TinyC is a simplified version of C. We picked it because a grammar for a common programming language would be way too complex to serve as an example. We would be typically look at grammars longer than 1,000 lines.

How we typically define a grammar using EBNF?

We define a grammar by specifying how to combine single elements in significant structures. As a first approximation we can consider single words to be elements. The structures correspond to sentences, periods, paragraphs, chapters, and entire documents.

A grammar tells us what are the correct ways to put together single words to obtain sentences, and how we can progress by combining sentences into periods, periods into paragraphs and so on until we get the entire document.

EBNF: Terminals and Non-Terminals

EBNF: Terminals and Non-Terminals

Terminals and non-terminals

In the approximation we used, single words correspond to terminals, while all the structures built on top of them (sentences, periods, paragraphs, chapters, and entire documents) correspond to non-terminals.

How terminals look like

Terminals are sometimes also called tokens. They are the smallest block we consider in our EBNF grammars.

A terminal could be either:

  1. a quoted literal
  2. a regular expression
  3. a name referring to a terminal definition. This option is not part of the EBNF standard, but it is used very frequently. Typically uppercase names are used for these terminal definitions, to distinguish them from non-terminal definitions. These latter definitions are the proper production rules

For example, in the grammar you could use "when" to indicate that exact string. Also regular expressions are used, like /[a-z]+/. Finally we could group terminal definitions somewhere and then use their names to refer to them. Using definitions have the advantage of permitting to reuse them multiple times.

Let’s see some typical terminals:

  • identifiers: these are the names used for variables, classes, functions, methods and so on. Typically most languages use different conventions for different names. For example in Java class names start with an upper case letter, static constants are written using all upper case letters, while methods and variable names start with a lowercase letter. However these are just best practices: in Java there is just one type of identifier that can be used everywhere. This is not the case for all the languages. In languages like Haskell, identifiers used for types must start with an uppercase letter. Another thing to consider is that the definition of identifiers typically overlaps with the definitions of keywords. The latters should take precedence. I.e., if a token could be either an identifier or a keyword than it should be recognized as a keyword.
  • keywords: almost every language uses keywords. They are exact strings that are used to indicate the start of a definition (think about class in Java or def in Python), a modifier (public, private, static, final, etc.) or control flow structures (while, for, until, etc.)
  • literals: these permit to define values in our languages. We can have string literals, numeric literal, char literals, boolean literals (but we could consider them keywords as well), array literals, map literals, and more, depending on the language
  • separators and delimiters: like colons, semicolons, commas, parenthesis, brackets, braces
  • whitespaces: spaces, tabs, newlines. They are typically not meaningful and they can be used everywhere in your code
  • comments: most languages have one or more ways to define comments and commonly they can be used everywhere. Some languages could have more structured forms of documentation comments

Whitespaces and comments are typically ignored in EBNF grammars. This is because usually they could be used everywhere in the language, so they should be reported all over the grammar. Some tools have specific options to mark some terminals as terminals to ignore.

How terminals are defined

Terminals are defined using string constants or regular expressions. Let’s see some typical definitions.

CategoryTerminalDefinitionNote
IdentifierJava identifier/[a-zA-Z$_][a-zA-Z0-9$_]*/This is a simplification because it is technically possible to use also UTF escape sequences in Java identifiers
Haskell type identifier/[A-Z][a-zA-Z0-9$_’]*/
Haskell value identifier/[a-z][a-zA-Z0-9_’]*/
KeywordSome Java keywords“abstract”, “assert”, “boolean”, “break”, “byte”, “case”, “catch”, “char”, “class”, “const”…“const” is a keyword in Java even if it is not used by any construct. It is a reserved keyword for future usages (same is true for “goto”)
Some Python 3 keywords/td>“def”, “return”, “raise”, “from”, “import”, “as”, “global”, “nonlocal”, “assert”…
LiteralJava string literal/'”‘ (~[“\\] | ‘\\’ [btnfr”‘\\])* ‘”‘/This is a simplification. We are ignoring the octal and unicode escape sequences
Java character literal/’\” (~[‘\\] |’\\’ [btnfr”‘\\]) ‘\”
Java integer literal/[“0”-“9”](([“0”-“9″,”_”])*[“0”-“9”])?/A Java integer literal can actually be expressed in decimal, hexadecimal, octal or binary format. We are just considering the decimal format here. This is true also for Java long literals
Java long literal/[“0”-“9”](([“0”-“9″,”_”])*[“0”-“9”])?(‘l’|’L’)
Java float literal/[“0”-“9”](([“0”-“9″,”_”])*[“0”-“9”])?’.'([“0”-“9”](([“0”-“9″,”_”])*[“0”-“9”])?)?(‘f’|’F’)A Java float literal can actually be expressed in decimal or hexadecimal format. We are just considering the decimal format here. We are also ignoring the possibility of specifying the exponent
Java boolean literal/”true”|”false”/
Separator/DelimiterSome Java separators and delimiters“(“, “)”, “{“, “}”, “,”, “;”…
Some Ruby separators and delimiters“,”, “;”…
WhitespaceJava whitespace/[ \t\r\n\u000C]+/
Ruby whitespace/(‘ ‘|’\t’)+/
CommentJava line comment/’//’ ~[\r\n]*/
Java block comment/’/*’ .*? ‘*/’/
Python line comment/’#’ ~[\r\n\f]*/

How non-terminals look like

Non-terminals are obtained by grouping terminals and other non-terminals in a hierarchy. After all our goal is to obtain an Abstract Syntax Tree, which is a tree. Our tree will have a root: one non-terminal representing our entire document. The root will contain other non-terminals that will contain other non-terminals and so on.

The picture below show how we can go from a stream of tokens (or terminals) to an AST, which groups terminals into a hierarchy of non-terminals.

EBNF - From stream of tokens to AST

EBNF – From stream of tokens to AST

In the grammar we will define the parser rules that determine how the AST is built.

We have seen that non-terminals represent structures at different levels. Examples of non-terminals are:

  • program/document: represent the entire file
  • module/classes: group several declarations togethers
  • functions/methods: group statements together
  • statements: these are the single instructions. Some of them can contain other statements. For example the loops have a body which is a list of other statements
  • expressions: are typically used within statements and can be composed in various ways

How non terminals are defined? We are going to see it in the next section.

Definition of production rules

An EBNF grammar is substantially a list of production rules. Each production rule tells us how a non-terminal can be composed. We are now going to see the different elements that we can use in such rules.

Terminal

We have seen that a terminal can be defined in-line, specifying a string or a regular expression, or can be defined elsewhere and simply referred to in a rule. The latter method is not technically part of EBNF, but it is commonly used.

Refer to a non-terminal

Similarly to references to terminals we can also refer to non-terminals. However this can lead to left-recursive rules that are typically forbidden. For more details see the paragraph on recursion in grammars.

Sequence

A sequence is simply represented by specifying two elements one after the other.

It is also called concatenation, and in the standard EBNF commas are used between elements, while typically you just use spaces to separate the elements of the sequence.

Alternative

There are some constructs, or portion of constructs, that can be defined in different ways. To represent this case we use alternatives. The different alternatives are separated by the pipe sign (“|“)

Optional (zero or one time)

An element can appear zero or one time. In other words it can be optional.

In the standard EBNF optional elements are represented inside square brackets. However it is more common to find them represented by a question mark following the optional element.

Zero or more times

An element can appear zero or more times (no upper limit).

In the standard EBNF optional elements are represented inside curly brackets. However it is more common to find them represented by an asterisk following the element to repeat.

One or more time

An element can appear one or more times (no upper limit).

In this example we are saying that a program should have at least one statement.

This is not present in standard EBNF, however this is equivalent to a sequence in which the same elements is present twice and the second time it is followed by an asterisk, so that the same element is effectively present always at least once.

Grouping

We can group multiple elements together by using round parenthesis. This is typically used because a modifier (optional, zero-or-more, one-or-more) must be applied to a set of elements. It can also be used to control the precedence of operators.

Without using the grouping we would have an alternative between the sequence of SCRIPT_OPEN SCRIPT_BODY (first alternative) and SCRIPT_SHORT_BODY (second alternative).

A few things to consider

We have seen what constructs we can use to define production rules. Now let’s dive in some aspects that we should consider when writing grammars.

Recursion in grammars: left or right?

EBNF lets us define recurring grammars. Recurring grammars are grammars that have recurring production rules, i.e., production rules that refers to themselves and they do so at the beginning of the production rule (left recurring grammars) or at the end (right recurring grammars).

Left recurring grammars are more common. Consider this example:

This production rule is left recurring because the expression could start with… an expression.

The fact is that many tools that process EBNF grammars cannot deal with that because they risk to enter infinite loops. There are ways to refactor left recurring rules, however they lead to less clear grammars. So, if you can, just use modern tooling that deal with left and right recurring grammars.

Precedence

Precedence could refer to two things: precedence in the EBNF grammar (i.e., in which order operators in grammars are applied) and precedence in the languages defined by EBNF.

This is very important because it would permit to interpret correctly expressions like:

In this case we all know that the multiplication has precedence on the addition. In the languages we are going to define we can have many different rules and we need to define a specific order to consider them.

Let’s talk about the second one.

The traditional way to manage precedence is to define a list of different rules that refers to each other. At the top level we will have the more abstract rules, like expression, at the bottom level we will have rules representing the smallest components, like a single term. The typical example is shown in TinyC:

The grammar as it is defined makes first parse single terms (id, integer or expressions between parenthesis). Later this can be possibly combined using the plus or the minus sign. Then the less than operator can be used and this can be part of an expr, the top level rule considered.

This is a relatively simple example but in richer languages we could have 7-12 intermediate rules like test and sum, to represent different groups of operators with the same precedence. Thing about the multiplication, division, power, comparison operators, logical operators, array access, etc. . There are many possible ways to build expressions and you need to define an order of precedence.

Now, some modern tools use just the order in which alternatives are defined to derive the precedence rules. For example in ANTLR you could write the previous example like:

Typical pattern: defining lists

A typical thing we want to define in an EBNF grammar is a list.

We have list of parameters, list of variables, all sorts of lists.

Typically we have a separator between elements (like a COMMA). We could have lists that must have at least one element and lists which can have zero elements. Let’s take a look at how we can implement both:

Syntactic vs Semantic validity

We have seen that the EBNF can be used to define a grammar, we should consider that a grammar defines what is syntactically valid, but it tells us nothing about what is semantically valid.

What does it mean? Consider these examples:

  • a document containing multiple declarations of elements with the same name
  • a sum between two variables: an integer and a boolean
  • a reference to a variable that was not declared before

We can imagine a language where all these examples are syntactically correct, i.e., they are built according to the grammar of the language. However they are all semantically incorrect, because they do not respect additional constraints on how the language should be used. These semantics constraints are used in addition to the grammar, after the structures of the language have been recognized. In the grammar we can say things like a declaration should be composed by a certain keyword and a name, with semantic constraints we can say two declarations should not have the same name. By combining syntactic and semantic rules we can express what is valid in our language.

How do you define semantic rules? By writing code that works on the AST. You cannot do that in the EBNF grammar.

Where to go from here?

An EBNF grammar is useful to support discussion and to communicate with other langue designers. Typically you may want to do more with it: you want to build actual parsers and tools to process languages from your EBNF grammars. I personally like to work with ANTLR. My suggestion is to take a look at the ANTLR Mega Tutorial.

You could be interested in learning more about parsing. We have prepared different articles explaining how to proceed depending on the language you prefer to work with:

And if you are lost and unsure about how to move forward just let us know.

Summary

In this article we aimed to be practical: we have discussed what is done in practice and what you need to start writing grammars using EBNF. We have discussed the differences between typical usages and the standard and tried to give a complete and correct picture.

There are things we did not discuss: for example the history that lead to the creation of EBNF or its roots in the work from Chomsky and other scientists. We have not discussed what a context free grammar is. The theory tells us that EBNF cannot be used to describe all possible forms of grammars. However it is powerful enough to describe all formal languages you should be interested in writing.

In the rest of this website you should find articles about the next steps like building a compilers or an editor for your language. So, have fun!