Building a compiler for your own language: validation

So you have parsed your code and built a clean AST for it. Now it is time to check if what the user has expressed make sense at all. We should perform validation, identifying semantical errors, to communicate together with lexical and syntactical errors (provided by the parser).

This post is the part of a series. The goal of the series is to describe how to create a useful language and all the supporting tools.

After writing this series of posts I refined my method, expanded it, and clarified into this book titled How to create pragmatic, lightweight languages

Code

Code is available on GitHub under the tag 07_validation

Implement semantic checks

In the previous post we have seen how to implement the function process to execute an action on all the nodes of our AST. A typical case is that we want to execute certain operation only on certain nodes. We want still to use process to navigate the tree. We can do that by creating this function named specificProcess.

fun <T: Node> Node.specificProcess(klass: Class<T>, operation: (T) -> Unit) {
    process { if (klass.isInstance(it)) { operation(it as T) } }
}

Let’s see how to use specificProcess to:

find all the VariableDeclarations and check they are not re-declaring a variable already declared
find all the VarReferences and verify they are not referring to a variable that has been not declared or has been declared after the VarReference
perform the same check done on VarReferences also for Assignments

data class Error(val message: String, val position: Point)

fun SandyFile.validate() : List<Error> {
    val errors = LinkedList<Error>()

    // check a variable is not duplicated
    val varsByName = HashMap<String, VarDeclaration>()
    this.specificProcess(VarDeclaration::class.java) {
        if (varsByName.containsKey(it.varName)) {
            errors.add(Error("A variable named '${it.varName}' has been already declared at ${varsByName[it.varName]!!.position!!.start}",
                    it.position!!.start))
        } else {
            varsByName[it.varName] = it
        }
    }

    // check a variable is not referred before being declared
    this.specificProcess(VarReference::class.java) {
        if (!varsByName.containsKey(it.varName)) {
            errors.add(Error("There is no variable named '${it.varName}'", it.position!!.start))
        } else if (it.isBefore(varsByName[it.varName]!!)) {
            errors.add(Error("You cannot refer to variable '${it.varName}' before its declaration", it.position!!.start))
        }
    }
    this.specificProcess(Assignment::class.java) {
        if (!varsByName.containsKey(it.varName)) {
            errors.add(Error("There is no variable named '${it.varName}'", it.position!!.start))
        } else if (it.isBefore(varsByName[it.varName]!!)) {
            errors.add(Error("You cannot refer to variable '${it.varName}' before its declaration", it.position!!.start))
        }
    }

    return errors
}

Ok, so invoking validate on the root of the AST will return all possible semantic errors.

Getting all errors: lexical, syntactic, and semantic

We first need to invoke the ANTLR parser and get:

the parse tree
the list of lexical and syntactic error

data class AntlrParsingResult(val root : SandyFileContext?, val errors: List<Error>) {
    fun isCorrect() = errors.isEmpty() && root != null
}

fun String.toStream(charset: Charset = Charsets.UTF_8) = ByteArrayInputStream(toByteArray(charset))

object SandyAntlrParserFacade {

    fun parse(code: String) : AntlrParsingResult = parse(code.toStream())

    fun parse(file: File) : AntlrParsingResult = parse(FileInputStream(file))

    fun parse(inputStream: InputStream) : AntlrParsingResult {
        val lexicalAndSyntaticErrors = LinkedList<Error>()
        val errorListener = object : ANTLRErrorListener {
            override fun reportAmbiguity(p0: Parser?, p1: DFA?, p2: Int, p3: Int, p4: Boolean, p5: BitSet?, p6: ATNConfigSet?) {
                // Ignored for now
            }

            override fun reportAttemptingFullContext(p0: Parser?, p1: DFA?, p2: Int, p3: Int, p4: BitSet?, p5: ATNConfigSet?) {
                // Ignored for now
            }

            override fun syntaxError(recognizer: Recognizer<*, *>?, offendingSymbol: Any?, line: Int, charPositionInline: Int, msg: String, ex: RecognitionException?) {
                lexicalAndSyntaticErrors.add(Error(msg, Point(line, charPositionInline)))
            }

            override fun reportContextSensitivity(p0: Parser?, p1: DFA?, p2: Int, p3: Int, p4: Int, p5: ATNConfigSet?) {
                // Ignored for now
            }
        }

        val lexer = SandyLexer(ANTLRInputStream(inputStream))
        lexer.removeErrorListeners()
        lexer.addErrorListener(errorListener)
        val parser = SandyParser(CommonTokenStream(lexer))
        parser.removeErrorListeners()
        parser.addErrorListener(errorListener)
        val antlrRoot = parser.sandyFile()
        return AntlrParsingResult(antlrRoot, lexicalAndSyntaticErrors)
    }

}

Then we map the parse tree to the AST and perform the semantic validation. Finally we return the AST and all the errors combined.

data class ParsingResult(val root : SandyFile?, val errors: List<Error>) {
    fun isCorrect() = errors.isEmpty() && root != null
}

object SandyParserFacade {

    fun parse(code: String) : ParsingResult = parse(code.toStream())

    fun parse(file: File) : ParsingResult = parse(FileInputStream(file))

    fun parse(inputStream: InputStream) : ParsingResult {
        val antlrParsingResult = SandyAntlrParserFacade.parse(inputStream)
        val lexicalAnsSyntaticErrors = antlrParsingResult.errors
        val antlrRoot = antlrParsingResult.root
        val astRoot = antlrRoot?.toAst(considerPosition = true)
        val semanticErrors = astRoot?.validate() ?: emptyList()
        return ParsingResult(astRoot, lexicalAnsSyntaticErrors + semanticErrors)
    }

}

In the rest of the system we will simply call the SandyParserFacade without the need to invoke the ANTLR parser directly.

Test validation

Will it fly? Let’s verify that.

class ValidationTest {

    @test fun duplicateVar() {
        val errors = SandyParserFacade.parse("""var a = 1
                                               |var a =2""".trimMargin("|")).errors
        assertEquals(listOf(Error("A variable named 'a' has been already declared at Line 1, Column 0", Point(2,0))), errors)
    }

    @test fun unexistingVarReference() {
        val errors = SandyParserFacade.parse("var a = b + 2").errors
        assertEquals(listOf(Error("There is no variable named 'b'", Point(1,8))), errors)
    }

    @test fun varReferenceBeforeDeclaration() {
        val errors = SandyParserFacade.parse("""var a = b + 2
                                               |var b = 2""".trimMargin("|")).errors
        assertEquals(listOf(Error("You cannot refer to variable 'b' before its declaration", Point(1,8))), errors)
    }

    @test fun unexistingVarAssignment() {
        val errors = SandyParserFacade.parse("a = 3").errors
        assertEquals(listOf(Error("There is no variable named 'a'", Point(1,0))), errors)
    }

    @test fun varAssignmentBeforeDeclaration() {
        val errors = SandyParserFacade.parse("""a = 1
                                               |var a =2""".trimMargin("|")).errors
        assertEquals(listOf(Error("You cannot refer to variable 'a' before its declaration", Point(1,0))), errors)

    }

Conclusions

This is all nice and well: with a simple call we can get a list of all errors we have. For each of them we have a description and the position. This is enough for our compiler but we would now need to show these errors in the editor. We will do that in out of the future posts.

Building a compiler for your own language: validation

Code

Implement semantic checks

Getting all errors: lexical, syntactic, and semantic

Test validation

Conclusions

Categories

More on Language design

Build a rich JSON validator website with Tylasu

Teaching Programming With Hedy