Chisel: an open-source method for parsers

In this article, we introduce the Chisel Method, an open-source method for parsers and transpilers; an approach in the field of software engineering to address the challenges in developing parsers for modern, complex programming languages, by focusing on clear, objective goals and measurable outcomes.

The article provides detailed information about the development lifecycle under Chisel, from initial development to maintenance and expansion, highlighting its innovative features like robust connectivity, comprehensive documentation, and the importance of making the method teachable.

Introduction

In the landscape of software engineering, the development of parsers remains a cornerstone, critical for interpreting and processing the programming of languages that power the digital world. Despite the well-established theories underpinning parsing, the actual process of building parsers often treads a precarious path, mired in unpredictability and a trial-and-error approach.

The Chisel Method attempts to turn the often chaotic process of building parsers into a streamlined, rule-based journey, ensuring reliability and efficiency. Whether you’re a seasoned developer or new to the field, by reading this article, you will learn how the Chisel Method attempts to turn the approach in parser construction, turning a complex process into an achievable, predictable task.

Parser Development Challenges

It is a common misconception, in the context of parsing programming languages, that creating parsers is an already solved problem.

While the basic principles of parsing are well-understood, the practical application of these principles presents ongoing challenges.

Modern programming languages are becoming increasingly complex with rich syntax and semantics. This complexity means that creating parsers that can accurately and efficiently interpret these languages is an ongoing challenge. Each new language feature or syntactic sugar can introduce parsing ambiguities and complexities that need to be addressed.

Programming languages are not static, they evolve over time, new versions of languages often come with changes or additions to the syntax, which parsers need to adapt to.

Maintaining parsers for evolving languages is a continual process, not a one-time solution.

The Chisel Method

The Chisel Method addresses the challenges in parser development through its innovative principles, offering a solution that is both efficient and effective.

The diagram below represents an overview of the process.

Clear Goal Definition

At its heart, Chisel establishes a clear, objective goal that is mutually understood and agreed upon by both the user of the parser and the Language Engineering Team.

The Chisel Method deliberately avoids using language specifications as the primary means of setting goals for parser development. This choice is based on several practical considerations:

Absence of Written Specifications: Many languages, especially domain-specific ones, may not have formal, written specifications. Relying solely on specifications to define the goals of a parser can be impractical or impossible in these cases. The Chisel Method needs to be adaptable to a wide range of languages, including those without formal documentation.
Difficulty in Measuring Progress: Measuring progress based on specifications can be misleading. For example, quantifying progress by the number of covered chapters or sections of a specification does not accurately reflect the actual work completed. Some sections may be more complex or time-consuming than others. This approach can give a false sense of progress and does not provide a clear indicator of how close the parser is to being functional or complete.
Verification Challenges: Ensuring that a parser fully “covers” a specification is a manual and subjective process. It’s challenging to automate this verification, making it inefficient and prone to errors. Without automation, the process becomes labor-intensive and can significantly slow down the development cycle.
Subjectivity and Interpretation Issues: Specifications can be open to interpretation, leading to disagreements between the Language Engineering team and the Client. What one party considers to be in compliance with the specifications may be viewed differently by the other. This subjectivity can lead to conflicts, revisions, and delays in the development process.

Instead, the Chisel Method focuses on concrete, measurable goals, such as the ability to parse a pre-selected set of example files and validate the correctness of the generated ASTs. This approach offers a more objective, quantifiable, and automated way to gauge progress and success, leading to a more efficient and harmonious development process. By sidestepping the ambiguities and limitations of language specifications, Chisel provides a more pragmatic and effective pathway to developing robust and reliable parsers.

Development

Chisel predicates its approach on having a well-defined, objective, and measurable goal. This clarity is crucial in aligning the efforts of all team members and ensuring that every step taken is towards a common end.

Validation Checks

The primary goal of the parser, revolves around two essential validation checks on a pre-selected set of example files:

Parsing Capability: The first check involves the parser’s ability to parse all the example files. Successfully parsing these files means that the parser can construct an Abstract Syntax Tree (AST) for each example. This is a fundamental requirement, as the AST is a critical component in understanding and manipulating the structure of the source code.
AST Validation: The second check focuses on validating the AST built for each language construct within the example files. This step is crucial to ensure that the ASTs are not just formed but are also correct and accurately represent the intended structure of the parsed language.

In the development phase it is crucial to collect a significant amount of examples for a given programming language. Github is a good starting point to search if you don’t have licensing concerns. Google hosts a selected public dataset Google BigQuery4, which contains more than 2.8million open source GitHub repositories. For example, you can extract the available Java files running the following query. The LIMIT 1 enables you to limit the number of extracted files as Google charges for the BigQuery usages.

The number is crucial, it is safe to start with a reasonable number and then add more examples to detect edge cases.

At Strumenta, we have developed tools and plugins integrated with the IDE that assist and streamline the process.

These tools allow monitoring the progress in the parser development, providing information about the Parsing Capability (number of files successfully parsed) and AST Validation (language construct within the example files). The process is not linear, in our experience the most of the effort goes into the parser development and tends to slow down when most of the remaining constructs are edge cases.

Completion and Delivery

Once the parser passes these two checks, it can be considered complete for its initial version. This milestone marks a significant achievement in the development process, allowing for the delivery of the first version of the parser.

Maintenance and Expansion

The maintenance phase begins post-delivery, where the parser goes continual refinement and improvement. During this phase, additional files can be added to the validation set, effectively setting new goals for the parser. This continuous improvement cycle ensures that the parser remains effective and up-to-date with evolving language specifications and use cases.

Adoption

A parser’s true value is realized only when it is effectively integrated and operational within a system, this phase is dedicated to ensuring that the transition is smooth, efficient, and avoids unnecessary complications.

The Chisel Method provides three key features to achieve this goal:

1. Providing Good Connectivity

The use of StarLasu open-source libraries ensures that the parser offers robust and user-friendly APIs for native integration. These APIs are crucial for allowing the parser to communicate effectively with other components in ensuring data is passed and processed efficiently.

StarLasu supports the following programming languages:

Kolasu, for the implementation on the JVM (and in particular with Kotlin and Java)
Tylasu, for the implementation on Node.js and on the browser, using Typescript or Javascript
Pylasu, for the implementation with Python
Sharplasu, for the implementation with C#

StarLasu supports the Adapter architecture for cross-language integration which enables the parser interaction between components written in different programming languages.

The adapters are not an alternative to the native libraries, they enable reuse of existing components such as a parser written in Java in the development of a transpiler in Python.

The serialization/deserialization available in StarLasu makes it possible to exchange the ASTs between components written in different languages organized as a Language Engineering Pipeline.

A typical example is the Language Engineering Pipeline which is a structured approach to language processing and translation. This approach is rooted in Model Driven Development, where models (in this case the Abstract Syntax Trees) play a crucial role.

In this architecture, the output of a component is used as input for the next component in the pipeline.

For example, an RPG to Java transpiler would be composed by the following components:

RPG Parser: Transforms RPG code into a Plain RPG AST. This step is foundational, as it converts the source code into a tree structure that represents its syntax.
Semantic Enricher: Enriches the Plain RPG AST by resolving symbols and calculating types, resulting in an Enriched RPG AST. This enrichment adds semantic context to the syntax tree, making it more meaningful for subsequent transformation.
AST Transformer: Converts the Enriched RPG AST into a Java AST. This transformation is the core of the transpilation process, mapping the constructs of one language to another.
Java Code Generator: Takes the Java AST and generates executable Java code. This final step turns the abstract tree structure into concrete, runnable code in the target language.

The diagram below provides a graphical representation of the pipeline:

(In this example, it is important to notice that every component could be written in any of the languages supported by StarLasu, for example the RPGParser and the Semantic Enricher could be in Java due to the performance and the AST Transformer and code generation could be Python.)

The Language Engineering Pipeline architecture can be applied in many use cases such as interpreters or code editors.

2. Providing Documentation

The use of an internal documentation tool automates the generation of comprehensive, clear, and up-to-date documentation. This documentation is essential for users and developers to understand how to integrate and utilize the parser effectively. Good documentation reduces the learning curve and speeds up the integration process.

3. Making the Method Teachable

Ensuring the method is teachable is key to its adoption. By training both the internal team and potentially the client’s team, the Chisel method establishes a knowledgeable base of users who can maintain and evolve the parser.

Automation plays a crucial role in simplifying the learning process. By automating tasks that are prone to human error, such as AST validation and parsing capability checks, Chisel reduces the likelihood of common mistakes that can be discouraging for learners. This automation not only speeds up the development process but also allows learners to focus on understanding the core concepts and methodologies, rather than getting bogged down in tedious, error-prone details.

Furthermore, Chisel’s systematic approach, which breaks down the development process into clearly defined stages, provides a structured learning path. This structured approach makes it easier to teach and learn, as it organizes the process into manageable segments, each with specific objectives and outcomes. Learners can focus on one aspect of the process at a time, building their knowledge and skills incrementally.

This approach significantly reduces the risk for the client, as they are not solely reliant on external support for maintaining the system.

Conclusion

The Chisel Method attempts to innovate in the realm of software engineering, in the specialized area of parser and transpiler development. Chisel offers a structured, pragmatic approach that prioritizes clear objectives, measurable goals, and practical solutions.

Its focus on parsing capabilities, AST validation, and the use of modern tools and resources marks a significant departure from traditional methods, positioning it as a versatile and efficient method for developers.

The Chisel Method not only simplifies the parser development process but also ensures adaptability and sustainability in a field characterized by continual evolution, proposing a more streamlined and effective approach for developers and engineers in Language Engineering.

More resources

Building Advanced Parsers in Kolasu