In this article, we are going to present you our SAS parser, a commercially-licensed parser for SAS. We are going to see how and why you can use it.

What is SAS?

SAS is a programming language used for statistical analysis. It is a part of the homonymous system used by large organizations to perform statistical computations on data. The system includes both graphical tools and the language itself to allow any kind of user to perform analysis.

What Do You Need a SAS Parser For?

SAS is a language to handle data, so one typical use case is to use a SAS parser to perform analysis and data lineage. In other words, you can use the parser to look at the structure of the data to understand its path, and how the program uses it. Since the language itself is geared toward data analysis, you would not need a parser to analyze the data itself, but it is useful to analyze how you organize the data.

SAS is still used in many companies, but it was developed decades ago. It works great when you are doing things the system is designed to do, but it can be complicated to integrate with new technologies. When such challenges outweigh the benefits of SAS, you want to build a transpiler. This is what some of our clients do with the SAS parser. In most cases, we build the transpiler ourselves tailored for the needs of the client, but you can use the parser to build a transpiler on your own. 

Since SAS is a whole system, moving out from SAS is quite complex and every company chooses a different approach. So your use case might be different, but we have built a few transpilers from SAS to SQL, Python and platforms like Spark and Databricks.

Why Use a Ready-to-go Parser?

In some cases there are open-source parsers available, so why consider buying a commercial one?

You need a parser that has been thoroughly tested, with good documentation and someone to call in case you encounter any problems.

We are experts and we have built tons of parsers for our clients. This means that we completely understand the importance of this component and we have a solid methodology.

Our SAS Parser has been built for the needs of our clients, so it is battle-tested and implements the main features of SAS. However, the language is so large that we have not built support for everything it offers, yet. It supports fundamental features like the DATA step, and the PROC step (i.e., procedures). In particular, it supports embedded SQL. These features support data lineage, a common need of our clients. So our SQL support includes handling of expressions and SELECT statements for the purpose of determining the path of data. This way you can understand where a column or table comes from.

We also support macros, although only partially. Macros in SAS are more powerful than what you see in languages like C or C++, so the difficulty is higher. It is difficult to give a good idea of our level of support for macros in a few words, so you can read a whole article on the topic of SAS macros: Challenges in Parsing Legacy Languages: The Case of SAS Macros.

What is a Parser (And What is Not)

We work with parsers every day, so for us it is very clear what a parser is and what it can do. However, this might not always be the case for our clients. So, let’s spend a few words on this point.

In general terms, a parser is software that can understand the syntax, but not the semantics of some code. Basically, a parser can read the code, but it cannot execute it. For example, a parser can both recognize a variable declaration and an expression. What it cannot do is to link the two and understand where a variable used in an expression was declared. This is called symbol resolution and it is a functionality built on top of the parser.

Given the needs of our clients, we implemented some advanced functionality which is usually outside the scope of a parser. For example, as we mentioned, our parser can extract the table and columns used in an embedded SQL. We are going to see some examples of what our specific parser can do, later. However, it is an important point to keep in mind in case you are comparing different parsers.

StarLasu Methodology

The SAS parser is based on the StarLasu methodology. StarLasu is the missing link between source code and a convenient structure for its interpretation and manipulation: an AST. When building an interpreter, transpiler, compiler, editor, static analysis tool, etc., at Strumenta we always implement the following pipeline:

Source code — ANTLR4 parser → Parse tree — StarLasu → AST — Further processing → … → Result

StarLasu is both the above methodology and a collection of runtime libraries to support it in Java, Kotlin, Python, Javascript and Typescript. 

At its core, StarLasu permits to define ASTs, on which all other functionalities are built. You can navigate and transform ASTs to do everything from reading the original values to simplify your code. With the features provided by the library you can do anything from analyzing a codebase to building a transpiler.

Kolasu Library

StarLasu identifies a methodology and a series of libraries that implement it. Kolasu is the opensource library for Kotlin and Java. The core of Kolasu is the class Node.

Extend your AST classes from Node to get these features:

  • Navigation: utility methods to traverse, search, and modify the AST
  • Printing: print the AST as XML, as JSON, as a parse tree
  • EMF interoperability: ASTs and their metamodel can be exported to EMF

Classes can have a name, and classes can reference a name. We supply utilities for resolving these references.

Kolasu tries to be non-invasive and implements this functionality by introspecting the AST. All properties, and therefore the whole tree structure, will be detected automatically.

How to Setup the Parser

The only two requirements you need to use the parser are sas-parser and Kolasu packages. For example, for a Java maven you would write something like this.

  <dependencies>
        <dependency>
            <groupId>com.strumenta</groupId>
            <artifactId>sas-parser</artifactId>
            <version>1.4.6</version>
        </dependency>
        <dependency>
            <groupId>com.strumenta.kolasu</groupId>
            <artifactId>kolasu-javalib</artifactId>
            <version>1.4.6</version>
        </dependency>
    </dependencies>

This would use the sas-parser and kolasu-javalib, the version of Kolasu tailored for use from Java.

You can easily adapt this for another build system like Gradle.

dependencies {
    implementation "com.strumenta.kolasu:kolasu-core:1.4.6"
    implementation "com.strumenta.kolasu:kolasu-javalib:1.4.6"
    implementation "com.strumenta:sas-parser:1.4-6.
}

How to Use the Parser

You can use the parser JAR just as a command line tool, to create an EMF representation of the code, if you wish. It is as easy as this.

java -jar sas-parser-<version>-jar-with-dependencies.jar <input.sas> emf <output.xmi>

Then you could take advantage of any EMF library to work with the result.

However, most people will want to integrate the parser in a larger project. For example, a Java project. That is almost as easy, as you can see from this example.

// imports

public class Covid19NYT {

    public static final String INDENTATION = "  ";

    public static void main(String[] args) throws IOException {
        String url = args.length > 0 ?
                args[0] :
                "https://raw.githubusercontent.com/sassoftware/covid-19-sas/master/Data/import-data-nyt.sas";
        try(InputStream inputStream = new URL(url).openStream()) {
            //We could parse directly from the input stream, without storing the code into a string, however we later
            //extract some text from the code string for demonstration purposes.
            //The parser keeps the entire text in memory anyway, as long as there are live references to the parse tree.
            String code = IOUtils.toString(inputStream, StandardCharsets.UTF_8);
            SASLanguage sas = new SASLanguage();
            sas.optimizeForSpeed(); // Or sas.optimizeForMemory();
            System.out.print("Parsing " + url + "...");
            ParsingResult<SourceFile> result = sas.parse(code, true, true);

            if(result.getCorrect()) {
               System.out.println(" no issues found.");
            } else {
                System.out.println(" there are issues.");
            }
            result.getIssues().forEach(i -> {
                switch (i.getSeverity()) {
                    case INFO:
                        System.out.println("INFO: " + i.getMessage() + (i.getPosition() != null ? " @ " + i.getPosition() : ""));
                        break;
                    case WARNING:
                        System.err.println("WARNING: " + i.getMessage() + (i.getPosition() != null ? " @ " + i.getPosition() : ""));
                        break;
                    case ERROR:
                        System.err.println("ERROR: " + i.getMessage() + (i.getPosition() != null ? " @ " + i.getPosition() : ""));
                        break;
                }
            });

            System.out.println();

            Traversing.walk(result.getRoot()).forEach(node -> {
                for(Node parent = node.getParent(); parent != null; parent = parent.getParent()) {
                    System.out.print(INDENTATION);
                }
                System.out.println(
                        node.getClass().getName().substring("com.strumenta.sas.ast.".length()) +
                        " @ " + node.getPosition() +
                        " with text \"" + StringUtils.abbreviate(node.getPosition().text(code), 30) + "\"");
                Processing.processProperties(
                        node,
                        p -> {
                            if(!p.getProvideNodes()) {
                                for (Node parent = node.getParent(); parent != null; parent = parent.getParent()) {
                                    System.out.print(INDENTATION);
                                }
                                System.out.print(INDENTATION);
                                System.out.println(p.getName() + " = " + p.getValue());
                            }
                            return null;
                        });
            });
        }
    }
}

The core of the example are these lines:

SASLanguage sas = new SASLanguage();
sas.optimizeForSpeed(); 
ParsingResult<SourceFile> result = sas.parse(code, true, true);

We create a SAS parser on the first line, we choose the optimization on the second one and parse the code on the third. Now, the result variable contains the AST In the subsequent lines we traverse the AST and print the properties we found. 

You can see the full example in our dedicated repository.

SAS Procedures

In addition to the DATA step, our SAS parser partially supports the PROC step (also known as procedures). It supports some procedures of the SAS system, but not all. It supports the following procedures:

  • SQL, The SQL procedure implements Structured Query Language (SQL) for SAS
  • Append, The APPEND procedure adds the observations from one SAS data set to the end of another SAS data set
  • Compare, The COMPARE procedure compares the contents of two SAS data sets, selected variables in different data sets, or variables within the same data set
  • Contents, The CONTENTS procedure shows the contents of a SAS data set and prints the directory of the SAS library
  • Datasets, The DATASETS procedure is a utility procedure that manages your SAS files
  • Delete, The DELETE procedure deletes members in a SAS library
  • Expand, The EXPAND procedure converts time series from one sampling interval or frequency to another and interpolates missing values in time series
  • Export, PROC EXPORT reads data from a SAS data set and writes it to an external data source
  • Format, The FORMAT procedure enables you to define your own informats and formats for variables. Informats determine how raw data values are read and stored. Formats determine how variable values are printed.
  • Freq, The FREQ procedure produces one-way to n-way frequency and contingency (crosstabulation) tables
  • Http, PROC HTTP issues HTTP requests
  • Import, PROC IMPORT reads data from an external data source and writes it to a SAS data set
  • Means,
  • Model, The MODEL procedure analyzes models in which the relationships among the variables form a system of one or more nonlinear equations
  • PrintTo, The PRINTTO procedure defines destinations, other than ODS destinations, for SAS procedure output and for the SAS log
  • Sort, The SORT procedure orders SAS data set observations by the values of one or more character or numeric variables
  • Summary, The SUMMARY procedure provides data summarization tools that compute descriptive statistics for variables across all observations or within groups of observations
  • Timeseries, The TIMESERIES procedure analyzes time-stamped transactional data with respect to time and accumulates the data into a time series format
  • TModel, The TMODEL procedure is a new, experimental version of the MODEL procedure
  • Transpose, The TRANSPOSE procedure creates an output data set by restructuring the values in a SAS data set, transposing selected variables into observations

In particular, the SQL procedure support is crucial since it is the most widely used technology to handle data. By supporting SQL procedure you can understand the core of your SAS code, the data handling aspect.

Our parser does also support data lineage for embedded SQL. Data lineage might be a complicated concept and even harder to implement. In our case, we support extracting columns and tables used inside the SQL code, in order to allow you to trace the path of data into your system. Let’s see an example.

Extracting Columns and Tables From Embedded SQL

Let’s assume that you need to extract all the columns resulting from some CreateTableStatement. All you need to do is use our parser and you can accomplish that in 17 lines or less.

SASLanguage sas = new SASLanguage();
ParsingResult<SourceFile> result = sas.parse(code, true, true);
SourceFile sf = result.getRoot();
Traversing.walkDescendantsBreadthFirst(sf, SqlProcedure.class, node -> {
    ((SqlProcedure) node).getStatements().forEach(stmt -> {
        if (stmt instanceof CreateTableStatement) {
            CreateTableStatement create = (CreateTableStatement) stmt;
            System.out.println("Table:" + create.getTable().getTable());
            create.getQuery().getProjections().forEach(p -> {
                if (p.getAlias() != null)
                    System.out.println("Label: " + p.getAlias().getName());
                else if (p.getExpression() instanceof ColumnExpression)
                    System.out.println("Name: " + ((ColumnExpression) p.getExpression()).getColumn().getColumn());
            });
        }
    });
});

You create the SASLanguage parser, as seen in the previous example, and then you get the full list of statements and declarations in the SAS file. We do this on lines 1-4. As a note, you need to use the handy Traversing class in Java, but you can use an extension method in Kotlin.

Then you just need to filter the statements to get just the ones of type CreateTableStatement and get the query of the statement. At this point the parser automatically calculates the projections used in the SQL query, i.e., the expressions that will create each column. So all you need to do is to check for each projection whether it has a label: if there is one, the label will be the name of the resulting column as seen by the CreateTableStatement. If there is no label, then we assume that the projection is simply the result of a direct column reference.

For example, in this case:

proc sql;
  create table work.country_states as
    select distinct country, state, avg(latitude) as latitude, avg(longitude) as longitude
      from work.novel_corona_virus_states
        where latitude ne . and longitude ne .
          group by country, state
  ;
quit;

The 4 projections of the CreateTable will result in 4 columns with names country, state, latitude and longitude.

Name: country
Name: state
Label: latitude
Label: longitude

This is just an example of what you can do with the data lineage support in our parser. You can extract columns and tables information from the common embedded SQL statements in a SAS file.

Macro Support: It is Complicated

Our parser includes limited macro support. The macro language in SAS is not overly complicated in terms of syntax, but it is integrated in a way that makes it hard to support it fully. You can understand the reason in our article on supporting macros in SAS:

However, recall that the SAS macro language is interpreted and operates on the stream consumed by the SAS interpreter. Macro statements are better understood as instructions to modify a stream of code while it’s being read.

Considering this approach of the SAS system, we have to make some compromises to get good performance and extensive parsing support. We parse the whole SAS file first, with only basic parsing of the macro sections of the code. This allows us to parse most SAS macro code with good performance. Then, when you need to access specifically the body of a macro we lazily fully parse that section.

This approach is not perfect, but we believe it works best for most needs: 

Still, we’ve found our approach to be effective at extracting a great amount of information from SAS source code with macros, while keeping adequate error handling and acceptable performance. Only, as we’ve said from the beginning, this is not a perfect solution, guaranteed to parse 100% of possible SAS code. However, if the objective is to extract information from SAS source code to perform some analysis, or to transpile a single SAS codebase (which will use only a subset of valid SAS, that we can tune our parser to), this approach fits the bill.

Summary

We hope you have seen how easy it is to use and be productive with our SAS parser. It is not perfect, but companies battle-tested it and used it in production by various companies.

If this interests you, you can read more about our parsers at Parser-ready-to-go and schedule a meeting with us to get to know more.