In this interview, we spoke with Sebastian Dremo, a .NET developer turned language engineer, who shared his journey into SQL parsing, his experiences building parsers for multiple SQL dialects, and his insights into the tools and techniques essential for success. From beginnings in Ethereum blockchain projects to leading parser development at Dataedo, Sebastian’s story highlights the value of embracing niche skills and continuous learning.

Federico: Hello Sebastian. Thank you for being here with us today. How are you?

Sebastian: Yeah, thank you for having me. I’m. I’m good. I’m good.

Federico: Good. Today we have a few questions that we want to, to ask you about your experience with the language language engineering and SQL parsing. But let’s get started by, well, asking you if you can introduce yourself for our audience.

Sebastian: Yeah. Hello everybody. I’m Sebastian, a developer from Poland. I’m a developer in .NET field world like since 2018. I think, right now I’m more of a language engineer, if I say so, working on SQL parsing. More of a, not only like parsing, more like analyzing SQL queries.

Federico: Yeah. Good. And can you tell us about your experience with language engineering? This, this niche, Did it started with your current position? Was something that you were passionate about before?

Sebastian: Yeah. So like me going into language engineering was totally by the accident. It started in my previous job in Ethereum world. Ethereum Blockchain, my previous company, like wanted to make something like language, like small DSL for analyzing the flow of data through chain. So like if you were to have a user who wanted to write his little DSL, like watch Mempool about some information, we wanted to have that. And that’s how I learned like it was my first time going into language engineering world. There I learned about Antlr and this whole like full machine about creating parsers and then I just wanted to like get another job and I just put like a keyword in my skills, Antlr and that’s it. I didn’t mention anything about it. And one of the resumes I sent to one of the companies this year was like, oh my God. I actually were looking for a man with this skill for a year now. Can you make a simple program which parser this SQL and if you do that you get the job. And I was like that is kind of easy to parse. So yeah, I got the job done. And he was like, please come join us. And at first I was like, all right. It’s like not my, my idea to work in this field, but why just not try it out in part time? So I did, the product move forward. Like I was like sole engineer about it. It was like our sub-product of our product. I was working on it and then I like fell in love with it. Like before that most of the like tasks I got were pretty trivial like, or not trivial, but numbing for me. And there I found like actual problems where I wanted to work on and I went for full time on that and yeah, that’s basically it. I’ve been doing SQL parsing for like three years now or more.

Federico: Okay. So for maybe young developers listening, maybe it’s worth spending time learning Angular. Oh yeah.

Sebastian: Not only Antlr like this little niche skills. Like I’ve seen what it can be to be done for your career development to just learn a little like niche skills and just wait for it to like give dividends in the years or so.

Federico: Yeah, that’s nice. And maybe sometimes it’s counter intuitive because I think a lot of young developers think oh I want just to learn the most popular programming language and so I will get a lot of jobs opportunity. But then in reality specialization helps. Good. But you mentioned parsing SQL and SQL is a huge word. So can you tell us more about. Do you mean by parsing SQL and if you have experience parsing all sorts of SQL dialects or just some. Or if you can talk about parsing SQL.

Sebastian: So at start it was like let’s try to parse as many SQL dice as we can with one parser and it backfire really hard. Like I remember this one time that I needed to show my progress and nothing was working just because I needed to support all dialects and I was like, okay, let’s pull all nighter and let’s make it like five dialects each. And it worked. So right now we are supporting like TSQL for MSQL word, it comes with Azure and stuff, PostgreSQL as well as MariaDB, PL/SQL, so Oracle World, MySQL and Snowflake SQL. And they’re like separate parsers but we call all of those parsers the parser, the big parser. So if I were to give my product to the other team, we do not like specify the dialect, we just keep them the parser of SQL.

Federico: Well, I can say that I’ve done the same mistake. Beginning to have a single parser running to issues.

Sebastian: Yeah, everybody is like man, those are just simple selects. What are you talking about? And then they’re going with CTEs, the sub-queries, the building functions and their syntax sugars and, Yeah, and they expect it to just work.

Federico: Okay, so you are covering a lot of different SQL dialets, but maybe can I ask you why parsing SQL? For which purpose?

Sebastian: Yeah, so like I mentioned before, right now I’m working for kind of a startup. Not really because it’s, it’s like 10 years in business but it’s still small business. And Dataedo it’s called, and they are like documenting the databases for the users. But the problem is that the databases have some API to get metadata from it. But not everything can be done with it. So we have, we can read the queries from users. So why not create a parser for the queries and analyze the queries and with that get even more data about it to give way more value for the users. And yeah, they were like searching for a guide to do it. Because if they were to write a parser just by hand, it was such a huge task that it cannot be done. And then they found me with the Antlr experience and I tried to do it and did it. But why? So most of the stuff is lineage. So like showing the user where the data comes from. So let’s say they use some like view on the script and we show him where the data come from to this view. And this view is used in here and here and so on. Other thing is ddls. So basically if we have a user that do not want to use third party application to document their databases, because if we were to document it, we need to connect to the database, even though, it’s on prem, it’s still third party. So they can just give us the script of creation of this database. So create tables, create views, etc. etc. And with that we can create whole documentation just by that. Next off Oracle have something like packages. It is like a thing which holds functions and procedures in one place. And the problem was that the API of Oracle gave a package just as an object, not like a collection of objects. So we needed to parse that. And there are a lot, a lot little uses over that, like formatting. So there’s a stuff with MySQL that if you read a query from the API, they give unformatted text. So, in one line. So we need to break that and so on and so on. I could just go on.

Federico: Nice. And so I know it’s a very broad question, but what kind of results did you get? Were you able to parse all the code that you wanted? Did you get a good coverage you?

Sebastian: Well, not a start. Like I said, it’s pretty big task to like support all those dialects. So at first we focused on little values just to bring the users and then with time more and more scripts were parsed and we gave the value to the users. So I would say that parser were like 80% done in like two years or so. It was not like I worked on like three or four months and it was done not like it was constant improving of the parser. Because every time you think that you’re done, then can a user with their little stuff and it breaks your code. Trees are not working properly and so on and so on. And you have to improve, improve, improve all the time. Like right now we have like a thousand or more test cases and it’s still not enough.

Federico: And you said you not just parse code, but you also calculate lineage. So I think that is additional complexity.

Sebastian: So the whole process looks like getting the script, creating a parser from it by Antlr. Then we moved like get the tree walker on it, listener and stuff from all that. From it we generate statements. So we break the code on simple statements like create, update, some select. Then we move into the statements and create metadata on it. Like what was updated from where, like what was used, so on and so on. So there’s a lot of going on in the project. It’s not simple parsing. And have you also built parsers for other languages? Well, you mentioned the DSL that you built at your previous. So my first project was with this DSL. It was kind of like SQL, but rather than selling, there was a watch. You specify what is watched. So in blockchain, mempool or order stuff where the data comes on and you basically tell the DSL what you need to watch and it streams the data. It was my first project other than SQL parsing, not really. Maybe some toy languages to target the .NET framework and stuff, but not anymore.

Federico: Good. And about the techniques that you use so that maybe we can give advice to someone that wants to learn about this field. You mentioned Antlr, you mentioned data Lineage. Does it does something else comes to mind about techniques or tools that someone could learn?

Sebastian: Tools and techniques, so I would say if you want to parse some text, not only SQL, any text, and you don’t have a library for it, do not create parser by hand. It’s not worth it. The project was solved and you can just use some library for it. Antlr is one of them. So if you have to do it, learn Antlr. It’s not that hard to be honest. If you want to create some simple parser, it’s likeone hour of reading documentation and you’re good to go. With Antlr, you have to learn if you want to move forward with it, like you want to improve it, you need to learn some theory about parsing. So what is the parser? What is the AST? What is how can you traverse the tree? So then you move into some books, like. Like two books from. From the author of Antlr Terence Parr. Like there’s a huge volume of knowledge there. Yeah, but that’s basically it. The further you go, the deeper the rabbit hole it gets.

Federico: Yeah. And have you found any challenges in entering this field and maybe you know, difficulty in learning certain topics or to find resources or anything?

Sebastian: Yeah. So the biggest problem for me was lack of documentation on everything basically. Like even Antlrs, its documentation is lacking for me. So I had to dig up like really deep to find anything. Like I were going through the PRS on GitHub of Antlr and Antlr grammars to just learn from guys how did they do something. Even though it was lacking. So thank God there are books and they help really well then I actually found your blog on Strumenta. The blog about performance improvements was pretty helpful. But still for me one of the problems in here are resources to learn from. We expect everybody to have a knowledge about everything. Let’s say I worked on improving the performance for some of the parsers. I knew about the ambiguities left occursion and stuff but I actually didn’t know where to start. So even though it could have been one week to finish the job, I had to take like two weeks to do it. So yeah, that is one of the biggest problem. Another problem is that Antlr even though like tools for to work with it are pretty good. The problem is that you have like three IDEs which are mainly used with working in Antlr. Like the biggest one are from the Jetbrains so rider and stuff. When it comes to C# you have VS Code and VS simply by that. And there are free plugins to come with it and the plugins all have their prons and cons working with them. Like I would love to be to just have one plugin which have it all like Visual Studio Code have great graphs when you you debug the script the parser it could create you like the ATN graph which shows you how the parser works. It’s great. Jetbrains doesn’t have it but Jetbrains have really good performance benchmarking like all have it. Like I would just try to get all of the pros of everybody and just make it in one product. It would be great. Yeah. Other than that little stuff. Good!

Federico: Well we discussed a few suggestions and pointers for someone entering the field. But maybe for someone working in the field based on your experience, do you see any challenge or any open problem that there is in the language engineering field? For which you think we should find a better solution.

Sebastian: For someone who’s in the field. I guess the biggest problem for like the hard stuff are profilers for the grammars. Like I only know about Threadbrains plugin when it comes to profiling and it’s showing you how many ambiguities the rules have and stuff, but it’s not showing enough for me. Like I would love to have it even deeper on the deeper level. Like I don’t know, like make it better. That’s it. I don’t have like particularly one thing to brag about. I just think that tooling should be done better. Like even the guys in the Antlr space, in the Antlr repositories, they create tools for themselves, even though, it could be done for the community. And if they do it for themselves, why not make it into the official plugin or something? Even though there is no official plugin. Like if it helps you give it to the community.

Federico: Yeah, no, I see your point and good. I think we can agree that better tooling would help. Maybe coming back to people entering the field, you mentioned a couple of books that people can look into. The books from Terence Parr, I think “Language implementation patterns” and “Antlr definitive reference”.

Sebastian: Yeah, yeah. Like if you start the Antlr book for sure, then if you want to go deeper, the language patterns.

Federico: Yeah, good. So are these the best resources that you will suggest or is there anything else that people should look into?

Sebastian: Well, they’re the easiest to get into. I also read some books about creating compilers and interpreters and stuff. But when it comes to just learning how to get into it, how to deliver the value for the customer, they’re the best ones. Like if you want to start with Antlr, the Antlr reference book is the best. You can like read four or five chapters which are like, I don’t know, 150 pages and that’s it. You know how to use Antlrs. Then like I said, improving you go to the second book of Parr. Makes sense.

Federico: Well, first of all, thank you for sharing a lot of your experience. But before we wrap it up, can you tell us where people can find more about your work and the work of your company?

Sebastian: So the company is called Dataedo, it’s based in Poland, but we are serving customers all over the world. As for me, I’m more of a quiet worker, so I don’t post any stuff on social media anymore. I did in the past, but no more. But if I were to post anything, I would look in my LinkedIn. Like if I were to post anything, I would post it there.

Federico: Good. Thank you. And finally, do you want to add anything else that I didn’t ask or any final remark?

Sebastian: Not really. As for resources like where when I was like learning all this stuff, the books was enough but if I wanted to like watch how other people were doing it, there is this streamer for Russia in YouTube called called Tsoding T S O D Y N G. He’s pretty great. He’s not a language engineer like specifically, but he do a lot of cool stuff about language engineering and I learned a lot from him. Bonus. There’s this programmer called Tim dgb. He got a blog. Just go check it out. He writes debuggers and you can learn a lot from it. It’s pretty high level but you can just go check it out. You won’t regret it.

Federico:We will look for the link and add it to the article. But thank you a lot Sebastian for sharing your experience. I think has been very valuable and yeah, thank you.

Sebastian: Thank you. Thank you Federico. Bye.

Federico: Bye