AI Coding Feels Like Using an Unreliable Compiler

Written by

Federico Tomassetti

AI Coding Feels Like Using an Unreliable Compiler

Every developer I know is asking roughly the same questions about LLMs.

Yes, there are LLM-fanatics and LLM-skeptics, but most are LLM-confused: sometimes LLMs seem amazing and sometimes they seem so dumb one wonders why we are using them in the first place.

So we keep asking where can they help us? Where do they waste our time? Where are they dangerous? Where do they really change the way we build software?

In other words, how can we use them to reliably write better software or software more productively?

Why I think most of us are asking themselves this question, we asked them based on our own experience, with our own years of scars from software development.

And from my point of view this reminds me of the time when we did not fully trust compilers. Believe it or not, there was a time where compilers felt a bit buggy, so when an issue in the code came up we could ask ourselves “could it be maybe a bug in the compiler and not in my code?”. 99% of the time the problem was in the code but from time to time… the compiler of a language as complex as C++ could be wrong. For example, I remember an issue with a compiler from Microsoft handlingly incorrectly variable declarations in for-loops.

So there was a bit of misstrust towards the thing that translated what we typed and what was then executed.

And this is exactly how LLMs make me feel today: LLMs used for coding today feel like unreliable compilers.

What I mean is that, when we use an LLM to transform our intention into code, the transformation is not faithful, repeatable, or reliable enough for us to just trust the output without any verification. So we cannot simply write the prompt and move on. We must inspect the generated code, understand it, test it, and often correct it.

And that means that now the bottleneck is reviewing code, instead of writing it.

This makes for a very un-sexy and un-exciting story. How can we make that better?

The Many Ways We Use LLMs to Code

There are several levels at which we can use AI for programming.

At the beginning, many of us used LLMs in the most primitive possible way: we opened a chat, pasted a piece of code, asked a question, copied the answer, and pasted it back into the project. The user had to do most of the work. We had to decide which files to paste, how much context to include, which details to omit, how to explain the architecture, and how to integrate the answer back into the codebase. The LLM was doing the generation, but we were manually building the context window around it.

Then the tools moved closer to the code: first through IDE plugins and completions, then through AI-native editors such as Cursor, and finally through command-line coding harnesses such as Claude Code and Codex. These systems can inspect a repository, modify several files, run commands, iterate on failures, and produce a change that looks much more like the work of a developer.

So now the shell around the model counts a lot. It gives the model tools. It decides when to search the codebase, when to read a file, when to run tests, when to use grep or git, when to inspect the diff, and when to retry after a failure.

And this shell is not an LLM: it is ordinary software engineering. It has deterministic logic, predictable commands, tool protocols, heuristics, guardrails, context management, and feedback loops.

This matters.

It means that the quality of AI coding tools does not depend only on the underlying model. It depends on the engineering around the model.

That is why two tools using comparable models can produce very different results. Cursor, Claude Code, Codex, or any other coding environment may behave differently not only because the model is better or worse, but because the harness around the model is better or worse.

The tool that gives the model the right context, exposes the right operations, constrains the right choices, and validates the right outcomes will usually produce better results.

In a sense, software engineering comes back through the window.

A Typical Session with a Coding Harness

Imagine I am working on a system that stores Abstract Syntax Trees in memory.

Now suppose I want to change the memory strategy of this system.

I could tell Claude Code something like:

This system currently keeps all ASTs in memory. Change it so that we keep only the 100 most frequently used ASTs in memory and store the others on disk. ASTs generated by our own transformations can be evicted more aggressively than ASTs obtained from parsings, because the latter are more likely to be reused soon.

The agent starts working.

It reads files. It searches for classes. It proposes a plan. It changes code. It adds a cache. It serializes ASTs to disk. It may add tests. It may run them. It may find a failure and correct itself. It may spend a few minutes working.

At the end, I get a diff.

And then the real work starts.

Because now I need to answer questions such as:

Did it understand what an AST means in this system?
Did it serialize the AST correctly? And deserialize them?
Did it handle invalidation?
Did it choose a good eviction policy?
Did it understand the difference between ASTs obtained from transformation and ASTs obtained from parsing?
Did it add a cache in a place where a cache makes architectural sense?
Did it make the code harder to evolve?
Did it create a performance improvement, or did it just move the cost somewhere harder to observe?

The code was generated quickly.

The review is not quick.

Fast Code Is Not the Same as Cheap Code

This is the first unpleasant truth.

LLMs can make code cheaper to produce, but not necessarily cheaper to own.

This distinction matters. A lot.

LLMs seem to offer another jump in abstraction: instead of writing code, we write intentions.

But that jump is incomplete.

The problem is that our intention is not compiled into code in a reliable way. It is interpreted probabilistically by a system that has no real obligation to preserve all the constraints we care about. And it may not even be aware of those constraints. It may produce something plausible. Plausible is not enough.

In software, plausible code is often worse than obviously wrong code.

Obviously wrong code fails immediately. Plausible code survives long enough to become your problem.

The Review Cliff

There is a hidden cliff in LLM-based programming.

Small tasks feel magical.

Ask for a simple script, a test helper, a small refactoring, a CSS adjustment, a parser example, a function that transforms one data structure into another. The output is short enough that you can read it quickly. If it is wrong, you notice. If it is slightly ugly, you fix it.

The ratio is favorable.

You write a short prompt, get a short result, review it in a few minutes, and move on.

But this is not enough. This is not what we are dreaming of. We want LLMs to write large chunks of the system, autonomously.

You write twenty lines of prompt and get five thousand lines of code.

At that point the economics change completely.

The generation cost is still low. The review cost explodes.

This is the review cliff: the moment where the LLM produces more code than you can responsibly understand at the speed at which it is produced.

You start one instance of Claude Code on the caching strategy. Another on a new API. Another on a migration script. Another on UI improvements. They all work. They all finish. They all produce diffs.

Congratulations: you have become the bottleneck.

“But It Compiles”

The obvious response is that we can automate verification.

And yes, we should.

We can check that the code compiles. We can run the tests. We can run linters. We can enforce formatting. We can run static analyzers. We can run benchmarks. We can add approval tests.

If we want LLM-generated code to be acceptable, we need verification machinery.

But let us not pretend this is trivial.

Compilation tells us something. It does not tell us enough.

Tests tell us something. They do not tell us enough either, especially if the tests were generated by the same tool that generated the code.

A benchmark tells us something only if the benchmark represents the real workload. Otherwise it tells us that we optimized a toy. Designing and running benchmark takes time.

A linter tells us whether we violated certain syntactic or stylistic constraints. It does not tell us whether the architecture is deteriorating.

Static analysis can help, but it needs rules. It produces false positives and false negatives.

None of this assess architecture coherence, or fit for the problem at hand.

And here we hit the real issue: many of the properties we care about are not cheap to specify.

The Things We Actually Care About

Suppose the LLM changes the AST storage mechanism.

What do we want to verify?

We want the code to compile. Fine.

We want the tests to pass. Fine.

We may want to check performance. But which performance?

Total processing time?
Peak memory usage?
Retained memory after processing?
Number of allocations?
Disk usage?
Cache hit rate?
Latency for user-provided ASTs?
Throughput when processing a large codebase?

Each answer implies a different measurement strategy.

Then we need representative data.

Not random data. Not tiny examples. Not examples that happen to fit in a blog post. Representative data.

LLMs do not remove this need.

They make it more urgent.

If an LLM can produce changes faster, then our verification process must become better, not lazier.

And performance is only one dimension.

We also care about:

code complexity
duplication
dead code
maintainability
consistency with local patterns
quality of abstractions
observability
debuggability
error messages
extension points
whether the solution fits the phase of the project

Some of these can be partially measured. Many cannot be fully measured in a way that is cheap, automatic, and reliable.

If they could, we would already have automated them before LLMs arrived!

This is the part of the AI coding discussion that is often skipped.

The Missing Context Problem

There is another problem: the LLM does not know everything I know.

And I do not mean only technical information.

I mean context.

For example, am I fixing a bug for a demo this afternoon?

If yes, I may accept a direct, ugly, localized fix. I may even prefer it. A clean redesign that takes three days is the wrong answer if I need the demo to work at 15:00.

Am I changing a foundational building block that should remain maintainable for the next twenty years?

Then the trade-offs are completely different.

Is this internal tooling?

Is this code going to be delivered to a client?

Is this a prototype?

Is this part of a long-term platform?

Is this code expected to become an API used by other teams?

Is this an experiment I will throw away next week?

These questions matter. They shape the solution.

Yet they are often not in the prompt.

Not because we are stupid. Because they are implicit in our head. We know the project. We know the client. We know the history. We know the political constraints. We know which part of the system is fragile. We know which module is scheduled to be replaced. We know which component was written by someone who left five years ago and should not be touched unless absolutely necessary. We know that while a certain solution would be better, our colleague Paul will disagree and keep complaining until we move to another approach he favors. Now type that in some markdown file, in a way that an LLM takes that into account, but that Paul does not get offended.

The LLM does not know any of that unless we say it.

And saying all of it is hard.

The Website Exception

There are domains where this problem is less severe.

For example, suppose I ask an LLM to generate a simple marketing website. No sensitive data. No complex backend. No deep domain logic. No long-term maintenance expectations. I can open the browser, look at the result, click around, and decide whether it is good enough.

In that case, the verification loop is visual and cheap.

Does the page look right? Does it adapt to mobile? Do the buttons work? Are there obvious layout problems? Does the copy say what I want?

For a brochure site, verification may be cheap.

For a caching layer in a language engineering platform, verification is not cheap.

The value of LLMs depends not only on how fast they generate the artifact.

It depends on how fast and reliably we can decide that the artifact is acceptable.

The Babysitting Future

There is a future of software development that I find deeply unattractive.

In that future, developers do not design systems. They do not build a deep understanding of the codebase. They do not improve abstractions. They do not shape languages, models, and tools.

They babysit agents.

They run five coding agents in parallel. Each agent produces a diff. The developer reviews the diffs, tries to understand unfamiliar code, asks for corrections, runs tests, accepts some changes, rejects others, and slowly loses touch with the system.

This may look productive from a dashboard.

Number of pull requests: up.

Lines of code produced: up.

Velocity: allegedly up.

Developer satisfaction: probably down.

System coherence: ask again in six months.

I know some people outside software may not care. They may think: “Fine, we will always find someone desperate enough to review generated code.”

Maybe.

But I doubt this produces good software.

More likely, one of two things happens.

Either we still need skilled developers to review the generated code, and they become the bottleneck in an increasingly frustrating process.

Or we stop reviewing properly.

The second option is faster.

It is also how systems rot.

Code Generation Has Always Had This Problem

This discussion is not entirely new.

We have had code generation for a long time.

I have written about it in A Guide to Code Generation. Code generation can be extremely valuable. It can remove repetitive work, enforce consistency, and let us operate at a higher level of abstraction.

But traditional code generation has one crucial property: the generator is deterministic.

If we feed the same model to the same generator, we expect the same output.

If the output is wrong, we can fix the generator or the model. Over time, the system becomes more reliable. The generated code may be ugly, but at least it is ugly in predictable ways.

LLM-based generation is different.

It is not just a generator. It is a generator with taste, guesses, incomplete memory, and occasional hallucinations.

This can be useful during exploration.

It is much less comforting when we need engineering guarantees.

A traditional generator is a machine we can improve.

An LLM is a collaborator we must supervise.

That distinction matters.

The Interesting Direction: Raise the Level of Review

So where do we go from here?

I do not think the answer is to reject LLMs.

But I also do not think the sustainable future is simply “write short prompts and accept larger diffs.”

The interesting direction is different:

We need to raise the level at which humans review the work.

Today, the gap is often absurd.

Human intention:

Keep the 100 most frequently used ASTs in memory and store the rest on disk, with different eviction priorities for generated and user-provided ASTs.

LLM output:

Thousands of lines of code spread across cache classes, serializers, tests, configuration, and lifecycle management.

The human must review the second artifact to determine whether it matches the first.

That is a huge semantic jump.

If the LLMs were producing an output in the tens of lines of code, we could give our input, we could be part of the conversation. We could review quickly and feel we have a role in the creative process, not just reviewing the code vomited by the LLM.

That is where LLMs could become genuinely liberating.

Not by replacing engineering judgment, but by helping us work at a level where engineering judgment is more effective.

This Is Where DSLs Become More Relevant, Not Less

This brings us back to Domain Specific Languages.

A DSL is not just “a cute syntax.”

A DSL is a way to capture domain concepts precisely enough that people and tools can reason about them. See The complete guide to external Domain Specific Languages.

The mistake would be to say: “Now that we have LLMs, we do not need DSLs.”

I think the opposite may be true.

Now that we have LLMs, we need better DSLs.

Because LLMs make it cheap to generate code, but they do not make it cheap to understand code.

What I Want from AI Coding Tools

I do not want an AI tool that produces the maximum amount of code.

That is easy. That is not the problem.

I want an AI tool that minimizes the amount of code I need to distrust.

That means:

smaller diffs
clearer intent
explicit trade-offs
generated artifacts at the right level of abstraction
the ability to say “this cannot be safely changed without more information”

That last point is important.

A junior developer who confidently changes code they do not understand is dangerous.

An LLM that confidently changes code it does not understand is not fundamentally different.

The difference is scale.

The LLM can be confidently wrong much faster.

A Better Collaboration Model

The best collaboration model I can imagine is not:

Human writes vague prompt.
LLM writes lots of code.
Human reviews lots of code.

That is the babysitting model.

A better model is:

Human and LLM discuss the problem at the domain level.
LLM helps produce a precise higher-level artifact.
Human reviews that artifact.
Tools validate it.
Deterministic generators or constrained transformations produce implementation code.

Conclusion: The Bottleneck Is Trust

LLMs are useful. We may like it or not, we can have our opinions but they are here to stay.

But the central problem remains trust.

If I cannot trust the transformation from intention to code, I must review the code. If the generated code is large, the review becomes expensive. If I skip the review, I accumulate risk. If I keep reviewing everything, I become the bottleneck.

This is why I say that LLMs are currently like unreliable compilers.

They translate a higher-level input into a lower-level artifact, but we cannot fully trust the translation.

And an unreliable compiler is not a small inconvenience. It changes the economics of development. It changes how we debug. It changes where we spend our attention. It forces us to verify what should have been guaranteed.

It makes humans drown in details.

The solution is not to abandon LLMs.

The solution is to stop pretending that natural language prompts plus generated code are enough.

We need better intermediate representations. Better DSLs. Better models. Better validation. Better tooling. Better ways to capture context. Better ways to review intent rather than inspect endless diffs.

The future of AI-assisted development should not be developers babysitting armies of code-generating agents.

It should be developers working at higher levels of abstraction, with AI helping them express, refine, validate, and implement decisions in forms that both humans and machines can understand.

That is a future I can get excited about.

The other one just sounds like reviewing pull requests forever.

Software development is not typing. It has never been typing. If typing were the bottleneck, the history of software engineering would have ended with better keyboards.

We have spent decades trying to raise the level of abstraction because the hard part is not producing more characters. The hard part is controlling complexity. I wrote about this in Raising the level of abstraction: what if we tried to do that bottom up?: as we move away from registers, pointers, and low-level details, we can reason in terms of larger building blocks.

I have spent a large part of my career building exactly this kind of machinery. In The complete guide to external Domain Specific Languages, I argued that DSLs are useful because they help us express domain concepts precisely and build tooling around them. In Building a language: tool support, I made the point even more directly: a language without proper tool support is not enough.

AI Coding Feels Like Using an Unreliable Compiler

AI Coding Feels Like Using an Unreliable Compiler

The Many Ways We Use LLMs to Code

A Typical Session with a Coding Harness

Fast Code Is Not the Same as Cheap Code

The Review Cliff

“But It Compiles”

The Things We Actually Care About

The Missing Context Problem

The Website Exception

The Babysitting Future

Code Generation Has Always Had This Problem

The Interesting Direction: Raise the Level of Review

This Is Where DSLs Become More Relevant, Not Less

What I Want from AI Coding Tools

A Better Collaboration Model

Conclusion: The Bottleneck Is Trust

Categories

What Is Software, and Will LLMs Replace It?

How to use the Visual Basic 6 and VBA Parser