AI RPG Migration Can Compile and Still Break Business Logic

Written by

Federico Tomassetti

AI-assisted RPG migration tools are being evaluated on the wrong metric. Compilation success and test-pass rates measure whether the translated code runs, not whether it produces the same results as the original. RPG carries implicit behavioral contracts that are invisible to compilers, absent from LLM training data, and untestable by suites generated from the same source. The tools miss them. The tests miss them. The bugs ship. And if I understood one thing about IBM i users, it is that they value stability and safety above almost everything else.

Is this another random claim by someone who is trying to sell me something? No, as the numbers are not ambiguous. Every single-prompt LLM baseline in the AgentModernize study (arxiv 2605.17535) compiled and ran successfully—and scored 0.0% on behavioral equivalence tests. The best multi-agent system in the same study reached 19.4%. No RPG migration vendor has published a semantic accuracy benchmark. Not IBM, not EvolveWare, not anyone.

Where does it leave the user? If you migrate your payroll program cleanly, as in “It compiles, the generated unit tests pass,” you just get a false sense of security (a trademark sign of all things LLM). Three months later, a year-end batch job silently produces wrong totals, only for employees whose bonus calculation involves a division that doesn’t terminate evenly. A fraction of a cent per record, accumulated across thousands of employees. The source RPG program truncated. The migrated Java program rounded. Nobody wrote a test for that. Nobody knew to.

That’s the key takeaway for any company planning an RPG migration: compiling and passing generated unit tests is not enough to sign off the project.

What the Numbers Actually Show

The AgentModernize study is the most precise public benchmark available. Their multi-agent framework with iterative feedback loops—the most sophisticated architecture in the study—achieved only 9.4% Behavioral Equivalence Rate with GPT-4o-mini, and 19.4% with GPT-5.3-codex as the best result. Without the feedback loop, BER dropped to 0.0% across all models regardless of architecture.

More telling: their Behavioral Specification Graph correctly extracted 91.2% of gold-standard business rules from the source. End-to-end code generation reached 9.4% behavioral equivalence. That is a 72-point gap between “the tool understood the rules” and “the tool correctly implemented them.” Understanding the business logic and translating it faithfully are separate problems, and current LLMs fail badly at the second one even when they succeed at the first.

The PL/SQL-to-Java study from a large Dutch financial institution (arxiv 2508.19663, 2.5 million lines of PL/SQL) found success rates ranging from 0% to 80% across individual files, with most files in the 26–60% range. Even there, the authors explicitly called out that “functional correctness of the generated code remains relatively low.” And PL/SQL is not RPG: it does not carry the same volume of implicit runtime contracts baked into the language specification itself.

I want to be in the room when the guy who approved the LLM-based migration tells to his boss that half the code was translated correctly. And then after a pause: we do not know which half. That is not a migration status report a CIO wants to deliver.

The Vendor Statements Are Telling

EvolveWare’s CEO Miten Marfatia told IT Jungle in November 2025: “if the context is not provided along with your prompts, then things go haywire.” He was talking about their own tool. It was not a competitor stating that. And, by the way, I find refreshing that someone talks with some realism in this space. I am more than fed up with incredible, unsubstantiated claims.

Golden Path Digital’s founder stated in April 2026 that the same input produces “wildly different output two out of 10 times.” Again, not so reassuring, in my opinion.

The Five Behaviors Compilers Cannot See

The reason the numbers are this bad is not model capability in general. It is that RPG’s implicit contracts are invisible to compilers, to LLMs trained on syntactic patterns, and to test suites generated from abstract source representations. Here are the specific behaviors where migrations fail silently.

Packed decimal truncation. RPG’s default arithmetic without the (H) extender is truncation, not rounding. Java’s BigDecimal.divide() without a MathContext throws ArithmeticException on non-terminating expansions; with a MathContext, it uses HALF_UP or HALF_EVEN. Those are three distinct behaviors. A migration that maps RPG numeric division to Java BigDecimal without explicitly replicating truncation produces wrong results — but only on values where the dropped digits are non-zero. Test data generated from “normal” business amounts will miss this entirely. But it is boring to reproduce these details, it leads to code that is too verbose! This may be true, and one can accept a deviation in certain contexts provided one knows what they are doing and the consequences.

*LOVAL for character fields. In RPG, *LOVAL on a character field is hexadecimal x'00' — null bytes, not an empty string. *HIVAL is x'FF' bytes. A Java migration that translates SETLL *LOVAL as positioning to an empty string key will misbehave on any key that begins with a character below x'20'.

Activation group persistence. In RPG, a program running in *CALLER or a named activation group does not reinitialize its global variables on each call. They persist for the lifetime of the activation group — potentially the entire job. If the migrated Java service uses instance-level fields, it reinitializes on every call. If it uses static fields incorrectly, it shares state across unrelated callers. Either way, the behavior diverges. No single-call test will catch this. You need a test that calls the program twice within the same job and verifies that the second call picks up state from the first.

CHAIN no-record-found leaves stale data. When a CHAIN operation fails to find a record, the externally described data structure fields retain their previous values. RPG programs routinely check %FOUND after CHAIN without first clearing the data structure, relying on the stale values deliberately — sometimes as a sentinel, sometimes by accident that became a feature. A Java migration that initializes the result object to null or zero on not-found will produce different output on any execution path that depends on those retained values. And since the pattern only shows up when CHAIN fails, it is invisible to tests built on happy-path data.

RPG cycle logic. RPG II and RPG III programs contain an implied program cycle: automatic record reading, control break detection, level indicator management, LR indicator handling that triggers final totals output, and implicit population of special variables like UDATE, UYEAR, and PAGE. A Java migration that replaces cycle logic with an explicit loop and sequential record reads will not match the sequencing of control breaks, output groups, or LR-triggered totals unless the developer explicitly models the entire RPG state machine. IBM’s own Granite training effort, as of mid-2025, treats this as an explanation problem, not yet a translation problem — the model needs to understand what the cycle is doing before it can claim to replace it. Most migrated code skips this entirely, producing a linear main method that compiles cleanly and produces wrong totals on any program that used level indicators.

The “Human in the Loop” Argument Fails Here

Both IBM and EvolveWare position developer review as the answer to the accuracy problem. Review the suggestions, run the tests, validate incrementally. This is a reasonable response to the general code generation problem. It is not a reasonable response to this specific one.

The developers reviewing AI-generated RPG-to-Java translations are not RPG semantics experts. They are the same developers who are migrating away from RPG because they don’t want to maintain it. If you had plenty of RPG developers, deeply familiar with all the intricacies of your system you would not be migrating.

What a Real Behavioral Equivalence Test Corpus Contains

A test suite generated by an LLM from the source code is not a test. An evaluation that runs the migrated code on the same inputs used in development is not sufficient.

A behavioral equivalence corpus for an RPG migration needs:

Production data distributions, not representative samples. The packed decimal truncation bug only fires on values where division is non-terminating. Year-end financial amounts, accumulated balances, tax calculations — these are the inputs that expose the gap.
Legacy date values. Dates before 1940 and after 2039 are out of range for *MDY format. A program that has been in production since 1985 has almost certainly processed dates in that range. Migrated code tested only on current-year dates will not catch compile-time DATFMT mismatches.
Multiple-call sequences within a single job. Any program that relies on activation group persistence, file cursor position, or stale-CHAIN behavior will pass all single-call tests and fail on the second invocation in a real job stream.
Not-found paths explicitly. CHAIN, READE, READ on empty result sets. Every path where a search returns nothing needs to be in the corpus with verification of what the output state is — not just whether the program returns successfully.

The teams that will catch these failures are not the teams that run the vendor’s demo data. They are the teams that feed the migrated system their actual 1987 payroll records and verify the cents column matches.

Until that benchmark exists, a vendor’s accuracy claim is a statement about syntax, not behavior. Those are not the same thing, and the difference is your money and your liability.

If you want to understand the full scope of what a proper RPG migration actually requires, we wrote Migrating RPG Code to Modern Languages as a practical playbook for exactly this.

If your team needs to migrate RPG code and cannot afford semantic drift, Strumenta can help with a deterministic migration approach: explicit modeling of the source language, controlled transformations, and behavioral validation instead of hoping that an AI translation happens to be right.

We run a fixed-price Migration Pilot for both RPG-to-Java and RPG-to-Python targets, if you want to see what that looks like on your own code.

Why Gherkin Is the Right Testing Tool for RPG Migration — And It Has Nothing to Do With BDD

14 July 2026

Joining forces to modernize legacy software

14 May 2026