Blog/GenAI

Translation quality with LLMs

December 9, 2025

I want to talk about what I think is the biggest impact of Large Language Models (LLMs) on the localization and translation industry – and why they’re fundamentally different from neural machine translation (NMT).

At its core, this is about one thing: how LLMs overcome these limits, and what you can expect.

How far NMT quality can go

NMT has given the industry a big productivity boost. But it comes with an intrinsic ceiling.
You can push quality up by choosing the right NMT model, cleaning your data, and carefully evaluating trained models. For a given language pair, you usually have two paths:

Use stock NMT models, pay about $20 per million characters, and get a baseline quality level.
Or customize NMT with your translation memories and glossaries, pay roughly 4× more per translated character, and invest effort into customization and evaluation to reach the next quality tier.

Do this well, and you can climb to around 50–60% of “perfect” translation for languages like Spanish, based on your quality requirements. Do it badly, and you may only get 15–20%.

After that, the quality plateaus. Moving from 60% to 70% takes a lot of time. You keep post-editing the same kinds of mistakes over and over until you’ve collected enough new data to retrain. Even then, you’re unlikely to reach 90%, let alone 100%.

That’s the ceiling. You can keep spending on technology and data, but returns shrink. Improvement occurs at the rate you create new translation memories, and that rate is limited.

How enterprises work around NMT limitations

Because of this ceiling, enterprise localization doesn’t rely on a single “MT solves everything” path. Content is split into several workflows based on requirements and risk tolerance:
– Full automation
– MT with human review
– MT with post-editing
– Traditional human translation

Localization teams update glossaries, retrain MT with new post-edits, and squeeze more productivity out of translators. If things go well, next year’s rates will improve by 5–10%.

What usually doesn’t change is this: content that sits in “MT + post-editing” almost never moves to “full automation.” The tiers stay where they are. MT reshapes cost within each tier, but it rarely lets you reclassify a whole content stream from “needs a human” to “machine-only.”

That’s the NMT story: decent, repeatable gains inside a box defined by the quality ceiling.

Why NMT hits the quality ceiling?

The ceiling isn’t just “we need more data.” It’s structural.

1. Improvement through training only

You may have translation data from the last 10 years, but your style guide could have changed 100 times over this period. Most organizations lack the necessary technology or resources to clean their outdated data. Then, once a new stylistic or terminology requirement is introduced, there’s no data yet to learn from. It’s like moving forward while looking backwards.

2. Frequency bias in training data

Your requirements – terminology, style guides, brand rules, legal phrasing – show up with very different frequencies in translation memories. MT can only learn what it sees often enough; otherwise, it considers it noise. It handles high-frequency patterns relatively well and regularly fails on low-frequency ones. Many localization rules simply don’t appear in the data often enough to be learned.

3. Only positive feedback

MT is trained by showing it good translations, which isn’t enough. As we know, it’s very hard to teach by giving only positive feedback. Every translation is “good” only for certain quality needs, but there’s no way to communicate this when training the model.

4. No real context

Classic MT doesn’t really know who is speaking, what the product category is, or who the audience is. Those are central to choosing tone, gender forms, and other context-dependent language patterns.

5. Sentence-level operation

Most MT works at the sentence level. That limits consistency across paragraphs and documents and makes some kinds of meaning harder to preserve.

Put together, these limitations create a hard ceiling. There is no way to radically improve quality with MT alone, no matter how much you tweak. You can hire a vendor to optimally train and maintain your MT stack, but after the first big optimization, it becomes very hard to show another big step up.

It feels like hitting a speed-of-light barrier: you can get close, and then you stop getting faster.

LLMs: quality without a built-in cap

Large Language Models (LLMs) change the picture because they don’t have this same intrinsic quality ceiling and don’t depend on new training data for every new rule or behavior.

Once you’ve reached what’s possible with your existing data assets, you can still move up a level by engineering a multi-agent solution on top of LLMs, without needing to bring in more training data.

LLMs can handle almost any language task without additional training. That lets you treat them as a programmable layer, not a fixed engine:

You can create different AI agents for different roles in the translation workflow.
You can instruct them in natural language, not just by retraining on more data.
When you notice a recurring deviation from requirements, you can fix it with prompt engineering and system design instead of waiting months for more training examples.

Quality stops being a static property of a single model. It becomes a property of the system you build around it.

From one engine to a team of agents

With NMT, you have one model predicting translations. With LLMs, you can design a translation flow that looks more like a coordinated team.

For example, you might have:

An agent that cleans and clarifies source content, so the rest of the pipeline isn’t fighting with bad inputs
A translator agent producing the initial translation
A post-editor agent focused on terminology and style
One or more reviewer agents, each checking a specific set of style guide or compliance rules
A proofreader agent for final polish

Each of these agents can see the relevant metadata and context: product category, market, audience, risk profile, brand voice. They don’t have to guess it from a sentence in isolation.

This lets you:

Enforce low-frequency rules that never showed up often enough in your translation memory
Treat different content streams differently based on audience and risk
Improve consistency and quality across sentences and documents, not just within one sentence at a time

This kind of system costs more to build, run, and maintain than NMT, and it’s slower. But it opens a path to quality improvements that MT alone simply can’t reach.

In fact, with enough metadata and clear business requirements, you can fully automate any translation using this approach. The limiting factor is no longer “does the model know this pattern from training data,” but “did we design the system and instructions well enough, and is the use case worth the investment?”

Economics: when the LLM approach is worth it

LLM-based translation will not always be less expensive than human translation. The economics matter.

For small or irregular translation volumes, building a complex multi-agent system doesn’t make sense. You’d essentially be paying for AI engineering instead of paying a good translator.
For large, recurring content streams that need fast turnaround and high quality, the calculation changes. That’s where multi-agent LLM systems can be very effective.

LLMs let you build premium-quality solutions for virtually any translation task. The trade-off is cost and time:

For some tasks, the total cost of ownership of such a system can exceed the return you get from automation.
For larger, stable use cases, the ability to move them to full automation at around half the price and roughly 10× the speed of human translation can absolutely be worth the investment.

This is where LLMs go beyond MT: they make it possible to move more use cases to full automation, not just squeeze a bit more efficiency out of post-editing.

From localization niche to the whole translation market

With MT alone, quality is capped and heavily dependent on training data. Most translation still goes through human review, post-editing, or human-only workflows, and automation is mainly concentrated in the small localization software market.

With LLMs, you can build premium-quality solutions for virtually any translation task by creating AI agents for different roles, instructing them via prompts, and providing detailed metadata and business requirements.

In practical terms, this means AI translation companies can address not only the small localization software market, but the whole translation market in general. The translation automation market expands from a niche to the full market, which means more venture money and better products and solutions in the upcoming 2–3 years.

How to think about translation quality in the LLM era

A simple way to frame it:

With NMT alone, the quality you get depends on the training data you have and runs into inherent limits. At some point, you are just pushing against the ceiling.
With LLM-based multi-agent systems, quality is bounded mainly by your design: how clearly you define requirements, how well you use context and metadata, and how much you’re willing to invest.

With LLMs, the trade-off is cost and time, but they can give you:

An approach that is effective for large, recurring content streams needing quick delivery with high quality standards
The ability to move those use cases to full automation at even half the price and 10× the speed compared to human translation
Premium-quality solutions for virtually any translation task

For enterprises that need the highest possible quality at scale, and for vendors building these systems, that is the real shift: translation quality stops being capped by the model and starts being driven by system design and business priorities.

You can watch the video version of this blog post here.