A translation can look “better” by generic metrics and still be worse for your business.
That happens when the output stays close to a reference, but misses what your team actually checks: terminology, tone, formatting, consistency, and tag/markup rules. That’s what creates the real work.
In our webinar, Daria Sinitsyna (Lead AI Engineer at Intento) and Jure Dernovšek (Solution Engineer Coordinator at memoQ) made a simple point: workflows built around your business and language requirements beat “pick the best engine” workflows because they turn repeated edits into rules—and apply them automatically in the tools you already use.
“We’re moving away from generic translation policy scores and toward client-specific requirements.”
It sounds subtle, but it isn’t. It changes what “quality” means in everyday production.
Requirements are not “extra.” They are quality.
Many machine translation (MT) programs still optimize for broad quality scores, but that doesn’t match real production, where “good” means meeting clear constraints:
- Terminology is correct and consistent.
- Tone stays on brand.
- Formatting and tags stay intact.
- Wording stays consistent across the whole document, not just per segment.
These aren’t edge cases. They’re the everyday reasons reviewers keep editing.
“If you want better quality with less post-editing time, focusing on requirements matters more than chasing generic quality scores.”
In The State of Translation Automation 2025, we evaluate translation setups against specific business and language requirements—like terminology, tone of voice, and formatting—instead of treating them as optional. That reflects what reviewers actually fix, and it’s a better guide to effort and rework than a single generic score.
Generic scores can reward the wrong thing
Daria shared a pattern from testing: raw MT can look best by generic metrics because it stays closer to a reference. Automated post-editing can score worse simply because it makes more changes.
But we found that raw MT produced many more major and critical errors, while post-edited output produced far fewer because it matched what reviewers check.
This is the core problem: a score can go up while your team still has to fix the same issues—and a score can stay flat while reviewer time drops sharply.
Post-editing shows what “quality” means
Post-editing exists for practical reasons: missing context, voice drift, consistency issues, terminology fixes, and formatting or markup problems.
Jure pointed out a real production risk: LLM output can sound confident and polished even when the source is unclear—so the translation may read well while still being wrong.
A more useful way to think about post-editing is this: it’s not just cleanup—it’s a feedback loop. Every recurring change is a sign the workflow is missing a clear requirement.
Recurring edits are rarely random. They usually point to a missing requirement. Capture those edits, turn them into clear rules, and make the workflow apply them automatically. Better outcomes come from that discipline—not from treating one overall score as “quality.”
Practically, this means keeping reviewer feedback connected to the translation workflow and reusable assets like translation memory, so the same fixes don’t repeat project after project.
Teams can also quantify what’s happening by extracting post-editing data and comparing changes between raw MT, translator edits, and reviewer edits (manually, with templates, or via API).
That turns “reviewer preferences” into something operational: clear, checkable requirements—strong inputs for a multi-agent setup that can check compliance and apply fixes automatically.
Terminology control that holds up in production
Terminology is one of the highest-impact requirements because it’s repeatable, easy to validate, and expensive to fix manually at scale. Jure described it as the backbone of consistency.
“Terminology is now the key pillar when it comes to consistency.”
He also made a practical point many teams miss: you don’t need huge glossaries to get consistent terms. “We can use really small glossaries to steer the translation and make the terminology consistent.”
When terminology is treated as a requirement, the workflow applies it consistently, and the output stays fluent while term choices stay consistent across segments—including inflections and plurals.
Set the preferred term once, and the workflow keeps it consistent everywhere, cutting repeat term fixes in review and making output more predictable at scale.
Turning post-editing patterns into clear requirements
The idea is simple: capture what reviewers keep fixing, turn it into requirements, then apply them automatically.
Capture requirements where they show up: Start with what you already have (brand, style, terminology), then add what post-edits reveal. Requirements are often both explicit and implicit.
Turn recurring edits into rules: Replace “quality is low” with clear rules: approved terms, forbidden variants, tone constraints, formatting expectations, and markup rules.
Keep requirements connected to the workflow: Store and reuse linguistic assets across projects (translation memories, term bases, configuration) so the same issues don’t come back.
Apply requirements during translation, not after review: Steer output upfront with term bases and requirement-driven workflows, then use post-editing agents where needed to bring output into compliance automatically.
Measure what matters: Track reviewer time, major/critical errors, and rework. Treat generic scores as secondary signals, not the goal.
Amadeus: generic scores stayed flat, reviewer time dropped
Amadeus kept the infrastructure their teams already used and layered requirements-based automation on top.
Daria described the approach: collect requirements (including terminology), extract more requirements from post-editing patterns, then roll out a post-editing agent, language by language—EN→ES first, then EN→FR and EN→IT, with more language pairs added over time.
The result was operational, not cosmetic:
“Even when the generic scores were basically the same, reviewers spent about 60%–70% less time on post-editing.”
That kind of impact comes from a structured setup, not luck: “We do not just throw an LLM and hope for the best.”
The bottom line
Translation automation gets easier when “quality” stops being a single score and becomes a set of requirements: terminology, tone, formatting, and consistency.
Consistency is a good example. When translation happens in isolated segments, models can’t see earlier choices, so wording drifts across the document. One practical fix is batching: translate with document context, then split back into segments for the usual workflow.
The rest is execution: capture what reviewers keep fixing, turn it into rules, automate the checks and fixes, and measure impact where it shows up—reviewer time and major errors, not a score that hides the real work.


