As a company automating localization for more than 50 Fortune 500 clients, we closely monitor and evaluate every significant update in translation technology. Google’s translation AI is a crucial component of our solutions, so we need to understand exactly how it performs.
We publish our comprehensive State of Machine Translation report annually, with our latest edition released in June 2024. With the major release of new Google Translation AI models, we wanted to check what has changed since then. Google has kindly included us in the Early Access program, so we were able to evaluate:
- Recent changes in Google’s Neural Machine Translation
- The capabilities of the updated Gemini models
- How Translation LLM performs and where it fits in the current AI translation landscape
The Evolution of Neural Machine Translation
It all started with Google’s 2017 paper “Attention Is All You Need,” which introduced Transformer technology and changed how we approach machine translation. But despite all the progress, here’s an interesting fact: about 95% of enterprise translation budgets still go to human translators. Why? Because global companies need translations that meet very specific localization requirements, which off-the-shelf machine translation models aren’t aware of.
In 2019, Google made a significant step forward with AutoML for Translation. This tool allows companies to customize Google’s pre-trained models using their own translation data. However, some language patterns and terminology either don’t appear in training data often enough to be learned or are context-dependent, which is why companies still need human effort for certain tasks, even with customized NMT. So companies hire human translators and instruct them to fix NMT drafts to meet missing requirements, often ending in hours of janitorial work, like changing the tone of voice and gender forms in thousands of otherwise good translations.
Large language models (LLMs) represent the next big step in translation technology. These models can understand instructions similar to how human post-editors do, and they’re designed to be both fast and efficient.
At Intento, we combine all these technologies in our Enterprise Language Hub. We use them to build high-quality, automatic translation workflows tuned to meet every enterprise localization requirement so that our clients can provide customer and coworker experience in any language that is as good and authentic as in English.
Evaluation of Google’s Translation Models
The data for the analysis was prepared by Intento and e2f for the Annual State of Machine Translation, and was never published online or provided to Google or any other AI vendor. For this small-scale study, we have compared Google technologies with another LLM, as well as three other top neural machine translation models.
Translation Quality
According to COMET scores, the translation quality of Google NMT and Translation LLM improved significantly for all languages except Arabic and German. Gemini’s translation quality in certain domains slightly dropped due to safety filters, highlighting the need for dedicated models in sensitive areas like medical and legal translations.
For both Google NMT and Translation LLM the median COMET score improvement is around 1%. Intento LQA-wise, the median improvement is 1.3% with per-language improvement going up to 3%.
Intento LQA shows that across all combinations domain ✕ pair, Gemini models have the most segments with no or minor issues.
According to the MQM typology, minor issues are described as having a ‘slight impact on meaning.’ This broad definition leads to a large proportion of segments being classified as having minor issues.
Out of 90 combinations of language pair x domain, at least one Google model has been chosen as best in 85 combinations.
Google NMT is among the best-performing real-time NMT models for 97% of the language-domain combinations we’ve tested (87 out of 90), which is 15% more than the closest competitor.
Human Assessment
In human post-editing assessments for German, Spanish, and Chinese, Google models consistently required the least human effort, demonstrating their efficiency and accuracy.
In our evaluation, Google NMT and Gemini 1.5 Pro emerged as top performers, delivering the highest number of perfect segments among all assessed providers for English-German translations. Meanwhile, Gemini 1.5 Pro was consistently chosen as the best provider for English-Spanish translations, not only achieving the highest ratings but also requiring the fewest corrections. Gemini 1.5 Flash excelled in English-Chinese translations, boasting the most perfect segments and being recognized for its superior fluency across the board.
Translation Errors
Google models exhibited the lowest number of critical translation errors, underscoring their reliability in producing accurate and fluent translations.
Translation Performance
In real-time translation scenarios, Google NMT emerged as the fastest per-segment translator, with Translation LLM and Gemini 1.5 Flash closely following
Cost Comparison
Contrary to popular belief, LLMs* are more cost-effective for translation than traditional models, although they require more processing time.
* Prices for LLMs are converted with an estimation of 2.83 characters per token on average
Conclusions
Over the past six months, Google’s Translation AI models have made significant strides in improving translation quality. Notably, Google NMT, a longstanding industry staple, has shown remarkable improvement. While LLMs occupy the top spots for translation quality, they are best suited for non-real-time tasks where quality is paramount.
The cost-effectiveness of Gemini Pro, coupled with its capacity for fine-grained translation instructions, positions it as a powerful tool for multilingual tasks.
A Recipe for the Automatic Translation in 2024
As we look to the future, the approach to automatic translations looks like this:
- For organizations without prior data, reliable stock Neural Machine Translation (NMT) models, such as Google NMT, provide a solid foundation.
- When feedback starts coming in and adjustments are needed, two options emerge:
a. Translation LLM with quick adaptation
b. Google AutoML to fine-tune the baseline Google NMT to your business needs- Both approaches support real-time translation (live chats, on-the-fly website translation) and can be enhanced with glossaries.
- For translation with high and particular requirements, the gap between system output and desired quality can be reduced by adding additional translation AI steps, or, as we call them at Intento, AI agents, built using powerful LLMs, such as Gemini Pro.
- To optimize time, money, and energy when processing texts with powerful LLMs, consider incorporating lightweight quality evaluation steps through:
- Translation Quality Prediction scores such as MetricX
- Customizable Translation Quality Evaluation scores based on lightweight LLMs like Gemini Flash (particularly useful for checking specific requirements)
To implement these workflows effectively, enterprises require a localization-specific framework like Intento’s Enterprise Language Hub, ensuring seamless integration with existing software systems and delivering consistent, authentic experiences to global audiences.