Read about

Blog/GenAI

Generative AI for Translation in 2024

February 9, 2024

In 2023, as new Generative AI models emerged, we rigorously tested their translation capabilities and shared our findings in three blog posts (gpt-3, gpt-3.5, gpt-4) and our annual State of the Machine Translation report.

As the year drew to a close, the GenAI field saw a flurry of exciting updates, including new models from Anthropic, Google, and OpenAI. Rather than waiting for our next annual report, we conducted a dedicated study immediately. Here are the results!

Experimental setting

GenAI models

We’ve chosen 9 out of all the available large language models (LLMs) for this study:

claude-2.1 (Anthropic)
gpt-4–1106-preview (OpenAI)
gpt-4 (upd June 2023) (OpenAI)
gpt-3.5-turbo (updated June 2023) (OpenAI)
llama-2–70b-chat-hf (Meta AI)
text-bison@002 (PaLM 2 — Google)
chat-bison@002 (PaLM 2 — Google)
text-unicorn@001 (PaLM 2 — Google)
gemini-pro (Google)

We have also added eight specialized Machine Translation models for comparison:

Alibaba Cloud E-Commerce Edition
Amazon Translate
Baidu Translate API
DeepL API
Google Cloud Advanced Translation API
Microsoft Translator API v3.0
ModernMT Enterprise Edition
SYSTRAN PNMT API

As you may see, out of the whole zoo of open-source LLMs we chose just one for this experiment (LLaMA 2). It outperforms all other open-source models in multilingual capabilities based on multiple internal evaluations. However, it still falls short compared to commercial models in this aspect.

Datasets

Just like our past research, we utilized a portion of our Machine Translation dataset, developed in collaboration with e2f. We focused on English-to-Spanish and English-to-German translations, conducting two sub-studies: General domain translation (both EN>ES and EN>DE), and Domain-specific translation (Legal and Healthcare for the EN>DE pair).

Prompts

When engineering prompts, we adhered to the top guidelines for using LLMs as MT. We also leaned on our past research experience with GPT models.

Since LLMs are mainly built for conversation, they add extra explanations to translations. We had to tackle this in the prompts.

For the general domain, here’s our system message:

You are a professional translator. Translate the text provided by the user from {source_lang} to {target_lang}. The output should only contain the translated text.

For specific domains (Legal and Healthcare):

You are a professional medical translator. Translate the text provided by the user from {source_lang} to {target_lang}. The output should only contain the translated text.

You are a professional legal translator. Translate the text provided by the user from {source_lang} to {target_lang}. The output should only contain the translated text.

It worked well for OpenAI models, but others required additional prompt and software engineering. For example, LLaMA-2 added the following (\n\n(Note:):

Las vistas magníficas se pueden disfrutar desde los senderos y el Rim Rock Drive, que se wind along the plateau. (Note: “”se wind”” is the correct translation of “”winds”” in this context, as it refers to the drive winding along the plateau, rather than the wind blowing.)

Still, the models provided a significant amount of explanations:

General domain: EN>DE — 70 out of total 473 segments; EN>ES — 70 out of total 471 segments;
Legal domain EN>DE — 132 out of 484 segments;
Healthcare domain EN>DE — 112 out of 481 segments.

Since we can configure multi-step translation workflows in Intento Language Hub, we dealt with those additions via post-processing.

In the case of text-bison and text-unicorn, we chose the following prompt as it showed the least issues in translation:

You are a professional translator. Translate the following from {source_lang} to {target_lang}.\\n{source_lang}: {source_segment}\\n{target_lang}:

For chat-bison and gemini-pro we chose the following context:

You are a professional translator. Translate the following from {source_lang} to {target_lang}. Return nothing but the {target_lang} translation.

Evaluation results

We conducted all translations from December 1st to 21st, 2023. Some systems we tested were in limited availability, which may have impacted their performance and accuracy.

Performance

One of the biggest hurdles in using GenAI for translation is its slow speed, so let’s start there. In the image below, we’re using a logarithmic scale for clarity.

Time to translate 480 segments in seconds.

As we can see, LLMs are significantly slower (100–1000 times) than specialized models.

It’s important to note that the translation speed of gemini-pro and text-unicorn@001 was greatly impacted by limited data processing quotas at the time.

Translation quality — COMET

Just like in our past tests with GPT-3, GPT-3.5, and GPT-4, we’ve used a COMET semantic similarity score to gauge how closely the machine translations match the original human reference.

General Domain translation, English to Spanish

Translating English to Spanish for general content is considered one of the “easiest” tasks due to the wealth of training data available. Many current machine translation models perform so well that their output is virtually indistinguishable by semantic similarity scores. So, it’s interesting to see models that lag significantly behind the cutting edge.

In the image below, the vertical axis shows the average COMET score for a specific model across the test data, with black ticks marking confidence intervals.

Translation quality for English to Spanish direction, general domain

As you can see, 9 out of the 17 models we tested fall into the same top-tier category. LLaMa2 is the weakest of the LLMs. Both Claude-2.1 and the updated GPT-4 perform notably lower than the best group, but not by a large margin. Interestingly, the updated GPT-4 model underperforms compared to its Turbo version and even GPT-3.5.

You might also notice a broad confidence interval for Gemini Pro. This model often returned empty results for our queries, which likely means safety filters were triggered, even though we set them to their lowest setting, “Block high only”. Aside from this, it’s on par with Claude and the updated GPT-4.

General Domain translation, English to German

Translating English to German is tougher, so it’s useful for pinpointing more sophisticated models.