Translating with GPT-4: the Latest & the Greatest

Konstantin Savenkov

CEO and Сo-founder of Intento

Only a week after completing our evaluation of ChatGPT (gpt-3.5-turbo) Machine Translation capabilities, OpenAI kindly granted us beta API access to GPT-4. We promptly integrated it into our MT evaluation framework and conducted our initial tests.

What’s new?

Well, many things.

  1. Enhanced factual accuracy, although not crucial for translation, as the goal is to accurately translate the source sentence, even if it isn’t entirely factual.
  2. Visual input compatibility, potentially enabling in-context translation, such as product descriptions using photos or UI elements using screenshots.
  3. Larger prompt capacity, allowing up to 8K tokens (~6K words) in beta and 32K tokens (~24K words) in the full version. This could enable the full-text translation and submission of extensive information or metadata to guide translations, like glossaries, fuzzy TM matches, or comprehensive human-readable translation style guides.
  4. Improved multilingual capabilities, which we’ll be exploring further in this article.

Let’s dive in!

Evaluation approach

The general approach is the same one we have used to evaluate GPT-3.

In our ChatGPT evaluation, we used API and prompts supplemented with system messages. GPT-3.5 documentation suggests future models will pay greater attention to these engineering prompts, which might be true for GPT-4.

We tested both text generation (GPT-3-style) and chat completion (ChatGPT-style) prompts. As shown below, there aren’t many differences, making text generation a preferable option due to its shorter prompts.

General domain translation

Surprise! GPT-4 achieves the highest average COMET score for English-to-Spanish translations. However, considering the confidence intervals (black ticks on the chart), this lead isn’t statistically significant, placing it in the same range as Google, GPT-3.5, Amazon, DeepL, Yandex, and Microsoft.

Determining the true leader would require human judgment, and we’ll share more details once we’ve had the opportunity to conduct such an evaluation.

COMET scores for GPT-4 English to Spanish translation, general domain

For General Domain texts in English to German, DeepL is still the best by far — however, GPT-4 has the highest scores in a group of the top-runners.

COMET scores for GPT-4 English to German translation, general domain

In-domain translation

For the German in Legal and Healthcare, GPT-4 ends up being the best out of the GPT engines, however, still performs slightly worse than the first-tier engines:

COMET scores for GPT-4 English to German translation, healthcare domain


COMET scores for GPT-4 English to German translation, legal domain



The 8K-token model available to us costs $0.03 per 1K tokens for the prompt and $0.06 per 1K tokens for completion. This equates to approximately $18 ($6 for the prompt and $12 for the completion) to translate 1M characters, making the cost similar to or slightly cheaper than Google Translate or DeepL.

However, it’s about 10 times more expensive than GPT-3.5-turbo, which costs $1.5 per million characters for machine translation.

Technical performance

Similar to what we’ve discovered for GPT-3, GPT-4 experiences multiple API errors during translation. It also takes nearly twice as long as GPT-3 and GPT-3.5-turbo, averaging 46 minutes to an hour for a dataset of 500 segments. This results in a 5–7 second processing time per segment, almost 15–20 times longer than typical commercial engines.

The output may contain technical noise (like quotation marks), which is easily removable. In 0.5–1% of cases, the model returns an empty completion without a message, making retries essential.


GPT-4 boasts significant improvements in machine translation, including better factual accuracy, visual input compatibility, larger prompt capacity, and enhanced multilingual capabilities. Despite high average COMET scores in general domains, it falls short of specialized MT systems for in-domain translation. Its cost is similar to popular services like Google Translate or DeepL but 10x pricier than GPT-3.5-turbo.

While GPT-4’s technical performance lags behind commercial engines and occasionally encounters API errors, it shows great promise in machine translation. The rapid advancement in this well-known field highlights the ever-evolving technology landscape. Expect more sophisticated and efficient solutions from OpenAI and other industry leaders in the future.

Read more

Continue reading the article after registration
Already a member? Sign In

We know how to make your business multilingual and productive. Let's talk.