10 Ways to Optimize Text for Machine Translation

Guidelines for creating machine-translation friendly source text
Vlada Klimova

Head of Customer Support at Intento

In 2020, Machine Translation is unlikely to embarrass you. It’s widely used by companies, students, book publishers, translation providers, and your foreign friends who want to read your post on social media. And then they put a heart next to it — they got it! Machine translation has come a long way indeed.

We’ve been evaluating the performance of machine translation engines for years, and believe us, we’ve seen a lot. This article is a summary of our experience (sometimes funny, other times — curious or painful) with one mission: to help you get the most of machine translation and, well, avoid embarrassment.

In this article, we focus on stock (=publically available) machine translation systems because custom models can be trained on your data samples to handle specifics of your text style better, while generic models are there to deal with all kinds of text.

Sounds good? Let’s dive in.

  1. Use a formal writing style.

It might be a good idea to remove or substitute the following:

  • Slang words [e.g., wooot, buddy or dude]
  • Loanwords and neologisms [e.g., Grand Prix, e-bike]
  • Idioms and specialized local terms [e.g., break the ice = “get the conversation going”]
  • Ambiguous words and words with a different meaning in source language dialects, for example:

a) words ending on -ed or -ing

b) word “table could mean a piece of furniture or a list depending on the context

c) the word “glass could mean material or tableware, etc.

  • Phrases based on local humor, customs, sayings, and biases
  • Ad-hoc abbreviations [e.g., in French there are a lot of abbreviations used in everyday communication: BJR = bonjour, BZ = bisous, bises., etc.]
  • Use phrases based on the common knowledge [e.g., Earth is a planet]

2. Use a simplified sentence structure.

  • Make sentences consistent and self-sufficient.
  • Don’t indulge in complex sentences with subordinate clauses.
  • If you can, avoid passive voice.
  • If needed, split a complex sentence.

3. Unify terms.

For example, instead of using both “client” and “customer” to describe the same thing, stick with either option.

4. Check orthography, punctuation, and mistyping.

Mistyped words can be mistranslated — “void gaps” instead of “avoid gaps” completely changes the meaning of the sentence. Once our office had to have a break when we saw a machine-translated result of the mistyped word “assked.”

5. Unify formats for:

  • Prices and currencies [e.g., $1.000]
  • Units of measurement [e.g., kg]
  • Numerals [try to use numbers instead of numerals, e.g., use “1“ instead of “one”]
  • Dates and times [e.g., 2020–08–12, 14:45]
  • All other specified data and terms that could be unified

6. Use low register as much as possible.

  • Avoid unnecessary capitalization [e.g., use “counterparty” instead of “Counterparty”]
  • Remove CAPS LOCK [e.g., the word “HERO” can be left untranslated]

7. Mind e-mails, file paths, URLs.

For example, the e-mail address “” can be machine-translated as flower @ yard, which is probably not the intended outcome.

8. Use glossaries for specialized terms.

  • Add sites [physical locations]/addresses [e.g., “Language Street” could be translated to target language as “[target language direct translation of language + street]]
  • Add products and services names [e.g., translated Product name could be different from your company Product name guide]
  • Add names and acronyms to a glossary [e.g., acronym “WORLD” could be translated as the word “world”]

9. Use a unified approach for toponym translation.

  • Choose whether to translate toponyms like La Grand-Place or leave their original naming.
  • Follow grammatical rules when keeping foreign words in their original language in translated texts. For example, if you need to use some original French words in a translated English text, stick with English language grammatical rules.

10. And finally, when sending translation requests to get better MT results, make sure to specify:

  • Source text language. If the source text language is not specified, language detection kicks in. Not only language detection takes time, but it also could provide wrong (not literally wrong, but unexpected) results in some cases, e.g., Kungens Kurva is the name of a street in Stockholm (which is King’s Curve in Swedish by the way). But if you try to translate it with no source language specified, you might get it autodetected as Croatian or even Polish. And consequentially, the translation result will be very far from the original meaning.
  • Source text format. When specifying TEXT as a format, you’ll get a plain text back. When specifying HTML, be ready to handle HTML-entities in the translation result, e.g., if you translate “Jag är mammas son” from Swedish to English with HTML format, you might get back “I'm my mother's son”.
  • When translating tagged text, consider sticking to standard HTML tags as some MT engines treat non-standard tags as a sentence breaker. Try translating “She <o>rose <o>and <o>left” to French for example. You might get “Elle <o> Rose <o> et <o> la gauche” back while expecting something like “Elle s’est levée et est partie”.

If you’ve ticked all of the boxes, you’re likely to be happy with the result. If, however, you feel like you need to take some significant bits out of your text to make it machine-translatable, here is a trick: take them out, have it translated, and then bring them back, sprinkling your already decent-looking translated text with some spice and flavor.

Read more

Continue reading the article after registration
Already a member? Sign In

We know how to make your business multilingual and productive. Let's talk.