{"id":11,"date":"2022-11-23T15:25:18","date_gmt":"2022-11-23T15:25:18","guid":{"rendered":"https:\/\/mtuniversity.inten.to\/?p=11"},"modified":"2023-05-02T10:20:48","modified_gmt":"2023-05-02T10:20:48","slug":"automated-scoring-and-evaluation-of-mt-engines","status":"publish","type":"post","link":"https:\/\/inten.to\/machine-translation-university\/automated-scoring-and-evaluation-of-mt-engines\/","title":{"rendered":"Automated scoring and evaluation of MT engines"},"content":{"rendered":"<h2><span style=\"font-weight: 400;\">MT Evaluation Goals<\/span><\/h2>\n<p><span style=\"font-weight: 400;\">Evaluating machine translation may appear to be solely focused on determining translation quality, but there is more to it than that. The chosen models must be approved by various departments, including security, legal, and procurement, ensuring they meet all requirements.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">To effectively evaluate MT engines, know why you are using MT and what you hope to achieve. Here are some common goals for MT evaluation:<\/span><\/p>\n<ol>\n<li><b>Selecting the best MT model<\/b><span style=\"font-weight: 400;\">: Pick the most suitable machine translation model that aligns with your domain, language pair, and content type for optimal results.<\/span><\/li>\n<li><b>Gathering data to enhance MT and identifying bottlenecks<\/b><span style=\"font-weight: 400;\">: Collect relevant data to improve MT performance and identify areas of improvement, ensuring a smoother translation process<\/span><\/li>\n<li><b>Assessing risk factors<\/b><span style=\"font-weight: 400;\">: Identify potential risk factors associated with implementing MT, such as data security, to mitigate any negative impact on your project.<\/span><\/li>\n<li><b>Evaluating the source content&#8217;s compatibility with MT<\/b><span style=\"font-weight: 400;\">: Assess how well your source content aligns with MT capabilities, making any necessary adjustments to ensure a seamless translation<\/span><\/li>\n<li><b>Gaining end-user trust and confidence<\/b><span style=\"font-weight: 400;\">: Foster confidence in end-users by collecting their feedback and demonstrating the effectiveness of your chosen MT solution.<\/span><\/li>\n<li><b>Establishing fair machine translation post-editing (MTPE) rates<\/b><span style=\"font-weight: 400;\">: Measure machine translation post-editing (MTPE) efforts to set reasonable rates to compensate editors for their work<\/span><\/li>\n<li><b>Implementing translation triage<\/b><span style=\"font-weight: 400;\">: Apply a translation prioritization system to allocate resources effectively, ensuring high-quality translations for the most critical content<\/span><\/li>\n<li><b>Estimating return on investment (ROI)<\/b><span style=\"font-weight: 400;\">: Calculate ROI for your MT project, considering cost savings, improved efficiency, and overall translation quality.<\/span><\/li>\n<\/ol>\n<h3><span style=\"font-weight: 400;\">MT ROI Framework for Localization Use-Case<\/span><\/h3>\n<p><span style=\"font-weight: 400;\">In this section, we will focus on the post-editing case of the MT ROI Framework. There are two approaches to evaluation: automatic and human evaluation.<\/span><\/p>\n<figure id=\"attachment_527\" aria-describedby=\"caption-attachment-527\" style=\"width: 800px\" class=\"wp-caption aligncenter\"><img fetchpriority=\"high\" decoding=\"async\" class=\"size-large wp-image-527\" src=\"https:\/\/inten.to\/machine-translation-university\/wp-content\/uploads\/2022\/11\/Figure-32.-MT-ROI-framework-for-Localization-use-case-1024x401.png\" alt=\"\" width=\"800\" height=\"313\" srcset=\"https:\/\/inten.to\/machine-translation-university\/wp-content\/uploads\/2022\/11\/Figure-32.-MT-ROI-framework-for-Localization-use-case-1024x401.png 1024w, https:\/\/inten.to\/machine-translation-university\/wp-content\/uploads\/2022\/11\/Figure-32.-MT-ROI-framework-for-Localization-use-case-300x118.png 300w, https:\/\/inten.to\/machine-translation-university\/wp-content\/uploads\/2022\/11\/Figure-32.-MT-ROI-framework-for-Localization-use-case-768x301.png 768w, https:\/\/inten.to\/machine-translation-university\/wp-content\/uploads\/2022\/11\/Figure-32.-MT-ROI-framework-for-Localization-use-case-1536x602.png 1536w, https:\/\/inten.to\/machine-translation-university\/wp-content\/uploads\/2022\/11\/Figure-32.-MT-ROI-framework-for-Localization-use-case-2048x803.png 2048w\" sizes=\"(max-width: 800px) 100vw, 800px\" \/><figcaption id=\"caption-attachment-527\" class=\"wp-caption-text\">Figure 32. MT ROI framework for Localization use-case<\/figcaption><\/figure>\n<h2><span style=\"font-weight: 400;\">MT Evaluation Types<\/span><\/h2>\n<ol>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Automatic Evaluation:<\/span>\n<ol>\n<li style=\"font-weight: 400;\" aria-level=\"2\"><span style=\"font-weight: 400;\">Reference-Based Scoring: Compare MT output to a human-generated reference translation, gauging translation quality through quality metrics.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"2\"><span style=\"font-weight: 400;\">MTQE Scoring: Use Machine Translation Quality Estimation (MTQE) metrics to predict translation quality without a reference.<\/span><\/li>\n<\/ol>\n<\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Human Evaluation:<\/span>\n<ol>\n<li style=\"font-weight: 400;\" aria-level=\"2\"><span style=\"font-weight: 400;\">Linguistic Quality Assessment (LQA): Conduct LQA to measure post-editing effort and translation quality by translators and editors, considering factors such as accuracy, style, and consistency.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"2\"><span style=\"font-weight: 400;\">Holistic Evaluation: Gather feedback from end-users who assess the overall translation quality.<\/span><\/li>\n<\/ol>\n<\/li>\n<\/ol>\n<p><span style=\"font-weight: 400;\">Human evaluation is often considered the benchmark for determining translation quality. Human evaluators possess linguistic and cultural expertise, enabling them to comprehend nuances, idiomatic expressions, and context-specific meanings in both the source and target languages. Human evaluators can identify subtle errors or inconsistencies that automated evaluation methods might overlook.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">However, human evaluation is labor-intensive and time-consuming, particularly when examining thousands of content segments. Human evaluators can only review some small parts when the vast content requires their attention. Consequently, evaluators often use sampling, selecting a subset of content for assessment. While this method can save time and resources, random sampling represents only some of the content adding the risk of missing business-critical errors.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">We combine automatic and human evaluation, using smart sampling to make the process reliable, directing reviewers&#8217; attention to the most relevant segments. Smart sampling guarantees critical information and context-specific nuances are captured, leading to more accurate and reliable translation assessments. This ensures that the effort spent evaluating 1,000 words represents the effort required for half a million words in production.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">The primary objective is to reduce the content review volume by approximately 200 times, which correlates with reducing review time.\u00a0<\/span><\/p>\n<h2><span style=\"font-weight: 400;\">Translation quality scores (metrics)<\/span><\/h2>\n<p><span style=\"font-weight: 400;\">Translation quality scores are numerical values assigned to machine-translated content to measure how close the translation is to a \u201cgolden\u201d reference. There are several ways to calculate them:<\/span><\/p>\n<ul>\n<li><span style=\"font-weight: 400;\">Corpus-based scores (e.g., BLEU): Non-comparable across datasets, unstable at the segment level, and do not provide statistical significance<\/span><\/li>\n<li><span style=\"font-weight: 400;\">N-gram-based scores: Do not tolerate alternative translations<\/span><\/li>\n<li><span style=\"font-weight: 400;\">Embedding-based scores: Tolerate alternative translations but may have review bias depending on training data (e.g., tone of voice preference); Some embedding-based scores are customizable<\/span><\/li>\n<\/ul>\n<p><span style=\"font-weight: 400;\">Note that scores should not be compared between languages, as they have different tokenization methods. Absolute values of scores are less important than how engines are ranked. Focus on the correlation between scores rather than their absolute values. Understanding the scoring system will help identify potential miscorrelations.<\/span><\/p>\n<h3><span style=\"font-weight: 400;\">Reference-based scores<\/span><\/h3>\n<p><span style=\"font-weight: 400;\">Reference-based scores compare the actual machine translation to a reference translation. There are two main types: <\/span><b>syntactic<\/b><span style=\"font-weight: 400;\"> similarity scores, primarily based on n-grams, and <\/span><b>semantic<\/b><span style=\"font-weight: 400;\"> similarity scores, which compare meaning using word embeddings. Syntactic similarity scores are less tolerant of alternative translations and less effective for languages with complex morphology. Semantic similarity scores are more tolerant of alternative translations.<\/span><\/p>\n<h3><span style=\"font-weight: 400;\">Examples of reference-based scores<\/span><\/h3>\n<ol>\n<li><span style=\"font-weight: 400;\"> hLepor (<\/span><i><span style=\"font-weight: 400;\">Syntactic<\/span><\/i><span style=\"font-weight: 400;\"> similarity)\u00a0<\/span><\/li>\n<\/ol>\n<ul>\n<li><span style=\"font-weight: 400;\">Compares token-based n-grams similarity<\/span><\/li>\n<li><span style=\"font-weight: 400;\">Penalizes omissions, additions, paraphrases, synonyms, and different-length translations<\/span><\/li>\n<\/ul>\n<p><a href=\"https:\/\/arxiv.org\/pdf\/1703.08748.pdf\"><span style=\"font-weight: 400;\">Paper<\/span><\/a><span style=\"font-weight: 400;\"> + <\/span><a href=\"https:\/\/github.com\/aaronlifenghan\/aaron-project-hlepor\"><span style=\"font-weight: 400;\">code\u00a0<\/span><\/a><\/p>\n<p>&nbsp;<\/p>\n<ol start=\"2\">\n<li><span style=\"font-weight: 400;\"> BERTScore (<\/span><i><span style=\"font-weight: 400;\">Semantic<\/span><\/i><span style=\"font-weight: 400;\"> similarity)<\/span><\/li>\n<\/ol>\n<ul>\n<li><span style=\"font-weight: 400;\">Analyzes cosine distances between BERT representations of MT and human reference<\/span><\/li>\n<li><span style=\"font-weight: 400;\">Does not penalize paraphrases or synonyms<\/span><\/li>\n<li><span style=\"font-weight: 400;\">May be unreliable for specific domains and languages underrepresented in BERT<\/span><\/li>\n<\/ul>\n<p><a href=\"https:\/\/arxiv.org\/abs\/1904.09675\"><span style=\"font-weight: 400;\">Paper<\/span><\/a><span style=\"font-weight: 400;\"> + <\/span><a href=\"https:\/\/github.com\/Tiiiger\/bert_score\"><span style=\"font-weight: 400;\">code<\/span><\/a><\/p>\n<p>&nbsp;<\/p>\n<ol start=\"3\">\n<li><span style=\"font-weight: 400;\"> TER (<\/span><i><span style=\"font-weight: 400;\">Syntactic<\/span><\/i><span style=\"font-weight: 400;\"> similarity)<\/span><\/li>\n<\/ol>\n<ul>\n<li><span style=\"font-weight: 400;\">\u00a0Measures edits required to transform MT into reference translation<\/span><\/li>\n<li><span style=\"font-weight: 400;\">\u00a0Penalizes paraphrases, synonyms, and different-length translations<\/span><\/li>\n<\/ul>\n<p><a href=\"https:\/\/aclanthology.org\/2006.amta-papers.25.pdf\"><span style=\"font-weight: 400;\">Paper<\/span><\/a><span style=\"font-weight: 400;\"> + <\/span><a href=\"https:\/\/github.com\/jhclark\/tercom\"><span style=\"font-weight: 400;\">code<\/span><\/a><\/p>\n<p>&nbsp;<\/p>\n<ol start=\"4\">\n<li><span style=\"font-weight: 400;\"> PRISM (<\/span><i><span style=\"font-weight: 400;\">Semantic<\/span><\/i><span style=\"font-weight: 400;\"> similarity)<\/span><\/li>\n<\/ol>\n<ul>\n<li><span style=\"font-weight: 400;\">Evaluates MT as a paraphrase of the human reference translation<\/span><\/li>\n<li><span style=\"font-weight: 400;\">Penalizes fluency and adequacy errors<\/span><\/li>\n<li><span style=\"font-weight: 400;\">Does not penalize paraphrases or synonyms<\/span><\/li>\n<\/ul>\n<p><a href=\"https:\/\/aclanthology.org\/2020.emnlp-main.8.pdf\"><span style=\"font-weight: 400;\">Paper<\/span><\/a><span style=\"font-weight: 400;\"> + <\/span><a href=\"https:\/\/github.com\/thompsonb\/prism\"><span style=\"font-weight: 400;\">code<\/span><\/a><\/p>\n<p>&nbsp;<\/p>\n<ol start=\"5\">\n<li><span style=\"font-weight: 400;\"> COMET (<\/span><i><span style=\"font-weight: 400;\">Semantic<\/span><\/i><span style=\"font-weight: 400;\"> similarity)<\/span><\/li>\n<\/ol>\n<ul>\n<li><span style=\"font-weight: 400;\">Predicts machine translation quality using information from both the source input and the reference translation<\/span><\/li>\n<li><span style=\"font-weight: 400;\">Achieves state-of-the-art correlation with human judgment<\/span><\/li>\n<li><span style=\"font-weight: 400;\">May penalize paraphrases and synonyms<\/span><\/li>\n<\/ul>\n<p><a href=\"https:\/\/aclanthology.org\/2020.emnlp-main.213.pdf\"><span style=\"font-weight: 400;\">Paper<\/span><\/a><span style=\"font-weight: 400;\"> + <\/span><a href=\"https:\/\/github.com\/Unbabel\/COMET\"><span style=\"font-weight: 400;\">code<\/span><\/a><\/p>\n<p>&nbsp;<\/p>\n<ol start=\"6\">\n<li><span style=\"font-weight: 400;\"> SacreBLEU (<\/span><i><span style=\"font-weight: 400;\">Syntactic<\/span><\/i><span style=\"font-weight: 400;\"> similarity)<\/span><\/li>\n<\/ol>\n<ul>\n<li><span style=\"font-weight: 400;\">Compares token-based similarity of the MT output with the reference segment; averages it over the entire corpus<\/span><\/li>\n<li><span style=\"font-weight: 400;\">Penalizes omissions, additions, paraphrases, synonyms, and different-length translations<\/span><\/li>\n<\/ul>\n<p><a href=\"https:\/\/aclanthology.org\/P02-1040.pdf\"><span style=\"font-weight: 400;\">Paper<\/span><\/a><span style=\"font-weight: 400;\"> + <\/span><a href=\"https:\/\/github.com\/mjpost\/sacrebleu\"><span style=\"font-weight: 400;\">code<\/span><\/a><\/p>\n<p>Read more about MT Quality metrics <a href=\"https:\/\/help.inten.to\/hc\/en-us\/articles\/4413600059282-Evaluate-Models#h_01GB4ZMCBHZ8NCQA9TA5B88SX1\">here<\/a>.<\/p>\n<figure id=\"attachment_528\" aria-describedby=\"caption-attachment-528\" style=\"width: 800px\" class=\"wp-caption aligncenter\"><img decoding=\"async\" class=\"size-large wp-image-528\" src=\"https:\/\/inten.to\/machine-translation-university\/wp-content\/uploads\/2022\/11\/Figure-33.-Scoring-1024x556.png\" alt=\"\" width=\"800\" height=\"434\" srcset=\"https:\/\/inten.to\/machine-translation-university\/wp-content\/uploads\/2022\/11\/Figure-33.-Scoring-1024x556.png 1024w, https:\/\/inten.to\/machine-translation-university\/wp-content\/uploads\/2022\/11\/Figure-33.-Scoring-300x163.png 300w, https:\/\/inten.to\/machine-translation-university\/wp-content\/uploads\/2022\/11\/Figure-33.-Scoring-768x417.png 768w, https:\/\/inten.to\/machine-translation-university\/wp-content\/uploads\/2022\/11\/Figure-33.-Scoring-1536x834.png 1536w, https:\/\/inten.to\/machine-translation-university\/wp-content\/uploads\/2022\/11\/Figure-33.-Scoring-2048x1111.png 2048w\" sizes=\"(max-width: 800px) 100vw, 800px\" \/><figcaption id=\"caption-attachment-528\" class=\"wp-caption-text\">Figure 33. Examples of using Translation quality scores in <a href=\"https:\/\/inten.to\/mt-studio\/\">Intento MT Studio<\/a><\/figcaption><\/figure>\n<h3><span style=\"font-weight: 400;\">MT Quality Estimation (MTQE) metrics<\/span><\/h3>\n<p><span style=\"font-weight: 400;\">MTQE\u00a0 scores indicate the likelihood of a translation being correct or incorrect when you do not have human reference to compare. For this purpose, various tools and models can be used, including open-source options like LaBSE and PRISM and commercial solutions like ModelFront and COMET QE.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">These scores are applicable for specific evaluation goals but not for choosing the MT model and estimating ROI, as pre-trained quality estimation models often show minor discrepancies between MTQE and LQA results. Customization of an\u00a0 MTQE model may help improve these discrepancies.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Nevertheless, MTQE scores are helpful for data cleaning by detecting mistranslations and directing reviewers&#8217; attention to potentially risky content. While they may not be suitable for broader MT evaluation purposes, they can be valuable in addressing specific evaluation concerns and ensuring data accuracy, as shown in Figure 34.<\/span><\/p>\n<figure id=\"attachment_529\" aria-describedby=\"caption-attachment-529\" style=\"width: 800px\" class=\"wp-caption aligncenter\"><img decoding=\"async\" class=\"size-large wp-image-529\" src=\"https:\/\/inten.to\/machine-translation-university\/wp-content\/uploads\/2022\/11\/Figure-35.-Correlation-between-MTQE-and-LQA-1024x555.jpg\" alt=\"\" width=\"800\" height=\"434\" srcset=\"https:\/\/inten.to\/machine-translation-university\/wp-content\/uploads\/2022\/11\/Figure-35.-Correlation-between-MTQE-and-LQA-1024x555.jpg 1024w, https:\/\/inten.to\/machine-translation-university\/wp-content\/uploads\/2022\/11\/Figure-35.-Correlation-between-MTQE-and-LQA-300x163.jpg 300w, https:\/\/inten.to\/machine-translation-university\/wp-content\/uploads\/2022\/11\/Figure-35.-Correlation-between-MTQE-and-LQA-768x416.jpg 768w, https:\/\/inten.to\/machine-translation-university\/wp-content\/uploads\/2022\/11\/Figure-35.-Correlation-between-MTQE-and-LQA.jpg 1476w\" sizes=\"(max-width: 800px) 100vw, 800px\" \/><figcaption id=\"caption-attachment-529\" class=\"wp-caption-text\">Figure 34. Correlation between MTQE and LQA<\/figcaption><\/figure>\n<h3><span style=\"font-weight: 400;\">Reducing the amount of data for human evaluation using automated corpus-level scoring\u00a0<\/span><\/h3>\n<p><span style=\"font-weight: 400;\">The first step in reducing data for human evaluation is to assess only the models with the highest corpus scores. Typically, this involves selecting the top three or four models or even those that are statistically significant based on their confidence intervals.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">A confidence interval is a statistical measure that estimates the range within which a specific parameter or value is likely to fall. It indicates the estimate&#8217;s uncertainty and is often expressed as a percentage. A confidence interval captures the possible range of values for a given estimate, with a specified level of confidence that the actual value lies within that range.<\/span><\/p>\n<figure id=\"attachment_530\" aria-describedby=\"caption-attachment-530\" style=\"width: 800px\" class=\"wp-caption alignnone\"><img loading=\"lazy\" decoding=\"async\" class=\"size-large wp-image-530\" src=\"https:\/\/inten.to\/machine-translation-university\/wp-content\/uploads\/2022\/11\/Figure-34.-MT-provider-ranking-by-COMET-1024x595.png\" alt=\"\" width=\"800\" height=\"465\" srcset=\"https:\/\/inten.to\/machine-translation-university\/wp-content\/uploads\/2022\/11\/Figure-34.-MT-provider-ranking-by-COMET-1024x595.png 1024w, https:\/\/inten.to\/machine-translation-university\/wp-content\/uploads\/2022\/11\/Figure-34.-MT-provider-ranking-by-COMET-300x174.png 300w, https:\/\/inten.to\/machine-translation-university\/wp-content\/uploads\/2022\/11\/Figure-34.-MT-provider-ranking-by-COMET-768x447.png 768w, https:\/\/inten.to\/machine-translation-university\/wp-content\/uploads\/2022\/11\/Figure-34.-MT-provider-ranking-by-COMET.png 1204w\" sizes=\"(max-width: 800px) 100vw, 800px\" \/><figcaption id=\"caption-attachment-530\" class=\"wp-caption-text\">Figure 35. MT provider ranking by COMET<\/figcaption><\/figure>\n<h3><span style=\"font-weight: 400;\">Aspects of scoring<\/span><\/h3>\n<p><span style=\"font-weight: 400;\">When using scores to evaluate MT quality, consider the following:<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Hyperparameters<\/b><span style=\"font-weight: 400;\">: Be aware of the comparability of scores across different algorithms, as they may have varying hyperparameters. For instance, BLEU scores cannot be directly compared due to differences in their parameterization.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Scaling, Normalization, Standardization<\/b><span style=\"font-weight: 400;\">: Consider these data preprocessing techniques to ensure scores are consistent and comparable across different data sets or models.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Absolute Scoring vs. Ranking<\/b><span style=\"font-weight: 400;\">: instead of using absolute scores, which provide a specific value for translation quality, use ranking, which orders translations based on their relative performance.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Statistical Significance and Confidence Intervals<\/b><span style=\"font-weight: 400;\">: Evaluate the statistical significance of your results and use confidence intervals to compare mean scores, determining the reliability and validity of your translation quality assessments.<\/span><\/li>\n<\/ul>\n<h3><span style=\"font-weight: 400;\">Calculating the scores. Libraries\u00a0<\/span><\/h3>\n<p><span style=\"font-weight: 400;\">Many MT evaluation scores can be found in research papers and open-source platforms. One method to calculate these scores uses Python packages for various metrics, such as COMET or SacreBLEU.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Intento has also implemented scores on the platform, allowing customers to calculate various scores via API. Note that for scores requiring GPU, you should manage deployment, provisioning, and de-provisioning to avoid unnecessary expenses.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Another option is<\/span><a href=\"https:\/\/help.inten.to\/hc\/en-us\/sections\/4413578103442-Evaluation-Projects\"><span style=\"font-weight: 400;\"> Intento MT Studio<\/span><\/a><span style=\"font-weight: 400;\">, which offers a simple user interface to run translations through several stock engines and calculate scores. By leveraging these tools and resources, you can efficiently evaluate machine translation models and determine the best fit for your needs.<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">More on how to make sampling smart and run LQA &#8211; in the next chapter, <\/span><i><span style=\"font-weight: 400;\">Linguistic quality analysis and ROI estimation<\/span><\/i><span style=\"font-weight: 400;\">.\u00a0<\/span><\/p>\n<h2><span style=\"font-weight: 400;\">Key Takeaways<\/span><\/h2>\n<ol>\n<li><span style=\"font-weight: 400;\"> Identify MT evaluation goals: Understand your objectives when evaluating MT engines, such as selecting the best model, improving performance, assessing risks, and estimating ROI.<\/span><\/li>\n<li><span style=\"font-weight: 400;\"> Combine automatic and human evaluation: Utilize reference-based and MTQE scoring for automatic evaluation while incorporating LQA and holistic evaluation from human reviewers.<\/span><\/li>\n<li><span style=\"font-weight: 400;\"> Sample smart for efficient evaluation: Implement smart sampling to focus reviewers&#8217; attention on the most relevant segments, ensuring accurate and reliable translation assessments while reducing review time.<\/span><\/li>\n<li><span style=\"font-weight: 400;\"> Understanding and selecting scoring methods: Consider various scoring types, including corpus-based, n-gram-based, and embedding-based scores, as well as their limitations and benefits, to make informed decisions during the evaluation process.<\/span><\/li>\n<\/ol>\n","protected":false},"excerpt":{"rendered":"<p>Using MT quality estimated scores<\/p>\n","protected":false},"author":4,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[1],"tags":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v20.6 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Automated scoring and evaluation of MT engines - inten.to\/machine-translation-university\/<\/title>\n<meta name=\"robots\" content=\"noindex, nofollow\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Automated scoring and evaluation of MT engines - inten.to\/machine-translation-university\/\" \/>\n<meta property=\"og:description\" content=\"Using MT quality estimated scores\" \/>\n<meta property=\"og:url\" content=\"https:\/\/inten.to\/machine-translation-university\/automated-scoring-and-evaluation-of-mt-engines\/\" \/>\n<meta property=\"og:site_name\" content=\"inten.to\/machine-translation-university\/\" \/>\n<meta property=\"article:published_time\" content=\"2022-11-23T15:25:18+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2023-05-02T10:20:48+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/inten.to\/machine-translation-university\/wp-content\/uploads\/2022\/11\/Figure-32.-MT-ROI-framework-for-Localization-use-case-1024x401.png\" \/>\n<meta name=\"author\" content=\"sergei.polikarpov@inten.to\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"sergei.polikarpov@inten.to\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"8 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"WebPage\",\"@id\":\"https:\/\/inten.to\/machine-translation-university\/automated-scoring-and-evaluation-of-mt-engines\/\",\"url\":\"https:\/\/inten.to\/machine-translation-university\/automated-scoring-and-evaluation-of-mt-engines\/\",\"name\":\"Automated scoring and evaluation of MT engines - inten.to\/machine-translation-university\/\",\"isPartOf\":{\"@id\":\"https:\/\/inten.to\/machine-translation-university\/#website\"},\"datePublished\":\"2022-11-23T15:25:18+00:00\",\"dateModified\":\"2023-05-02T10:20:48+00:00\",\"author\":{\"@id\":\"https:\/\/inten.to\/machine-translation-university\/#\/schema\/person\/1aa9e5874e74cbf37313324ccc703af0\"},\"breadcrumb\":{\"@id\":\"https:\/\/inten.to\/machine-translation-university\/automated-scoring-and-evaluation-of-mt-engines\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/inten.to\/machine-translation-university\/automated-scoring-and-evaluation-of-mt-engines\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/inten.to\/machine-translation-university\/automated-scoring-and-evaluation-of-mt-engines\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"MT University\",\"item\":\"https:\/\/inten.to\/machine-translation-university\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Automated scoring and evaluation of MT engines\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/inten.to\/machine-translation-university\/#website\",\"url\":\"https:\/\/inten.to\/machine-translation-university\/\",\"name\":\"inten.to\/machine-translation-university\/\",\"description\":\"Intento MT University\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/inten.to\/machine-translation-university\/?s={search_term_string}\"},\"query-input\":\"required name=search_term_string\"}],\"inLanguage\":\"en-US\"},{\"@type\":\"Person\",\"@id\":\"https:\/\/inten.to\/machine-translation-university\/#\/schema\/person\/1aa9e5874e74cbf37313324ccc703af0\",\"name\":\"sergei.polikarpov@inten.to\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/inten.to\/machine-translation-university\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/1fbab3532c586e5c65e28bb673c63bb7?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/1fbab3532c586e5c65e28bb673c63bb7?s=96&d=mm&r=g\",\"caption\":\"sergei.polikarpov@inten.to\"},\"url\":\"https:\/\/inten.to\/machine-translation-university\/author\/sergei-polikarpovinten-to\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Automated scoring and evaluation of MT engines - inten.to\/machine-translation-university\/","robots":{"index":"noindex","follow":"nofollow"},"og_locale":"en_US","og_type":"article","og_title":"Automated scoring and evaluation of MT engines - inten.to\/machine-translation-university\/","og_description":"Using MT quality estimated scores","og_url":"https:\/\/inten.to\/machine-translation-university\/automated-scoring-and-evaluation-of-mt-engines\/","og_site_name":"inten.to\/machine-translation-university\/","article_published_time":"2022-11-23T15:25:18+00:00","article_modified_time":"2023-05-02T10:20:48+00:00","og_image":[{"url":"https:\/\/inten.to\/machine-translation-university\/wp-content\/uploads\/2022\/11\/Figure-32.-MT-ROI-framework-for-Localization-use-case-1024x401.png"}],"author":"sergei.polikarpov@inten.to","twitter_card":"summary_large_image","twitter_misc":{"Written by":"sergei.polikarpov@inten.to","Est. reading time":"8 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"WebPage","@id":"https:\/\/inten.to\/machine-translation-university\/automated-scoring-and-evaluation-of-mt-engines\/","url":"https:\/\/inten.to\/machine-translation-university\/automated-scoring-and-evaluation-of-mt-engines\/","name":"Automated scoring and evaluation of MT engines - inten.to\/machine-translation-university\/","isPartOf":{"@id":"https:\/\/inten.to\/machine-translation-university\/#website"},"datePublished":"2022-11-23T15:25:18+00:00","dateModified":"2023-05-02T10:20:48+00:00","author":{"@id":"https:\/\/inten.to\/machine-translation-university\/#\/schema\/person\/1aa9e5874e74cbf37313324ccc703af0"},"breadcrumb":{"@id":"https:\/\/inten.to\/machine-translation-university\/automated-scoring-and-evaluation-of-mt-engines\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/inten.to\/machine-translation-university\/automated-scoring-and-evaluation-of-mt-engines\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/inten.to\/machine-translation-university\/automated-scoring-and-evaluation-of-mt-engines\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"MT University","item":"https:\/\/inten.to\/machine-translation-university\/"},{"@type":"ListItem","position":2,"name":"Automated scoring and evaluation of MT engines"}]},{"@type":"WebSite","@id":"https:\/\/inten.to\/machine-translation-university\/#website","url":"https:\/\/inten.to\/machine-translation-university\/","name":"inten.to\/machine-translation-university\/","description":"Intento MT University","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/inten.to\/machine-translation-university\/?s={search_term_string}"},"query-input":"required name=search_term_string"}],"inLanguage":"en-US"},{"@type":"Person","@id":"https:\/\/inten.to\/machine-translation-university\/#\/schema\/person\/1aa9e5874e74cbf37313324ccc703af0","name":"sergei.polikarpov@inten.to","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/inten.to\/machine-translation-university\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/1fbab3532c586e5c65e28bb673c63bb7?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/1fbab3532c586e5c65e28bb673c63bb7?s=96&d=mm&r=g","caption":"sergei.polikarpov@inten.to"},"url":"https:\/\/inten.to\/machine-translation-university\/author\/sergei-polikarpovinten-to\/"}]}},"_links":{"self":[{"href":"https:\/\/inten.to\/machine-translation-university\/wp-json\/wp\/v2\/posts\/11"}],"collection":[{"href":"https:\/\/inten.to\/machine-translation-university\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/inten.to\/machine-translation-university\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/inten.to\/machine-translation-university\/wp-json\/wp\/v2\/users\/4"}],"replies":[{"embeddable":true,"href":"https:\/\/inten.to\/machine-translation-university\/wp-json\/wp\/v2\/comments?post=11"}],"version-history":[{"count":6,"href":"https:\/\/inten.to\/machine-translation-university\/wp-json\/wp\/v2\/posts\/11\/revisions"}],"predecessor-version":[{"id":531,"href":"https:\/\/inten.to\/machine-translation-university\/wp-json\/wp\/v2\/posts\/11\/revisions\/531"}],"wp:attachment":[{"href":"https:\/\/inten.to\/machine-translation-university\/wp-json\/wp\/v2\/media?parent=11"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/inten.to\/machine-translation-university\/wp-json\/wp\/v2\/categories?post=11"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/inten.to\/machine-translation-university\/wp-json\/wp\/v2\/tags?post=11"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}