LEPOR (Length Penalty, Precision, n-gram Position difference Penalty and Recall) is an automatic language independent machine translation evaluation metric with tunable parameters and reinforced factors.

Background

Since IBM proposed and realized the system of

BLEU Bleu or BLEU may refer to: * the French word for blue * '' Three Colors: Blue'', a 1993 movie * BLEU (Bilingual Evaluation Understudy), a machine translation evaluation metric * Belgium–Luxembourg Economic Union * Blue cheese, a type of cheese ...

as the automatic metric for

Machine Translation Machine translation, sometimes referred to by the abbreviation MT (not to be confused with computer-aided translation, machine-aided human translation or interactive translation), is a sub-field of computational linguistics that investigates t ...

(MT) evaluation, many other methods have been proposed to revise or improve it, such as TER,

METEOR A meteoroid () is a small rocky or metallic body in outer space. Meteoroids are defined as objects significantly smaller than asteroids, ranging in size from grains to objects up to a meter wide. Objects smaller than this are classified as micr ...

, etc. However, there exist some problems in the traditional automatic evaluation metrics. Some metrics perform well on certain languages but weak on other languages, which is usually called as a language bias problem. Some metrics rely on a lot of language features or linguistic information, which makes it difficult for other researchers to repeat the experiments. LEPOR is an automatic evaluation metric that tries to address some of the existing problems. LEPOR is designed with augmented factors and the corresponding tunable parameters to address the language bias problem. Furthermore, in the improved version of LEPOR, i.e. the hLEPOR, it tries to use the optimized linguistic features that are extracted from

treebank In linguistics, a treebank is a parsed text corpus that annotates syntactic or semantic sentence structure. The construction of parsed corpora in the early 1990s revolutionized computational linguistics, which benefitted from large-scale empiri ...

s. Another advanced version of LEPOR is the nLEPOR metric, which adds the n-gram features into the previous factors. So far, the LEPOR metric has been developed into LEPOR series. LEPOR metrics have been studied and analyzed by many researchers from different fields, such as machine translation,

natural-language generation Natural language generation (NLG) is a software process that produces natural language output. In one of the most widely-cited survey of NLG methods, NLG is characterized as "the subfield of artificial intelligence and computational linguistics tha ...

, and searching, and beyond. LEPOR metrics are getting more attention from scientific researchers in

natural language processing Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to pro ...

Design

LEPOR is designed with the factors of enhanced length penalty,

precision Precision, precise or precisely may refer to: Science, and technology, and mathematics Mathematics and computing (general) * Accuracy and precision, measurement deviation from true value and its scatter * Significant figures, the number of digit ...

, n-gram word order penalty, and

recall Recall may refer to: * Recall (bugle call), a signal to stop * Recall (information retrieval), a statistical measure * ''ReCALL'' (journal), an academic journal about computer-assisted language learning * Recall (memory) * ''Recall'' (Overwatch ...

. The enhanced length penalty ensures that the hypothesis translation, which is usually translated by machine translation systems, is punished if it is longer or shorter than the reference translation. The precision score reflects the accuracy of the hypothesis translation. The recall score reflects the loyalty of the hypothesis translation to the reference translation or source language. The n-gram based word order penalty factor is designed for the different position orders between the hypothesis translation and reference translation. The word order penalty factor has been proved to be useful by many researchers, such as the work of Wong and Kit (2008). In light that the word surface string matching metrics were criticized with lack of syntax and semantic awareness, the further developed LEPOR metric (hLEPOR) investigates the integration of linguistic features, such as part of speech (POS). POS is introduced as a certain functionality of both syntax and semantic point of view, e.g. if a token of output sentence is a verb while it is expected to be a noun, then there shall be a penalty; also, if the POS is the same but the exact word is not the same, e.g. good vs nice, then this candidate shall gain certain credit. The overall score of hLEPOR then is calculated as the combination of word level score and POS level score with a weighting set. Language modelling inspired n-gram knowledge is also extensively explored in nLEPOR. In addition to the n-gram knowledge for n-gram position difference penalty calculation, n-gram is also applied to n-gram precision and n-gram recall in nLEPOR, and the parameter n is an adjustable factor. In addition to POS knowledge in hLEPOR, phrase structure from parsing information is included in a new variant HPPR. In HPPR evaluation modeling, the phrase structure set, such as noun phrase, verb phrase, prepositional phrase, adverbial phrase are considered during the matching from candidate text to reference text.

Software implementation

LEPOR metrics were originally implemented in Perl programming language, and recently the Python version is available by other researchers and engineers, with a press announcement from Logrus Global Language Service company.

Performance

LEPOR series have shown their good performances in th
ACL
s annual international workshop of statistical machine translation
ACL-WMT
. ACL-WMT is held by the special interest group of machine translation (SIGMT) in the international association for

computational linguistics Computational linguistics is an Interdisciplinarity, interdisciplinary field concerned with the computational modelling of natural language, as well as the study of appropriate computational approaches to linguistic questions. In general, comput ...

(ACL). In the ACL-WMT 2013, there are two translation and evaluation tracks, English-to-other and other-to-English. The "other" languages include

Spanish Spanish might refer to: * Items from or related to Spain: **Spaniards are a nation and ethnic group indigenous to Spain **Spanish language, spoken in Spain and many Latin American countries **Spanish cuisine Other places * Spanish, Ontario, Cana ...

French French (french: français(e), link=no) may refer to: * Something of, from, or related to France ** French language, which originated in France, and its various dialects and accents ** French people, a nation and ethnic group identified with Franc ...

German German(s) may refer to: * Germany (of or related to) **Germania (historical use) * Germans, citizens of Germany, people of German ancestry, or native speakers of the German language ** For citizens of Germany, see also German nationality law **Ger ...

Czech Czech may refer to: * Anything from or related to the Czech Republic, a country in Europe ** Czech language ** Czechs, the people of the area ** Czech culture ** Czech cuisine * One of three mythical brothers, Lech, Czech, and Rus' Places *Czech, ...

and

Russian Russian(s) refers to anything related to Russia, including: *Russians (, ''russkiye''), an ethnic group of the East Slavic peoples, primarily living in Russia and neighboring countries *Rossiyane (), Russian language term for all citizens and peo ...

. In the English-to-other direction, nLEPOR metric achieves the highest system-level correlation score with human judgments using the Pearson correlation coefficient, the second highest system-level correlation score with human judgments using the Spearman rank correlation coefficient. In the other-to-English direction, nLEPOR performs moderate and

yields the highest correlation score with human judgments, which is due to the fact that nLEPOR only uses the concise linguistic feature, part-of-speech information, except for the officially offered training data; however, METEOR has used many other external resources, such as the

synonym A synonym is a word, morpheme, or phrase that means exactly or nearly the same as another word, morpheme, or phrase in a given language. For example, in the English language, the words ''begin'', ''start'', ''commence'', and ''initiate'' are all ...

s dictionaries,

paraphrase A paraphrase () is a restatement of the meaning of a text or passage using other words. The term itself is derived via Latin ', . The act of paraphrasing is also called ''paraphrasis''. History Although paraphrases likely abounded in oral tra ...

, and

stemming In linguistic morphology and information retrieval, stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form. The stem need not be identical to the morpholog ...

, etc. One extended work and introduction about LEPOR's performances with different conditions including pure word-surface form, POS features, phrase tags features, is described in a thesis from

University of Macau The University of Macau (UM; Portuguese: ''Universidade de Macau'', Chinese: 澳門大學) is an internationalised public comprehensive university in Macau. The UM campus is located in the east of Hengqin Island, Guangdong province in Mainland ...

. There is a deep statistical analysis about hLEPOR and nLEPOR performance in WMT13, which shows it performed as one of the best metrics "in both the individual language pair assessment for Spanish-to-English and the aggregated set of 9 language pairs", see the paper (Accurate Evaluation of Segment-level Machine Translation Metrics) "https://www.aclweb.org/anthology/N15-1124" Graham et al. 2015 NAACL (https://github.com/ygraham/segment-mteval)

Applications

LEPOR automatic metric series have been applied and used by many researchers from different fields in

. For instance, in standard MT and Neural MT. Also outside of MT community, for instance, applied LEPOR in Search evaluation; mentioned the application of LEPOR for code (programming language) generation evaluation; investigated automatic evaluation of natural language generation with metrics including LEPOR, and argued that automatic metrics can help system level evaluations; also LEPOR is applied in image captioning evaluation.Qiu et al. (2020)

Notes

References

* Papineni, K., Roukos, S., Ward, T., and Zhu, W. J. (2002). "BLEU: a method for automatic evaluation of machine translation" in ''ACL-2002: 40th Annual meeting of the Association for Computational Linguistics'' pp. 311–318 * Han, A.L.F., Wong, D.F., and Chao, L.S. (2012) "LEPOR: A Robust Evaluation Metric for Machine Translation with Augmented Factors" in ''Proceedings of the 24th International Conference on Computational Linguistics (COLING 2012): Posters, pp. 441–450. Mumbai, India
Online paperOpen source tool
' * Han, A.L.F., Wong, D.F., Chao, L.S., He, L., Lu, Y., Xing, J., and Zeng, X. (2013a) "Language-independent Model for Machine Translation Evaluation with Reinforced Factors" in ''Proceedings of the Machine Translation Summit XIV (MT SUMMIT 2013), pp. 215-222. Nice, France. Publisher: International Association for Machine Translation
Online paperOpen source tool
' * Han, A.L.F., Wong, D.F., Chao, L.S., Lu, Y., He, L., Wang, Y., and Zhou, J. (2013b) "A Description of Tunable Machine Translation Evaluation Systems in WMT13 Metrics Task" in ''Proceedings of the Eighth Workshop on Statistical Machine Translation, ACL-WMT13, Sofia, Bulgaria. Association for Computational Linguistics
Online paper
' pp. 414–421 * * ACL-WMT. (2013)

* Wong, B. T-M, and Kit, C. (2008). "Word choice and word position for automatic MT evaluation" in ''Workshop: MetricsMATR of the Association for Machine Translation in the Americas (AMTA)'', short paper, Waikiki, US. * Banerjee, S. and Lavie, A. (2005) "METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments" in ''Proceedings of Workshop on Intrinsic and Extrinsic Evaluation Measures for MT and/or Summarization at the 43rd Annual Meeting of the Association of Computational Linguistics (ACL-2005), Ann Arbor, Michigan, June 2005'' * Han, Lifeng. (2014) "LEPOR: An Augmented Machine Translation Evaluation Metric". Thesis for Master of Science in Software Engineering. University of Macau, Macao
PPT
* Yvette Graham, Timothy Baldwin, and Nitika Mathur. (2015) Accurate evaluation of segment-level machine translation metrics. In NAACL HLT 2015, The 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Denver, Colorado, USA, May 31 - June 5, 2015, pages 1183–1191. * * Jekaterina Novikova, Ondˇrej Dušek, Amanda Cercas Curry, and Verena Rieser. (2017) Why we need new evaluation metrics for NLG. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2241–2252, Copenhagen, Denmark. Association for Computational Linguistics. * * * * D Qiu, B Rothrock, T Islam, AK Didier, VZ Sun… (2020) SCOTI: Science Captioning of Terrain Images for data prioritization and local image search. Planetary and Space. Elsevier * *