NIST is a method for evaluating the quality of text which has been translated using machine translation. Its name comes from the US National Institute of Standards and Technology.
Machine translation, sometimes referred to by the abbreviation MT is a sub-field of computational linguistics that investigates the use of software to translate text or speech from one language to another.
The National Institute of Standards and Technology (NIST) is a physical sciences laboratory, and a non-regulatory agency of the United States Department of Commerce. Its mission is to promote innovation and industrial competitiveness. NIST's activities are organized into laboratory programs that include nanoscale science and technology, engineering, information technology, neutron research, material measurement, and physical measurement. The American AI initiative has called NIST to lead the development of appropriate technical standards for reliable, robust, trustworthy, secure, portable, and interoperable AI systems.
It is based on the BLEU metric, but with some alterations. Where BLEU simply calculates n-gram precision adding equal weight to each one, NIST also calculates how informative a particular n-gram is. That is to say when a correct n-gram is found, the rarer that n-gram is, the more weight it will be given. [1]
In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sample of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application. The n-grams typically are collected from a text or speech corpus. When the items are words, n-grams may also be called shingles.
For example, if the bigram "on the" is correctly matched, it will receive lower weight than the correct matching of bigram "interesting calculations", as this is less likely to occur.
NIST also differs from BLEU in its calculation of the brevity penalty insofar as small variations in translation length do not impact the overall score as much.
BLEU is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another. Quality is considered to be the correspondence between a machine's output and that of a human: "the closer a machine translation is to a professional human translation, the better it is" – this is the central idea behind BLEU. BLEU was one of the first metrics to claim a high correlation with human judgements of quality, and remains one of the most popular automated and inexpensive metrics.
In statistical analysis of binary classification, the F1 score (also F-score or F-measure) is a measure of a test's accuracy. It considers both the precision p and the recall r of the test to compute the score: p is the number of correct positive results divided by the number of all positive results returned by the classifier, and r is the number of correct positive results divided by the number of all relevant samples (all samples that should have been identified as positive). The F1 score is the harmonic average of the precision and recall, where an F1 score reaches its best value at 1 (perfect precision and recall) and worst at 0.
METEOR is a metric for the evaluation of machine translation output. The metric is based on the harmonic mean of unigram precision and recall, with recall weighted higher than precision. It also has several features that are not found in other metrics, such as stemming and synonymy matching, along with the standard exact word matching. The metric was designed to fix some of the problems found in the more popular BLEU metric, and also produce good correlation with human judgement at the sentence or segment level. This differs from the BLEU metric in that BLEU seeks correlation at the corpus level.
Conversion of units is the conversion between different units of measurement for the same quantity, typically through multiplicative conversion factors.
The litre or liter is an SI accepted metric system unit of volume equal to 1 cubic decimetre (dm3), 1,000 cubic centimetres (cm3) or 1/1,000 cubic metre. A cubic decimetre occupies a volume of 10 cm×10 cm×10 cm and is thus equal to one-thousandth of a cubic metre.
The pound or pound-mass is a unit of mass used in the imperial, United States customary and other systems of measurement. Various definitions have been used; the most common today is the international avoirdupois pound, which is legally defined as exactly 0.45359237 kilograms, and which is divided into 16 avoirdupois ounces. The international standard symbol for the avoirdupois pound is lb; an alternative symbol is lbm, #, and ℔ or ″̶.
The International System of Units is the modern form of the metric system, and is the most widely used system of measurement. It comprises a coherent system of units of measurement built on seven base units, which are the ampere, kelvin, second, metre, kilogram, candela, mole, and a set of twenty prefixes to the unit names and unit symbols that may be used when specifying multiples and fractions of the units. The system also specifies names for 22 derived units, such as lumen and watt, for other common physical quantities.
The tonne, commonly referred to as the metric ton in the United States and Canada, is a non-SI metric unit of mass equal to 1,000 kilograms or one megagram. It is equivalent to approximately 2,204.6 pounds, 1.102 short tons (US) or 0.984 long tons (UK). Although not part of the SI, the tonne is accepted for use with SI units and prefixes by the International Committee for Weights and Measures.
The mole is the base unit of amount of substance in the International System of Units (SI). Effective 20 May 2019, the mole is defined as the amount of a chemical substance that contains exactly 6.02214076×1023 (Avogadro constant) constitutive particles, e.g., atoms, molecules, ions or electrons.
In chemistry, the molar massM is a physical property defined as the mass of a given substance divided by the amount of substance. The base SI unit for molar mass is kg/mol. However, for historical reasons, molar masses are almost always expressed in g/mol.
The kilogram-force, or kilopond, is a gravitational metric unit of force. It is equal to the magnitude of the force exerted on one kilogram of mass in a 9.80665 m/s2 gravitational field. Therefore, one kilogram-force is by definition equal to 9.80665 N. Similarly, a gram-force is 9.80665 mN, and a milligram-force is 9.80665 μN. One kilogram-force is approximately 2.204622 pound-force.
Round-trip translation (RTT), also known as back-and-forth translation, recursive translation and bi-directional translation, is the process of translating a word, phrase or text into another language, then translating the result back into the original language, using machine translation (MT) software. It is often used by laypeople to evaluate a machine translation system, or to test whether a text is suitable for MT when they are unfamiliar with the target language. Because the resulting text can often differ substantially from the original, RTT can also be a source of entertainment.
The sections below give objective criteria for evaluating the usability of machine translation software output.
Various methods for the evaluation for machine translation have been employed. This article focuses on the evaluation of the output of machine translation, rather than on performance or usability evaluation.
ROUGE, or Recall-Oriented Understudy for Gisting Evaluation, is a set of metrics and a software package used for evaluating automatic summarization and machine translation software in natural language processing. The metrics compare an automatically produced summary or translation against a reference or a set of references (human-produced) summary or translation.
The Europarl Corpus is a corpus that consists of the proceedings of the European Parliament from 1996 to the present. In its first release in 2001, it covered eleven official languages of the European Union. With the political expansion of the EU the official languages of the ten new member states have been added to the corpus data. The latest release (2012) comprised up to 60 million words per language with the newly added languages being slightly underrepresented as data for them is only available from 2007 onwards. This latest version includes 21 European languages: Romanic, Germanic, Slavic, Finno-Ugric, Baltic, and Greek.
LEPOR is an automatic language independent machine translation evaluation metric with tunable parameters and reinforced factors.
Paraphrase or Paraphrasing in computational linguistics is the natural language processing task of detecting and generating paraphrases. Applications of paraphrasing are varied including information retrieval, question answering, text summarization, and plagiarism detection. Paraphrasing is also useful in the evaluation of machine translation, as well as semantic parsing and generation of new samples to expand existing corpora.
NIST 2005 Machine Translation Evaluation Official Results
This technology-related article is a stub. You can help Wikipedia by expanding it. |