Mean opinion score (MOS) is a measure used in the domain of

Quality of Experience Quality of experience (QoE) is a measure of the delight or annoyance of a customer's experiences with a service (e.g., web browsing, phone call, TV broadcast).Qualinet White Paper on Definitions of Quality of Experience (2012). European Network on Q ...

and

telecommunications engineering Telecommunications Engineering is a subfield of electrical engineering which seeks to design and devise systems of communication at a distance. The work ranges from basic circuit design to strategic mass developments. A telecommunication enginee ...

, representing overall quality of a stimulus or system. It is the

arithmetic mean In mathematics and statistics, the arithmetic mean ( ) or arithmetic average, or just the ''mean'' or the ''average'' (when the context is clear), is the sum of a collection of numbers divided by the count of numbers in the collection. The colle ...

over all individual "values on a predefined scale that a subject assigns to his opinion of the performance of a system quality". Such ratings are usually gathered in a subjective quality evaluation test, but they can also be algorithmically estimated. MOS is a commonly used measure for video, audio, and audiovisual quality evaluation, but not restricted to those modalities.

ITU-T The ITU Telecommunication Standardization Sector (ITU-T) is one of the three sectors (divisions or units) of the International Telecommunication Union (ITU). It is responsible for coordinating standards for telecommunications and Information Commu ...

has defined several ways of referring to a MOS in Recommendatio
ITU-T P.800.1
depending on whether the score was obtained from audiovisual, conversational, listening, talking, or video quality tests.

Rating scales and mathematical definition

The MOS is expressed as a single rational number, typically in the range 1–5, where 1 is lowest perceived quality, and 5 is the highest perceived quality. Other MOS ranges are also possible, depending on the

rating scale :''Concerning rating scales as systems of educational marks, see articles about education in different countries (named "Education in ..."), for example, Education in Ukraine.'' :''Concerning rating scales used in the practice of medicine, see arti ...

that has been used in the underlying test. The

Absolute Category Rating Absolute Category Rating (ACR) is a test method used in quality tests. The levels of the scale are, sorted by quality in decreasing order: * Excellent * Good * Fair * Poor * Bad In this method, a single test condition (generally an image or a v ...

scale is very commonly used, which maps ratings between ''Bad'' and ''Excellent'' to numbers between 1 and 5, as seen in below table. Other standardized quality rating scales exist in

ITU-T Recommendations The ITU Telecommunication Standardization Sector (ITU-T) is one of the three sectors (divisions or units) of the International Telecommunication Union (ITU). It is responsible for coordinating standards for telecommunications and Information Commu ...

(such a
ITU-T P.800
o
ITU-T P.910
. For example, one could use a continuous scale ranging between 1–100. Which scale is used depends on the purpose of the test. In certain contexts there are no statistically significant differences between ratings for the same stimuli when they are obtained using different scales. The MOS is calculated as the

over single ratings performed by human subjects for a given stimulus in a subjective quality evaluation test. Thus: :

MOS = \frac

Where are the individual ratings for a given stimulus by subjects.

Properties of the MOS

The MOS is subject to certain mathematical properties and biases. In general, there is an ongoing debate on the usefulness of the MOS to quantify Quality of Experience in a single scalar value. When the MOS is acquired using a categorical rating scales, it is based on – similar to

Likert scale A Likert scale ( , commonly mispronounced as ) is a psychometric scale commonly involved in research that employs questionnaires. It is the most widely used approach to scaling responses in survey research, such that the term (or more fully the ...

s – an

ordinal scale Ordinal data is a categorical, statistical data type where the variables have natural, ordered categories and the distances between the categories are not known. These data exist on an ordinal scale, one of four levels of measurement described b ...

. In this case, the ranking of the scale items is known, but their interval is not. Therefore, it is mathematically incorrect to calculate a mean over individual ratings in order to obtain the central tendency; the median should be used instead. However, in practice and in the definition of MOS, it is considered acceptable to calculate the arithmetic mean. It has been shown that for categorical rating scales (such as ACR), the individual items are not perceived equidistant by subjects. For example, there may be a larger "gap" between ''Good'' and ''Fair'' than there is between ''Good'' and ''Excellent''. The perceived distance may also depend on the language into which the scale is translated. However, there exist studies that could not prove a significant impact of scale translation on the obtained results. Several other biases are present in the way MOS ratings are typically acquired.Zielinski, Slawomir, Francis Rumsey, and Søren Bech. "On some biases encountered in modern audio quality listening tests-a review." Journal of the Audio Engineering Society 56.6 (2008): 427-451. In addition to the above-mentioned issues with scales that are perceived non-linearly, there is a so-called "range-equalization bias": subjects, over the course of a subjective experiment, tend to give scores that span the entire rating scale. This makes it impossible to compare two different subjective tests if the range of presented quality differs. In other words, the MOS is never an absolute measure of quality, but only relative to the test in which it has been acquired. For the above reasons – and due to several other contextual factors influencing the perceived quality in a subjective test – a MOS value should only be reported if the context in which the values have been collected in is known and reported as well. MOS values gathered from different contexts and test designs therefore should not be directly compared. Recommendatio
ITU-T P.800.2
prescribes how MOS values should be reported. Specifically, P.800.2 says:

it is not meaningful to directly compare MOS values produced from separate experiments, unless those experiments were explicitly designed to be compared, and even then the data should be statistically analysed to ensure that such a comparison is valid.

MOS for speech and audio quality estimation

MOS historically originates from subjective measurements where listeners would sit in a "quiet room" and score a telephone call quality as they perceived it. This kind of test methodology had been in use in the telephony industry for decades and was standardized in Recommendatio
ITU-T P.800
It specifies that "the talker should be seated in a quiet room with volume between 30 and 120 m³ and a reverberation time less than 500 ms (preferably in the range 200–300 ms). The room noise level must be below 30 dBA with no dominant peaks in the spectrum." Requirements for other modalities were similarly specified in later ITU-T Recommendations.

MOS estimation using quality models

Obtaining MOS ratings may be time-consuming and expensive as it requires the recruitment of human assessors. For various use cases such as codec development or service quality monitoring purposes – where quality should be estimated repeatedly and automatically – MOS scores can also be predicted by objective quality models, which typically have been developed and trained using human MOS ratings. A question that arises from using such models is whether the MOS differences produced are noticeable to the users. For example, when rating images on a five point MOS scale, an image with a MOS equal to 5 is expected to be noticeably better in quality than one with a MOS equal to 1. Contrary to that, it is not evident whether an image with a MOS equal to 3.8 is noticeably better in quality than one with a MOS equal to 3.6. Research conducted on determining the smallest MOS difference that is perceptible to users for digital photographs showed that a MOS difference of approximately 0.46 is required in order for 75% of the users to be able to detect the higher quality image. Nevertheless, image quality expectation, and hence MOS, changes over time with the change of user expectations. As a result, minimum noticeable MOS differences determined using analytical methods such as in may change over time.

References

{{Reflist Multimedia Telecommunications

Rating scales and mathematical definition

Properties of the MOS

MOS for speech and audio quality estimation

MOS estimation using quality models

See also

References