Lip reading, also known as speechreading, is a technique of understanding a limited range of

speech Speech is the use of the human voice as a medium for language. Spoken language combines vowel and consonant sounds to form units of meaning like words, which belong to a language's lexicon. There are many different intentional speech acts, suc ...

by visually interpreting the movements of the lips, face and tongue without sound. Estimates of the range of lip reading vary, with some figures as low as 30% because lip reading relies on context, language knowledge, and any residual hearing. Although lip reading is used most extensively by deaf and hard-of-hearing people, most people with normal hearing process can infer some speech information by observing a speaker's mouth.

Process

Although

speech perception Speech perception is the process by which the sounds of language are heard, interpreted, and understood. The study of speech perception is closely linked to the fields of phonology and phonetics in linguistics and cognitive psychology and percept ...

is considered to be an auditory skill, it is intrinsically multimodal, since producing speech requires the speaker to make movements of the lips, teeth and tongue which are often visible in face-to-face communication. Information from the lips and face supports aural comprehension and most fluent listeners of a language are sensitive to seen speech actions (see

McGurk effect The McGurk effect is a perceptual phenomenon that demonstrates an interaction between hearing and vision in speech perception. The illusion occurs when the auditory component of one sound is paired with the visual component of another sound, lea ...

). The extent to which people make use of seen speech actions varies with the visibility of the speech action and the knowledge and skill of the perceiver.

Phonemes and visemes

The

phoneme A phoneme () is any set of similar Phone (phonetics), speech sounds that are perceptually regarded by the speakers of a language as a single basic sound—a smallest possible Phonetics, phonetic unit—that helps distinguish one word fr ...

is the smallest detectable unit of sound in a language that serves to distinguish words from one another. /pit/ and /pik/ differ by one phoneme and refer to different concepts. Spoken English has about 44 phonemes. For lip reading, the number of visually distinctive units - visemes - is much smaller, thus several phonemes map onto a few visemes. This is because many phonemes are produced within the mouth and throat, and are hard to see. These include

glottal consonants Glottal consonants are consonants using the glottis as their primary articulation. Many phoneticians consider them, or at least the glottal fricative, to be transitional states of the glottis without a point of articulation as other consonants ...

and most gestures of the tongue.

Voiced Voice or voicing is a term used in phonetics and phonology to characterize speech sounds (usually consonants). Speech sounds can be described as either voiceless (otherwise known as ''unvoiced'') or voiced. The term, however, is used to refe ...

and

unvoiced In linguistics, voicelessness is the property of sounds being pronounced without the larynx vibrating. Phonologically, it is a type of phonation, which contrasts with other states of the larynx, but some object that the word phonation implies vo ...

pairs look identical, such as and and and and and and likewise for

nasalisation In phonetics, nasalization (or nasalisation in British English) is the production of a sound while the velum is lowered, so that some air escapes through the nose during the production of the sound by the mouth. An archetypal nasal sound is . ...

(e.g. vs. . Homophenes are words that look similar when lip read, but which contain different phonemes. Because there are about three times as many phonemes as visemes in English, it is often claimed that only 30% of speech can be lip read. Homophenes are a crucial source of mis-lip reading.

Co-articulation Coarticulation in its general sense refers to a situation in which a conceptually isolated speech sound is influenced by, and becomes more like, a preceding or following speech sound. There are two types of coarticulation: ''anticipatory coarticul ...

Visemes can be captured as still images, but speech unfolds in time. The smooth articulation of speech sounds in sequence can mean that mouth patterns may be 'shaped' by an adjacent phoneme: the 'th' sound in 'tooth' and in 'teeth' appears very different because of the vocalic context. This feature of dynamic speech-reading affects lip-reading 'beyond the viseme'.

How can it 'work' with so few visemes?

While visemes offer a useful starting point for understanding lipreading, spoken distinctions within a viseme can be distinguished and can help support identification. Moreover, the statistical distribution of phonemes within the lexicon of a language is uneven. While there are clusters of words which are phonemically similar to each other ('lexical neighbors', such as spit/sip/sit/stick...etc.), others are unlike all other words: they are 'unique' in terms of the distribution of their phonemes ('umbrella' may be an example). Skilled users of the language bring this knowledge to bear when interpreting speech, so it is generally harder to identify a heard word with many lexical neighbors than one with few neighbors. Applying this insight to seen speech, some words in the language can be unambiguously lip-read even when they contain few visemes - simply because no other words could possibly 'fit'.

Variation in readability and skill

Many factors affect the visibility of a speaking face, including illumination, movement of the head/camera, frame-rate of the moving image and distance from the viewer (see e.g.). Head movement that accompanies normal speech can also improve lip-reading, independently of oral actions. However, when lip-reading connected speech, the viewer's knowledge of the spoken language, familiarity with the speaker and style of speech, and the context of the lip-read material are as important as the visibility of the speaker. While most hearing people are sensitive to seen speech, there is great variability in individual speechreading skill. Good lipreaders are often more accurate than poor lipreaders at identifying phonemes from visual speech. A simple visemic measure of 'lipreadability' has been questioned by some researchers. The 'phoneme equivalence class' measure takes into account the statistical structure of the lexicon and can also accommodate individual differences in lip-reading ability. In line with this, excellent lipreading is often associated with more broad-based cognitive skills including general language proficiency,

executive function In cognitive science and neuropsychology, executive functions (collectively referred to as executive function and cognitive control) are a set of cognitive processes that support goal-directed behavior, by regulating thoughts and actions thro ...

and

working memory Working memory is a cognitive system with a limited capacity that can Memory, hold information temporarily. It is important for reasoning and the guidance of decision-making and behavior. Working memory is often used synonymously with short-term m ...

Lipreading and language learning in hearing infants and children

The first few months

Seeing the mouth plays a role in the very young infant's early sensitivity to speech, and prepares them to become speakers at 1 – 2 years. In order to imitate, a baby must learn to shape their lips in accordance with the sounds they hear; seeing the speaker may help them to do this. Newborns imitate adult mouth movements such as sticking out the tongue or opening the mouth, which could be a precursor to further imitation and later language learning. Infants are disturbed when audiovisual speech of a familiar speaker is desynchronized and tend to show different looking patterns for familiar than for unfamiliar faces when matched to (recorded) voices. Infants are sensitive to McGurk illusions months before they have learned to speak. These studies and many more point to a role for vision in the development of sensitivity to (auditory) speech in the first half-year of life.

The next six months; a role in learning a native language

Until around six months of age, most hearing infants are sensitive to a wide range of speech gestures - including ones that can be seen on the mouth - which may or may not later be part of the

phonology Phonology (formerly also phonemics or phonematics: "phonemics ''n.'' 'obsolescent''1. Any procedure for identifying the phonemes of a language from a corpus of data. 2. (formerly also phonematics) A former synonym for phonology, often pre ...

of their native language. But in the second six months of life, the hearing infant shows perceptual narrowing for the phonetic structure of their own language - and may lose the early sensitivity to mouth patterns that are not useful. The speech sounds /v/ and /b/ which are visemically distinctive in English but not in Castilian Spanish are accurately distinguished in Spanish-exposed and English-exposed babies up to the age of around 6 months. However, older Spanish-exposed infants lose the ability to 'see' this distinction, while it is retained for English-exposed infants. Such studies suggest that rather than hearing and vision developing in independent ways in infancy, multimodal processing is the rule, not the exception, in (language) development of the infant brain.

Early language production: one to two years

Given the many studies indicating a role for vision in the development of language in the pre-lingual infant, the effects of congenital blindness on language development are surprisingly small. 18-month-olds learn new words more readily when they hear them, and do not learn them when they are shown the speech movements without hearing. However, children blind from birth can confuse /m/ and /n/ in their own early production of English words – a confusion rarely seen in sighted hearing children, since /m/ and /n/ are visibly distinctive, but auditorially confusable. The role of vision in children aged 1–2 years may be less critical to the production of their native language, since, by that age, they have attained the skills they need to identify and imitate speech sounds. However, hearing a non-native language can shift the child's attention to visual and auditory engagement by way of lipreading and listening in order to process, understand and produce speech.

In childhood

Studies with pre-lingual infants and children use indirect, non-verbal measures to indicate sensitivity to seen speech. ''Explicit'' lip-reading can be reliably tested in hearing preschoolers by asking them to 'say aloud what I say silently'. In school-age children, lipreading of familiar closed-set words such as number words can be readily elicited. Individual differences in lip-reading skill, as tested by asking the child to 'speak the word that you lip-read', or by matching a lip-read utterance to a picture, show a relationship between lip-reading skill and age.

In hearing adults: lifespan considerations

While lip-reading silent speech poses a challenge for most hearing people, adding sight of the speaker to heard speech improves speech processing under many conditions. The mechanisms for this, and the precise ways in which lip-reading helps, are topics of current research. Seeing the speaker helps at all levels of speech processing from phonetic feature discrimination to interpretation of

pragmatic Pragmatism is a philosophical movement. Pragmatism or pragmatic may also refer to: * "Pragmaticism", Charles Sanders Peirce's post-1905 branch of philosophy * Pragmatics, a subfield of linguistics and semiotics * ''Pragmatics'' (journal), an aca ...

utterances. The positive effects of adding vision to heard speech are greater in noisy than quiet environments, where by making speech perception easier, seeing the speaker can free up cognitive resources, enabling deeper processing of speech content. As hearing becomes less reliable in old-age, people may tend to rely more on lip-reading, and are encouraged to do so. However, greater reliance on lip-reading may not always make good the effects of age-related hearing loss. Cognitive decline in aging may be preceded by and/or associated with measurable hearing loss. Thus lipreading may not always be able to fully compensate for the combined hearing and cognitive age-related decrements.

In specific (hearing) populations

A number of studies report anomalies of lipreading in populations with distinctive developmental disorders.

Autism Autism, also known as autism spectrum disorder (ASD), is a neurodevelopmental disorder characterized by differences or difficulties in social communication and interaction, a preference for predictability and routine, sensory processing d ...

: People with autism may show reduced lipreading abilities and reduced reliance on vision in audiovisual speech perception. This may be associated with gaze-to-the-face anomalies in these people.

Williams syndrome Williams syndrome (WS), also Williams–Beuren syndrome (WBS), is a genetic disorder that affects many parts of the body. Facial features frequently include a broad forehead, underdeveloped chin, short nose, and full cheeks. Mild to moderate int ...

: People with Williams syndrome show some deficits in speechreading which may be independent of their visuo-spatial difficulties.

Specific Language Impairment Specific language impairment (SLI) is diagnosed when a child's language does not develop normally and the difficulties cannot be accounted for by generally slow development, physical abnormality of the speech apparatus, autism spectrum disorder, a ...

: Children with SLI are also reported to show reduced lipreading sensitivity, as are people with

dyslexia Dyslexia (), previously known as word blindness, is a learning disability that affects either reading or writing. Different people are affected to different degrees. Problems may include difficulties in spelling words, reading quickly, wri ...

Deafness

Debate has raged for hundreds of years over the role of lip-reading ('

oralism Oralism is the education of deaf students through oral language by using lip reading, speech, and mimicking the mouth shapes and breathing patterns of speech.Through Deaf Eyes. Diane Garey, Lawrence R. Hott. DVD, PBS (Direct), 2007. Oralism c ...

') compared with other communication methods (most recently, total communication) in the education of deaf people. The extent to which one or other approach is beneficial depends on a range of factors, including level of hearing loss of the deaf person, age of hearing loss, parental involvement and parental language(s). Then there is a question concerning the aims of the deaf person and their community and carers. Is the aim of education to enhance communication generally, to develop

sign language Sign languages (also known as signed languages) are languages that use the visual-manual modality to convey meaning, instead of spoken words. Sign languages are expressed through manual articulation in combination with #Non-manual elements, no ...

as a first language, or to develop skills in the spoken language of the hearing community? Researchers now focus on which aspects of language and communication may be best delivered by what means and in which contexts, given the hearing status of the child and her family, and their educational plans.

Bimodal bilingualism Bimodal bilingualism is an individual or community's bilingual competency in at least one oral language and at least one sign language, which utilize two different modalities. An oral language consists of a vocal-aural modality versus a signed lang ...

(proficiency in both speech and sign language) is one dominant current approach in language education for the deaf child. Deaf people are often better lip-readers than people with normal hearing. Some deaf people practice as professional lipreaders for instance in

forensic lipreading Forensic speechreading (or forensic lipreading) is the use of speechreading for information or evidential purposes. Forensic speechreading can be considered a branch of forensic linguistics. In contrast to speaker recognition, which is often the f ...

. In deaf people who have a

cochlear implant A cochlear implant (CI) is a surgically implanted Neuroprosthetics, neuroprosthesis that provides a person who has moderate-to-profound sensorineural hearing loss with sound perception. With the help of therapy, cochlear implants may allow for imp ...

, pre-implant lip-reading skill can predict post-implant (auditory or audiovisual) speech processing. In adults, the later the age of implantation, the better the visual speechreading abilities of the deaf person. For many deaf people, access to spoken communication can be helped when a spoken message is relayed via a trained,
professional lip-speaker
'. In connection with lipreading and literacy development, children born deaf typically show delayed development of literacy skills which can reflect difficulties in acquiring elements of the spoken language. In particular, reliable phoneme-grapheme mapping may be more difficult for deaf children, who need to be skilled speech-readers in order to master this necessary step in literacy acquisition. Lip-reading skill is associated with literacy abilities in deaf adults and children and training in lipreading may help to develop literacy skills.

Cued Speech Cued speech is a visual system of communication used with and among deaf or hard-of-hearing people. It is a phonemic-based system which makes traditionally spoken languages accessible by using a small number of handshapes, known as cues (represe ...

uses lipreading with accompanying hand shapes that disambiguate the visemic (consonant) lipshape. Cued speech is said to be easier for hearing parents to learn than a sign language, and studies, primarily from Belgium, show that a deaf child exposed to cued speech in infancy can make more efficient progress in learning a spoken language than from lipreading alone. The use of cued speech in cochlear implantation for deafness is likely to be positive. A similar approach, involving the use of handshapes accompanying seen speech, is
Visual Phonics
which is used by some educators to support the learning of written and spoken language.

Teaching and training

The aim of teaching and training in lipreading is to develop awareness of the nature of lipreading, and to practice ways of improving the ability to perceive speech 'by eye'. While the value of lipreading training in improving 'hearing by eye' was not always clear, especially for people with acquired hearing loss, there is evidence that systematic training in alerting students to attend to seen speech actions can be beneficial. Lipreading classes, often called ''lipreading and managing hearing loss classes'', are mainly aimed at adults who have hearing loss. The highest proportion of adults with hearing loss have an age-related, or noise-related loss; with both of these forms of hearing loss, the high-frequency sounds are lost first. Since many of the consonants in speech are high-frequency sounds, speech becomes distorted. Hearing aids help but may not cure this. Lipreading classes have been shown to be of benefit in UK studies commissioned by the Action on Hearing Loss charity (2012). Trainers recognise that lipreading is an inexact art. Students are taught to watch the lips, tongue and jaw movements, to follow the stress and rhythm of language, to use their residual hearing, with or without hearing aids, to watch expression and body language, and to use their ability to reason and deduce. They are taught th
lipreaders' alphabet
groups of sounds that look alike on the lips (visemes) like p, b, m, or f, v. The aim is to get the gist, so as to have the confidence to join in conversation and avoid the damaging social isolation that often accompanies hearing loss. Lipreading classes are recommended for anyone who struggles to hear in noise, and help to adjust to hearing loss.

Tests

Most tests of lipreading were devised to measure individual differences in performing specific speech-processing tasks and to detect changes in performance following training. Lipreading tests have been used with relatively small groups in experimental settings, or as clinical indicators with individual patients and clients. That is, most lipreading tests to date have limited validity as markers of lipreading skill in the general population.

Lipreading and lip-speaking by machine

Automated lip-reading has been a topic of interest in computational engineering, as well as in science fiction movies. The computational engineer Steve Omohundro, among others, pioneered its development. In

facial animation Computer facial animation is primarily an area of computer graphics that encapsulates methods and techniques for generating and animating images or models of a character face. The character can be a human, a humanoid, an animal, a legendary creatu ...

, the aim is to generate realistic facial actions, especially mouth movements, that simulate human speech actions. Computer algorithms to deform or manipulate images of faces can be driven by heard or written language. Systems may be based on detailed models derived from facial movements (

motion capture Motion capture (sometimes referred as mocap or mo-cap, for short) is the process of recording high-resolution motion (physics), movement of objects or people into a computer system. It is used in Military science, military, entertainment, sports ...

); on anatomical modelling of actions of the jaw, mouth and tongue; or on mapping of known viseme- phoneme properties. Facial animation has been used in speechreading training (demonstrating how different sounds 'look'). These systems are a subset of

speech synthesis Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer, and can be implemented in software or hardware products. A text-to-speech (TTS) system converts normal langua ...

modelling which aim to deliver reliable 'text-to-(seen)-speech' outputs. A complementary aim—the reverse of making faces move in speech—is to develop computer algorithms that can deliver realistic interpretations of speech (i.e. a written transcript or audio record) from natural video data of a face in action: this is facial speech recognition. These models too can be sourced from a variety of data. Automatic visual speech recognition from video has been quite successful in distinguishing different languages (from a

corpus Corpus (plural ''corpora'') is Latin for "body". It may refer to: Linguistics * Text corpus, in linguistics, a large and structured set of texts * Speech corpus, in linguistics, a large set of speech audio files * Corpus linguistics, a branch of ...

of spoken language data). Demonstration models, using machine-learning algorithms, have had some success in lipreading speech elements, such as specific words, from video and for identifying hard-to-lipread phonemes from visemically similar seen mouth actions. Machine-based speechreading is now making successful use o
neural-net based algorithms
which use large databases of speakers and speech material (following the successful model for auditory

automatic speech recognition Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers. It is also k ...

). Uses for machine lipreading could include automated lipreading of video-only records, automated lipreading of speakers with damaged vocal tracts, and speech processing in face-to-face video (i.e. from videophone data). Automated lipreading may help in processing noisy or unfamiliar speech. Automated lipreading may contribute to

biometric Biometrics are body measurements and calculations related to human characteristics and features. Biometric authentication (or realistic authentication) is used in computer science as a form of identification and access control. It is also used t ...

person identification, replacing password-based identification.

The brain

Following the discovery that auditory brain regions, including Heschl's gyrus, were activated by seen speech, the neural circuitry for speechreading was shown to include supra-modal processing regions, especially

superior temporal sulcus The superior temporal sulcus (STS) is the sulcus separating the superior temporal gyrus from the middle temporal gyrus, in the temporal lobe of the mammalian brain. A sulcus (plural sulci) is a deep groove that curves into the largest part of ...

(all parts) as well as posterior inferior occipital-temporal regions including regions specialised for the processing of

faces The face is the front of the head that features the eyes, nose and mouth, and through which animals express many of their emotions. The face is crucial for human identity, and damage such as scarring or developmental deformities may affect the ...

and biological motion. In some but not all studies, activation of Broca's area is reported for speechreading, suggesting that articulatory mechanisms can be activated in speechreading. Studies of the time course of audiovisual speech processing showed that sight of speech can prime auditory processing regions in advance of the acoustic signal. Better lipreading skill is associated with greater activation in (left) superior temporal sulcus and adjacent inferior temporal (visual) regions in hearing people. In deaf people, the circuitry devoted to speechreading appears to be very similar to that in hearing people, with similar associations of (left) superior temporal activation and lipreading skill.

References

Bibliography

* D.Stork and M.Henneke (Eds) (1996) Speechreading by Humans and machines: Models Systems and Applications. Nato ASI series F Computer and Systems sciences Vol 150. Springer, Berlin Germany * E.Bailly, P.Perrier and E.Vatikiotis-Bateson (Eds)(2012) Audiovisual Speech processing, Cambridge University press, Cambridge UK
Hearing By Eye (1987)
B.Dodd and R.Campbell (Eds), Erlbaum Asstes, Hillsdale NJ, USA;
Hearing by Eye II
(1997) R.Campbell, B.Dodd and D.Burnham (Eds), Psychology Press, Hove UK * D. W. Massaro (1987, reprinted 2014
Speech perception by ear and by eye
Lawrence Erlbaum Associates, Hillsdale NJ