Audio visual speech recognition (AVSR) is a technique that uses
image processing
An image is a visual representation of something. It can be two-dimensional, three-dimensional, or somehow otherwise feed into the visual system to convey information. An image can be an artifact, such as a photograph or other two-dimensiona ...
capabilities in
lip reading to aid
speech recognition
Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers with the m ...
systems in recognizing undeterministic
phones
A telephone is a telecommunications device that permits two or more users to conduct a conversation when they are too far apart to be easily heard directly. A telephone converts sound, typically and most efficiently the human voice, into ele ...
or giving preponderance among near probability decisions.
Each system of
lip reading and
speech recognition
Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers with the m ...
works separately, then their results are mixed at the stage of
feature fusion
Feature may refer to:
Computing
* Feature (CAD), could be a hole, pocket, or notch
* Feature (computer vision), could be an edge, corner or blob
* Feature (software design) is an intentional distinguishing characteristic of a software item ...
. As the name suggests, it has two parts. First one is the audio part and second one is the visual part. In audio part we use features like log mel spectrogram, mfcc etc. from the raw audio samples and we build a model to get feature vector out of it . For visual part generally we use some variant of convolutional neural network to compress the image to a feature vector after that we concatenate these two vectors (audio and visual ) and try to predict the target object.
External links
IBM Research - Audio Visual Speech TechnologiesLooking to listen at cocktail party
Computational linguistics
Speech recognition
Applications of computer vision
Multimodal interaction
{{comp-ling-stub