Audio visual speech recognition (AVSR) is a technique that uses
image processing
An image or picture is a visual representation. An image can be two-dimensional, such as a drawing, painting, or photograph, or three-dimensional, such as a carving or sculpture. Images may be displayed through other media, including a pr ...
capabilities in
lip reading
Lip reading, also known as speechreading, is a technique of understanding a limited range of speech by visually interpreting the movements of the lips, face and tongue without sound. Estimates of the range of lip reading vary, with some figures as ...
to aid
speech recognition
Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers. It is also ...
systems in recognizing undeterministic
phones or giving preponderance among near probability decisions.
Each system of
lip reading
Lip reading, also known as speechreading, is a technique of understanding a limited range of speech by visually interpreting the movements of the lips, face and tongue without sound. Estimates of the range of lip reading vary, with some figures as ...
and
speech recognition
Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers. It is also ...
works separately, then their results are mixed at the stage of
feature fusion. As the name suggests, it has two parts. First one is the audio part and second one is the visual part. In audio part we use features like log mel spectrogram, mfcc etc. from the raw audio samples and we build a model to get feature vector out of it . For visual part generally we use some variant of convolutional neural network to compress the image to a feature vector after that we concatenate these two vectors (audio and visual ) and try to predict the target object.
External links
IBM Research - Audio Visual Speech TechnologiesLooking to listen at cocktail party
Computational linguistics
Speech recognition
Applications of computer vision
Multimodal interaction
{{comp-ling-stub