Audio visual speech recognition (AVSR) is a technique that uses

image processing An image or picture is a visual representation. An image can be two-dimensional, such as a drawing, painting, or photograph, or three-dimensional, such as a carving or sculpture. Images may be displayed through other media, including a pr ...

capabilities in

lip reading Lip reading, also known as speechreading, is a technique of understanding a limited range of speech by visually interpreting the movements of the lips, face and tongue without sound. Estimates of the range of lip reading vary, with some figures as ...

to aid

speech recognition Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers. It is also ...

systems in recognizing undeterministic phones or giving preponderance among near probability decisions. Each system of

and

works separately, then their results are mixed at the stage of feature fusion. As the name suggests, it has two parts. First one is the audio part and second one is the visual part. In audio part we use features like log mel spectrogram, mfcc etc. from the raw audio samples and we build a model to get feature vector out of it . For visual part generally we use some variant of convolutional neural network to compress the image to a feature vector after that we concatenate these two vectors (audio and visual ) and try to predict the target object.

External links

IBM Research - Audio Visual Speech TechnologiesLooking to listen at cocktail party
Computational linguistics Speech recognition Applications of computer vision Multimodal interaction {{comp-ling-stub