
Visual words, as used in
image retrieval
An image retrieval system is a computer system used for browsing, searching and retrieving images from a large database of digital images. Most traditional and common methods of image retrieval utilize some method of adding metadata such as capti ...
systems,
refer to small parts of an image that carry some kind of information related to the features (such as the color, shape, or texture) or changes occurring in the
pixels
In digital imaging, a pixel (abbreviated px), pel, or picture element is the smallest addressable element in a raster image, or the smallest point in an all points addressable display device.
In most digital display devices, pixels are the ...
such as the filtering, low-level feature descriptors (
SIFT
A sieve, fine mesh strainer, or sift, is a device for separating wanted elements from unwanted material or for controlling the particle size distribution of a sample, using a screen such as a woven mesh or net or perforated sheet materia ...
or
SURF).
History
The approaches of
text retrieval Document retrieval is defined as the matching of some stated user query against a set of free-text records. These records could be any type of mainly unstructured text, such as newspaper articles, real estate records or paragraphs in a manual. Us ...
system (or
information retrieval IR system
) which were developed over 40 years, are based on
keywords or Term. The advantage of these approaches is that they are effective and fast.
Text-search engines are able to quickly find documents from hundreds or millions (by using a
vector space model
Vector space model or term vector model is an algebraic model for representing text documents (and any objects, in general) as vectors of identifiers (such as index terms). It is used in information filtering, information retrieval, indexing and ...
). At the same time, text retrieval systems have huge successes, whereas the standard image retrieval systems (like simple search by colors or shapes) have a large number of limitations. Consequently, researchers try to take advantage of text retrieval techniques to apply them to
image retrieval
An image retrieval system is a computer system used for browsing, searching and retrieving images from a large database of digital images. Most traditional and common methods of image retrieval utilize some method of adding metadata such as capti ...
. That can be accomplished by a new kind of vision to understand images as
textual documents, which is the visual words approach.
Analogy text-image
Consider that the pixels of an image, which are the smallest parts of a
digital image
A digital image is an image composed of picture elements, also known as ''pixels'', each with '' finite'', '' discrete quantities'' of numeric representation for its intensity or gray level that is an output from its two-dimensional functions f ...
and cannot be divided into smaller ones, are like the letters of an alphabetical language. Then, a set of pixels in an image (a patch or arrays of pixels) is a word. Each word can then be reprocessed into a morphological system to extract a term related to that word. Then, several words can share the same meaning, each one will refer to the same term (like in any language). Multiple words share the same meaning and belong to the same term (have the same information). By this view, researchers can take advantage from text retrieval techniques to apply them to image retrieval system.
Visual definitions

This principle can be applied to games to find what words and terms will be in our images. The idea is to try to understand the images with a collection of "visual words".
Definition 1: Visual word
A small patch on the image which can carry any information in any feature space, such as color changes or texture changes.
In general visual words (VWs) exist in a feature space of continuous values implying a huge number of words and therefore a huge language. Since image retrieval systems need to use text retrieval techniques that are dependent on natural languages, which have a limit to the number of terms and words, there is a need to reduce the number of visual words.
A number of solutions exist to solve this problem, such as dividing the feature space into ranges, each having common characteristics (which can be considered as the same word). Nonetheless, this solution carries many issues, like the division strategy and the size of the range in the feature space. Another solution proposed by researchers is using a clustering mechanism to classify and merge words carrying common information in a finite number of terms.
Definition 2: Visual term
The clustering result in the feature space (centers of the clusters). More than one patch can give the nearest information in feature space, so we can consider it in the same term.
As the Term in a text (the infinity verb, nouns, and articles) refer to many common words having the same characteristics, the visual term (with its clustering result) will refer to all common words which shared the same information in a feature space.
Lastly, if all images refer to the same set of visual terms, then all images can speak the same language (or
visual language
A visual language is a system of communication using visual elements. Speech as a means of communication cannot strictly be separated from the whole of human communicative activity which includes the visual and the term 'language' in relation to ...
).
Definition 3: Visual language
A set of visual words and visual terms. Considering the visual terms alone is the “Visual Vocabulary” which will be the reference and retrieval system that will depend on it for retrieving images.
All images will be represented with this visual language as a collection of visual words, or bag of visual words.
Definition 4: Bag of visual words
A collection of visual words which together give information on the meaning of part or all of the image.
Based on this kind of image representation, it is possible to use text retrieval techniques to design an image retrieval system. However, since all text retrieval systems depend on terms, the user's query images must be converted into a set of visual terms in the system. Then, it will compare these visual terms with all visual terms in the database.
See also
*
Content-based Image and Video Retrieval
*
Face Recognition
A facial recognition system is a technology capable of matching a human face from a digital image or a video frame against a database of faces. Such a system is typically employed to authenticate users through ID verification services, an ...
*
Text Information Retrieval
*
Bag-of-words model in computer vision
In computer vision, the bag-of-words model (BoW model) sometimes called bag-of-visual-words model can be applied to image classification or retrieval, by treating image features as words. In document classification, a bag of words is a sparse ve ...
References
{{reflist
External links
A tribute to visual words and how they revolutionized computer visionBag-of-Visual-Words lecture from Carnegie Mellon UniversityBag of visual words model: recognizing object categoriesVisual Word based Location Recognition in 3D models using DistanceAugmented Weighting
Applications of computer vision
Image search