In computer vision, the bag-of-words model (BoW model) sometimes called bag-of-visual-words model
can be applied to
image classification
Computer vision is an interdisciplinary scientific field that deals with how computers can gain high-level understanding from digital images or videos. From the perspective of engineering, it seeks to understand and automate tasks that the hum ...
or
retrieval, by treating
image features as words. In
document classification
Document classification or document categorization is a problem in library science, information science and computer science. The task is to assign a document to one or more classes or categories. This may be done "manually" (or "intellectually") ...
, a
bag of words
The bag-of-words model is a simplifying representation used in natural language processing and information retrieval (IR). In this model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding g ...
is a
sparse vector
In numerical analysis and scientific computing, a sparse matrix or sparse array is a matrix in which most of the elements are zero. There is no strict definition regarding the proportion of zero-value elements for a matrix to qualify as sparse ...
of occurrence counts of words; that is, a sparse
histogram
A histogram is an approximate representation of the distribution of numerical data. The term was first introduced by Karl Pearson. To construct a histogram, the first step is to " bin" (or "bucket") the range of values—that is, divide the ent ...
over the vocabulary. In
computer vision
Computer vision is an interdisciplinary scientific field that deals with how computers can gain high-level understanding from digital images or videos. From the perspective of engineering, it seeks to understand and automate tasks that the hum ...
, a ''bag of visual words'' is a vector of occurrence counts of a vocabulary of local image features.
Image representation based on the BoW model
To represent an image using the BoW model, an image can be treated as a document. Similarly, "words" in images need to be defined too. To achieve this, it usually includes following three steps:
feature detection, feature description, and codebook generation.
A definition of the BoW model can be the "histogram representation based on independent features".
[
] Content based image indexing and retrieval (CBIR) appears to be the early adopter of this image representation technique.
Feature representation
After feature detection, each image is abstracted by several local patches. Feature representation methods deal with how to represent the patches as numerical vectors. These vectors are called feature descriptors. A good descriptor should have the ability to handle intensity, rotation, scale and affine variations to some extent. One of the most famous descriptors is
Scale-invariant feature transform
The scale-invariant feature transform (SIFT) is a computer vision algorithm to detect, describe, and match local ''features'' in images, invented by David Lowe in 1999.
Applications include object recognition, robotic mapping and navigation, ima ...
(SIFT).
SIFT converts each patch to 128-dimensional vector. After this step, each image is a collection of vectors of the same dimension (128 for SIFT), where the order of different vectors is of no importance.
Codebook generation
The final step for the BoW model is to convert vector-represented patches to "codewords" (analogous to words in text documents), which also produces a "codebook" (analogy to a word dictionary). A codeword can be considered as a representative of several similar patches. One simple method is performing
k-means clustering over all the vectors. Codewords are then defined as the centers of the learned clusters. The number of the clusters is the codebook size (analogous to the size of the word dictionary).
Thus, each patch in an image is mapped to a certain codeword through the clustering process and the image can be represented by the
histogram
A histogram is an approximate representation of the distribution of numerical data. The term was first introduced by Karl Pearson. To construct a histogram, the first step is to " bin" (or "bucket") the range of values—that is, divide the ent ...
of the codewords.
Learning and recognition based on the BoW model
Computer vision researchers have developed several learning methods to leverage the BoW model for image related tasks, such as
object categorization. These methods can roughly be divided into two categories, unsupervised and supervised models. For multiple label categorization problem, the
confusion matrix
In the field of machine learning and specifically the problem of statistical classification, a confusion matrix, also known as an error matrix, is a specific table layout that allows visualization of the performance of an algorithm, typically a su ...
can be used as an evaluation metric.
Unsupervised models
Here are some notations for this section. Suppose the size of codebook is
.
*
: each patch
is a V-dimensional vector that has a single component equal to one and all other components equal to zero (For k-means clustering setting, the single component equal one indicates the cluster that
belongs to). The
th codeword in the codebook can be represented as
and
for
.
*
: each image is represented by