In computer vision, the bag-of-words model (BoW model) sometimes called bag-of-visual-words model can be applied to

image classification Computer vision is an interdisciplinary scientific field that deals with how computers can gain high-level understanding from digital images or videos. From the perspective of engineering, it seeks to understand and automate tasks that the human ...

retrieval Retrieval could refer to: Computer science * RETRIEVE, Tymshare database that inspired dBASE and others * Data retrieval * Document retrieval * Image retrieval * Information retrieval * Knowledge retrieval * Medical retrieval * Music informati ...

, by treating

image feature In computer vision and image processing, a feature is a piece of information about the content of an image; typically about whether a certain region of the image has certain properties. Features may be specific structures in the image such as poi ...

s as words. In

document classification Document classification or document categorization is a problem in library science, information science and computer science. The task is to assign a document to one or more classes or categories. This may be done "manually" (or "intellectually") ...

, a

bag of words The bag-of-words model is a simplifying representation used in natural language processing and information retrieval (IR). In this model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding g ...

is a

sparse vector In numerical analysis and scientific computing, a sparse matrix or sparse array is a matrix in which most of the elements are zero. There is no strict definition regarding the proportion of zero-value elements for a matrix to qualify as sparse b ...

of occurrence counts of words; that is, a sparse

histogram A histogram is an approximate representation of the distribution of numerical data. The term was first introduced by Karl Pearson. To construct a histogram, the first step is to " bin" (or " bucket") the range of values—that is, divide the ent ...

over the vocabulary. In

computer vision Computer vision is an Interdisciplinarity, interdisciplinary scientific field that deals with how computers can gain high-level understanding from digital images or videos. From the perspective of engineering, it seeks to understand and automate t ...

, a ''bag of visual words'' is a vector of occurrence counts of a vocabulary of local image features.

Image representation based on the BoW model

To represent an image using the BoW model, an image can be treated as a document. Similarly, "words" in images need to be defined too. To achieve this, it usually includes following three steps: feature detection, feature description, and codebook generation. A definition of the BoW model can be the "histogram representation based on independent features". Content based image indexing and retrieval (CBIR) appears to be the early adopter of this image representation technique.

Feature representation

After feature detection, each image is abstracted by several local patches. Feature representation methods deal with how to represent the patches as numerical vectors. These vectors are called feature descriptors. A good descriptor should have the ability to handle intensity, rotation, scale and affine variations to some extent. One of the most famous descriptors is

Scale-invariant feature transform The scale-invariant feature transform (SIFT) is a computer vision algorithm to detect, describe, and match local '' features'' in images, invented by David Lowe in 1999. Applications include object recognition, robotic mapping and navigation, ...

(SIFT). SIFT converts each patch to 128-dimensional vector. After this step, each image is a collection of vectors of the same dimension (128 for SIFT), where the order of different vectors is of no importance.

Codebook generation

The final step for the BoW model is to convert vector-represented patches to "codewords" (analogous to words in text documents), which also produces a "codebook" (analogy to a word dictionary). A codeword can be considered as a representative of several similar patches. One simple method is performing

k-means clustering ''k''-means clustering is a method of vector quantization, originally from signal processing, that aims to partition ''n'' observations into ''k'' clusters in which each observation belongs to the cluster with the nearest mean (cluster centers o ...

over all the vectors. Codewords are then defined as the centers of the learned clusters. The number of the clusters is the codebook size (analogous to the size of the word dictionary). Thus, each patch in an image is mapped to a certain codeword through the clustering process and the image can be represented by the

of the codewords.

Learning and recognition based on the BoW model

Computer vision researchers have developed several learning methods to leverage the BoW model for image related tasks, such as object categorization. These methods can roughly be divided into two categories, unsupervised and supervised models. For multiple label categorization problem, the

confusion matrix In the field of machine learning and specifically the problem of statistical classification, a confusion matrix, also known as an error matrix, is a specific table layout that allows visualization of the performance of an algorithm, typically a ...

can be used as an evaluation metric.

Unsupervised models

Here are some notations for this section. Suppose the size of codebook is

V

. *

w

: each patch

w

is a V-dimensional vector that has a single component equal to one and all other components equal to zero (For k-means clustering setting, the single component equal one indicates the cluster that

w

belongs to). The

v

th codeword in the codebook can be represented as

w^v=1

and

w^u = 0

for

u\neq v

. *

\mathbf

: each image is represented by

\mathbf=_1, w_2, \cdots, w_N /math>, all the patches in an image
* d_j : the j th image in an image collection
* c : category of the image
* z : theme or topic of the patch
* \pi : mixture proportion

Since the BoW model is an analogy to the BoW model in NLP, generative models developed in text domains can also be adapted in computer vision. Simple Naïve Bayes model and hierarchical Bayesian models are discussed.

Naïve Bayes

The simplest one is

Naïve Bayes In statistics, naive Bayes classifiers are a family of simple " probabilistic classifiers" based on applying Bayes' theorem with strong (naive) independence assumptions between the features (see Bayes classifier). They are among the simplest Bay ...

classifier. Using the language of

graphical models ''Graphical Models'' is an academic journal in computer graphics and geometry processing publisher by Elsevier. , its editor-in-chief is Jorg Peters of the University of Florida. History This journal has gone through multiple names. Founded in 1 ...

, the Naïve Bayes classifier is described by the equation below. The basic idea (or assumption) of this model is that each category has its own distribution over the codebooks, and that the distributions of each category are observably different. Take a face category and a car category for an example. The face category may emphasize the codewords which represent "nose", "eye" and "mouth", while the car category may emphasize the codewords which represent "wheel" and "window". Given a collection of training examples, the classifier learns different distributions for different categories. The categorization decision is made by :

c^*=\arg \max_c p(c, \mathbf) = \arg \max_c p(c)p(\mathbf, c)=\arg \max_c p(c)\prod_^Np(w_n, c)

Since the Naïve Bayes classifier is simple yet effective, it is usually used as a baseline method for comparison.

Hierarchical Bayesian models

The basic assumption of Naïve Bayes model does not hold sometimes. For example, a natural scene image may contain several different themes.

Probabilistic latent semantic analysis Probabilistic latent semantic analysis (PLSA), also known as probabilistic latent semantic indexing (PLSI, especially in information retrieval circles) is a statistical technique for the analysis of two-mode and co-occurrence data. In effect, one ca ...

(pLSA) and

latent Dirichlet allocation In natural language processing, Latent Dirichlet Allocation (LDA) is a generative statistical model that explains a set of observations through unobserved groups, and each group explains why some parts of the data are similar. The LDA is an exa ...

(LDA) are two popular topic models from text domains to tackle the similar multiple "theme" problem. Take LDA for an example. To model natural scene images using LDA, an analogy is made with document analysis: * the image category is mapped to the document category; * the mixture proportion of themes maps the mixture proportion of topics; * the theme index is mapped to topic index; * the codeword is mapped to the word. This method shows very promising results in natural scene categorization o
13 Natural Scene Categories

Supervised models

Since images are represented based on the BoW model, any discriminative model suitable for text document categorization can be tried, such as

support vector machine In machine learning, support vector machines (SVMs, also support vector networks) are supervised learning models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laboratories ...

(SVM) and

AdaBoost AdaBoost, short for ''Adaptive Boosting'', is a statistical classification meta-algorithm formulated by Yoav Freund and Robert Schapire in 1995, who won the 2003 Gödel Prize for their work. It can be used in conjunction with many other types ...

Kernel trick In machine learning, kernel machines are a class of algorithms for pattern analysis, whose best known member is the support-vector machine (SVM). The general task of pattern analysis is to find and study general types of relations (for example c ...

is also applicable when kernel based classifier is used, such as SVM. Pyramid match kernel is newly developed one based on the BoW model. The local feature approach of using BoW model representation learnt by machine learning classifiers with different kernels (e.g., EMD-kernel and

X^2

kernel) has been vastly tested in the area of texture and object recognition. Very promising results on a number of datasets have been reported. This approach has achieved very impressive results i
the PASCAL Visual Object Classes Challenge

Pyramid match kernel

Pyramid match kernel is a fast algorithm (linear complexity instead of classic one in quadratic complexity) kernel function (satisfying

Mercer's condition In mathematics, specifically functional analysis, Mercer's theorem is a representation of a symmetric positive-definite function on a square as a sum of a convergent sequence of product functions. This theorem, presented in , is one of the most n ...

) which maps the BoW features, or set of features in high dimension, to multi-dimensional multi-resolution histograms. An advantage of these multi-resolution histograms is their ability to capture co-occurring features. The pyramid match kernel builds multi-resolution histograms by binning data points into discrete regions of increasing size. Thus, points that do not match at high resolutions have the chance to match at low resolutions. The pyramid match kernel performs an approximate similarity match, without explicit search or computation of distance. Instead, it intersects the histograms to approximate the optimal match. Accordingly, the computation time is only linear in the number of features. Compared with other kernel approaches, the pyramid match kernel is much faster, yet provides equivalent accuracy. The pyramid match kernel was applied t
ETH-80 database
an

with promising results.

Limitations and recent developments

One of the notorious disadvantages of BoW is that it ignores the spatial relationships among the patches, which are very important in image representation. Researchers have proposed several methods to incorporate the spatial information. For feature level improvements,

correlogram In the analysis of data, a correlogram is a chart of correlation statistics. For example, in time series analysis, a plot of the sample autocorrelations r_h\, versus h\, (the time lags) is an autocorrelogram. If cross-correlation is plotte ...

features can capture spatial co-occurrences of features. For generative models, relative positions of codewords are also taken into account. The hierarchical shape and appearance model for human action introduces a new part layer (

Constellation model The constellation model is a probabilistic, generative model for category-level object recognition in computer vision. Like other part-based models, the constellation model attempts to represent an object class by a set of ''N'' parts under mutual g ...

) between the mixture proportion and the BoW features, which captures the spatial relationships among parts in the layer. For discriminative models, spatial pyramid match performs pyramid matching by partitioning the image into increasingly fine sub-regions and compute histograms of local features inside each sub-region. Recently, an augmentation of local image descriptors (i.e.

SIFT A sieve, fine mesh strainer, or sift, is a device for separating wanted elements from unwanted material or for controlling the particle size distribution of a sample, using a screen such as a woven mesh or net or perforated sheet materia ...

) by their spatial coordinates normalised by the image width and height have proved to be a robust and simple Spatial Coordinate Coding approach which introduces spatial information to the BoW model. The BoW model has not been extensively tested yet for view point invariance and scale invariance, and the performance is unclear. Also the BoW model for object segmentation and localization is not well understood. A systematic comparison of classification pipelines found that the encoding of first and second order statistics (Vector of Locally Aggregated Descriptors (VLAD) and Fisher Vector (FV)) considerably increased classification accuracy compared to BoW, while also decreasing the codebook size, thus lowering the computational effort for codebook generation. Moreover, a recent detailed comparison of coding and pooling methods for BoW has showed that second order statistics combined with

Sparse Coding Neural coding (or Neural representation) is a neuroscience field concerned with characterising the hypothetical relationship between the stimulus and the individual or ensemble neuronal responses and the relationship among the electrical activit ...

and an appropriate pooling such as Power Normalisation can further outperform Fisher Vectors and even approach results of simple models of

Convolutional Neural Network In deep learning, a convolutional neural network (CNN, or ConvNet) is a class of artificial neural network (ANN), most commonly applied to analyze visual imagery. CNNs are also known as Shift Invariant or Space Invariant Artificial Neural Netwo ...

on some object recognition datasets such as Oxford Flower Dataset 102.

References

External links

Bag of Visual Words in a Nutshell
a short tutorial by Bethea Davida.

by L. Fei-Fei, R. Fergus, and A. Torralba.
Caltech Large Scale Image Search Toolbox
a Matlab/C++ toolbox implementing Inverted File search for Bag of Words model. It also contains implementations for fast approximate nearest neighbor search using randomized

k-d tree In computer science, a ''k''-d tree (short for ''k-dimensional tree'') is a space-partitioning data structure for organizing points in a ''k''-dimensional space. ''k''-d trees are a useful data structure for several applications, such as sea ...

locality-sensitive hashing In computer science, locality-sensitive hashing (LSH) is an algorithmic technique that hashes similar input items into the same "buckets" with high probability. (The number of buckets is much smaller than the universe of possible input items.) Since ...

, and

hierarchical k-means A hierarchy (from Greek: , from , 'president of sacred rites') is an arrangement of items (objects, names, values, categories, etc.) that are represented as being "above", "below", or "at the same level as" one another. Hierarchy is an important ...

.
DBoW2 library
a library that implements a fast bag of words in C++ with support for

OpenCV OpenCV (''Open Source Computer Vision Library'') is a library of programming functions mainly aimed at real-time computer vision. Originally developed by Intel, it was later supported by Willow Garage then Itseez (which was later acquired by I ...

. {{DEFAULTSORT:Bag Of Words Model In Computer Vision Object recognition and categorization it:Modello della borsa di parole