Bag-of-words model in computer vision

Last updated
Bag of words.JPG

In computer vision, the bag-of-words model (BoW model) sometimes called bag-of-visual-words model [1] [2] can be applied to image classification or retrieval, by treating image features as words. In document classification, a bag of words is a sparse vector of occurrence counts of words; that is, a sparse histogram over the vocabulary. In computer vision, a bag of visual words is a vector of occurrence counts of a vocabulary of local image features.

Contents

Image representation based on the BoW model

To represent an image using the BoW model, an image can be treated as a document. Similarly, "words" in images need to be defined too. To achieve this, it usually includes following three steps: feature detection, feature description, and codebook generation. [1] [2] [3] A definition of the BoW model can be the "histogram representation based on independent features". [4] Content based image indexing and retrieval (CBIR) appears to be the early adopter of this image representation technique. [5]

Feature representation

After feature detection, each image is abstracted by several local patches. Feature representation methods deal with how to represent the patches as numerical vectors. These vectors are called feature descriptors. A good descriptor should have the ability to handle intensity, rotation, scale and affine variations to some extent. One of the most famous descriptors is the scale-invariant feature transform (SIFT). [6] SIFT converts each patch to 128-dimensional vector. After this step, each image is a collection of vectors of the same dimension (128 for SIFT), where the order of different vectors is of no importance.

Codebook generation

The final step for the BoW model is to convert vector-represented patches to "codewords" (analogous to words in text documents), which also produces a "codebook" (analogy to a word dictionary). A codeword can be considered as a representative of several similar patches. One simple method is performing k-means clustering over all the vectors. [7] Codewords are then defined as the centers of the learned clusters. The number of the clusters is the codebook size (analogous to the size of the word dictionary).

Thus, each patch in an image is mapped to a certain codeword through the clustering process and the image can be represented by the histogram of the codewords.

Learning and recognition based on the BoW model

Computer vision researchers have developed several learning methods to leverage the BoW model for image related tasks, such as object categorization. These methods can roughly be divided into two categories, unsupervised and supervised models. For multiple label categorization problem, the confusion matrix can be used as an evaluation metric.

Unsupervised models

Here are some notations for this section. Suppose the size of codebook is .

Since the BoW model is an analogy to the BoW model in NLP, generative models developed in text domains can also be adapted in computer vision. Simple Naive Bayes model and hierarchical Bayesian models are discussed.

Naive Bayes

The simplest one is Naive Bayes classifier. [2] Using the language of graphical models, the Naive Bayes classifier is described by the equation below. The basic idea (or assumption) of this model is that each category has its own distribution over the codebooks, and that the distributions of each category are observably different. Take a face category and a car category for an example. The face category may emphasize the codewords which represent "nose", "eye" and "mouth", while the car category may emphasize the codewords which represent "wheel" and "window". Given a collection of training examples, the classifier learns different distributions for different categories. The categorization decision is made by

Since the Naive Bayes classifier is simple yet effective, it is usually used as a baseline method for comparison.

Hierarchical Bayesian models

The basic assumption of Naive Bayes model does not hold sometimes. For example, a natural scene image may contain several different themes. Probabilistic latent semantic analysis (pLSA) [8] [9] and latent Dirichlet allocation (LDA) [10] are two popular topic models from text domains to tackle the similar multiple "theme" problem. Take LDA for an example. To model natural scene images using LDA, an analogy is made with document analysis:

  • the image category is mapped to the document category;
  • the mixture proportion of themes maps the mixture proportion of topics;
  • the theme index is mapped to topic index;
  • the codeword is mapped to the word.

This method shows very promising results in natural scene categorization on 13 Natural Scene Categories. [3]

Supervised models

Since images are represented based on the BoW model, any discriminative model suitable for text document categorization can be tried, such as support vector machine (SVM) [2] and AdaBoost. [11] Kernel trick is also applicable when kernel based classifier is used, such as SVM. Pyramid match kernel is newly developed one based on the BoW model. The local feature approach of using BoW model representation learnt by machine learning classifiers with different kernels (e.g., EMD-kernel and kernel) has been vastly tested in the area of texture and object recognition. [12] Very promising results on a number of datasets have been reported. This approach [12] has achieved very impressive results in the PASCAL Visual Object Classes Challenge.

Pyramid match kernel

Pyramid match kernel [13] is a fast algorithm (linear complexity instead of classic one in quadratic complexity) kernel function (satisfying Mercer's condition) which maps the BoW features, or set of features in high dimension, to multi-dimensional multi-resolution histograms. An advantage of these multi-resolution histograms is their ability to capture co-occurring features. The pyramid match kernel builds multi-resolution histograms by binning data points into discrete regions of increasing size. Thus, points that do not match at high resolutions have the chance to match at low resolutions. The pyramid match kernel performs an approximate similarity match, without explicit search or computation of distance. Instead, it intersects the histograms to approximate the optimal match. Accordingly, the computation time is only linear in the number of features. Compared with other kernel approaches, the pyramid match kernel is much faster, yet provides equivalent accuracy. The pyramid match kernel was applied to ETH-80 database and Caltech 101 database with promising results. [13] [14]

Limitations and recent developments

One of the notorious disadvantages of BoW is that it ignores the spatial relationships among the patches, which are very important in image representation. Researchers have proposed several methods to incorporate the spatial information. For feature level improvements, correlogram features can capture spatial co-occurrences of features. [15] For generative models, relative positions [16] [17] of codewords are also taken into account. The hierarchical shape and appearance model for human action [18] introduces a new part layer (Constellation model) between the mixture proportion and the BoW features, which captures the spatial relationships among parts in the layer. For discriminative models, spatial pyramid match [19] performs pyramid matching by partitioning the image into increasingly fine sub-regions and compute histograms of local features inside each sub-region. Recently, an augmentation of local image descriptors (i.e. SIFT) by their spatial coordinates normalised by the image width and height have proved to be a robust and simple Spatial Coordinate Coding [20] [21] approach which introduces spatial information to the BoW model.

The BoW model has not been extensively tested yet for view point invariance and scale invariance, and the performance is unclear. Also the BoW model for object segmentation and localization is not well understood. [4]

A systematic comparison of classification pipelines found that the encoding of first and second order statistics (Vector of Locally Aggregated Descriptors (VLAD) [22] and Fisher Vector (FV)) considerably increased classification accuracy compared to BoW, while also decreasing the codebook size, thus lowering the computational effort for codebook generation. [23] Moreover, a recent detailed comparison of coding and pooling methods [21] for BoW has showed that second order statistics combined with Sparse Coding and an appropriate pooling such as Power Normalisation can further outperform Fisher Vectors and even approach results of simple models of Convolutional Neural Network on some object recognition datasets such as Oxford Flower Dataset 102.

See also

Related Research Articles

<span class="mw-page-title-main">Image segmentation</span> Partitioning a digital image into segments

In digital image processing and computer vision, image segmentation is the process of partitioning a digital image into multiple image segments, also known as image regions or image objects. The goal of segmentation is to simplify and/or change the representation of an image into something that is more meaningful and easier to analyze. Image segmentation is typically used to locate objects and boundaries in images. More precisely, image segmentation is the process of assigning a label to every pixel in an image such that pixels with the same label share certain characteristics.

The scale-invariant feature transform (SIFT) is a computer vision algorithm to detect, describe, and match local features in images, invented by David Lowe in 1999. Applications include object recognition, robotic mapping and navigation, image stitching, 3D modeling, gesture recognition, video tracking, individual identification of wildlife and match moving.

Scale-space theory is a framework for multi-scale signal representation developed by the computer vision, image processing and signal processing communities with complementary motivations from physics and biological vision. It is a formal theory for handling image structures at different scales, by representing an image as a one-parameter family of smoothed images, the scale-space representation, parametrized by the size of the smoothing kernel used for suppressing fine-scale structures. The parameter in this family is referred to as the scale parameter, with the interpretation that image structures of spatial size smaller than about have largely been smoothed away in the scale-space level at scale .

<span class="mw-page-title-main">Automatic image annotation</span>

Automatic image annotation is the process by which a computer system automatically assigns metadata in the form of captioning or keywords to a digital image. This application of computer vision techniques is used in image retrieval systems to organize and locate images of interest from a database.

Structure from motion (SfM) is a photogrammetric range imaging technique for estimating three-dimensional structures from two-dimensional image sequences that may be coupled with local motion signals. It is studied in the fields of computer vision and visual perception.

Mean shift is a non-parametric feature-space mathematical analysis technique for locating the maxima of a density function, a so-called mode-seeking algorithm. Application domains include cluster analysis in computer vision and image processing.

Articulated body pose estimation in computer vision is the study of algorithms and systems that recover the pose of an articulated body, which consists of joints and rigid parts using image-based observations. It is one of the longest-lasting problems in computer vision because of the complexity of the models that relate observation with pose, and because of the variety of situations in which it would be useful.

Part-based models refers to a broad class of detection algorithms used on images, in which various parts of the image are used separately in order to determine if and where an object of interest exists. Amongst these methods a very popular one is the constellation model which refers to those schemes which seek to detect a small number of features and their relative positions to then determine whether or not the object of interest is present.

Caltech 101 is a data set of digital images created in September 2003 and compiled by Fei-Fei Li, Marco Andreetto, Marc 'Aurelio Ranzato and Pietro Perona at the California Institute of Technology. It is intended to facilitate computer vision research and techniques and is most applicable to techniques involving image recognition classification and categorization. Caltech 101 contains a total of 9,146 images, split between 101 distinct object categories and a background category. Provided with the images are a set of annotations describing the outlines of each image, along with a Matlab script for viewing.

In computer vision, the problem of object categorization from image search is the problem of training a classifier to recognize categories of objects, using only the images retrieved automatically with an Internet search engine. Ideally, automatic image collection would allow classifiers to be trained with nothing but the category names as input. This problem is closely related to that of content-based image retrieval (CBIR), where the goal is to return better image search results rather than training a classifier for image recognition.

<span class="mw-page-title-main">Pedestrian detection</span> Computer technology

Pedestrian detection is an essential and significant task in any intelligent video surveillance system, as it provides the fundamental information for semantic understanding of the video footages. It has an obvious extension to automotive applications due to the potential for improving safety systems. Many car manufacturers offer this as an ADAS option in 2017.

A convolutional neural network (CNN) is a regularized type of feed-forward neural network that learns features by itself via filter optimization. This type of deep learning network has been applied to process and make predictions from many different types of data including text, images and audio. Convolution-based networks are the de-facto standard in deep learning-based approaches to computer vision and image processing, and have only recently been replaced -- in some cases -- by newer deep learning architectures such as the transformer. Vanishing gradients and exploding gradients, seen during backpropagation in earlier neural networks, are prevented by using regularized weights over fewer connections. For example, for each neuron in the fully-connected layer, 10,000 weights would be required for processing an image sized 100 × 100 pixels. However, applying cascaded convolution kernels, only 25 neurons are required to process 5x5-sized tiles. Higher-layer features are extracted from wider context windows, compared to lower-layer features.

In computer vision, rigid motion segmentation is the process of separating regions, features, or trajectories from a video sequence into coherent subsets of space and time. These subsets correspond to independent rigidly moving objects in the scene. The goal of this segmentation is to differentiate and extract the meaningful rigid motion from the background and analyze it. Image segmentation techniques labels the pixels to be a part of pixels with certain characteristics at a particular time. Here, the pixels are segmented depending on its relative movement over a period of time i.e. the time of the video sequence.

The ImageNet project is a large visual database designed for use in visual object recognition software research. More than 14 million images have been hand-annotated by the project to indicate what objects are pictured and in at least one million of the images, bounding boxes are also provided. ImageNet contains more than 20,000 categories, with a typical category, such as "balloon" or "strawberry", consisting of several hundred images. The database of annotations of third-party image URLs is freely available directly from ImageNet, though the actual images are not owned by ImageNet. Since 2010, the ImageNet project runs an annual software contest, the ImageNet Large Scale Visual Recognition Challenge, where software programs compete to correctly classify and detect objects and scenes. The challenge uses a "trimmed" list of one thousand non-overlapping classes.

In the domain of physics and probability, the filters, random fields, and maximum entropy (FRAME) model is a Markov random field model of stationary spatial processes, in which the energy function is the sum of translation-invariant potential functions that are one-dimensional non-linear transformations of linear filter responses. The FRAME model was originally developed by Song-Chun Zhu, Ying Nian Wu, and David Mumford for modeling stochastic texture patterns, such as grasses, tree leaves, brick walls, water waves, etc. This model is the maximum entropy distribution that reproduces the observed marginal histograms of responses from a bank of filters, where for each filter tuned to a specific scale and orientation, the marginal histogram is pooled over all the pixels in the image domain. The FRAME model is also proved to be equivalent to the micro-canonical ensemble, which was named the Julesz ensemble. Gibbs sampler is adopted to synthesize texture images by drawing samples from the FRAME model.

<span class="mw-page-title-main">Video super-resolution</span> Generating high-resolution video frames from given low-resolution ones

Video super-resolution (VSR) is the process of generating high-resolution video frames from the given low-resolution video frames. Unlike single-image super-resolution (SISR), the main goal is not only to restore more fine details while saving coarse ones, but also to preserve motion consistency.

Self-supervised learning (SSL) is a paradigm in machine learning where a model is trained on a task using the data itself to generate supervisory signals, rather than relying on externally-provided labels. In the context of neural networks, self-supervised learning aims to leverage inherent structures or relationships within the input data to create meaningful training signals. SSL tasks are designed so that solving them requires capturing essential features or relationships in the data. The input data is typically augmented or transformed in a way that creates pairs of related samples, where one sample serves as the input, and the other is used to formulate the supervisory signal. This augmentation can involve introducing noise, cropping, rotation, or other transformations. Self-supervised learning more closely imitates the way humans learn to classify objects.

<span class="mw-page-title-main">Vision transformer</span> Machine learning model for vision processing

A vision transformer (ViT) is a transformer designed for computer vision. A ViT decomposes an input image into a series of patches, serializes each patch into a vector, and maps it to a smaller dimension with a single matrix multiplication. These vector embeddings are then processed by a transformer encoder as if they were token embeddings.

<span class="mw-page-title-main">3D Face Morphable Model</span> Generative model for 3D textured faces

In computer vision and computer graphics, the 3D Face Morphable Model (3DFMM) is a generative technique for modeling textured 3D faces. The generation of new faces is based on a pre-existing database of example faces acquired through a 3D scanning procedure. All these faces are in dense point-to-point correspondence, which enables the generation of a new realistic face (morph) by combining the acquired faces. A new 3D face can be inferred from one or multiple existing images of a face or by arbitrarily combining the example faces. 3DFMM provides a way to represent face shape and texture disentangled from external factors, such as camera parameters and illumination.

FAISS (Facebook AI Similarity Search) is an open-source library for similarity search and clustering of vectors. It contains algorithms that search in sets of vectors of any size, up to ones that possibly do not fit in RAM. It also contains supporting code for evaluation and parameter tuning.

References

  1. 1 2 Video Google: A Text Retrieval Approach to Object Matching in Videos. 13-16 October 2003. 2003.
  2. 1 2 3 4 G. Csurka; C. Dance; L.X. Fan; J. Willamowski & C. Bray (2004). "Visual categorization with bags of keypoints". Proc. of ECCV International Workshop on Statistical Learning in Computer Vision.
  3. 1 2 Fei-Fei Li; Perona, P. (2005). "A Bayesian Hierarchical Model for Learning Natural Scene Categories". 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05). Vol. 2. pp. 524–531. doi:10.1109/CVPR.2005.16. ISBN   978-0-7695-2372-9. S2CID   6387937.
  4. 1 2 L. Fei-Fei; R. Fergus & A. Torralba. "Recognizing and Learning Object Categories, CVPR 2007 short course".
  5. Qiu, G. (2002). "Indexing chromatic and achromatic patterns for content-based colour image retrieval" (PDF). Pattern Recognition. 35 (8): 1675–1686. Bibcode:2002PatRe..35.1675Q. doi:10.1016/S0031-3203(01)00162-5.
  6. Vidal-Naquet; Ullman (1999). "Object recognition with informative features and linear classification" (PDF). Proceedings Ninth IEEE International Conference on Computer Vision. pp. 1150–1157. CiteSeerX   10.1.1.131.1283 . doi:10.1109/ICCV.2003.1238356. ISBN   978-0-7695-1950-0. S2CID   15620181.
  7. T. Leung; J. Malik (2001). "Representing and recognizing the visual appearance of materials using three-dimensional textons" (PDF). International Journal of Computer Vision. 43 (1): 29–44. doi:10.1023/A:1011126920638. S2CID   14915716.
  8. T. Hoffman (1999). "Probabilistic Latent Semantic Analysis" (PDF). Proc. of the Fifteenth Conference on Uncertainty in Artificial Intelligence. Archived from the original (PDF) on 2007-07-10. Retrieved 2007-12-10.
  9. Sivic, J.; Russell, B.C.; Efros, A.A.; Zisserman, A.; Freeman, W.T. (2005). "Discovering objects and their location in images" (PDF). Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1. p. 370. CiteSeerX   10.1.1.184.1253 . doi:10.1109/ICCV.2005.77. ISBN   978-0-7695-2334-7. S2CID   206769491. Archived from the original (PDF) on 2020-01-31. Retrieved 2007-12-10.
  10. D. Blei; A. Ng & M. Jordan (2003). Lafferty, John (ed.). "Latent Dirichlet allocation" (PDF). Journal of Machine Learning Research. 3 (4–5): 993–1022. doi:10.1162/jmlr.2003.3.4-5.993. Archived from the original (PDF) on 2008-08-22. Retrieved 2007-12-10.
  11. Serre, T.; Wolf, L.; Poggio, T. (2005). "Object Recognition with Features Inspired by Visual Cortex" (PDF). 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05). Vol. 2. p. 994. CiteSeerX   10.1.1.71.5276 . doi:10.1109/CVPR.2005.254. ISBN   978-0-7695-2372-9. S2CID   260426. Archived from the original (PDF) on 2017-07-06. Retrieved 2007-12-10.
  12. 1 2 Jianguo Zhang; Marcin Marszałek; Svetlana Lazebnik; Cordelia Schmid (2007). "Local Features and Kernels for Classification of Texture and Object Categories: a Comprehensive Study" (PDF). International Journal of Computer Vision. 73 (2): 213–238. doi:10.1007/s11263-006-9794-4. S2CID   1486613.
  13. 1 2 Grauman, K.; Darrell, T. (2005). "The pyramid match kernel: discriminative classification with sets of image features" (PDF). Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1. p. 1458. CiteSeerX   10.1.1.644.6159 . doi:10.1109/ICCV.2005.239. ISBN   978-0-7695-2334-7. S2CID   13036203.
  14. Jianchao Yang; Kai Yu; Yihong Gong; Huang, T. (2009). "Linear spatial pyramid matching using sparse coding for image classification". 2009 IEEE Conference on Computer Vision and Pattern Recognition. p. 1794. doi:10.1109/CVPR.2009.5206757. ISBN   978-1-4244-3992-8. S2CID   440212. Archived from the original on 2019-03-20. Retrieved 2011-09-09.
  15. Savarese, S.; Winn, J.; Criminisi, A. (2006). "Discriminative Object Class Models of Appearance and Shape by Correlatons" (PDF). 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Volume 2 (CVPR'06). Vol. 2. p. 2033. CiteSeerX   10.1.1.587.8853 . doi:10.1109/CVPR.2006.102. ISBN   978-0-7695-2597-6. S2CID   1457124. Archived from the original (PDF) on 2013-10-29. Retrieved 2007-12-10.
  16. Sudderth, E.B.; Torralba, A.; Freeman, W.T.; Willsky, A.S. (2005). "Learning hierarchical models of scenes, objects, and parts" (PDF). Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1. p. 1331. CiteSeerX   10.1.1.128.7259 . doi:10.1109/ICCV.2005.137. ISBN   978-0-7695-2334-7. S2CID   6153430. Archived from the original (PDF) on 2019-02-03. Retrieved 2007-12-10.
  17. E. Sudderth; A. Torralba; W. Freeman & A. Willsky (2005). "Describing Visual Scenes using Transformed Dirichlet Processes" (PDF). Proc. of Neural Information Processing Systems.
  18. Niebles, Juan Carlos; Li Fei-Fei (2007). "A Hierarchical Model of Shape and Appearance for Human Action Classification" (PDF). 2007 IEEE Conference on Computer Vision and Pattern Recognition. p. 1. CiteSeerX   10.1.1.173.2667 . doi:10.1109/CVPR.2007.383132. ISBN   978-1-4244-1179-5. S2CID   9213242.
  19. Lazebnik, S.; Schmid, C.; Ponce, J. (2006). "Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories" (PDF). 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Volume 2 (CVPR'06). Vol. 2. p. 2169. CiteSeerX   10.1.1.651.9183 . doi:10.1109/CVPR.2006.68. ISBN   978-0-7695-2597-6. S2CID   2421251. Archived from the original (PDF) on 2018-05-08. Retrieved 2007-12-10.
  20. Koniusz, Piotr; Yan, Fei; Mikolajczyk, Krystian (2013-05-01). "Comparison of mid-level feature coding approaches and pooling strategies in visual concept detection". Computer Vision and Image Understanding. 117 (5): 479–492. doi:10.1016/j.cviu.2012.10.010. ISSN   1077-3142.
  21. 1 2 Koniusz, Piotr; Yan, Fei; Gosselin, Philippe Henri; Mikolajczyk, Krystian (2017-02-24). "Higher-order occurrence pooling for bags-of-words: Visual concept detection" (PDF). IEEE Transactions on Pattern Analysis and Machine Intelligence. 39 (2): 313–326. doi:10.1109/TPAMI.2016.2545667. hdl: 10044/1/39814 . ISSN   0162-8828. PMID   27019477.
  22. Jégou, H.; Douze, M.; Schmid, C.; Pérez, P. (2010-06-01). "Aggregating local descriptors into a compact image representation". 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (PDF). pp. 3304–3311. doi:10.1109/CVPR.2010.5540039. ISBN   978-1-4244-6984-0. S2CID   1912782.
  23. Seeland, Marco; Rzanny, Michael; Alaqraa, Nedal; Wäldchen, Jana; Mäder, Patrick (2017-02-24). "Plant species classification using flower images—A comparative study of local feature representations". PLOS ONE. 12 (2): e0170629. Bibcode:2017PLoSO..1270629S. doi: 10.1371/journal.pone.0170629 . ISSN   1932-6203. PMC   5325198 . PMID   28234999.