Caltech 101

Last updated

Caltech 101 is a data set of digital images created in September 2003 and compiled by Fei-Fei Li, Marco Andreetto, Marc 'Aurelio Ranzato and Pietro Perona at the California Institute of Technology. It is intended to facilitate Computer Vision research and techniques and is most applicable to techniques involving image recognition classification and categorization. Caltech 101 contains a total of 9,146 images, split between 101 distinct object categories (faces, watches, ants, pianos, etc.) and a background category. Provided with the images are a set of annotations describing the outlines of each image, along with a Matlab script for viewing.

Contents

Purpose

Most Computer Vision and Machine Learning algorithms function by training on example inputs. They require a large and varied set of training data to work effectively. For example, the real-time face detection method used by Paul Viola and Michael J. Jones was trained on 4,916 hand-labeled faces. [1]

Cropping, re-sizing and hand-marking points of interest is tedious and time-consuming.

Historically, most data sets used in computer vision research have been tailored to the specific needs of the project being worked on. A large problem in comparing computer vision techniques is the fact that most groups use their own data sets. Each set may have different properties that make reported results from different methods harder to compare directly. For example, differences in image size, image quality, relative location of objects within the images and level of occlusion and clutter present can lead to varying results. [2]

The Caltech 101 data set aims at alleviating many of these common problems.

However, a recent study [3] demonstrates that tests based on uncontrolled natural images (like the Caltech 101 data set) can be seriously misleading, potentially guiding progress in the wrong direction.

Data set

Images

The Caltech 101 data set consists of a total of 9,146 images, split between 101 different object categories, as well as an additional background/clutter category.

Each object category contains between 40 and 800 images. Common and popular categories such as faces tend to have a larger number of images than others.

Each image is about 300x200 pixels. Images of oriented objects such as airplanes and motorcycles were mirrored to be left to right aligned and vertically oriented structures such as buildings were rotated to be off axis.

Annotations

A set of annotations is provided for each image. Each set of annotations contains two pieces of information: the general bounding box in which the object is located and a detailed human-specified outline enclosing the object.

A Matlab script is provided with the annotations. It loads an image and its corresponding annotation file and displays them as a Matlab figure.

Uses

The Caltech 101 data set was used to train and test several computer vision recognition and classification algorithms. The first paper to use Caltech 101 was an incremental Bayesian approach to one shot learning, [4] an attempt to classify an object using only a few examples, by building on prior knowledge of other classes.

The Caltech 101 images, along with the annotations, were used for another one shot learning paper at Caltech. [5]

Other Computer Vision papers that report using the Caltech 101 data set include:

Analysis and comparison

Advantages

Caltech 101 has several advantages over other similar data sets:

Weaknesses

Weaknesses to the Caltech 101 data set [3] [14] may be conscious trade-offs, but others are limitations of the data set. Papers that rely solely on Caltech 101 are frequently rejected.

Weaknesses include:

Other data sets

See also

Related Research Articles

<span class="mw-page-title-main">Boosting (machine learning)</span> Method in machine learning

In machine learning, boosting is an ensemble meta-algorithm for primarily reducing bias, and also variance in supervised learning, and a family of machine learning algorithms that convert weak learners to strong ones. Boosting is based on the question posed by Kearns and Valiant : "Can a set of weak learners create a single strong learner?" A weak learner is defined to be a classifier that is only slightly correlated with the true classification. In contrast, a strong learner is a classifier that is arbitrarily well-correlated with the true classification.

<span class="mw-page-title-main">Automatic image annotation</span>

Automatic image annotation is the process by which a computer system automatically assigns metadata in the form of captioning or keywords to a digital image. This application of computer vision techniques is used in image retrieval systems to organize and locate images of interest from a database.

Range segmentation is the task of segmenting (dividing) a range image, an image containing depth information for each pixel, into segments (regions), so that all the points of the same surface belong to the same region, there is no overlap between different regions and the union of these regions generates the entire image.

Articulated body pose estimation in computer vision is the study of algorithms and systems that recover the pose of an articulated body, which consists of joints and rigid parts using image-based observations. It is one of the longest-lasting problems in computer vision because of the complexity of the models that relate observation with pose, and because of the variety of situations in which it would be useful.

In computer vision, the bag-of-words model sometimes called bag-of-visual-words model can be applied to image classification or retrieval, by treating image features as words. In document classification, a bag of words is a sparse vector of occurrence counts of words; that is, a sparse histogram over the vocabulary. In computer vision, a bag of visual words is a vector of occurrence counts of a vocabulary of local image features.

Object recognition – technology in the field of computer vision for finding and identifying objects in an image or video sequence. Humans recognize a multitude of objects in images with little effort, despite the fact that the image of the objects may vary somewhat in different view points, in many different sizes and scales or even when they are translated or rotated. Objects can even be recognized when they are partially obstructed from view. This task is still a challenge for computer vision systems. Many approaches to the task have been implemented over multiple decades.

Part-based models refers to a broad class of detection algorithms used on images, in which various parts of the image are used separately in order to determine if and where an object of interest exists. Amongst these methods a very popular one is the constellation model which refers to those schemes which seek to detect a small number of features and their relative positions to then determine whether or not the object of interest is present.

The constellation model is a probabilistic, generative model for category-level object recognition in computer vision. Like other part-based models, the constellation model attempts to represent an object class by a set of N parts under mutual geometric constraints. Because it considers the geometric relationship between different parts, the constellation model differs significantly from appearance-only, or "bag-of-words" representation models, which explicitly disregard the location of image features.

One-shot learning is an object categorization problem, found mostly in computer vision. Whereas most machine learning-based object categorization algorithms require training on hundreds or thousands of examples, one-shot learning aims to classify objects from one, or only a few, examples. The term few-shot learning is also used for these problems, especially when more than one example is needed.

In computer vision, the problem of object categorization from image search is the problem of training a classifier to recognize categories of objects, using only the images retrieved automatically with an Internet search engine. Ideally, automatic image collection would allow classifiers to be trained with nothing but the category names as input. This problem is closely related to that of content-based image retrieval (CBIR), where the goal is to return better image search results rather than training a classifier for image recognition.

<span class="mw-page-title-main">Pedestrian detection</span> Computer technology

Pedestrian detection is an essential and significant task in any intelligent video surveillance system, as it provides the fundamental information for semantic understanding of the video footages. It has an obvious extension to automotive applications due to the potential for improving safety systems. Many car manufacturers offer this as an ADAS option in 2017.

The Overhead Imagery Research Data Set (OIRDS) is a collection of an open-source, annotated, overhead images that computer vision researchers can use to aid in the development of algorithms. Most computer vision and machine learning algorithms function by training on a large set of example data. Further, for many academic and industry researchers, the availability of truth-labeled test data helps drive algorithm research.

<span class="mw-page-title-main">Pietro Perona</span> American computer scientist

Pietro Perona is an Italian-American educator and computer scientist. He is the Allan E. Puckett Professor of Electrical Engineering and Computation and Neural Systems at the California Institute of Technology and director of the National Science Foundation Engineering Research Center in Neuromorphic Systems Engineering. He is known for his research in computer vision and is the director of the Caltech Computational Vision Group.

In computer vision, rigid motion segmentation is the process of separating regions, features, or trajectories from a video sequence into coherent subsets of space and time. These subsets correspond to independent rigidly moving objects in the scene. The goal of this segmentation is to differentiate and extract the meaningful rigid motion from the background and analyze it. Image segmentation techniques labels the pixels to be a part of pixels with certain characteristics at a particular time. Here, the pixels are segmented depending on its relative movement over a period of time i.e. the time of the video sequence.

The ImageNet project is a large visual database designed for use in visual object recognition software research. More than 14 million images have been hand-annotated by the project to indicate what objects are pictured and in at least one million of the images, bounding boxes are also provided. ImageNet contains more than 20,000 categories, with a typical category, such as "balloon" or "strawberry", consisting of several hundred images. The database of annotations of third-party image URLs is freely available directly from ImageNet, though the actual images are not owned by ImageNet. Since 2010, the ImageNet project runs an annual software contest, the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), where software programs compete to correctly classify and detect objects and scenes. The challenge uses a "trimmed" list of one thousand non-overlapping classes.

Jiebo Luo is a Chinese-American computer scientist, the Albert Arendt Hopeman Professor of Engineering and Professor of Computer Science at the University of Rochester. He is interested in artificial intelligence, data science and computer vision.

<span class="mw-page-title-main">Michael J. Black</span> American-born computer scientist

Michael J. Black is an American-born computer scientist working in Tübingen, Germany. He is a founding director at the Max Planck Institute for Intelligent Systems where he leads the Perceiving Systems Department in research focused on computer vision, machine learning, and computer graphics. He is also an Honorary Professor at the University of Tübingen.

<span class="mw-page-title-main">Rita Cucchiara</span> Italian electrical and computer engineer (born 1965)

Rita Cucchiara is an Italian electrical and computer engineer, and professor in Computer engineering and Science in the Enzo Ferrari Department of Engineering at the University of Modena and Reggio Emilia (UNIMORE) in Italy. She helds the courses of “Computer Architecture” and “Computer Vision and Cognitive Systems”. Cucchiara's research work focuses on artificial intelligence, specifically deep network technologies and computer vision for human behavior understanding (HBU) and visual, language and multimodal generative AI. She is the scientific coordinator of the AImage Lab at UNIMORE and is director of the Artificial Intelligence Research and Innovation Center (AIRI) as well as the ELLIS Unit at Modena. She was founder and director from 2018 to 2021 of the Italian National Lab of Artificial Intelligence and intelligent systems AIIS of CINI. Cucchiara was also president of the CVPL from 2016 to 2018. Rita Cucchiara is IAPR Fellow since 2006 and ELLIS Fellow since 2020.

<span class="mw-page-title-main">Self-supervised learning</span> A paradigm in machine learning

Self-supervised learning (SSL) is a paradigm in machine learning where a model is trained on a task using the data itself to generate supervisory signals, rather than relying on external labels provided by humans. In the context of neural networks, self-supervised learning aims to leverage inherent structures or relationships within the input data to create meaningful training signals. SSL tasks are designed so that solving it requires capturing essential features or relationships in the data. The input data is typically augmented or transformed in a way that creates pairs of related samples. One sample serves as the input, and the other is used to formulate the supervisory signal. This augmentation can involve introducing noise, cropping, rotation, or other transformations. Self-supervised learning more closely imitates the way humans learn to classify objects.

References

  1. Viola, Paul; Jones, Michael J. (2004). "Robust Real-Time Face Detection". International Journal of Computer Vision. 57 (2): 137–154. doi:10.1023/B:VISI.0000013087.49260.fb. S2CID   2796017.
  2. Oertel, Carsten; Colder, Brian; Colombe, Jeffrey; High, Julia; Ingram, Michael; Sallee, Phil (2008). "Current challenges in automating visual perception". 2008 37th IEEE Applied Imagery Pattern Recognition Workshop. pp. 1–8. doi:10.1109/AIPR.2008.4906457. ISBN   978-1-4244-3125-0. S2CID   36669995.
  3. 1 2 3 Pinto, Nicolas; Cox, David D.; Dicarlo, James J. (2008). "Why is Real-World Visual Object Recognition Hard?". PLOS Computational Biology. 4 (1): e27. Bibcode:2008PLSCB...4...27P. doi: 10.1371/journal.pcbi.0040027 . PMC   2211529 . PMID   18225950.
  4. L. Fei-Fei, R. Fergus and P. Perona. Learning generative visual models from few training examples: an incremental Bayesian approach tested on 101 object categories. IEEE. CVPR 2004, Workshop on Generative-Model Based Vision. 2004
  5. L. Fei-Fei; R. Fergus; P. Perona (April 2006). "One-Shot learning of object categories" (PDF). IEEE Transactions on Pattern Analysis and Machine Intelligence. 28 (4): 594–611. doi:10.1109/TPAMI.2006.79. PMID   16566508. S2CID   6953475. Archived from the original (PDF) on 2007-06-09. Retrieved 2008-01-16.
  6. The Pyramid Match Kernel:Discriminative Classification with Sets of Image Features. K. Grauman and T. Darrell. International Conference on Computer Vision (ICCV), 2005
  7. Holub, AD; Welling, M; Perona, P. Combining Generative Models and Fisher Kernels for Object Class Recognition. International Conference on Computer Vision (ICCV), 2005. Archived from the original on 2007-08-14. Retrieved 2008-01-16.
  8. Object Recognition with Features Inspired by Visual Cortex. T. Serre, L. Wolf and T. Poggio. Proceedings of 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2005), IEEE Computer Society Press, San Diego, June 2005
  9. SVM-KNN: Discriminative Nearest Neighbor Classification for Visual Category Recognition. Hao Zhang, Alex Berg, Michael Maire, Jitendra Malik. CVPR, 2006
  10. Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories. Svetlana Lazebnik, Cordelia Schmid, and Jean Ponce. CVPR, 2006
  11. Empirical study of multi-scale filter banks for object categorization, M.J. Mar韓-Jim閚ez, and N. P閞ez de la Blanca. December 2005
  12. Multiclass Object Recognition with Sparse, Localized Features, Jim Mutch and David G. Lowe. , pg. 11-18, CVPR 2006, IEEE Computer Society Press, New York, June 2006
  13. G. Wang; Y. Zhang; L. Fei-Fei (2006). "Using Dependent Regions or Object Categorization in a Generative Framework" (PDF). IEEE Comp. Vis. Patt. Recog. Archived from the original (PDF) on 2007-06-09. Retrieved 2008-01-16.
  14. J. Ponce; T. L. Berg; M. Everingham; D. A. Forsyth; M. Hebert; S. Lazebnik; M. Marszalek; C. Schmid; B. C. Russell; A. Torralba; C. K. I. Williams; J. Zhang; A. Zisserman (2006). J. Ponce; M. Hebert; C. Schmid; A. Zisserman (eds.). "Dataset Issues in Object Recognition" (PDF). Toward Category-Level Object Recognition, Springer-Verlag Lecture Notes in Computer Science. Archived from the original (PDF) on 2016-12-24. Retrieved 2008-02-08.
  15. F. Tanner, B. Colder, C. Pullen, D. Heagy, C. Oertel, & P. Sallee, Overhead Imagery Research Data Set (OIRDS) – an annotated data library and tools to aid in the development of computer vision algorithms, June 2009, <http://sourceforge.net/apps/mediawiki/oirds/index.php?title=Documentation Archived 2012-11-09 at the Wayback Machine > (28 December 2009)
  16. "L. Ballan, M. Bertini, A. Del Bimbo, A.M. Serain, G. Serra, B.F. Zaccone. Combining Generative and Discriminative Models for Classifying Social Images from 101 Object Categories. Int. Conference on Pattern Recognition (ICPR), 2012" (PDF). Archived from the original (PDF) on 2014-08-26. Retrieved 2012-07-11.