The ImageNet project is a large visual database designed for use in visual object recognition software research. More than 14 million [1] [2] images have been hand-annotated by the project to indicate what objects are pictured and in at least one million of the images, bounding boxes are also provided. [3] ImageNet contains more than 20,000 categories, [2] with a typical category, such as "balloon" or "strawberry", consisting of several hundred images. [4] The database of annotations of third-party image URLs is freely available directly from ImageNet, though the actual images are not owned by ImageNet. [5] Since 2010, the ImageNet project runs an annual software contest, the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), where software programs compete to correctly classify and detect objects and scenes. The challenge uses a "trimmed" list of one thousand non-overlapping classes. [6]
AI researcher Fei-Fei Li began working on the idea for ImageNet in 2006. At a time when most AI research focused on models and algorithms, Li wanted to expand and improve the data available to train AI algorithms. [7] In 2007, Li met with Princeton professor Christiane Fellbaum, one of the creators of WordNet , to discuss the project. As a result of this meeting, Li went on to build ImageNet starting from the roughly 22,000 nouns of WordNet and using many of its features. [8] She was also inspired by a 1987 estimate [9] that the average person recognizes roughly 30,000 different kinds of objects. [10]
As an assistant professor at Princeton, Li assembled a team of researchers to work on the ImageNet project. They used Amazon Mechanical Turk to help with the classification of images. Labeling started in July 2008 and ended in April 2010. It took 2.5 years to complete the labeling. [8] They had enough budget to have each of the 14 million images labelled three times. [10]
The original plan called for 10,000 images per category, for 40,000 categories at 400 million images, each verified 3 times. They found that humans can classify at most 2 images/sec. At this rate, it was estimated to take 19 human-years of labor (without rest). [11]
They presented their database for the first time as a poster at the 2009 Conference on Computer Vision and Pattern Recognition (CVPR) in Florida, titled "ImageNet: A Preview of a Large-scale Hierarchical Dataset". [12] [8] [13] [14] The poster was reused at Vision Sciences Society 2009. [15]
In 2009, Alex Berg suggested adding object localization as a task. Li approached PASCAL Visual Object Classes contest in 2009 for a collaboration. It resulted in the subsequent ImageNet Large Scale Visual Recognition Challenge starting in 2010, which has 1000 classes and object localization, as compared to PASCAL VOC which had just 20 classes and 19,737 images (in 2010). [6] [8]
On 30 September 2012, a convolutional neural network (CNN) called AlexNet [16] achieved a top-5 error of 15.3% in the ImageNet 2012 Challenge, more than 10.8 percentage points lower than that of the runner up. Using convolutional neural networks was feasible due to the use of graphics processing units (GPUs) during training, [16] an essential ingredient of the deep learning revolution. According to The Economist , "Suddenly people started to pay attention, not just within the AI community but across the technology industry as a whole." [4] [17] [18]
In 2015, AlexNet was outperformed by Microsoft's very deep CNN with over 100 layers, which won the ImageNet 2015 contest. [19]
ImageNet crowdsources its annotation process. Image-level annotations indicate the presence or absence of an object class in an image, such as "there are tigers in this image" or "there are no tigers in this image". Object-level annotations provide a bounding box around the (visible part of the) indicated object. ImageNet uses a variant of the broad WordNet schema to categorize objects, augmented with 120 categories of dog breeds to showcase fine-grained classification. [6]
In 2012, ImageNet was the world's largest academic user of Mechanical Turk. The average worker identified 50 images per minute. [2]
The original plan of the full ImageNet would have roughly 50M clean, diverse and full resolution images spread over approximately 50K synsets. [13] This was not achieved.
The summary statistics given on April 30, 2010: [20]
The categories of ImageNet were filtered from the WordNet concepts. Each concept, since it can contain multiple synonyms (for example, "kitty" and "young cat"), so each concept is called a "synonym set" or "synset". There were more than 100,000 synsets in WordNet 3.0, majority of them are nouns (80,000+). The ImageNet dataset filtered these to 21,841 synsets that are countable nouns that can be visually illustrated.
Each synset in WordNet 3.0 has a "WordNet ID" (wnid), which is a concatenation of part of speech and an "offset" (a unique identifying number). Every wnid starts with "n" because ImageNet only includes nouns. For example, the wnid of synset "dog, domestic dog, Canis familiaris" is "n02084071". [21]
The categories in ImageNet fall into 9 levels, from level 1 (such as "mammal") to level 9 (such as "German shepherd"). [11]
The images were scraped from online image search (Google, Picsearch, MSN, Yahoo, Flickr, etc) using synonyms in multiple languages. For example: German shepherd, German police dog, German shepherd dog, Alsatian, ovejero alemán, pastore tedesco, 德国牧羊犬. [22]
ImageNet consists of images in RGB format with varying resolutions. For example, in ImageNet 2012, "fish" category, the resolution ranges from 4288 x 2848 to 75 x 56. In machine learning, these are typically preprocessed into a standard constant resolution, and whitened, before further processing by neural networks.
For example, in PyTorch, ImageNet images are by default normalized by dividing the pixel values so that they fall between 0 and 1, then subtracting by [0.485, 0.456, 0.406], then dividing by [0.229, 0.224, 0.225]. These are the mean and standard deviations, for ImageNet, so these whitens the input data. [23]
Each image is labelled with exactly one wnid.
Dense SIFT features (raw SIFT descriptors, quantized codewords, and coordinates of each descriptor/codeword) for ImageNet-1K were available for download, designed for bag of visual words. [24]
The bounding boxes of objects were available for about 3000 popular synsets [25] with on average 150 images in each synset. [26]
Furthermore, some images have attributes. They released 25 attributes for ~400 popular synsets: [27] [28]
The full original dataset is referred to as ImageNet-21K. ImageNet-21k contains 14,197,122 images divided into 21,841 classes. Some papers round this up and name it ImageNet-22k. [29]
The full ImageNet-21k was released in Fall of 2011, as fall11_whole.tar
. There is no official train-validation-test split for ImageNet-21k. Some classes contain only 1-10 samples, while others contain thousands. [29]
There are various subsets of the ImageNet dataset used in various context, sometimes referred to as "versions". [16]
One of the most highly used subset of ImageNet is the "ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 2012–2017 image classification and localization dataset". This is also referred to in the research literature as ImageNet-1K or ILSVRC2017, reflecting the original ILSVRC challenge that involved 1,000 classes. ImageNet-1K contains 1,281,167 training images, 50,000 validation images and 100,000 test images. [30]
Each category in ImageNet-1K is a leaf category, meaning that there are no child nodes below it, unlike ImageNet-21K. For example, in ImageNet-21K, there are some images categorized as simply "mammal", whereas in ImageNet-1K, there are only images categorized as things like "German shepherd", since there are no child-words below "German shepherd". [22]
In 2021 winter, ImageNet-21k was updated. 2,702 categories in the "person" subtree were filtered to prevent "problematic behaviors" in a trained model. In 2021, ImageNet-1k was updated by annotating faces appearing in the 997 non-person categories. They found training models on the dataset with these faces blurred caused minimal loss in performance. [31]
ImageNetV2 was a new dataset containing three test sets with 10,000 each, constructed by the same methodology as the original ImageNet. [32]
ImageNet-21K-P was a filtered and cleaned subset of ImageNet-21K, with 12,358,688 images from 11,221 categories. [29]
Name | Published | Classes | Training | Validation | Test | Size |
---|---|---|---|---|---|---|
PASCAL VOC | 2005 | 20 | ||||
ImageNet-1K | 2009 | 1,000 | 1,281,167 | 50,000 | 100,000 | 130 GB |
ImageNet-21K | 2011 | 21,841 | 14,197,122 | 1.31 TB | ||
ImageNetV2 | 2019 | 30,000 | ||||
ImageNet-21K-P | 2021 | 11,221 | 11,797,632 | 561,052 |
The ILSVRC aims to "follow in the footsteps" of the smaller-scale PASCAL VOC challenge, established in 2005, which contained only about 20,000 images and twenty object classes. [6] To "democratize" ImageNet, Fei-Fei Li proposed to the PASCAL VOC team a collaboration, beginning in 2010, where research teams would evaluate their algorithms on the given data set, and compete to achieve higher accuracy on several visual recognition tasks. [8]
The resulting annual competition is now known as the ImageNet Large Scale Visual Recognition Challenge (ILSVRC). The ILSVRC uses a "trimmed" list of only 1000 image categories or "classes", including 90 of the 120 dog breeds classified by the full ImageNet schema. [6]
The 2010s saw dramatic progress in image processing.
The first competition in 2010 had 11 participating teams. The winning team was a linear support vector machine (SVM). The features are a dense grid of HoG and LBP, sparsified by local coordinate coding and pooling. [33] It achieved 52.9% in classification accuracy and 71.8% in top-5 accuracy. It was trained for 4 days on three 8-core machines (dual quad-core 2GHz Intel Xeon CPU). [34]
The second competition in 2011 had fewer teams, with another SVM winning at top-5 error rate 25%. [10] The winning team was XRCE by Florent Perronnin, Jorge Sanchez. The system was another linear SVM, running on quantized [35] Fisher vectors. [36] [37] It achieved 74.2% in top-5 accuracy.
In 2012, a deep convolutional neural net called AlexNet achieved 84.7% in top-5 accuracy, a great leap forward. [38] In the next couple of years, top-5 accuracy grew to above 90%. While the 2012 breakthrough "combined pieces that were all there before", the dramatic quantitative improvement marked the start of an industry-wide artificial intelligence boom. [4]
By 2014, more than fifty institutions participated in the ILSVRC. [6] In 2017, 29 of 38 competing teams had greater than 95% accuracy. [39] In 2017 ImageNet stated it would roll out a new, much more difficult challenge in 2018 that involves classifying 3D objects using natural language. Because creating 3D data is more costly than annotating a pre-existing 2D image, the dataset is expected to be smaller. The applications of progress in this area would range from robotic navigation to augmented reality. [1]
By 2015, researchers at Microsoft reported that their CNNs exceeded human ability at the narrow ILSVRC tasks. [19] [40] However, as one of the challenge's organizers, Olga Russakovsky, pointed out in 2015, the contest is over only 1000 categories; humans can recognize a larger number of categories, and also (unlike the programs) can judge the context of an image. [41]
It is estimated that over 6% of labels in the ImageNet-1k validation set are wrong. [42] It is also found that around 10% of ImageNet-1k contains ambiguous or erroneous labels, and that, when presented with a model's prediction and the original ImageNet label, human annotators prefer the prediction of a state of the art model in 2020 trained on the original ImageNet, suggesting that ImageNet-1k has been saturated. [43]
A study of the history of the multiple layers (taxonomy, object classes and labeling) of ImageNet and WordNet in 2019 described how bias [ clarification needed ] is deeply embedded in most classification approaches for all sorts of images. [44] [45] [46] [47] ImageNet is working to address various sources of bias. [48]
One downside of WordNet use is the categories may be more "elevated" than would be optimal for ImageNet: "Most people are more interested in Lady Gaga or the iPod Mini than in this rare kind of diplodocus."[ clarification needed ]
Computer vision tasks include methods for acquiring, processing, analyzing, and understanding digital images, and extraction of high-dimensional data from the real world in order to produce numerical or symbolic information, e.g. in the form of decisions. "Understanding" in this context signifies the transformation of visual images into descriptions of the world that make sense to thought processes and can elicit appropriate action. This image understanding can be seen as the disentangling of symbolic information from image data using models constructed with the aid of geometry, physics, statistics, and learning theory.
In machine learning (ML), boosting is an ensemble metaheuristic for primarily reducing bias. It can also improve the stability and accuracy of ML classification and regression algorithms. Hence, it is prevalent in supervised learning for converting weak learners to strong learners.
Automatic image annotation is the process by which a computer system automatically assigns metadata in the form of captioning or keywords to a digital image. This application of computer vision techniques is used in image retrieval systems to organize and locate images of interest from a database.
In computer vision, the bag-of-words model sometimes called bag-of-visual-words model can be applied to image classification or retrieval, by treating image features as words. In document classification, a bag of words is a sparse vector of occurrence counts of words; that is, a sparse histogram over the vocabulary. In computer vision, a bag of visual words is a vector of occurrence counts of a vocabulary of local image features.
Object recognition – technology in the field of computer vision for finding and identifying objects in an image or video sequence. Humans recognize a multitude of objects in images with little effort, despite the fact that the image of the objects may vary somewhat in different view points, in many different sizes and scales or even when they are translated or rotated. Objects can even be recognized when they are partially obstructed from view. This task is still a challenge for computer vision systems. Many approaches to the task have been implemented over multiple decades.
Caltech 101 is a data set of digital images created in September 2003 and compiled by Fei-Fei Li, Marco Andreetto, Marc 'Aurelio Ranzato and Pietro Perona at the California Institute of Technology. It is intended to facilitate computer vision research and techniques and is most applicable to techniques involving image recognition classification and categorization. Caltech 101 contains a total of 9,146 images, split between 101 distinct object categories and a background category. Provided with the images are a set of annotations describing the outlines of each image, along with a Matlab script for viewing.
In computer vision, the problem of object categorization from image search is the problem of training a classifier to recognize categories of objects, using only the images retrieved automatically with an Internet search engine. Ideally, automatic image collection would allow classifiers to be trained with nothing but the category names as input. This problem is closely related to that of content-based image retrieval (CBIR), where the goal is to return better image search results rather than training a classifier for image recognition.
A convolutional neural network (CNN) is a regularized type of feed-forward neural network that learns features by itself via filter optimization. This type of deep learning network has been applied to process and make predictions from many different types of data including text, images and audio. Convolution-based networks are the de-facto standard in deep learning-based approaches to computer vision and image processing, and have only recently have been replaced -- in some cases -- by newer deep learning architectures such as the transformer. Vanishing gradients and exploding gradients, seen during backpropagation in earlier neural networks, are prevented by using regularized weights over fewer connections. For example, for each neuron in the fully-connected layer, 10,000 weights would be required for processing an image sized 100 × 100 pixels. However, applying cascaded convolution kernels, only 25 neurons are required to process 5x5-sized tiles. Higher-layer features are extracted from wider context windows, compared to lower-layer features.
Fei-Fei Li is a Chinese-American computer scientist, known for establishing ImageNet, the dataset that enabled rapid advances in computer vision in the 2010s. She is the Sequoia Capital professor of computer science at Stanford University and former board director at Twitter. Li is a co-director of the Stanford Institute for Human-Centered Artificial Intelligence and a co-director of the Stanford Vision and Learning Lab. She served as the director of the Stanford Artificial Intelligence Laboratory from 2013 to 2018.
In computer vision, a saliency map is an image that highlights either the region on which people's eyes focus first or the most relevant regions for machine learning models. The goal of a saliency map is to reflect the degree of importance of a pixel to the human visual system or an otherwise opaque ML model.
AlexNet is the name of a convolutional neural network (CNN) architecture, designed by Alex Krizhevsky in collaboration with Ilya Sutskever and Geoffrey Hinton, who was Krizhevsky's Ph.D. advisor at the University of Toronto. It had 60 million parameters and 650,000 neurons.
Olga Russakovsky is an associate professor of computer science at Princeton University. Her research investigates computer vision and machine learning. She was one of the leaders of the ImageNet Large Scale Visual Recognition challenge and has been recognised by MIT Technology Review as one of the world's top young innovators.
Imageability is a measure of how easily a physical object, word or environment will evoke a clear mental image in the mind of any person observing it. It is used in architecture and city planning, in psycholinguistics, and in automated computer vision research. In automated image recognition, training models to connect images with concepts that have low imageability can lead to biased and harmful results.
80 Million Tiny Images is a dataset intended for training machine learning systems constructed by Antonio Torralba, Rob Fergus, and William T. Freeman in a collaboration between MIT and New York University. It contains 79,302,017 32×32 pixel color images, scaled down from images extracted from the World Wide Web in 2008 using automated web search queries on a set of 75,062 non-abstract nouns derived from WordNet. The words in the search terms were then used as labels for the images. The researchers used seven web search resources for this purpose: Altavista, Ask.com, Flickr, Cydral, Google, Picsearch and Webshots.
Contrastive Language-Image Pre-training (CLIP) is a technique for training a pair of neural network models, one for image understanding and one for text understanding, using a contrastive objective. This method has enabled broad applications across multiple domains, including cross-modal retrieval, text-to-image generation, aesthetic ranking, and image captioning.
Self-supervised learning (SSL) is a paradigm in machine learning where a model is trained on a task using the data itself to generate supervisory signals, rather than relying on external labels provided by humans. In the context of neural networks, self-supervised learning aims to leverage inherent structures or relationships within the input data to create meaningful training signals. SSL tasks are designed so that solving it requires capturing essential features or relationships in the data. The input data is typically augmented or transformed in a way that creates pairs of related samples. One sample serves as the input, and the other is used to formulate the supervisory signal. This augmentation can involve introducing noise, cropping, rotation, or other transformations. Self-supervised learning more closely imitates the way humans learn to classify objects.
A vision transformer (ViT) is a transformer designed for computer vision. A ViT decomposes an input image into a series of patches, serializes each patch into a vector, and maps it to a smaller dimension with a single matrix multiplication. These vector embeddings are then processed by a transformer encoder as if they were token embeddings.
The VGGNets are a series of convolutional neural networks (CNNs) developed by the Visual Geometry Group (VGG) at the University of Oxford.
When Li, who had moved back to Princeton to take a job as an assistant professor in 2007, talked up her idea for ImageNet, she had a hard time getting faculty members to help out. Finally, a professor who specialized in computer architecture agreed to join her as a collaborator.
Having read about WordNet's approach, Li met with professor Christiane Fellbaum, a researcher influential in the continued work on WordNet, during a 2006 visit to Princeton.
{{cite journal}}
: Cite journal requires |journal=
(help)