LabelMe

Last updated March 27, 2021

LabelMe is a project created by the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL) which provides a dataset of digital images with annotations. The dataset is dynamic, free to use, and open to public contribution. The most applicable use of LabelMe is in computer vision research. As of October 31, 2010, LabelMe has 187,240 images, 62,197 annotated images, and 658,992 labeled objects.

Motivation

The motivation behind creating LabelMe comes from the history of publicly available data for computer vision researchers. Most available data was tailored to a specific research group's problems and caused new researchers to have to collect additional data to solve their own problems. LabelMe was created to solve several common shortcomings of available data. The following is a list of qualities that distinguish LabelMe from previous work.

Designed for recognition of a class of objects instead of single instances of an object. For example, a traditional dataset may have contained images of dogs, each of the same size and orientation. In contrast, LabelMe contains images of dogs in multiple angles, sizes, and orientations.
Designed for recognizing objects embedded in arbitrary scenes instead of images that are cropped, normalized, and/or resized to display a single object.
Complex annotation: Instead of labeling an entire image (which also limits each image to containing a single object), LabelMe allows annotation of multiple objects within an image by specifying a polygon bounding box that contains the object.
Contains a large number of object classes and allows the creation of new classes easily.
Diverse images: LabelMe contains images from many different scenes.
Provides non-copyrighted images and allows public additions to the annotations. This creates a free environment.

Annotation Tool

The LabelMe annotation tool provides a means for users to contribute to the project. The tool can be accessed anonymously or by logging into a free account. To access the tool, users must have a compatible web browser with JavaScript support. When the tool is loaded, it chooses a random image from the LabelMe dataset and displays it on the screen. If the image already has object labels associated with it, they will be overlaid on top of the image in polygon format. Each distinct object label is displayed in a different color.

If the image is not completely labeled, the user can use the mouse to draw a polygon containing an object in the image. For example, in the adjacent image, if a person was standing in front of the building, the user could click on a point on the border of the person, and continue clicking along the outside edge until returning to the starting point. Once the polygon is closed, a bubble pops up on the screen which allows the user to enter a label for the object. The user can choose whatever label the user thinks best describes the object. If the user disagrees with the previous labeling of the image, the user can click on the outline polygon of an object and either delete the polygon completely or edit the text label to give it a new name.

As soon as changes are made to the image by the user, they are saved and openly available for anyone to download from the LabelMe dataset. In this way, the data is always changing due to contributions by the community of users who use the tool. Once the user is finished with an image, the Show me another image link can be clicked and another random image will be selected to display to the user.

Problems with the data

The LabelMe dataset has some problems. Some are inherent in the data, such as the objects in the images not being uniformly distributed with respect to size and image location. This is due to the images being primarily taken by humans who tend to focus the camera on interesting objects in a scene. However, cropping and rescaling the images randomly can simulate a uniform distribution.^[1] Other problems are caused by the amount of freedom given to the users of the annotation tool. Some problems that arise are:

The user can choose which objects in the scene to outline. Should an occluded person be labeled? Should an occluded part of an object be included when outlining the object? Should the sky be labeled?
The user has to describe the shape of the object themselves by outlining a polygon. Should the fingers of a hand on a person be outlined with detail? How much precision must be used when outlining objects?
The user chooses what text to enter as the label for the object. Should the label be person, man, or pedestrian?

The creators of LabelMe decided to leave these decisions up to the annotator. The reason for this is that they believe people will tend to annotate the images according to what they think is the natural labeling of the images. This also provides some variability in the data, which can help researchers tune their algorithms to account for this variability.^[2]

Extending the data

Using WordNet

Since the text labels for objects provided in LabelMe come from user input, there is a lot of variation in the labels used (as described above). Because of this, analysis of objects can be difficult. For example, a picture of a dog might be labeled as dog, canine, hound, pooch, or animal. Ideally, when using the data, the object class dog at the abstract level should incorporate all of these text labels.

WordNet is a database of words organized into a structural way. It allows assigning a word to a category, or in WordNet language: a sense. Sense assignment is not easy to do automatically. When the authors of LabelMe tried automatic sense assignment, they found that it was prone to a high rate of error, so instead they assigned words to senses manually. At first, this may seem like a daunting task since new labels are added to the LabelMe project continuously. To the right is a graph comparing the growth of polygons to the growth of words (descriptions). As you can see, the growth of words is small compared with the continuous growth of polygons, and therefore is easy enough to keep up to date manually by the LabelMe team.^[3]

Once WordNet assignment is done, searches in the LabelMe database are much more effective. For example, a search for animal might bring up pictures of dogs, cats and snakes. However, since the assignment was done manually, a picture of a computer mouse labeled as mouse would not show up in a search for animals. Also, if objects are labeled with more complex terms like dog walking, WordNet still allows the search of dog to return these objects as results. WordNet makes the LabelMe database much more useful.

Object-part hierarchy

Having a large dataset of objects where overlap is allowed provides enough data to try and categorize objects as being a part of another object. For example, most of the labels assigned wheel are probably part of objects assigned to other labels like car or bicycle. These are called part labels. To determine if label P is a part label for label O:^[4]

Let $\mathrm {I} _{\mathrm {O} }\,$ denote the set of images containing an object (e.g. car)
Let $\mathrm {I} _{\mathrm {P} }\,$ denote the set of images containing a part (e.g. wheel)
Let the overlap score between object O and part P, $\mathrm {S} _{\mathrm {O} ,\mathrm {P} }\,$ , be defined as the ratio of the intersection area to the area of the part polygon. (e.g. ${\frac {\mathrm {A} (\mathrm {O} \cap \mathrm {P} )}{\mathrm {A} (\mathrm {P} )}}\,$ )
Let $\mathrm {I} _{\mathrm {O} ,\mathrm {P} }\subseteq \mathrm {I} _{\mathrm {P} }\,$ denote the images where object and part polygons have $\mathrm {S} _{\mathrm {O} ,\mathrm {P} }>\beta \,$ where $\beta \,$ is some threshold value. The authors of LabelMe use $\beta =0.5\,$
The object-part score for a candidate label is ${\frac {\mathrm {N} _{\mathrm {O} ,\mathrm {P} }}{\mathrm {N} _{\mathrm {P} }+\alpha }}\,$ where $\mathrm {N} _{\mathrm {O} ,\mathrm {P} }\,$ and $\mathrm {N} _{\mathrm {P} }\,$ are the number of images in $\mathrm {I} _{\mathrm {O} ,\mathrm {P} }\,$ and $\mathrm {I} _{\mathrm {P} }\,$ , respectively, and $\alpha \,$ is a concentration parameter. The authors of LabelMe use $\alpha =5\,$ .

This algorithm allows the automatic classification of parts of an object when the part objects are frequently contained within the outer object.

Object depth ordering

Another instance of object overlap is when one object is actually on top of the other. For example, an image might contain a person standing in front of a building. The person is not a part label as above since the person is not part of the building. Instead, they are two separate objects that happen to overlap. To automatically determine which object is the foreground and which is the background, the authors of LabelMe propose several options:^[5]

If an object is completely contained within another object, then the inner object must be in the foreground. Otherwise, it would not be visible in the image. The only exception is with transparent or translucent objects, but these occur rarely.
One of the objects could be labeled as something that cannot be in the foreground. Examples are sky, ground, or road.
The object with more polygon points inside the intersecting area is most likely the foreground. The authors tested this hypothesis and found it to be highly accurate.
Histogram intersection^[6] can be used. To do this, a color histogram in the intersecting areas is compared to the color histogram of the two objects. The object with the closer color histogram is assigned as the foreground. This method is less accurate than counting the polygon points.

Matlab Toolbox

The LabelMe project provides a set of tools for using the LabelMe dataset from Matlab. Since research is often done in Matlab, this allows the integration of the dataset with existing tools in computer vision. The entire dataset can be downloaded and used offline, or the toolbox allows dynamic downloading of content on demand.

Related Research Articles

In geometry, the Minkowski sum of two sets of position vectors A and B in Euclidean space is formed by adding each vector in A to each vector in B, i.e., the set

In digital image processing and computer vision, image segmentation is the process of partitioning a digital image into multiple segments. The goal of segmentation is to simplify and/or change the representation of an image into something that is more meaningful and easier to analyze. Image segmentation is typically used to locate objects and boundaries in images. More precisely, image segmentation is the process of assigning a label to every pixel in an image such that pixels with the same label share certain characteristics.

Scientific visualization is an interdisciplinary branch of science concerned with the visualization of scientific phenomena. It is also considered a subset of computer graphics, a branch of computer science. The purpose of scientific visualization is to graphically illustrate scientific data to enable scientists to understand, illustrate, and glean insight from their data.

Association rule learning is a rule-based machine learning method for discovering interesting relations between variables in large databases. It is intended to identify strong rules discovered in databases using some measures of interestingness.

In digital image processing, thresholding is the simplest method of segmenting images. From a grayscale image, thresholding can be used to create binary images.

Shot transition detection also called cut detection is a field of research of video processing. Its subject is the automated detection of transitions between shots in digital video with the purpose of temporal segmentation of videos.

The following are common definitions related to the machine vision field.

Histogram equalization method in image processing of contrast adjustment using the images histogram

Histogram equalization is a method in image processing of contrast adjustment using the image's histogram.

A region of interest, are samples within a data set identified for a particular purpose. The concept of a ROI is commonly used in many application areas. For example, in medical imaging, the boundaries of a tumor may be defined on an image or in a volume, for the purpose of measuring its size. The endocardial border may be defined on an image, perhaps during different phases of the cardiac cycle, for example, end-systole and end-diastole, for the purpose of assessing cardiac function. In geographical information systems (GIS), a ROI can be taken literally as a polygonal selection from a 2D map. In computer vision and optical character recognition, the ROI defines the borders of an object under consideration. In many applications, symbolic (textual) labels are added to a ROI, to describe its content in a compact manner. Within a ROI may lie individual points of interest (POIs).

Curved mirror Mirror with a curved reflecting surface

A curved mirror is a mirror with a curved reflecting surface. The surface may be either convex or concave. Most curved mirrors have surfaces that are shaped like part of a sphere, but other shapes are sometimes used in optical devices. The most common non-spherical type are parabolic reflectors, found in optical devices such as reflecting telescopes that need to image distant objects, since spherical mirror systems, like spherical lenses, suffer from spherical aberration. Distorting mirrors are used for entertainment. They have convex and concave regions that produce deliberately distorted images. They also provide highly magnified or highly diminished (smaller) images when the object is placed at certain distances.

As applied in the field of computer vision, graph cut optimization can be employed to efficiently solve a wide variety of low-level computer vision problems, such as image smoothing, the stereo correspondence problem, image segmentation, object co-segmentation, and many other computer vision problems that can be formulated in terms of energy minimization. Many of these energy minimization problems can be approximated by solving a maximum flow problem in a graph. Under most formulations of such problems in computer vision, the minimum energy solution corresponds to the maximum a posteriori estimate of a solution. Although many computer vision algorithms involve cutting a graph, the term "graph cuts" is applied specifically to those models which employ a max-flow/min-cut optimization.

The histogram of oriented gradients (HOG) is a feature descriptor used in computer vision and image processing for the purpose of object detection. The technique counts occurrences of gradient orientation in localized portions of an image. This method is similar to that of edge orientation histograms, scale-invariant feature transform descriptors, and shape contexts, but differs in that it is computed on a dense grid of uniformly spaced cells and uses overlapping local contrast normalization for improved accuracy.

In computer vision, the bag-of-words model sometimes called bag-of-visual-words model can be applied to image classification, by treating image features as words. In document classification, a bag of words is a sparse vector of occurrence counts of words; that is, a sparse histogram over the vocabulary. In computer vision, a bag of visual words is a vector of occurrence counts of a vocabulary of local image features.

Caltech 101 is a data set of digital images created in September 2003 and compiled by Fei-Fei Li, Marco Andreetto, Marc 'Aurelio Ranzato and Pietro Perona at the California Institute of Technology. It is intended to facilitate Computer Vision research and techniques and is most applicable to techniques involving image recognition classification and categorization. Caltech 101 contains a total of 9,146 images, split between 101 distinct object categories and a background category. Provided with the images are a set of annotations describing the outlines of each image, along with a Matlab script for viewing.

In computer vision, the problem of object categorization from image search is the problem of training a classifier to recognize categories of objects, using only the images retrieved automatically with an Internet search engine. Ideally, automatic image collection would allow classifiers to be trained with nothing but the category names as input. This problem is closely related to that of content-based image retrieval (CBIR), where the goal is to return better image search results rather than training a classifier for image recognition.

The Point Cloud Library (PCL) is an open-source library of algorithms for point cloud processing tasks and 3D geometry processing, such as occur in three-dimensional computer vision. The library contains algorithms for filtering, feature estimation, surface reconstruction, 3D registration, model fitting, object recognition, and segmentation. Each module is implemented as a smaller library that can be compiled separately. PCL has its own data format for storing point clouds - PCD, but also allows datasets to be loaded and saved in many other formats. It is written in C++ and released under the BSD license.

Foreground detection is one of the major tasks in the field of computer vision and image processing whose aim is to detect changes in image sequences. Background subtraction is any technique which allows an image's foreground to be extracted for further processing.

In computer vision, a saliency map is an image that shows each pixel's unique quality. The goal of a saliency map is to simplify and/or change the representation of an image into something that is more meaningful and easier to analyze. For example, if a pixel has a high grey level or other unique color quality in a color image, that pixel's quality will show in the saliency map and in an obvious way. Saliency is a kind of image segmentation.

Computer Vision Annotation Tool free and open source, web-based image and video annotation tool

Computer Vision Annotation Tool (CVAT) is a free, open source, web-based image and video annotation tool which is used for labeling data for computer vision algorithms. CVAT was developed for use by a professional data annotation team, with a user interface optimized for computer vision annotation tasks. Try it online cvat.org.

VoTT is a free and open source electron app for image annotation and labeling developed by Microsoft. The software is written in the TypeScript programming language and used for building end to end object detection models from image and videos assets for computer vision algorithms.

References

Bibliography

Russell, Bryan C.; Torralba, Antonio; Murphy, Kevin P.; Freeman, William T. (2008). "Label Me: A Database and Web-Based Tool for Image Annotation" (PDF). International Journal of Computer Vision. 77 (1–3): 157–173. doi:10.1007/s11263-007-0090-8. S2CID 1900911.
Swain, Michael J.; Ballard, Dana H. (1991). "Color indexing". International Journal of Computer Vision. 7: 11–32. doi:10.1007/BF00130487. S2CID 8167136.

External links

http://labelme.csail.mit.edu/ - LabelMe - The open annotation tool

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Russell et al. 2008 , Section 2.5

[2] Russell et al. 2008 , Section 2.2

[3] Russell et al. 2008 , Section 3.1

[4] Russell et al. 2008 , Section 3.2

[5] Russell et al. 2008 , Section 3.3

[6] Swain & Ballard 1991

[1]

[2]

[3]

[4]

[5]

[6]