80 Million Tiny Images

Last updated

80 Million Tiny Images is a dataset intended for training machine learning systems constructed by Antonio Torralba, Rob Fergus, and William T. Freeman in a collaboration between MIT and New York University. It was published in 2008.

Contents

The dataset has size 760 GB. It contains 79,302,017 32×32 pixel color images, scaled down from images scraped from the World Wide Web over 8 months. The images are classified into 75,062 classes. Each class is a non-abstract noun in WordNet. Images may appear in more than one class. The dataset was motivated by non-parametric models of neural activations in the visual cortex upon seeing images. [1] [2]

The CIFAR-10 dataset uses a subset of the images in this dataset, but with independently generated labels, as the original labels were not reliable. The CIFAR-10 set has 6000 examples of each of 10 classes and the CIFAR-100 set has 600 examples of each of 100 non-overlapping classes. [3]

Construction

It was first reported in a technical report in April 2007, during the middle of the construction process, when there were only 73 million images. [4] The full dataset was published in 2008. [1]

They began with all 75,846 nonabstract nouns in WordNet, and then for each of these nouns, they scraped 7 Image search engines: Altavista, Ask.com, Flickr, Cydral, Google, Picsearch and Webshots. After 8 months of scraping, they obtained 97,245,098 images. Since they didn't have enough storage, they downsized the images to 32×32 as they were scraped.

After gathering, they removed images with zero variance and intra-word duplicate images, resulting in the final dataset.

Out of the 75,846 nouns, only 75,062 classes had any results, so the other nouns did not appear in the final dataset.

The number of images per noun follows a Zipf-like distribution, with 1056 images per noun on average. To prevent a few nouns taking up too many images, they put an upper bound of at most 3000 images per noun. [1]

Retirement

The 80 Million Tiny Images dataset was retired from use by its creators in 2020, [5] after a paper by researchers Abeba Birhane and Vinay Prabhu found that some of the labeling of several publicly available image datasets, including 80 Million Tiny Images, contained racist and misogynistic slurs which were causing models trained on them to exhibit racial and sexual bias. The dataset also contained offensive images. [6] [7] Following the release of the paper, the dataset's creators removed the dataset from distribution, and requested that other researchers not use it for further research and to delete their copies of the dataset. [5]

See also

Related Research Articles

<span class="mw-page-title-main">Bantu languages</span> Large language family spoken in Sub-Saharan Africa

The Bantu languages are a language family of about 600 languages that are spoken by the Bantu peoples of Central, Southern, Eastern and Southeast Africa. They form the largest branch of the Southern Bantoid languages.

An adjective is a word that describes or defines a noun or noun phrase. Its semantic role is to change information given by the noun.

Document classification or document categorization is a problem in library science, information science and computer science. The task is to assign a document to one or more classes or categories. This may be done "manually" or algorithmically. The intellectual classification of documents has mostly been the province of library science, while the algorithmic classification of documents is mainly in information science and computer science. The problems are overlapping, however, and there is therefore interdisciplinary research on document classification.

The English word squaw is an ethnic and sexual slur, historically used for Indigenous North American women. Contemporary use of the term, especially by non-Natives, is considered derogatory, misogynist, and racist.

<i>Urban Dictionary</i> Crowdsourced online dictionary of slang terms

Urban Dictionary is a crowdsourced English-language online dictionary for slang words and phrases. The website was founded in 1999 by Aaron Peckham. Originally, Urban Dictionary was intended as a dictionary of slang or cultural words and phrases, not typically found in standard English dictionaries, but it is now used to define any word, event, or phrase. Words or phrases on Urban Dictionary may have multiple definitions, usage examples, and tags. As of 2014, the dictionary contains over seven million definitions, while around 2,000 new entries were being added daily. Urban Dictionary has also become a repository for bigoted postings.

<span class="mw-page-title-main">Uncle Ruckus</span> Fictional character from The Boondocks

Uncle Ruckus is a fictional character and the main antagonist of the American animated sitcom The Boondocks. Voiced by Gary Anthony Williams, he first appeared on television in the show's pilot episode on November 6, 2005. Created and designed by cartoonist Aaron McGruder, Ruckus gained substantial popularity after appearing in the 1996 comic strip of the same name.

LabelMe is a project created by the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL) that provides a dataset of digital images with annotations. The dataset is dynamic, free to use, and open to public contribution. The most applicable use of LabelMe is in computer vision research. As of October 31, 2010, LabelMe has 187,240 images, 62,197 annotated images, and 658,992 labeled objects.

In computer vision, the bag-of-words model sometimes called bag-of-visual-words model can be applied to image classification or retrieval, by treating image features as words. In document classification, a bag of words is a sparse vector of occurrence counts of words; that is, a sparse histogram over the vocabulary. In computer vision, a bag of visual words is a vector of occurrence counts of a vocabulary of local image features.

Caltech 101 is a data set of digital images created in September 2003 and compiled by Fei-Fei Li, Marco Andreetto, Marc 'Aurelio Ranzato and Pietro Perona at the California Institute of Technology. It is intended to facilitate computer vision research and techniques and is most applicable to techniques involving image recognition classification and categorization. Caltech 101 contains a total of 9,146 images, split between 101 distinct object categories and a background category. Provided with the images are a set of annotations describing the outlines of each image, along with a Matlab script for viewing.

In the English language, the term negro is a term historically used to refer to people of Black African heritage. The term negro means the color black in Spanish and Portuguese, where English took it from. The term can be viewed as offensive, inoffensive, or completely neutral, largely depending on the region or country where it is used, as well as the time period and context in which it is applied. It has various equivalents in other languages of Europe.

<span class="mw-page-title-main">Golliwog</span> Doll-like character

The golliwog, also spelled golliwogg or shortened to golly, is a doll-like character, created by cartoonist and author Florence Kate Upton, which appeared in children's books in the late 19th century, usually depicted as a type of rag doll. It was reproduced, both by commercial and hobby toy-makers, as a children's soft toy called the "golliwog", a portmanteau of golly and polliwog, and had great popularity in the Southern United States, the United Kingdom, South Africa and Australia into the 1970s.

Hori is an ethnic slur used against people of Māori descent. The term comes from a Māori-language approximation of George, an English name that was very popular during the early years of European colonisation of New Zealand. By means of synecdoche, the term came to be ascribed firstly to any unknown male Māori and then as a negative epithet to all male Māori.

The ImageNet project is a large visual database designed for use in visual object recognition software research. More than 14 million images have been hand-annotated by the project to indicate what objects are pictured and in at least one million of the images, bounding boxes are also provided. ImageNet contains more than 20,000 categories, with a typical category, such as "balloon" or "strawberry", consisting of several hundred images. The database of annotations of third-party image URLs is freely available directly from ImageNet, though the actual images are not owned by ImageNet. Since 2010, the ImageNet project runs an annual software contest, the ImageNet Large Scale Visual Recognition Challenge, where software programs compete to correctly classify and detect objects and scenes. The challenge uses a "trimmed" list of one thousand non-overlapping classes.

The CIFAR-10 dataset is a collection of images that are commonly used to train machine learning and computer vision algorithms. It is one of the most widely used datasets for machine learning research. The CIFAR-10 dataset contains 60,000 32x32 color images in 10 different classes. The 10 different classes represent airplanes, cars, birds, cats, deer, dogs, frogs, horses, ships, and trucks. There are 6,000 images of each class.

Imageability is a measure of how easily a physical object, word or environment will evoke a clear mental image in the mind of any person observing it. It is used in architecture and city planning, in psycholinguistics, and in automated computer vision research. In automated image recognition, training models to connect images with concepts that have low imageability can lead to biased and harmful results.

Karen is a slang term typically used to refer to an upper middle-class white American woman who is perceived as entitled or excessively demanding. The term is often portrayed in memes depicting middle-class white women who "use their white and class privilege to demand their own way". Depictions include demanding to "speak to the manager", being racist, or wearing a particular bob cut hairstyle. It was popularized in the aftermath of the Central Park birdwatching incident in 2020.

<span class="mw-page-title-main">GPT-2</span> 2019 text-generating language model

Generative Pre-trained Transformer 2 (GPT-2) is a large language model by OpenAI and the second in their foundational series of GPT models. GPT-2 was pre-trained on a dataset of 8 million web pages. It was partially released in February 2019, followed by full release of the 1.5-billion-parameter model on November 5, 2019.

Abeba Birhane is an Ethiopian-born cognitive scientist who works at the intersection of complex adaptive systems, machine learning, algorithmic bias, and critical race studies. Birhane's work with Vinay Prabhu uncovered that large-scale image datasets commonly used to develop AI systems, including ImageNet and 80 Million Tiny Images, carried racist and misogynistic labels and offensive images. She has been recognized by VentureBeat as a top innovator in computer vision and named as one of the 100 most influential persons in AI 2023 by TIME magazine.

<span class="mw-page-title-main">LAION</span> Non-profit German artificial intelligence organization

LAION is a German non-profit which makes open-sourced artificial intelligence models and datasets. It is best known for releasing a number of large datasets of images and captions scraped from the web which have been used to train a number of high-profile text-to-image models, including Stable Diffusion and Imagen.

References

  1. 1 2 3 Torralba, Antonio; Fergus, Rob; Freeman, William T. (November 2008). "80 million tiny images: a large data set for nonparametric object and scene recognition" (PDF). IEEE Transactions on Pattern Analysis and Machine Intelligence. 30 (11): 1958–1970. doi:10.1109/TPAMI.2008.128. ISSN   1939-3539. PMID   18787244. S2CID   7487588.
  2. 80 Million Tiny Images , IPAM Workshop on Numerical Tools and Fast Algorithms for Massive Data Mining, Search Engines and Applications October 23rd 2007
  3. A. Krizhevsky. Learning multiple layers of features from tiny images. Tech Report, 2009. University of Toronto
  4. A Torralba, R Fergus, WT Freeman. "Tiny images". Tech. Rep. MIT-CSAIL-TR-2007-024, 2007.
  5. 1 2 "80 Million Tiny Images". groups.csail.mit.edu. Retrieved 2020-07-02.
  6. Prabhu, Vinay Uday; Birhane, Abeba (2020-06-24). "Large image datasets: A pyrrhic win for computer vision?". arXiv: 2006.16923 [cs.CY].
  7. Quach, Katyanna (1 July 2020). "MIT apologizes, permanently pulls offline huge dataset that taught AI systems to use racist, misogynistic slurs". www.theregister.com. Retrieved 2020-07-02.