Labeled data

Last updated

Labeled data is a group of samples that have been tagged with one or more labels. Labeling typically takes a set of unlabeled data and augments each piece of it with informative tags. For example, a data label might indicate whether a photo contains a horse or a cow, which words were uttered in an audio recording, what type of action is being performed in a video, what the topic of a news article is, what the overall sentiment of a tweet is, or whether a dot in an X-ray is a tumor.

Contents

Labels can be obtained by asking humans to make judgments about a given piece of unlabeled data. Labeled data is significantly more expensive to obtain than the raw unlabeled data.

Crowdsourced labeled data

In 2006, Fei-Fei Li, the co-director of the Stanford Human-Centered AI Institute, set out to improve the artificial intelligence models and algorithms for image recognition by significantly enlarging the training data. The researchers downloaded millions of images from the World Wide Web and a team of undergraduates started to apply labels for objects to each image. In 2007, Li outsourced the data labeling work on Amazon Mechanical Turk, an online marketplace for digital piece work. The 3.2 million images that were labeled by more than 49,000 workers formed the basis for ImageNet, one of the largest hand-labeled database for outline of object recognition. [1]

Automated data labelling

After obtaining a labeled dataset, machine learning models can be applied to the data so that new unlabeled data can be presented to the model and a likely label can be guessed or predicted for that piece of unlabeled data. [2]

Data-driven bias

Algorithmic decision-making is subject to programmer-driven bias as well as data-driven bias. Training data that relies on bias labeled data will result in prejudices and omissions in a predictive model, despite the machine learning algorithm being legitimate. The labeled data used to train a specific machine learning algorithm needs to be a statistically representative sample to not bias the results. [3] Because the labeled data available to train facial recognition systems has not been representative of a population, underrepresented groups in the labeled data are later often misclassified. In 2018, a study by Joy Buolamwini and Timnit Gebru demonstrated that two facial analysis datasets that have been used to train facial recognition algorithms, IJB-A and Adience, are composed of 79.6% and 86.2% lighter skinned humans respectively. [4]

Related Research Articles

<span class="mw-page-title-main">Supervised learning</span> A paradigm in machine learning

Supervised learning (SL) is a paradigm in machine learning where input objects and a desired output value train a model. The training data is processed, building a function that maps new data on expected output values. An optimal scenario will allow for the algorithm to correctly determine output values for unseen instances. This requires the learning algorithm to generalize from the training data to unseen situations in a "reasonable" way. This statistical quality of an algorithm is measured through the so-called generalization error.

Pattern recognition is the task of assigning a class to an observation based on patterns extracted from data. While similar, pattern recognition (PR) is not to be confused with pattern machines (PM) which may possess (PR) capabilities but their primary function is to distinguish and create emergent pattern. PR has applications in statistical data analysis, signal processing, image analysis, information retrieval, bioinformatics, data compression, computer graphics and machine learning. Pattern recognition has its origins in statistics and engineering; some modern approaches to pattern recognition include the use of machine learning, due to the increased availability of big data and a new abundance of processing power.

Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of statistical algorithms that can learn from data and generalize to unseen data, and thus perform tasks without explicit instructions. Recently, artificial neural networks have been able to surpass many previous approaches in performance.

Caltech 101 is a data set of digital images created in September 2003 and compiled by Fei-Fei Li, Marco Andreetto, Marc 'Aurelio Ranzato and Pietro Perona at the California Institute of Technology. It is intended to facilitate computer vision research and techniques and is most applicable to techniques involving image recognition classification and categorization. Caltech 101 contains a total of 9,146 images, split between 101 distinct object categories and a background category. Provided with the images are a set of annotations describing the outlines of each image, along with a Matlab script for viewing.

Active learning is a special case of machine learning in which a learning algorithm can interactively query a human user, to label new data points with the desired outputs. The human user must possess knowledge/expertise in the problem domain, including the ability to consult/research authoritative sources when necessary. In statistics literature, it is sometimes also called optimal experimental design. The information source is also called teacher or oracle.

In computer science, landmark detection is the process of finding significant landmarks in an image. This originally referred to finding landmarks for navigational purposes – for instance, in robot vision or creating maps from satellite images. Methods used in navigation have been extended to other fields, notably in facial recognition where it is used to identify key points on a face. It also has important applications in medicine, identifying anatomical landmarks in medical images.

<span class="mw-page-title-main">Feature learning</span> Set of learning techniques in machine learning

In machine learning, feature learning or representation learning is a set of techniques that allows a system to automatically discover the representations needed for feature detection or classification from raw data. This replaces manual feature engineering and allows a machine to both learn the features and use them to perform a specific task.

<span class="mw-page-title-main">Domain adaptation</span> Field associated with machine learning and transfer learning

Domain adaptation is a field associated with machine learning and transfer learning. This scenario arises when we aim at learning a model from a source data distribution and applying that model on a different target data distribution. For instance, one of the tasks of the common spam filtering problem consists in adapting a model from one user to a new user who receives significantly different emails. Domain adaptation has also been shown to be beneficial to learning unrelated sources. Note that, when more than one source distribution is available the problem is referred to as multi-source domain adaptation.

DeepFace is a deep learning facial recognition system created by a research group at Facebook. It identifies human faces in digital images. The program employs a nine-layer neural network with over 120 million connection weights and was trained on four million images uploaded by Facebook users. The Facebook Research team has stated that the DeepFace method reaches an accuracy of 97.35% ± 0.25% on Labeled Faces in the Wild (LFW) data set where human beings have 97.53%. This means that DeepFace is sometimes more successful than human beings. As a result of growing societal concerns Meta announced that it plans to shut down Facebook facial recognition system, deleting the face scan data of more than one billion users. This change will represent one of the largest shifts in facial recognition usage in the technology's history. Facebook planned to delete by December 2021 more than one billion facial recognition templates, which are digital scans of facial features. However, it did not plan to eliminate DeepFace which is the software that powers the facial recognition system. The company has also not ruled out incorporating facial recognition technology into future products, according to Meta spokesperson.

<span class="mw-page-title-main">Manifold regularization</span>

In machine learning, Manifold regularization is a technique for using the shape of a dataset to constrain the functions that should be learned on that dataset. In many machine learning problems, the data to be learned do not cover the entire input space. For example, a facial recognition system may not need to classify any possible image, but only the subset of images that contain faces. The technique of manifold learning assumes that the relevant subset of data comes from a manifold, a mathematical structure with useful properties. The technique also assumes that the function to be learned is smooth: data with different labels are not likely to be close together, and so the labeling function should not change quickly in areas where there are likely to be many data points. Because of this assumption, a manifold regularization algorithm can use unlabeled data to inform where the learned function is allowed to change quickly and where it is not, using an extension of the technique of Tikhonov regularization. Manifold regularization algorithms can extend supervised learning algorithms in semi-supervised learning and transductive learning settings, where unlabeled data are available. The technique has been used for applications including medical imaging, geographical imaging, and object recognition.

The ImageNet project is a large visual database designed for use in visual object recognition software research. More than 14 million images have been hand-annotated by the project to indicate what objects are pictured and in at least one million of the images, bounding boxes are also provided. ImageNet contains more than 20,000 categories, with a typical category, such as "balloon" or "strawberry", consisting of several hundred images. The database of annotations of third-party image URLs is freely available directly from ImageNet, though the actual images are not owned by ImageNet. Since 2010, the ImageNet project runs an annual software contest, the ImageNet Large Scale Visual Recognition Challenge, where software programs compete to correctly classify and detect objects and scenes. The challenge uses a "trimmed" list of one thousand non-overlapping classes.

<span class="mw-page-title-main">Artificial intelligence in healthcare</span> Overview of the use of artificial intelligence in healthcare

Artificial intelligence in healthcare is a term used to describe the use of machine-learning algorithms and software, or artificial intelligence (AI), to copy human cognition in the analysis, presentation, and understanding of complex medical and health care data, or to exceed human capabilities by providing new ways to diagnose, treat, or prevent disease. Specifically, AI is the ability of computer algorithms to arrive at approximate conclusions based solely on input data.

<span class="mw-page-title-main">Algorithmic bias</span> Technological phenomenon with social implications

Algorithmic bias describes systematic and repeatable errors in a computer system that create "unfair" outcomes, such as "privileging" one category over another in ways different from the intended function of the algorithm.

<span class="mw-page-title-main">Joy Buolamwini</span> Computer scientist and digital activist

Joy Adowaa Buolamwini is a Canadian-American computer scientist and digital activist based at the MIT Media Lab. She founded the Algorithmic Justice League (AJL), an organization that works to challenge bias in decision-making software, using art, advocacy, and research to highlight the social implications and harms of artificial intelligence (AI).

<span class="mw-page-title-main">Timnit Gebru</span> Computer scientist

Timnit Gebru is an Eritrean Ethiopian-born computer scientist who works in the fields of artificial intelligence (AI), algorithmic bias and data mining. She is an advocate for diversity in technology and co-founder of Black in AI, a community of Black researchers working in AI. She is the founder of the Distributed Artificial Intelligence Research Institute (DAIR).

Weak supervision is a paradigm in machine learning, the relevance and notability of which increased with the advent of large language models due to large amount of data required to train them. It is characterized by using a combination of a small amount of human-labeled data, followed by a large amount of unlabeled data. In other words, the desired output values are provided only for a subset of the training data. The remaining data is unlabeled or imprecisely labeled. Intuitively, it can be seen as an exam and labeled data as sample problems that the teacher solves for the class as an aid in solving another set of problems. In the transductive setting, these unsolved problems act as exam questions. In the inductive setting, they become practice problems of the sort that will make up the exam. Technically, it could be viewed as performing clustering and then labeling the clusters with the labeled data, pushing the decision boundary away from high-density regions, or learning an underlying one-dimensional manifold where the data reside.

Amazon Rekognition is a cloud-based software as a service (SaaS) computer vision platform that was launched in 2016. It has been sold to, and used by, a number of United States government agencies, including U.S. Immigration and Customs Enforcement (ICE) and Orlando, Florida police, as well as private entities.

Olga Russakovsky is an associate professor of computer science at Princeton University. Her research investigates computer vision and machine learning. She was one of the leaders of the ImageNet Large Scale Visual Recognition challenge and has been recognised by MIT Technology Review as one of the world's top young innovators.

Imageability is a measure of how easily a physical object, word or environment will evoke a clear mental image in the mind of any person observing it. It is used in architecture and city planning, in psycholinguistics, and in automated computer vision research. In automated image recognition, training models to connect images with concepts that have low imageability can lead to biased and harmful results.

<span class="mw-page-title-main">Algorithmic Justice League</span> Digital advocacy non-profit organization

The Algorithmic Justice League (AJL) is a digital advocacy non-profit organization based in Cambridge, Massachusetts. Founded in 2016 by computer scientist Joy Buolamwini, the AJL uses research, artwork, and policy advocacy to increase societal awareness regarding the use of artificial intelligence (AI) in society and the harms and biases that AI can pose to society. The AJL has engaged in a variety of open online seminars, media appearances, and tech advocacy initiatives to communicate information about bias in AI systems and promote industry and government action to mitigate against the creation and deployment of biased AI systems. In 2021, Fast Company named AJL as one of the 10 most innovative AI companies in the world.

References

  1. Mary L. Gray; Siddharth Suri (2019). Ghost Work: How to Stop Silicon Valley from Building a New Global Underclass. Houghton Mifflin Harcourt. p. 7. ISBN   978-1-328-56628-7.
  2. Johnson, Leif. "What is the difference between labeled and unlabeled data?", Stack Overflow , 4 October 2013. Retrieved on 13 May 2017. Creative Commons by-sa small.svg  This article incorporates text by lmjohns3 available under the CC BY-SA 3.0 license.
  3. Xianhong Hu; Bhanu Neupane; Lucia Flores Echaiz; Prateek Sibal; Macarena Rivera Lam (2019). Steering AI and advanced ICTs for knowledge societies: a Rights, Openness, Access, and Multi-stakeholder Perspective. UNESCO Publishing. p. 64. ISBN   978-92-3-100363-9.
  4. Xianhong Hu; Bhanu Neupane; Lucia Flores Echaiz; Prateek Sibal; Macarena Rivera Lam (2019). Steering AI and advanced ICTs for knowledge societies: a Rights, Openness, Access, and Multi-stakeholder Perspective. UNESCO Publishing. p. 66. ISBN   978-92-3-100363-9.