Zero-shot learning

Last updated

Zero-shot learning (ZSL) is a problem setup in deep learning where, at test time, a learner observes samples from classes which were not observed during training, and needs to predict the class that they belong to. The name is a play on words based on the earlier concept of one-shot learning, in which classification can be learned from only one, or a few, examples.

Contents

Zero-shot methods generally work by associating observed and non-observed classes through some form of auxiliary information, which encodes observable distinguishing properties of objects. [1] For example, given a set of images of animals to be classified, along with auxiliary textual descriptions of what animals look like, an artificial intelligence model which has been trained to recognize horses, but has never been given a zebra, can still recognize a zebra when it also knows that zebras look like striped horses. This problem is widely studied in computer vision, natural language processing, and machine perception. [2]

Background and history

The first paper on zero-shot learning in natural language processing appeared in 2008 at the AAAI’08, but the name given to the learning paradigm there was dataless classification. [3] The first paper on zero-shot learning in computer vision appeared at the same conference, under the name zero-data learning. [4] The term zero-shot learning itself first appeared in the literature in a 2009 paper from Palatucci, Hinton, Pomerleau, and Mitchell at NIPS’09. [5] This terminology was repeated later in another computer vision paper [6] and the term zero-shot learning caught on, as a take-off on one-shot learning that was introduced in computer vision years earlier. [7]

In computer vision, zero-shot learning models learned parameters for seen classes along with their class representations and rely on representational similarity among class labels so that, during inference, instances can be classified into new classes.

In natural language processing, the key technical direction developed builds on the ability to "understand the labels"—represent the labels in the same semantic space as that of the documents to be classified. This supports the classification of a single example without observing any annotated data, the purest form of zero-shot classification. The original paper [3] made use of the Explicit Semantic Analysis (ESA) representation but later papers made use of other representations, including dense representations. This approach was also extended to multilingual domains, [8] [9] fine entity typing [10] and other problems. Moreover, beyond relying solely on representations, the computational approach has been extended to depend on transfer from other tasks, such as textual entailment [11] and question answering. [12]

The original paper [3] also points out that, beyond the ability to classify a single example, when a collection of examples is given, with the assumption that they come from the same distribution, it is possible to bootstrap the performance in a semi-supervised like manner (or transductive learning).

Unlike standard generalization in machine learning, where classifiers are expected to correctly classify new samples to classes they have already observed during training, in ZSL, no samples from the classes have been given during training the classifier. It can therefore be viewed as an extreme case of domain adaptation.

Prerequisite information for zero-shot classes

Naturally, some form of auxiliary information has to be given about these zero-shot classes, and this type of information can be of several types. 

Generalized zero-shot learning

The above ZSL setup assumes that at test time, only zero-shot samples are given, namely, samples from new unseen classes. In generalized zero-shot learning, samples from both new and known classes, may appear at test time. This poses new challenges for classifiers at test time, because it is very challenging to estimate if a given sample is new or known. Some approaches to handle this include: 

Domains of application

Zero shot learning has been applied to the following fields:

See also

Related Research Articles

Natural language processing (NLP) is an interdisciplinary subfield of computer science and information retrieval. It is primarily concerned with giving computers the ability to support and manipulate human language. It involves processing natural language datasets, such as text corpora or speech corpora, using either rule-based or probabilistic machine learning approaches. The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. To this end, natural language processing often borrows ideas from theoretical linguistics. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.

Word-sense disambiguation (WSD) is the process of identifying which sense of a word is meant in a sentence or other segment of context. In human language processing and cognition, it is usually subconscious/automatic but can often come to conscious attention when ambiguity impairs clarity of communication, given the pervasive polysemy in natural language. In computational linguistics, it is an open problem that affects other computer-related writing, such as discourse, improving relevance of search engines, anaphora resolution, coherence, and inference.

Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of statistical algorithms that can learn from data and generalize to unseen data, and thus perform tasks without explicit instructions. Recently, artificial neural networks have been able to surpass many previous approaches in performance.

Natural language generation (NLG) is a software process that produces natural language output. A widely-cited survey of NLG methods describes NLG as "the subfield of artificial intelligence and computational linguistics that is concerned with the construction of computer systems than can produce understandable texts in English or other human languages from some underlying non-linguistic representation of information".

Question answering (QA) is a computer science discipline within the fields of information retrieval and natural language processing (NLP) that is concerned with building systems that automatically answer questions that are posed by humans in a natural language.

Document processing is a field of research and a set of production processes aimed at making an analog document digital. Document processing does not simply aim to photograph or scan a document to obtain a digital image, but also to make it digitally intelligible. This includes extracting the structure of the document or the layout and then the content, which can take the form of text or images. The process can involve traditional computer vision algorithms, convolutional neural networks or manual labor. The problems addressed are related to semantic segmentation, object detection, optical character recognition (OCR), handwritten text recognition (HTR) and, more broadly, transcription, whether automatic or not. The term can also include the phase of digitizing the document using a scanner and the phase of interpreting the document, for example using natural language processing (NLP) or image classification technologies. It is applied in many industrial and scientific fields for the optimization of administrative processes, mail processing and the digitization of analog archives and historical documents.

In statistics, classification is the problem of identifying which of a set of categories (sub-populations) an observation belongs to. Examples are assigning a given email to the "spam" or "non-spam" class, and assigning a diagnosis to a given patient based on observed characteristics of the patient.

A language model is a probabilistic model of a natural language. In 1980, the first significant statistical language model was proposed, and during the decade IBM performed ‘Shannon-style’ experiments, in which potential sources for language modeling improvement were identified by observing and analyzing the performance of human subjects in predicting or correcting text.

<span class="mw-page-title-main">Automatic image annotation</span>

Automatic image annotation is the process by which a computer system automatically assigns metadata in the form of captioning or keywords to a digital image. This application of computer vision techniques is used in image retrieval systems to organize and locate images of interest from a database.

Grammar induction is the process in machine learning of learning a formal grammar from a set of observations, thus constructing a model which accounts for the characteristics of the observed objects. More generally, grammatical inference is that branch of machine learning where the instance space consists of discrete combinatorial objects such as strings, trees and graphs.

Sentiment analysis is the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information. Sentiment analysis is widely applied to voice of the customer materials such as reviews and survey responses, online and social media, and healthcare materials for applications that range from marketing to customer service to clinical medicine. With the rise of deep language models, such as RoBERTa, also more difficult data domains can be analyzed, e.g., news texts where authors typically express their opinion/sentiment less explicitly.

<span class="mw-page-title-main">Object detection</span> Computer technology related to computer vision and image processing

Object detection is a computer technology related to computer vision and image processing that deals with detecting instances of semantic objects of a certain class in digital images and videos. Well-researched domains of object detection include face detection and pedestrian detection. Object detection has applications in many areas of computer vision, including image retrieval and video surveillance.

In natural language processing, textual entailment (TE), also known as natural language inference (NLI), is a directional relation between text fragments. The relation holds whenever the truth of one text fragment follows from another text.

The following outline is provided as an overview of and topical guide to natural-language processing:

<span class="mw-page-title-main">Feature learning</span> Set of learning techniques in machine learning

In machine learning, feature learning or representation learning is a set of techniques that allows a system to automatically discover the representations needed for feature detection or classification from raw data. This replaces manual feature engineering and allows a machine to both learn the features and use them to perform a specific task.

This glossary of artificial intelligence is a list of definitions of terms and concepts relevant to the study of artificial intelligence, its sub-disciplines, and related fields. Related glossaries include Glossary of computer science, Glossary of robotics, and Glossary of machine vision.

<span class="mw-page-title-main">Semantic parsing</span>

Semantic parsing is the task of converting a natural language utterance to a logical form: a machine-understandable representation of its meaning. Semantic parsing can thus be understood as extracting the precise meaning of an utterance. Applications of semantic parsing include machine translation, question answering, ontology induction, automated reasoning, and code generation. The phrase was first used in the 1970s by Yorick Wilks as the basis for machine translation programs working with only semantic representations. Semantic parsing is one of the important tasks in computational linguistics and natural language processing.

<span class="mw-page-title-main">Triplet loss</span> Function for machine learning algorithms

Triplet loss is a loss function for machine learning algorithms where a reference input is compared to a matching input and a non-matching input. The distance from the anchor to the positive is minimized, and the distance from the anchor to the negative input is maximized. An early formulation equivalent to triplet loss was introduced for metric learning from relative comparisons by M. Schultze and T. Joachims in 2003.

An energy-based model (EBM) (also called a Canonical Ensemble Learning(CEL) or Learning via Canonical Ensemble (LCE)) is an application of canonical ensemble formulation of statistical physics for learning from data problems. The approach prominently appears in generative models (GMs).

Self-supervised learning (SSL) is a paradigm in machine learning where a model is trained on a task using the data itself to generate supervisory signals, rather than relying on external labels provided by humans. In the context of neural networks, self-supervised learning aims to leverage inherent structures or relationships within the input data to create meaningful training signals. SSL tasks are designed so that solving it requires capturing essential features or relationships in the data. The input data is typically augmented or transformed in a way that creates pairs of related samples. One sample serves as the input, and the other is used to formulate the supervisory signal. This augmentation can involve introducing noise, cropping, rotation, or other transformations. Self-supervised learning more closely imitates the way humans learn to classify objects.

References

  1. Xian, Yongqin; Lampert, Christoph H.; Schiele, Bernt; Akata, Zeynep (2020-09-23). "Zero-Shot Learning -- A Comprehensive Evaluation of the Good, the Bad and the Ugly". arXiv: 1707.00600 [cs.CV].
  2. Xian, Yongqin; Schiele, Bernt; Akata, Zeynep (2017). "Zero-shot learning-the good, the bad and the ugly". Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition: 4582–4591. arXiv: 1703.04394 . Bibcode:2017arXiv170304394X.
  3. 1 2 3 Chang, M.W. (2008). "Importance of Semantic Representation: Dataless Classification". AAAI.
  4. Larochelle, Hugo (2008). "Zero-data Learning of New Tasks" (PDF).
  5. Palatucci, Mark (2009). "Zero-Shot Learning with Semantic Output Codes" (PDF). NIPS.
  6. 1 2 Lampert, C.H. (2009). "Learning to detect unseen object classes by between-class attribute transfer". IEEE Conference on Computer Vision and Pattern Recognition: 951–958. CiteSeerX   10.1.1.165.9750 .
  7. Miller, E. G. (2000). "Learning from One Example Through Shared Densities on Transforms" (PDF). CVPR.
  8. Song, Yangqiu (2019). "Toward any-language zero-shot topic classification of textual documents". Artificial Intelligence. 274: 133–150. doi: 10.1016/j.artint.2019.02.002 .
  9. Song, Yangqiu (2016). "Cross-Lingual Dataless Classification for Many Languages" (PDF). IJCAI.
  10. 1 2 Zhou, Ben (2018). "Zero-Shot Open Entity Typing as Type-Compatible Grounding" (PDF). EMNLP. arXiv: 1907.03228 .
  11. Yin, Wenpeng (2019). "Benchmarking Zero-shot Text Classification: Datasets, Evaluation and Entailment Approach" (PDF). EMNLP. arXiv: 1909.00161 .
  12. Levy, Omer (2017). "Zero-Shot Relation Extraction via Reading Comprehension" (PDF). CoNLL. arXiv: 1706.04115 .
  13. Romera-Paredes, Bernardino; Torr, Phillip (2015). "An embarrassingly simple approach to zero-shot learning" (PDF). International Conference on Machine Learning: 2152–2161.
  14. Atzmon, Yuval; Chechik, Gal (2018). "Probabilistic AND-OR Attribute Grouping for Zero-Shot Learning" (PDF). Uncertainty in Artificial Intelligence. arXiv: 1806.02664 . Bibcode:2018arXiv180602664A.
  15. Roth, Dan (2009). "Aspect Guided Text Categorization with Unobserved Labels". ICDM. CiteSeerX   10.1.1.148.9946 .
  16. Hu, R Lily; Xiong, Caiming; Socher, Richard (2018). "Zero-Shot Image Classification Guided by Natural Language Descriptions of Classes: A Meta-Learning Approach" (PDF). NeurIPS.
  17. Srivastava, Shashank; Labutov, Igor; Mitchelle, Tom (2018). "Zero-shot Learning of Classifiers from Natural Language Quantification". Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 306–316. doi: 10.18653/v1/P18-1029 .
  18. Frome, Andrea; et, al (2013). "Devise: A deep visual-semantic embedding model" (PDF). Advances in Neural Information Processing Systems: 2121–2129.
  19. Socher, R; Ganjoo, M; Manning, C.D.; Ng, A. (2013). "Zero-shot learning through cross-modal transfer". Neural Information Processing Systems. arXiv: 1301.3666 . Bibcode:2013arXiv1301.3666S.
  20. Atzmon, Yuval (2019). "Adaptive Confidence Smoothing for Generalized Zero-Shot Learning". The IEEE Conference on Computer Vision and Pattern Recognition: 11671–11680. arXiv: 1812.09903 . Bibcode:2018arXiv181209903A.
  21. Felix, R; et, al (2018). "Multi-modal cycle-consistent generalized zero-shot learning". Proceedings of the European Conference on Computer Vision: 21–37. arXiv: 1808.00136 . Bibcode:2018arXiv180800136F.
  22. Wittmann, Bruce J.; Yue, Yisong; Arnold, Frances H. (2020-12-04). "Machine Learning-Assisted Directed Evolution Navigates a Combinatorial Epistatic Fitness Landscape with Minimal Screening Burden": 2020.12.04.408955. doi:10.1101/2020.12.04.408955. S2CID   227914824.{{cite journal}}: Cite journal requires |journal= (help)