Zero-shot learning

Last updated January 05, 2025 • 4 min readFrom Wikipedia, The Free Encyclopedia

Zero-shot learning (ZSL) is a problem setup in deep learning where, at test time, a learner observes samples from classes which were not observed during training, and needs to predict the class that they belong to. The name is a play on words based on the earlier concept of one-shot learning, in which classification can be learned from only one, or a few, examples.

Zero-shot methods generally work by associating observed and non-observed classes through some form of auxiliary information, which encodes observable distinguishing properties of objects.^[1] For example, given a set of images of animals to be classified, along with auxiliary textual descriptions of what animals look like, an artificial intelligence model which has been trained to recognize horses, but has never been given a zebra, can still recognize a zebra when it also knows that zebras look like striped horses. This problem is widely studied in computer vision, natural language processing, and machine perception.^[2]

Background and history

The first paper on zero-shot learning in natural language processing appeared in a 2008 paper by Chang, Ratinov, Roth, and Srikumar, at the AAAI’08, but the name given to the learning paradigm there was dataless classification.^[3] The first paper on zero-shot learning in computer vision appeared at the same conference, under the name zero-data learning.^[4] The term zero-shot learning itself first appeared in the literature in a 2009 paper from Palatucci, Hinton, Pomerleau, and Mitchell at NIPS’09.^[5] This terminology was repeated later in another computer vision paper^[6] and the term zero-shot learning caught on, as a take-off on one-shot learning that was introduced in computer vision years earlier.^[7]

In computer vision, zero-shot learning models learned parameters for seen classes along with their class representations and rely on representational similarity among class labels so that, during inference, instances can be classified into new classes.

In natural language processing, the key technical direction developed builds on the ability to "understand the labels"—represent the labels in the same semantic space as that of the documents to be classified. This supports the classification of a single example without observing any annotated data, the purest form of zero-shot classification. The original paper^[3] made use of the Explicit Semantic Analysis (ESA) representation but later papers made use of other representations, including dense representations. This approach was also extended to multilingual domains,^[8]^[9] fine entity typing^[10] and other problems. Moreover, beyond relying solely on representations, the computational approach has been extended to depend on transfer from other tasks, such as textual entailment ^[11] and question answering.^[12]

The original paper^[3] also points out that, beyond the ability to classify a single example, when a collection of examples is given, with the assumption that they come from the same distribution, it is possible to bootstrap the performance in a semi-supervised like manner (or transductive learning).

Unlike standard generalization in machine learning, where classifiers are expected to correctly classify new samples to classes they have already observed during training, in ZSL, no samples from the classes have been given during training the classifier. It can therefore be viewed as an extreme case of domain adaptation.

Prerequisite information for zero-shot classes

Naturally, some form of auxiliary information has to be given about these zero-shot classes, and this type of information can be of several types.

Learning with attributes: classes are accompanied by pre-defined structured description. For example, for bird descriptions, this could include "red head", "long beak".^[6]^[13] These attributes are often organized in a structured compositional way, and taking that structure into account improves learning.^[14] While this approach was used mostly in computer vision, there are some examples for it also in natural language processing.^[15]
Learning from textual description. As pointed out above, this has been the key direction pursued in natural language processing. Here class labels are taken to have a meaning and are often augmented with definitions or free-text natural-language description. This could include for example a wikipedia description of the class.^[10]^[16]^[17]
Class-class similarity. Here, classes are embedded in a continuous space. A zero-shot classifier can predict that a sample corresponds to some position in that space, and the nearest embedded class is used as a predicted class, even if no such samples were observed during training.^[18]

Generalized zero-shot learning

The above ZSL setup assumes that at test time, only zero-shot samples are given, namely, samples from new unseen classes. In generalized zero-shot learning, samples from both new and known classes, may appear at test time. This poses new challenges for classifiers at test time, because it is very challenging to estimate if a given sample is new or known. Some approaches to handle this include:

a gating module, which is first trained to decide if a given sample comes from a new class or from an old one, and then, at inference time, outputs either a hard decision,^[19] or a soft probabilistic decision^[20]
a generative module, which is trained to generate feature representation of the unseen classes--a standard classifier can then be trained on samples from all classes, seen and unseen.^[21]

Domains of application

Zero shot learning has been applied to the following fields:

Related Research Articles

Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related to information retrieval, knowledge representation and computational linguistics, a subfield of linguistics. Typically data is collected in text corpora, using either rule-based, statistical or neural-based approaches in machine learning and deep learning.

In machine learning (ML), boosting is an ensemble metaheuristic for primarily reducing bias. It can also improve the stability and accuracy of ML classification and regression algorithms. Hence, it is prevalent in supervised learning for converting weak learners to strong learners.

Natural language generation (NLG) is a software process that produces natural language output. A widely-cited survey of NLG methods describes NLG as "the subfield of artificial intelligence and computational linguistics that is concerned with the construction of computer systems that can produce understandable texts in English or other human languages from some underlying non-linguistic representation of information".

Question answering (QA) is a computer science discipline within the fields of information retrieval and natural language processing (NLP) that is concerned with building systems that automatically answer questions that are posed by humans in a natural language.

Document processing is a field of research and a set of production processes aimed at making an analog document digital. Document processing does not simply aim to photograph or scan a document to obtain a digital image, but also to make it digitally intelligible. This includes extracting the structure of the document or the layout and then the content, which can take the form of text or images. The process can involve traditional computer vision algorithms, convolutional neural networks or manual labor. The problems addressed are related to semantic segmentation, object detection, optical character recognition (OCR), handwritten text recognition (HTR) and, more broadly, transcription, whether automatic or not. The term can also include the phase of digitizing the document using a scanner and the phase of interpreting the document, for example using natural language processing (NLP) or image classification technologies. It is applied in many industrial and scientific fields for the optimization of administrative processes, mail processing and the digitization of analog archives and historical documents.

Grammar induction is the process in machine learning of learning a formal grammar from a set of observations, thus constructing a model which accounts for the characteristics of the observed objects. More generally, grammatical inference is that branch of machine learning where the instance space consists of discrete combinatorial objects such as strings, trees and graphs.

Sentiment analysis is the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information. Sentiment analysis is widely applied to voice of the customer materials such as reviews and survey responses, online and social media, and healthcare materials for applications that range from marketing to customer service to clinical medicine. With the rise of deep language models, such as RoBERTa, also more difficult data domains can be analyzed, e.g., news texts where authors typically express their opinion/sentiment less explicitly.

<span class="mw-page-title-main">Object detection</span> Computer technology related to computer vision and image processing

Object detection is a computer technology related to computer vision and image processing that deals with detecting instances of semantic objects of a certain class in digital images and videos. Well-researched domains of object detection include face detection and pedestrian detection. Object detection has applications in many areas of computer vision, including image retrieval and video surveillance.

In machine learning and statistical classification, multiclass classification or multinomial classification is the problem of classifying instances into one of three or more classes. For example, deciding on whether an image is showing a banana, an orange, or an apple is a multiclass classification problem, with three possible classes, while deciding on whether an image contains an apple or not is a binary classification problem.

In natural language processing, textual entailment (TE), also known as natural language inference (NLI), is a directional relation between text fragments. The relation holds whenever the truth of one text fragment follows from another text.

The following outline is provided as an overview of and topical guide to natural-language processing:

A convolutional neural network (CNN) is a regularized type of feed-forward neural network that learns features by itself via filter optimization. This type of deep learning network has been applied to process and make predictions from many different types of data including text, images and audio. Convolution-based networks are the de-facto standard in deep learning-based approaches to computer vision and image processing, and have only recently been replaced -- in some cases -- by newer deep learning architectures such as the transformer. Vanishing gradients and exploding gradients, seen during backpropagation in earlier neural networks, are prevented by using regularized weights over fewer connections. For example, for each neuron in the fully-connected layer, 10,000 weights would be required for processing an image sized 100 × 100 pixels. However, applying cascaded convolution kernels, only 25 neurons are required to process 5x5-sized tiles. Higher-layer features are extracted from wider context windows, compared to lower-layer features.

Quantum machine learning is the integration of quantum algorithms within machine learning programs.

Adversarial machine learning is the study of the attacks on machine learning algorithms, and of the defenses against such attacks. A survey from May 2020 exposes the fact that practitioners report a dire need for better protecting machine learning systems in industrial applications.

Dan Roth is the Eduardo D. Glandt Distinguished Professor of Computer and Information Science at the University of Pennsylvania and the Chief AI Scientist at Oracle. Until June 2024 Dan was a VP/Distinguished Scientist at AWS AI. In his role at AWS Roth led over the last three years the scientific effort behind the first-generation Generative AI products from AWS, including Titan Models, Amazon Q efforts, and Bedrock, from inception until they became generally available.

This glossary of artificial intelligence is a list of definitions of terms and concepts relevant to the study of artificial intelligence (AI), its subdisciplines, and related fields. Related glossaries include Glossary of computer science, Glossary of robotics, and Glossary of machine vision.

Structured k-nearest neighbours (SkNN) is a machine learning algorithm that generalizes k-nearest neighbors (k-NN). k-NN supports binary classification, multiclass classification, and regression, whereas SkNN allows training of a classifier for general structured output.

Semantic parsing is the task of converting a natural language utterance to a logical form: a machine-understandable representation of its meaning. Semantic parsing can thus be understood as extracting the precise meaning of an utterance. Applications of semantic parsing include machine translation, question answering, ontology induction, automated reasoning, and code generation. The phrase was first used in the 1970s by Yorick Wilks as the basis for machine translation programs working with only semantic representations. Semantic parsing is one of the important tasks in computational linguistics and natural language processing.

An energy-based model (EBM) is an application of canonical ensemble formulation from statistical physics for learning from data. The approach prominently appears in generative artificial intelligence.

Self-supervised learning (SSL) is a paradigm in machine learning where a model is trained on a task using the data itself to generate supervisory signals, rather than relying on externally-provided labels. In the context of neural networks, self-supervised learning aims to leverage inherent structures or relationships within the input data to create meaningful training signals. SSL tasks are designed so that solving them requires capturing essential features or relationships in the data. The input data is typically augmented or transformed in a way that creates pairs of related samples, where one sample serves as the input, and the other is used to formulate the supervisory signal. This augmentation can involve introducing noise, cropping, rotation, or other transformations. Self-supervised learning more closely imitates the way humans learn to classify objects.

References

↑ Xian, Yongqin; Lampert, Christoph H.; Schiele, Bernt; Akata, Zeynep (2020-09-23). "Zero-Shot Learning -- A Comprehensive Evaluation of the Good, the Bad and the Ugly". arXiv: 1707.00600 [cs.CV].
↑ Xian, Yongqin; Schiele, Bernt; Akata, Zeynep (2017). "Zero-shot learning-the good, the bad and the ugly". Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition: 4582–4591. arXiv: 1703.04394 . Bibcode:2017arXiv170304394X.
1 2 3 Chang, M.W. (2008). "Importance of Semantic Representation: Dataless Classification". AAAI.
↑ Larochelle, Hugo (2008). "Zero-data Learning of New Tasks" (PDF).
↑ Palatucci, Mark (2009). "Zero-Shot Learning with Semantic Output Codes" (PDF). NIPS.
1 2 Lampert, C.H. (2009). "Learning to detect unseen object classes by between-class attribute transfer". IEEE Conference on Computer Vision and Pattern Recognition: 951–958. CiteSeerX 10.1.1.165.9750 .
↑ Miller, E. G. (2000). "Learning from One Example Through Shared Densities on Transforms" (PDF). CVPR.
↑ Song, Yangqiu (2019). "Toward any-language zero-shot topic classification of textual documents". Artificial Intelligence. 274: 133–150. doi: 10.1016/j.artint.2019.02.002 .
↑ Song, Yangqiu (2016). "Cross-Lingual Dataless Classification for Many Languages" (PDF). IJCAI.
1 2 Zhou, Ben (2018). "Zero-Shot Open Entity Typing as Type-Compatible Grounding" (PDF). EMNLP. arXiv: 1907.03228 .
↑ Yin, Wenpeng (2019). "Benchmarking Zero-shot Text Classification: Datasets, Evaluation and Entailment Approach" (PDF). EMNLP. arXiv: 1909.00161 .
↑ Levy, Omer (2017). "Zero-Shot Relation Extraction via Reading Comprehension" (PDF). CoNLL. arXiv: 1706.04115 .
↑ Romera-Paredes, Bernardino; Torr, Phillip (2015). "An embarrassingly simple approach to zero-shot learning" (PDF). International Conference on Machine Learning: 2152–2161.
↑ Atzmon, Yuval; Chechik, Gal (2018). "Probabilistic AND-OR Attribute Grouping for Zero-Shot Learning" (PDF). Uncertainty in Artificial Intelligence. arXiv: 1806.02664 . Bibcode:2018arXiv180602664A.
↑ Roth, Dan (2009). "Aspect Guided Text Categorization with Unobserved Labels". ICDM. CiteSeerX 10.1.1.148.9946 .
↑ Hu, R Lily; Xiong, Caiming; Socher, Richard (2018). "Zero-Shot Image Classification Guided by Natural Language Descriptions of Classes: A Meta-Learning Approach" (PDF). NeurIPS.
↑ Srivastava, Shashank; Labutov, Igor; Mitchelle, Tom (2018). "Zero-shot Learning of Classifiers from Natural Language Quantification". Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 306–316. doi: 10.18653/v1/P18-1029 .
↑ Frome, Andrea; et, al (2013). "Devise: A deep visual-semantic embedding model" (PDF). Advances in Neural Information Processing Systems: 2121–2129.
↑ Socher, R; Ganjoo, M; Manning, C.D.; Ng, A. (2013). "Zero-shot learning through cross-modal transfer". Neural Information Processing Systems. arXiv: 1301.3666 . Bibcode:2013arXiv1301.3666S.
↑ Atzmon, Yuval (2019). "Adaptive Confidence Smoothing for Generalized Zero-Shot Learning". The IEEE Conference on Computer Vision and Pattern Recognition: 11671–11680. arXiv: 1812.09903 . Bibcode:2018arXiv181209903A.
↑ Felix, R; et, al (2018). "Multi-modal cycle-consistent generalized zero-shot learning". Proceedings of the European Conference on Computer Vision: 21–37. arXiv: 1808.00136 . Bibcode:2018arXiv180800136F.
↑ Wittmann, Bruce J.; Yue, Yisong; Arnold, Frances H. (2020-12-04). "Machine Learning-Assisted Directed Evolution Navigates a Combinatorial Epistatic Fitness Landscape with Minimal Screening Burden": 2020.12.04.408955. doi:10.1101/2020.12.04.408955. S2CID 227914824.{{cite journal}}: Cite journal requires |journal= (help)

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Xian, Yongqin; Lampert, Christoph H.; Schiele, Bernt; Akata, Zeynep (2020-09-23). "Zero-Shot Learning -- A Comprehensive Evaluation of the Good, the Bad and the Ugly". arXiv: 1707.00600 [cs.CV].

[2] Xian, Yongqin; Schiele, Bernt; Akata, Zeynep (2017). "Zero-shot learning-the good, the bad and the ugly". Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition: 4582–4591. arXiv: 1703.04394 . Bibcode:2017arXiv170304394X.

[:0-3] 1 2 3 Chang, M.W. (2008). "Importance of Semantic Representation: Dataless Classification". AAAI.

[4] Larochelle, Hugo (2008). "Zero-data Learning of New Tasks" (PDF).

[5] Palatucci, Mark (2009). "Zero-Shot Learning with Semantic Output Codes" (PDF). NIPS.

[:1-6] 1 2 Lampert, C.H. (2009). "Learning to detect unseen object classes by between-class attribute transfer". IEEE Conference on Computer Vision and Pattern Recognition: 951–958. CiteSeerX 10.1.1.165.9750 .

[7] Miller, E. G. (2000). "Learning from One Example Through Shared Densities on Transforms" (PDF). CVPR.

[8] Song, Yangqiu (2019). "Toward any-language zero-shot topic classification of textual documents". Artificial Intelligence. 274: 133–150. doi: 10.1016/j.artint.2019.02.002 .

[9] Song, Yangqiu (2016). "Cross-Lingual Dataless Classification for Many Languages" (PDF). IJCAI.

[:2-10] 1 2 Zhou, Ben (2018). "Zero-Shot Open Entity Typing as Type-Compatible Grounding" (PDF). EMNLP. arXiv: 1907.03228 .

[11] Yin, Wenpeng (2019). "Benchmarking Zero-shot Text Classification: Datasets, Evaluation and Entailment Approach" (PDF). EMNLP. arXiv: 1909.00161 .

[12] Levy, Omer (2017). "Zero-Shot Relation Extraction via Reading Comprehension" (PDF). CoNLL. arXiv: 1706.04115 .

[13] Romera-Paredes, Bernardino; Torr, Phillip (2015). "An embarrassingly simple approach to zero-shot learning" (PDF). International Conference on Machine Learning: 2152–2161.

[14] Atzmon, Yuval; Chechik, Gal (2018). "Probabilistic AND-OR Attribute Grouping for Zero-Shot Learning" (PDF). Uncertainty in Artificial Intelligence. arXiv: 1806.02664 . Bibcode:2018arXiv180602664A.

[15] Roth, Dan (2009). "Aspect Guided Text Categorization with Unobserved Labels". ICDM. CiteSeerX 10.1.1.148.9946 .

[16] Hu, R Lily; Xiong, Caiming; Socher, Richard (2018). "Zero-Shot Image Classification Guided by Natural Language Descriptions of Classes: A Meta-Learning Approach" (PDF). NeurIPS.

[17] Srivastava, Shashank; Labutov, Igor; Mitchelle, Tom (2018). "Zero-shot Learning of Classifiers from Natural Language Quantification". Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 306–316. doi: 10.18653/v1/P18-1029 .

[18] Frome, Andrea; et, al (2013). "Devise: A deep visual-semantic embedding model" (PDF). Advances in Neural Information Processing Systems: 2121–2129.

[19] Socher, R; Ganjoo, M; Manning, C.D.; Ng, A. (2013). "Zero-shot learning through cross-modal transfer". Neural Information Processing Systems. arXiv: 1301.3666 . Bibcode:2013arXiv1301.3666S.

[20] Atzmon, Yuval (2019). "Adaptive Confidence Smoothing for Generalized Zero-Shot Learning". The IEEE Conference on Computer Vision and Pattern Recognition: 11671–11680. arXiv: 1812.09903 . Bibcode:2018arXiv181209903A.

[21] Felix, R; et, al (2018). "Multi-modal cycle-consistent generalized zero-shot learning". Proceedings of the European Conference on Computer Vision: 21–37. arXiv: 1808.00136 . Bibcode:2018arXiv180800136F.

[22] Wittmann, Bruce J.; Yue, Yisong; Arnold, Frances H. (2020-12-04). "Machine Learning-Assisted Directed Evolution Navigates a Combinatorial Epistatic Fitness Landscape with Minimal Screening Burden": 2020.12.04.408955. doi:10.1101/2020.12.04.408955. S2CID 227914824.{{cite journal}}: Cite journal requires |journal= (help)

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]