Developer(s) | OpenAI |
---|---|
Initial release | January 5, 2021 |
Repository | https://github.com/OpenAI/CLIP |
Written in | Python |
License | MIT License |
Website | openai.com/research/clip |
Contrastive Language-Image Pre-training (CLIP) is a technique for training a pair of neural network models, one for image understanding and one for text understanding, using a contrastive objective. [1] This method has enabled broad applications across multiple domains, including cross-modal retrieval, [2] text-to-image generation, [3] aesthetic ranking, [4] and image captioning. [5]
It was first announced on OpenAI's official blog on January 5, 2021, [6] with a report served directly through OpenAI's CDN, [7] and a GitHub repository. [8] The paper was delivered on arXiv on 26 February 2021. [9]
The report (with some details removed, and its appendix cut out to a "Supplementary PDF") was published in Proceedings of the 38th International Conference on Machine Learning, PMLR, [1] which had a submission deadline of February 2021. [10]
Concurrent to CLIP was ALIGN, published at the same conference. It was done by researchers at Google, with essentially the same algorithm. [11]
The CLIP method trains a pair of models contrastively. [1] One model takes in a piece of text as input and outputs a single vector representing its semantic content. The other model takes in an image and similarly outputs a single vector representing its visual content. The models are trained so that the vectors corresponding to semantically similar text-image pairs are close together in the shared vector space, while those corresponding to dissimilar pairs are far apart.
To train a pair of CLIP models, one would start by preparing a large dataset of image-caption pairs. During training, the models are presented with batches of image-caption pairs. Let the outputs from the text and image models be respectively . Two vectors are considered "similar" if their dot product is large.
The loss incurred on this batch is the multi-class N-pair loss, [12] which is a symmetric cross-entropy loss over similarity scores:In essence, this loss function encourages the dot product between matching image and text vectors () to be high, while discouraging high dot products between non-matching pairs. The parameter is the temperature, which is parameterized in the original CLIP model as where is a learned parameter.
Other loss functions are possible. For example, Sigmoid CLIP (SigLIP) [13] proposes the following loss function:where is the negative log sigmoid loss, and the Dirac delta symbol is 1 if else 0.
While the original model was developed by OpenAI, subsequent models have been trained by other organizations as well.
The image encoding models used in CLIP are typically vision transformers (ViT). The naming convention for these models often reflects the specific ViT architecture used. For instance, "ViT-L/14" means a "vision transformer large" (compared to other models in the same series) with a patch size of 14, meaning that the image is divided into 14-by-14 pixel patches before being processed by the transformer. The size indicator ranges from B, L, H, G (base, large, huge, giant), in that order.
Other than ViT, the image model is typically a convolutional neural network, such as ResNet (in the original series by OpenAI), or ConvNeXt [14] (in the OpenCLIP model series by LAION [15] ).
Since the output vectors of the image model and the text model must have exactly the same length, both the image model and the text model have fixed-length vector outputs, which in the original report is called "embedding dimension". [note 1]
For example, in the original OpenAI model, the ResNet models have embedding dimensions ranging from 512 to 1024, [9] : Table 19 and for the ViTs, from 512 to 768. [9] : Table 20
Model name | Resolution | Parameters (total, in millions) | Parameters (vision) | Parameters (text) | Embedding dimension | Size (MB) | Release date |
---|---|---|---|---|---|---|---|
RN50 | 224 | 102 | 38.3 | 63.1 | 1024 | 244 | 2021-01 |
RN101 | 224 | 120 | 56.3 | 63.1 | 512 | 278 | 2021-03 |
RN50x4 | 288 | 178 | 87.1 | 90.7 | 640 | 402 | 2021-03 |
RN50x16 | 384 | 291 | 167.3 | 123.0 | 768 | 630 | 2021-07 |
RN50x64 | 448 | 623 | 420.4 | 201.8 | 1024 | 1260 | 2022-01 |
ViT-B/32 | 224 | 151 | 87.8 | 63.1 | 512 | 338 | 2021-01 |
ViT-B/16 | 224 | 150 | 86.2 | 63.1 | 512 | 335 | 2021-07 |
ViT-L/14 | 224 | 428 | 304.0 | 123.0 | 768 | 890 | 2022-01 |
ViT-L/14@336px | 336 | 428 | 304.3 | 123.0 | 768 | 891 | 2022-04 |
Its implementation of ViT was the same as the original one, [17] with one modification: after position embeddings are added to the initial patch embeddings, there is a LayerNorm.
Its implementation of ResNet was the same as the original one, [18] with 3 modifications:
ALIGN [11] used EfficientNet [22] of various sizes, a kind of convolutional neural network.
The text encoding models used in CLIP are typically Transformers.
In the original OpenAI report, they reported using a Transformer (63M-parameter, 12-layer, 512-wide, 8 attention heads) with lower-cased byte pair encoding (BPE) with 49152 vocabulary size. Context length was capped at 76 for efficiency. Like GPT, it was decoder-only, with only causally-masked self-attention. [1] : 5 Its architecture is the same as GPT-2. [23]
Like BERT, the text sequence is bracketed by two special tokens [SOS]
and [EOS]
("start of sequence" and "end of sequence"). Take the activations of the highest layer of the transformer on the [EOS]
, apply LayerNorm, then a final linear map. This is the text encoding of the input sequence. The final linear map has output dimension equal to the embedding dimension of whatever image encoder it is paired with. These models all had context length 77 and vocabulary size 49408.
ALIGN [11] used BERT of various sizes.
The CLIP models released by OpenAI were trained on a dataset called "WebImageText" (WIT) containing 400 million pairs of images and their corresponding captions scraped from the internet. The total number of words in this dataset is similar in scale to the WebText dataset used for training GPT-2, which contains about 40 gigabytes of text data. [1]
The dataset contains 500,000 text-queries, with up to 20,000 (image, text) pairs per query. The text-queries were generated by starting with all words occurring at least 100 times in English Wikipedia, then extended by bigrams with high mutual information, names of all Wikipedia articles above a certain search volume, and WordNet synsets.
The dataset is private and has not been released to the public, and there is no further information on it. [note 3]
For the CLIP image models, the input images are preprocessed by first dividing each of the R, G, B values of an image by the maximum possible value, so that these values fall between 0 and 1, then subtracting by [0.48145466, 0.4578275, 0.40821073], and dividing by [0.26862954, 0.26130258, 0.27577711].
The rationale was that these are the mean and standard deviations of the images in the WebImageText dataset, so this preprocessing step roughly whitens the image tensor. These numbers slightly differ from the standard preprocessing for ImageNet, which uses [0.485, 0.456, 0.406] and [0.229, 0.224, 0.225]. [25]
If the input image does not have the same resolution as the native resolution (224x224 for all except ViT-L/14@336px, which has 336x336 resolution), then the input image is scaled down by bicubic interpolation, so that its shorter side is the same as the native resolution, then the central square of the image is cropped out.
ALIGN [11] used over one billion image-text pairs, obtained by extracting images and their alt-tags from online crawling. The method was described as similar to how the Conceptual Captions dataset [26] was constructed, but instead of complex filtering, they only applied a frequency-based filtering.
Later models trained by other organizations had published datasets. For example, LAION trained OpenCLIP with published datasets LAION-400M, LAION-2B, and DataComp-1B. [27] [15]
In the original OpenAI CLIP report, they reported training 5 ResNet and 3 ViT (ViT-B/32, ViT-B/16, ViT-L/14). Each was trained for 32 epochs. The largest ResNet model took 18 days to train on 592 V100 GPUs. The largest ViT model took 12 days on 256 V100 GPUs.
All ViT models were trained on 224x224 image resolution. The ViT-L/14 was then boosted to 336x336 resolution by FixRes, [28] resulting in a model. [note 4] They found this was the best-performing model. [1] : Appendix F. Model Hyperparameters
In the OpenCLIP series, the ViT-L/14 model was trained on 384 A100 GPUs on the LAION-2B dataset, for 160 epochs for a total of 32B samples seen. [29]
CLIP's cross-modal retrieval enables the alignment of visual and textual data in a shared latent space, allowing users to retrieve images based on text descriptions and vice versa, without the need for explicit image annotations. [30] In text-to-image retrieval, users input descriptive text, and CLIP retrieves images with matching embeddings. In image-to-text retrieval, images are used to find related text content.
CLIP’s ability to connect visual and textual data has found applications in multimedia search, content discovery, and recommendation systems. [31] [32]
CLIP can perform zero-shot image classification tasks. This is achieved by prompting the text encoder with class names and selecting the class whose embedding is closest to the image embedding. For example, to classify an image, they compared the embedding of the image with the embedding of the text "A photo of a {class}.", and the {class} that results in the highest dot product is outputted.
CLIP has been used as a component in multimodal learning. For example, during the training of Google DeepMind's Flamingo (2022), [33] the authors trained a CLIP pair, with BERT as the text encoder and NormalizerFree ResNet F6 [34] as the image encoder. The image encoder of the CLIP pair was taken with parameters frozen and the text encoder was discarded. The frozen image encoder was then combined with a frozen Chinchilla language model, by finetuning with some further parameters that connect the two frozen models.
CLIP has been used in various domains beyond its original purpose:
!pipinstallgit+https://github.com/openai/CLIP.git!wgethttps://github.com/openai/CLIP/raw/main/CLIP.png-OCLIP.pngimporttorchimportclipfromPILimportImageimportnumpyasnpdevice="cuda"iftorch.cuda.is_available()else"cpu"forminclip.available_models():model,preprocess=clip.load(m,device=device)input_resolution=model.visual.input_resolutioncontext_length=model.context_lengthvocab_size=model.vocab_sizeprint("Model parameters:",f"{np.sum([int(np.prod(p.shape))forpinmodel.parameters()]):,}")print("Input resolution:",input_resolution)print("Context length:",context_length)print("Vocab size:",vocab_size)n_params_vision=sum(p.numel()forpinmodel.visual.parameters())n_params_text=sum(p.numel()forpinmodel.transformer.parameters())image=preprocess(Image.open("CLIP.png")).unsqueeze(0).to(device)image_features=model.encode_image(image)print(f"Model: {m}, #vision parameters: {n_params_vision:,}, #text parameters: {n_params_text:,}, embedding dimension: {image_features.shape[1]}")delmodel,preprocess,image,image_features
ViT-L/14-336px
and ViT-L/14@336px
, inconsistently throughout the report.In machine learning (ML), feature learning or representation learning is a set of techniques that allow a system to automatically discover the representations needed for feature detection or classification from raw data. This replaces manual feature engineering and allows a machine to both learn the features and use them to perform a specific task.
Multimodal learning is a type of deep learning that integrates and processes multiple types of data, referred to as modalities, such as text, audio, images, or video. This integration allows for a more holistic understanding of complex data, improving model performance in tasks like visual question answering, cross-modal retrieval, text-to-image generation, aesthetic ranking, and image captioning.
Neural machine translation (NMT) is an approach to machine translation that uses an artificial neural network to predict the likelihood of a sequence of words, typically modeling entire sentences in a single integrated model.
A residual neural network is a deep learning architecture in which the layers learn residual functions with reference to the layer inputs. It was developed in 2015 for image recognition, and won the ImageNet Large Scale Visual Recognition Challenge of that year.
A transformer is a deep learning architecture developed by researchers at Google and based on the multi-head attention mechanism, proposed in the 2017 paper "Attention Is All You Need". Text is converted to numerical representations called tokens, and each token is converted into a vector via lookup from a word embedding table. At each layer, each token is then contextualized within the scope of the context window with other (unmasked) tokens via a parallel multi-head attention mechanism, allowing the signal for key tokens to be amplified and less important tokens to be diminished.
Bidirectional encoder representations from transformers (BERT) is a language model introduced in October 2018 by researchers at Google. It learns to represent text as a sequence of vectors using self-supervised learning. It uses the encoder-only transformer architecture. It is notable for its dramatic improvement over previous state-of-the-art models, and as an early example of a large language model. As of 2020, BERT is a ubiquitous baseline in natural language processing (NLP) experiments.
The Fréchet inception distance (FID) is a metric used to assess the quality of images created by a generative model, like a generative adversarial network (GAN) or a diffusion model.
Attention is a machine learning method that determines the relative importance of each component in a sequence relative to the other components in that sequence. In natural language processing, importance is represented by "soft" weights assigned to each word in a sentence. More generally, attention encodes vectors called token embeddings across a fixed-width sequence that can range from tens to millions of tokens in size.
DALL-E, DALL-E 2, and DALL-E 3 are text-to-image models developed by OpenAI using deep learning methodologies to generate digital images from natural language descriptions known as "prompts".
Self-supervised learning (SSL) is a paradigm in machine learning where a model is trained on a task using the data itself to generate supervisory signals, rather than relying on externally-provided labels. In the context of neural networks, self-supervised learning aims to leverage inherent structures or relationships within the input data to create meaningful training signals. SSL tasks are designed so that solving them requires capturing essential features or relationships in the data. The input data is typically augmented or transformed in a way that creates pairs of related samples, where one sample serves as the input, and the other is used to formulate the supervisory signal. This augmentation can involve introducing noise, cropping, rotation, or other transformations. Self-supervised learning more closely imitates the way humans learn to classify objects.
A vision transformer (ViT) is a transformer designed for computer vision. A ViT decomposes an input image into a series of patches, serializes each patch into a vector, and maps it to a smaller dimension with a single matrix multiplication. These vector embeddings are then processed by a transformer encoder as if they were token embeddings.
Prompt engineering is the process of structuring an instruction that can be interpreted and understood by a generative artificial intelligence (AI) model. A prompt is natural language text describing the task that an AI should perform. A prompt for a text-to-text language model can be a query such as "what is Fermat's little theorem?", a command such as "write a poem in the style of Edgar Allan Poe about leaves falling", or a longer statement including context, instructions, and conversation history.
Stable Diffusion is a deep learning, text-to-image model released in 2022 based on diffusion techniques. The generative artificial intelligence technology is the premier product of Stability AI and is considered to be a part of the ongoing artificial intelligence boom.
A text-to-image model is a machine learning model which takes an input natural language description and produces an image matching that description.
LAION is a German non-profit which makes open-sourced artificial intelligence models and datasets. It is best known for releasing a number of large datasets of images and captions scraped from the web which have been used to train a number of high-profile text-to-image models, including Stable Diffusion and Imagen.
In machine learning, the term tensor informally refers to two different concepts for organizing and representing data. Data may be organized in a multidimensional array (M-way array), informally referred to as a "data tensor"; however, in the strict mathematical sense, a tensor is a multilinear mapping over a set of domain vector spaces to a range vector space. Observations, such as images, movies, volumes, sounds, and relationships among words and concepts, stored in an M-way array ("data tensor"), may be analyzed either by artificial neural networks or tensor methods.
A large language model (LLM) is a type of computational model designed for natural language processing tasks such as language generation. As language models, LLMs acquire these abilities by learning statistical relationships from vast amounts of text during a self-supervised and semi-supervised training process.
Whisper is a machine learning model for speech recognition and transcription, created by OpenAI and first released as open-source software in September 2022.
"Attention Is All You Need" is a 2017 landmark research paper in machine learning authored by eight scientists working at Google. The paper introduced a new deep learning architecture known as the transformer, based on the attention mechanism proposed in 2014 by Bahdanau et al. It is considered a foundational paper in modern artificial intelligence, as the transformer approach has become the main architecture of large language models like those based on GPT. At the time, the focus of the research was on improving Seq2seq techniques for machine translation, but the authors go further in the paper, foreseeing the technique's potential for other tasks like question answering and what is now known as multimodal Generative AI.
The Latent Diffusion Model (LDM) is a diffusion model architecture developed by the CompVis group at LMU Munich.
{{cite journal}}
: Cite journal requires |journal=
(help){{cite journal}}
: Cite journal requires |journal=
(help)