Contrastive Language-Image Pre-training

Last updated
CLIP
Developer(s) OpenAI
Initial releaseJanuary 5, 2021
Repository https://github.com/OpenAI/CLIP
Written in Python
License MIT License
Website openai.com/research/clip

Contrastive Language-Image Pre-training (CLIP) is a technique for training a pair of neural network models, one for image understanding and one for text understanding, using a contrastive objective. [1]

Contents

Publication history

It was first announced on OpenAI's official blog on January 5, 2021, [2] with a report served directly through OpenAI's CDN, [3] and a GitHub repository. [4] The paper was delivered on arXiv on 26 February 2021. [5]

The report (with its appendix cut out to a "Supplementary PDF") was published in Proceedings of the 38th International Conference on Machine Learning, PMLR, [1] which had a submission deadline of February 2021. [6]

Concurrent to CLIP was ALIGN, published at the same conference. It was done by researchers at Google, with essentially the same algorithm. [7]

Algorithm

Architecture overview of CLIP. Contrastive Language-Image Pretraining.png
Architecture overview of CLIP.

The CLIP method trains a pair of models contrastively. [1] One model takes in a piece of text as input and outputs a single vector representing its semantic content. The other model takes in an image and similarly outputs a single vector representing its visual content. The models are trained so that the vectors corresponding to semantically similar text-image pairs are close together in the shared vector space, while those corresponding to dissimilar pairs are far apart.

To train a pair of CLIP models, one would start by preparing a large dataset of image-caption pairs. During training, the models are presented with batches of image-caption pairs. Let the outputs from the text and image models be respectively . Two vectors are considered "similar" if their dot product is large.

The loss incurred on this batch is the multi-class N-pair loss, [8] which is a symmetric cross-entropy loss over similarity scores:In essence, this loss function encourages the dot product between matching image and text vectors () to be high, while discouraging high dot products between non-matching pairs. The parameter is the temperature, which is parameterized in the original CLIP model as where is a learned parameter.

Other loss functions are possible. For example, Sigmoid CLIP [9] proposes the following loss function:where is the negative log sigmoid loss.

CLIP models

While the original model was developed by OpenAI, subsequent models have been trained by other organizations as well.

Image model

Vision Transformer architecture. The Rep<CLS> output vector is used as the image encoding for CLIP. Vision Transformer.svg
Vision Transformer architecture. The Rep<CLS> output vector is used as the image encoding for CLIP.

The image encoding models used in CLIP are typically vision transformers (ViT). The naming convention for these models often reflects the specific ViT architecture used. For instance, "ViT-L/14" means a "vision transformer large" (compared to other models in the same series) with a patch size of 14, meaning that the image is divided into 14-by-14 pixel patches before being processed by the transformer. The size indicator ranges from B, L, H, G (base, large, huge, giant), in that order.

Other than ViT, the image model is typically a convolutional neural network, such as ResNet (in the original report), and ConvNeXt [10] (in the OpenCLIP model series [11] ).

Since the output vectors of the image model and the text model must have exactly the same length, both the image model and the text model have fixed-length vector outputs, which in the original report is called "embedding dimension". [note 1]

For example, in the original OpenAI model, the ResNet models have embedding dimensions ranging from 512 to 1024. [5] :Table 19 The ViTs, ranging from 512 to 768. [5] :Table 20

Models released by OpenAI [12] [note 2]
Model nameParameters (total, in millions)Parameters (vision)Parameters (text)Embedding dimensionSize (MB)
RN5010238.363.11024244
RN10112056.363.1512278
RN50x417887.190.7640402
RN50x16291167.3123.0768630
RN50x64623420.4201.810241260
ViT-B/3215187.863.1512338
ViT-B/1615086.263.1512335
ViT-L/14428304.0123.0768890
ViT-L/14@336px428304.3123.0768891

Its implementation of ViT was the same as the original one, [13] with one modification: after position embeddings are added to the initial patch embeddings, there is a LayerNorm.

Its implementation of ResNet was the same as the original one, [14] with 3 modifications:

ALIGN [7] used EfficientNet [18] of various sizes, a kind of convolutional neural network.

Text model

One decoder layer. The Transformer used in the CLIP text encoder was made by removing the cross-attention module, then stacking the resulting module 12 times. Transformer, one decoder block.png
One decoder layer. The Transformer used in the CLIP text encoder was made by removing the cross-attention module, then stacking the resulting module 12 times.

The text encoding models used in CLIP are typically Transformers.

In the original OpenAI report, they reported using a Transformer (63M-parameter, 12-layer, 512-wide, 8 attention heads) with lower-cased byte pair encoding (BPE) with 49152 vocabulary size. Context length was capped at 76 for efficiency. Like GPT, it was decoder-only, with only causally-masked self-attention. [1] :5 Its architecture is the same as GPT-2. [19]

Like BERT, the text sequence is bracketed by two special tokens [SOS] and [EOS] ("start of sequence" and "end of sequence"). Take the activations of the highest layer of the transformer on the [EOS], apply LayerNorm, then a final linear map. This is the text encoding of the input sequence. The final linear map has output dimension equal to the embedding dimension of whatever image encoder it is paired with.

ALIGN [7] used BERT of various sizes.

Dataset

WebImageText

The CLIP models released by OpenAI were trained on a dataset called "WebImageText" (WIT) containing 400 million pairs of images and their corresponding captions scraped from the internet. The total number of words in this dataset is similar in scale to the WebText dataset used for training GPT-2, which contains about 40 gigabytes of text data. [1]

The dataset contains 500,000 text-queries, with up to 20,000 (image, text) pairs per query. The text-queries were generated by starting with all words occurring at least 100 times in English Wikipedia, then extended by bigrams with high mutual information, names of all Wikipedia articles above a certain search volume, and WordNet synsets.

The dataset is private and has not been released to the public, and there is no further information on it. [note 3]

Others

ALIGN [7] used over one billion image-text pairs, obtained by extracting images and their alt-tags from online crawling. The method was described as similar to how the Conceptual Captions dataset [21] was constructed, but instead of complex filtering, they only applied a frequency-based filtering.

Later models trained by other organizations had published datasets. For example, LAION trained OpenCLIP with published datasets LAION-400M, LAION-2B, and DataComp-1B. [22] [11]

Training

In the original OpenAI CLIP report, they reported training 5 ResNet and 3 ViT (ViT-B/32, ViT-B/16, ViT-L/14). Each was trained for 32 epochs. The largest ResNet model took 18 days to train on 592 V100 GPUs. The largest ViT model took 12 days on 256 V100 GPUs.

All ViT models were trained on 224x224 image resolution. The ViT-L/14 was then boosted to 336x336 resolution by FixRes, [23] resulting in a model. [note 4] They found this was the best-performing model. [1] :Appendix F. Model Hyperparameters

In the OpenCLIP series, the ViT-L/14 model was trained on 384 A100 GPUs on the LAION-2B dataset, for 160 epochs for a total of 32B samples seen. [24]

Applications

CLIP has found wide applications in various domains.

Multimodality

CLIP has been used as a component in multimodal learning.

For example, during the training of Google DeepMind's Flamingo (2022), [33] the authors trained a CLIP pair, with BERT as the text encoder and NormalizerFree ResNet F6 [34] as the image encoder. The image encoder of the CLIP pair was taken with parameters frozen and the text encoder was discarded. The frozen image encoder was then combined with a frozen Chinchilla language model, by finetuning with some further parameters that connect the two frozen models.

Notes

  1. Similar to the "embedding dimension" of text embedding in Transformer models.
  2. !pipinstallgit+https://github.com/openai/CLIP.gitimporttorchimportclipfromPILimportImagedevice="cuda"iftorch.cuda.is_available()else"cpu"forminclip.available_models():model,preprocess=clip.load(m,device=device)n_params_vision=sum(p.numel()forpinmodel.visual.parameters())n_params_text=sum(p.numel()forpinmodel.transformer.parameters())n_params_text+=sum(p.numel()forpinmodel.token_embedding.parameters())n_params_text+=sum(p.numel()forpinmodel.ln_final.parameters())image=preprocess(Image.open("CLIP.png")).unsqueeze(0).to(device)image_features=model.encode_image(image)print(f"Model: {m}, #vision parameters: {n_params_vision:,}, #text parameters: {n_params_text:,}, embedding dimension: {image_features.shape[1]}")delmodel,preprocess,image,image_features
  3. It is not the same as the Wikipedia-based Image Text dataset, also called "WIT". [20]
  4. They referred to this as both ViT-L/14-336px and ViT-L/14@336px, inconsistently throughout the report.

Related Research Articles

<span class="mw-page-title-main">Feature learning</span> Set of learning techniques in machine learning

In machine learning, feature learning or representation learning is a set of techniques that allows a system to automatically discover the representations needed for feature detection or classification from raw data. This replaces manual feature engineering and allows a machine to both learn the features and use them to perform a specific task.

A convolutional neural network (CNN) is a regularized type of feed-forward neural network that learns features by itself via filter optimization. This type of deep learning network has been applied to process and make predictions from many different types of data including text, images and audio. Convolution-based networks are the de-facto standard in deep learning-based approaches to computer vision and image processing, and have only recently have been replaced -- in some cases -- by more recent deep learning architectures such as the transformer. Vanishing gradients and exploding gradients, seen during backpropagation in earlier neural networks, are prevented by using regularized weights over fewer connections. For example, for each neuron in the fully-connected layer, 10,000 weights would be required for processing an image sized 100 × 100 pixels. However, applying cascaded convolution kernels, only 25 neurons are required to process 5x5-sized tiles. Higher-layer features are extracted from wider context windows, compared to lower-layer features.

Multimodal learning, in the context of machine learning, is a type of deep learning using multiple modalities of data, such as text, audio, or images.

Neural machine translation (NMT) is an approach to machine translation that uses an artificial neural network to predict the likelihood of a sequence of words, typically modeling entire sentences in a single integrated model.

<span class="mw-page-title-main">Residual neural network</span> Deep learning method

A residual neural network is a deep learning architecture in which the weight layers learn residual functions with reference to the layer inputs. It was developed in 2015 for image recognition and won that year's ImageNet Large Scale Visual Recognition Challenge.

Neural architecture search (NAS) is a technique for automating the design of artificial neural networks (ANN), a widely used model in the field of machine learning. NAS has been used to design networks that are on par with or outperform hand-designed architectures. Methods for NAS can be categorized according to the search space, search strategy and performance estimation strategy used:

<span class="mw-page-title-main">Transformer (deep learning architecture)</span> Machine learning algorithm used for natural-language processing

A transformer is a deep learning architecture developed by researchers at Google and based on the multi-head attention mechanism, proposed in a 2017 paper "Attention Is All You Need". Text is converted to numerical representations called tokens, and each token is converted into a vector via looking up from a word embedding table. At each layer, each token is then contextualized within the scope of the context window with other (unmasked) tokens via a parallel multi-head attention mechanism allowing the signal for key tokens to be amplified and less important tokens to be diminished.

Bidirectional Encoder Representations from Transformers (BERT) is a language model introduced in October 2018 by researchers at Google. It learned by self-supervised learning to represent text as a sequence of vectors. It had the transformer encoder architecture. It was notable for its dramatic improvement over previous state of the art models, and as an early example of large language model. As of 2020, BERT was a ubiquitous baseline in Natural Language Processing (NLP) experiments.

<span class="mw-page-title-main">Seq2seq</span> Family of machine learning approaches

Seq2seq is a family of machine learning approaches used for natural language processing. Applications include language translation, image captioning, conversational models, and text summarization. Seq2seq uses sequence transformation: it turns one sequence into another sequence.

<span class="mw-page-title-main">Attention (machine learning)</span> Machine learning technique

Attention is a machine learning method that determines the relative importance of each component in a sequence relative to the other components in that sequence. In natural language processing, importance is represented by "soft" weights assigned to each word in a sentence. More generally, attention encodes vectors called token embeddings across a fixed-width sequence that can range from tens to millions of tokens in size.

<span class="mw-page-title-main">DALL-E</span> Image-generating deep-learning model

DALL·E, DALL·E 2, and DALL·E 3 are text-to-image models developed by OpenAI using deep learning methodologies to generate digital images from natural language descriptions known as "prompts".

<span class="mw-page-title-main">Vision transformer</span> Variant of Transformer designed for vision processing

A vision transformer (ViT) is a transformer designed for computer vision. A ViT breaks down an input image into a series of patches, serialises each patch into a vector, and maps it to a smaller dimension with a single matrix multiplication. These vector embeddings are then processed by a transformer encoder as if they were token embeddings.

Prompt engineering is the process of structuring an instruction that can be interpreted and understood by a generative AI model. A prompt is natural language text describing the task that an AI should perform: a prompt for a text-to-text language model can be a query such as "what is Fermat's little theorem?", a command such as "write a poem about leaves falling", or a longer statement including context, instructions, and conversation history. Prompt engineering may involve phrasing a query, specifying a style, providing relevant context or assigning a role to the AI such as "Act as a native French speaker". A prompt may include a few examples for a model to learn from, such as asking the model to complete "maison → house, chat → cat, chien →", an approach called few-shot learning.

<span class="mw-page-title-main">Stable Diffusion</span> Image-generating machine learning model

Stable Diffusion is a deep learning, text-to-image model released in 2022 based on diffusion techniques. The generative artificial intelligence technology is the premier product of Stability AI and is considered to be a part of the ongoing artificial intelligence boom.

<span class="mw-page-title-main">Text-to-image model</span> Machine learning model

A text-to-image model is a machine learning model which takes an input natural language description and produces an image matching that description.

<span class="mw-page-title-main">LAION</span> Non-profit German artificial intelligence organization

LAION is a German non-profit which makes open-sourced artificial intelligence models and datasets. It is best known for releasing a number of large datasets of images and captions scraped from the web which have been used to train a number of high-profile text-to-image models, including Stable Diffusion and Imagen.

A large language model (LLM) is a computational model capable of language generation or other natural language processing tasks. As language models, LLMs acquire these abilities by learning statistical relationships from vast amounts of text during a self-supervised and semi-supervised training process.

In machine learning, a neural scaling law is an empirical scaling law that describes how neural network performance changes as key factors are scaled up or down. These factors typically include the number of parameters, training dataset size, and training cost.

Whisper is a machine learning model for speech recognition and transcription, created by OpenAI and first released as open-source software in September 2022.

T5 is a series of large language models developed by Google AI. Introduced in 2019, T5 models are trained on a massive dataset of text and code using a text-to-text framework. The T5 models are capable of performing the text-based tasks that they were pretrained for. They can also be finetuned to perform other tasks. They have been employed in various applications, including chatbots, machine translation systems, text summarization tools, code generation, and robotics.

References

  1. 1 2 3 4 5 6 7 8 Radford, Alec; Kim, Jong Wook; Hallacy, Chris; Ramesh, Aditya; Goh, Gabriel; Agarwal, Sandhini; Sastry, Girish; Askell, Amanda; Mishkin, Pamela; Clark, Jack; Krueger, Gretchen; Sutskever, Ilya (2021-07-01). Learning Transferable Visual Models From Natural Language Supervision. Proceedings of the 38th International Conference on Machine Learning. PMLR. pp. 8748–8763.
  2. "Clip: Connecting text and images". OpenAI. January 5, 2021.
  3. https://web.archive.org/web/20210105204011/https://cdn.openai.com/papers/Learning_Transferable_Visual_Models_From_Natural_Language.pdf
  4. "initial commit · openai/CLIP@b1c4b6b". GitHub. 5 January 2021. Archived from the original on 9 Feb 2021. Retrieved 2024-09-06.
  5. 1 2 3 Radford, Alec; Kim, Jong Wook; Hallacy, Chris; Ramesh, Aditya; Goh, Gabriel; Agarwal, Sandhini; Sastry, Girish; Askell, Amanda; Mishkin, Pamela; Clark, Jack; Krueger, Gretchen; Sutskever, Ilya (2021). "Learning Transferable Visual Models From Natural Language Supervision". arXiv: 2103.00020 .
  6. "ICML 2021 Call for Papers". icml.cc. Retrieved 2024-09-06.
  7. 1 2 3 4 Jia, Chao; Yang, Yinfei; Xia, Ye; Chen, Yi-Ting; Parekh, Zarana; Pham, Hieu; Le, Quoc; Sung, Yun-Hsuan; Li, Zhen; Duerig, Tom (2021-07-01). "Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision". Proceedings of the 38th International Conference on Machine Learning. PMLR: 4904–4916.
  8. Sohn, Kihyuk (2016). "Improved Deep Metric Learning with Multi-class N-pair Loss Objective". Advances in Neural Information Processing Systems. 29. Curran Associates, Inc.
  9. Zhai, Xiaohua; Mustafa, Basil; Kolesnikov, Alexander; Beyer, Lucas (2023). "Sigmoid Loss for Language Image Pre-Training": 11975–11986.{{cite journal}}: Cite journal requires |journal= (help)
  10. Liu, Zhuang; Mao, Hanzi; Wu, Chao-Yuan; Feichtenhofer, Christoph; Darrell, Trevor; Xie, Saining (2022). "A ConvNet for the 2020s": 11976–11986.{{cite journal}}: Cite journal requires |journal= (help)
  11. 1 2 Ilharco, Gabriel; Wortsman, Mitchell; Wightman, Ross; Gordon, Cade; Carlini, Nicholas; Taori, Rohan; Dave, Achal; Shankar, Vaishaal; Namkoong, Hongseok (July 2021), OpenCLIP , retrieved 2024-09-06
  12. openai/CLIP, OpenAI, 2024-09-06, retrieved 2024-09-06
  13. Dosovitskiy, Alexey; Beyer, Lucas; Kolesnikov, Alexander; Weissenborn, Dirk; Zhai, Xiaohua; Unterthiner, Thomas; Dehghani, Mostafa; Minderer, Matthias; Heigold, Georg; Gelly, Sylvain; Uszkoreit, Jakob (2021-06-03). "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale". arXiv: 2010.11929 [cs.CV].
  14. He, Kaiming; Zhang, Xiangyu; Ren, Shaoqing; Sun, Jian (10 Dec 2015). Deep Residual Learning for Image Recognition. arXiv: 1512.03385 .
  15. He, Tong; Zhang, Zhi; Zhang, Hang; Zhang, Zhongyue; Xie, Junyuan; Li, Mu (2018-12-05), Bag of Tricks for Image Classification with Convolutional Neural Networks, doi:10.48550/arXiv.1812.01187 , retrieved 2024-09-11
  16. Zhang, Richard (2018-09-27). "Making Convolutional Networks Shift-Invariant Again".{{cite journal}}: Cite journal requires |journal= (help)
  17. Zhang, Richard (2019-06-08), Making Convolutional Networks Shift-Invariant Again, doi:10.48550/arXiv.1904.11486 , retrieved 2024-09-11
  18. Tan, Mingxing; Le, Quoc V. (2020-09-11), EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks, doi:10.48550/arXiv.1905.11946 , retrieved 2024-09-06
  19. Radford, Alec; Wu, Jeff; Child, R.; Luan, D.; Amodei, Dario; Sutskever, I. (2019). "Language Models are Unsupervised Multitask Learners".{{cite journal}}: Cite journal requires |journal= (help)
  20. Srinivasan, Krishna; Raman, Karthik; Chen, Jiecao; Bendersky, Michael; Najork, Marc (2021-07-11). "WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning". Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval: 2443–2449. arXiv: 2103.01913 . doi:10.1145/3404835.3463257.
  21. Sharma, Piyush; Ding, Nan; Goodman, Sebastian; Soricut, Radu (July 2018). Gurevych, Iryna; Miyao, Yusuke (eds.). "Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning". Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Melbourne, Australia: Association for Computational Linguistics: 2556–2565. doi: 10.18653/v1/P18-1238 .
  22. Cherti, Mehdi; Beaumont, Romain; Wightman, Ross; Wortsman, Mitchell; Ilharco, Gabriel; Gordon, Cade; Schuhmann, Christoph; Schmidt, Ludwig; Jitsev, Jenia (June 2023). "Reproducible scaling laws for contrastive language-image learning". 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR): 2818–2829. arXiv: 2212.07143 . doi:10.1109/CVPR52729.2023.00276.
  23. Touvron, Hugo; Vedaldi, Andrea; Douze, Matthijs; Jegou, Herve (2019). "Fixing the train-test resolution discrepancy". Advances in Neural Information Processing Systems. 32. Curran Associates, Inc.
  24. "laion/CLIP-ViT-L-14-laion2B-s32B-b82K · Hugging Face". huggingface.co. 2023-09-10. Retrieved 2024-09-06.
  25. "Stable Diffusion Repository on GitHub". CompVis - Machine Vision and Learning Research Group, LMU Munich. 17 September 2022. Archived from the original on January 18, 2023. Retrieved 17 September 2022.
  26. Ramesh, Aditya; Dhariwal, Prafulla; Nichol, Alex; Chu, Casey; Chen, Mark (2022-04-12), Hierarchical Text-Conditional Image Generation with CLIP Latents, doi:10.48550/arXiv.2204.06125 , retrieved 2024-09-08
  27. Nichol, Alex; Dhariwal, Prafulla; Ramesh, Aditya; Shyam, Pranav; Mishkin, Pamela; McGrew, Bob; Sutskever, Ilya; Chen, Mark (2022-03-08), GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models, doi:10.48550/arXiv.2112.10741 , retrieved 2024-09-08
  28. Whitaker, Jonathan (2022-05-22). "Fun With Neural Cellular Automata". W&B. Retrieved 2024-09-08.
  29. LAION-AI/aesthetic-predictor, LAION AI, 2024-09-06, retrieved 2024-09-08
  30. Haltakov, Vladimir (2024-09-03), haltakov/natural-language-image-search , retrieved 2024-09-06
  31. Beaumont, Romain (2024-09-07), rom1504/clip-retrieval , retrieved 2024-09-08
  32. Mokady, Ron; Hertz, Amir; Bermano, Amit H. (2021). "ClipCap: CLIP Prefix for Image Captioning". doi:10.48550/ARXIV.2111.09734.{{cite journal}}: Cite journal requires |journal= (help)
  33. Alayrac, Jean-Baptiste; Donahue, Jeff; Luc, Pauline; Miech, Antoine; Barr, Iain; Hasson, Yana; Lenc, Karel; Mensch, Arthur; Millican, Katherine; Reynolds, Malcolm; Ring, Roman; Rutherford, Eliza; Cabi, Serkan; Han, Tengda; Gong, Zhitao (2022-12-06). "Flamingo: a Visual Language Model for Few-Shot Learning". Advances in Neural Information Processing Systems. 35: 23716–23736.
  34. Brock, Andy; De, Soham; Smith, Samuel L.; Simonyan, Karen (2021-07-01). "High-Performance Large-Scale Image Recognition Without Normalization". Proceedings of the 38th International Conference on Machine Learning. PMLR: 1059–1071.