Vision transformer

Last updated

A vision transformer (ViT) is a transformer designed for computer vision. [1] A ViT breaks down an input image into a series of patches (rather than breaking up text into tokens), serialises each patch into a vector, and maps it to a smaller dimension with a single matrix multiplication. These vector embeddings are then processed by a transformer encoder as if they were token embeddings.

Contents

ViT has found applications in image recognition, image segmentation, and autonomous driving. [2]

History

Transformers were introduced in 2017, in a paper "Attention Is All You Need", [3] and have found widespread use in natural language processing. In 2020, they were adapted for computer vision, yielding ViT. [1]

In 2021 a pure transformer model demonstrated better performance and greater efficiency than CNNs on image classification. [2]

A study in June 2021 added a transformer backend to ResNet, which dramatically reduced costs and increased accuracy. [4] [5] [6]

In the same year, some important variants of the Vision Transformers were proposed. These variants are mainly intended to be more efficient, more accurate or better suited to a specific domain. Among the most relevant is the Swin Transformer, [7] which through some modifications to the attention mechanism and a multi-stage approach achieved state-of-the-art results on some object detection datasets such as COCO. Another interesting variant is the TimeSformer, designed for video understanding tasks and able to capture spatial and temporal information through the use of divided space-time attention. [8] [9]

Overview

The basic architecture, used by the original 2020 paper, [1] is as follows. In summary, it is a BERT-like encoder-only Transformer.

The input image is of type , where are height, width, channel (RGB). It is then split into square-shaped patches of type .

For each patch, the patch is pushed through a linear operator, to obtain a vector ("patch embedding"). The position of the patch is also transformed into a vector by "position encoding". The two vectors are added, then pushed through several Transformer encoders.

The attention mechanism in a ViT repeatedly transforms representation vectors of image patches, incorporating more and more semantic relations between image patches in an image. This is analogous to how in natural language processing, as representation vectors flow through a transformer, they incorporate more and more semantic relations between words, from syntax to semantics.

The above architecture turns an image into a sequence of vector representations. To use these for downstream applications, an additional head needs to be trained to interpret them.

For example, to use it for classification, one can add a shallow MLP on top of it that outputs a probability distribution over classes. The original paper uses a linear-GeLU-linear-softmax network. [1]

Variants

Original ViT

Vision Transformer Architecture for Image Classification Vision Transformer.gif
Vision Transformer Architecture for Image Classification

Transformers found their initial applications in natural language processing tasks, as demonstrated by language models such as BERT and GPT-3. By contrast the typical image processing system uses a convolutional neural network (CNN). Well-known projects include Xception, ResNet, EfficientNet, [10] DenseNet, [11] and Inception. [2]

Transformers measure the relationships between pairs of input tokens (words in the case of text strings), termed attention. The cost is quadratic in the number of tokens. For images, the basic unit of analysis is the pixel. However, computing relationships for every pixel pair in a typical image is prohibitive in terms of memory and computation. Instead, ViT computes relationships among pixels in various small sections of the image (e.g., 16x16 pixels), at a drastically reduced cost. The sections (with positional embeddings) are placed in a sequence. The embeddings are learnable vectors. Each section is arranged into a linear sequence and multiplied by the embedding matrix. The result, with the position embedding is fed to the transformer. [2]

As in the case of BERT, a fundamental role in classification tasks is played by the class token. A special token that is used as the only input of the final MLP Head as it has been influenced by all the others.

The architecture for image classification is the most common and uses only the Transformer Encoder in order to transform the various input tokens. However, there are also other applications in which the decoder part of the traditional Transformer Architecture is also used.

Masked Autoencoder

In Masked Autoencoder, there are two ViTs put end-to-end. The first one takes in image patches with positional encoding, and outputs vectors representing each patch. The second one takes in vectors with positional encoding and outputs image patches again. During training, both ViTs are used. An image is cut into patches, and only 25% of the patches are put into the first ViT. The second ViT takes the encoded vectors and outputs a reconstruction of the full image. During use, only the first ViT is used. [12]

Swin Transformer

The Swin Transformer ("Shifted windows") [7] takes inspiration from standard convolutional neural networks:

It is improved by Swin Transformer V2, [13] which modifies upon the ViT by a different attention mechanism (Figure 1):

ViT-VQGAN

In ViT-VQGAN, [14] there are two ViT encoders and a discriminator. One encodes 8x8 patches of an image into a list of vectors, one for each patch. The vectors can only come from a discrete set of "codebook", as in vector quantization. Another encodes the quantized vectors back to image patches. The training objective attempts to make the reconstruction image (the output image) faithful to the input image. The discriminator (usually a convolutional network, but other networks are allowed) attempts to decide if an image is an original real image, or a reconstructed image by the ViT.

The idea is essentially the same as vector quantized variational autoencoder (VQVAE) plus generative adversarial network (GAN).

After such a ViT-VQGAN is trained, it can be used to code an arbitrary image into a list of symbols, and code an arbitrary list of symbols into an image. The list of symbols can be used to train into a standard autoregressive transformer (like GPT), for autoregressively generating an image. Further, one can take a list of caption-image pairs, convert the images into strings of symbols, and train a standard GPT-style transformer. Then at test time, one can just give an image caption, and have it autoregressively generate the image. This is the structure of Google Parti. [15]

Comparison with Convolutional Neural Networks

Due to the commonly used (comparatively) large patch size, ViT performance depends more heavily on decisions including that of the optimizer, dataset-specific hyperparameters, and network depth than convolutional networks. Preprocessing with a layer of smaller-size, overlapping (stride < size) convolutional filters helps with performance and stability. [6]

The CNN translates from the basic pixel level to a feature map. A tokenizer translates the feature map into a series of tokens that are then fed into the transformer, which applies the attention mechanism to produce a series of output tokens. Finally, a projector reconnects the output tokens to the feature map. The latter allows the analysis to exploit potentially significant pixel-level details. This drastically reduces the number of tokens that need to be analyzed, reducing costs accordingly. [4]

The differences between CNNs and Vision Transformers are many and lie mainly in their architectural differences.

In fact, CNNs achieve excellent results even with training based on data volumes that are not as large as those required by Vision Transformers.

This different behaviour seems to derive from the different inductive biases they possess. The filter-oriented architecture of CNNs can be somehow exploited by these networks to grasp more quickly the particularities of the analysed images even if, on the other hand, they end up limiting them making it more complex to grasp global relations. [16] [17]

On the other hand, the Vision Transformers possess a different kind of bias toward exploring topological relationships between patches, which leads them to be able to capture also global and wider range relations but at the cost of a more onerous training in terms of data.

Vision Transformers also proved to be much more robust to input image distortions such as adversarial patches or permutations. [18]

However, choosing one architecture over another is not always the wisest choice, and excellent results have been obtained in several Computer Vision tasks through hybrid architectures combining convolutional layers with Vision Transformers. [19] [20] [21]

The Role of Self-Supervised Learning

The considerable need for data during the training phase has made it essential to find alternative methods to train these models, [22] and a central role is now played by self-supervised methods. Using these approaches, it is possible to train a neural network in an almost autonomous way, allowing it to deduce the peculiarities of a specific problem without having to build a large dataset or provide it with accurately assigned labels. Being able to train a Vision Transformer without having to have a huge vision dataset at its disposal could be the key to the widespread dissemination of this promising new architecture.

Applications

Vision Transformers have been used in many Computer Vision tasks with excellent results and in some cases even state-of-the-art.

Among the most relevant areas of application are:

Vision Transformer-based algorithms such as DINO (self-distillation with no labels) [23] also show promising properties on biological datasets such as images generated with the Cell Painting assay. DINO has been demonstrated to learn image representations which could be used to cluster images and explore morphological profiles in a feature space. [24]

Implementations

There are many implementations of Vision Transformers and its variants available in open source online. The main versions of this architecture have been implemented in PyTorch [25] but implementations have also been made available for TensorFlow. [26]

See also

Related Research Articles

<span class="mw-page-title-main">Deep learning</span> Branch of machine learning

Deep learning is the subset of machine learning methods based on artificial neural networks (ANNs) with representation learning. The adjective "deep" refers to the use of multiple layers in the network. Methods used can be either supervised, semi-supervised or unsupervised.

<span class="mw-page-title-main">Feature learning</span> Set of learning techniques in machine learning

In machine learning, feature learning or representation learning is a set of techniques that allows a system to automatically discover the representations needed for feature detection or classification from raw data. This replaces manual feature engineering and allows a machine to both learn the features and use them to perform a specific task.

Convolutional neural network (CNN) is a regularized type of feed-forward neural network that learns feature engineering by itself via filters optimization. Vanishing gradients and exploding gradients, seen during backpropagation in earlier neural networks, are prevented by using regularized weights over fewer connections. For example, for each neuron in the fully-connected layer 10,000 weights would be required for processing an image sized 100 × 100 pixels. However, applying cascaded convolution kernels, only 25 neurons are required to process 5x5-sized tiles. Higher-layer features are extracted from wider context windows, compared to lower-layer features.

Multimodal learning, in the context of machine learning, is a type of deep learning using a combination of various modalities of data, often arising in real-world applications. An example of multi-modal data is data that combines text with imaging data consisting of pixel intensities and annotation tags. As these modalities have fundamentally different statistical properties, combining them is non-trivial, which is why specialized modelling strategies and algorithms are required. The model is then trained to able to understand and work with multiple forms of data.

Neural machine translation (NMT) is an approach to machine translation that uses an artificial neural network to predict the likelihood of a sequence of words, typically modeling entire sentences in a single integrated model.

<span class="mw-page-title-main">Residual neural network</span> Deep learning method

A residual neural network is a deep learning model in which the weight layers learn residual functions with reference to the layer inputs. It behaves like a highway network whose gates are opened through strongly positive bias weights. This enables deep learning models with tens or hundreds of layers to train easily and approach better accuracy when going deeper. The identity skip connections, often referred to as "residual connections", are also used in the 1997 LSTM networks, Transformer models, the AlphaGo Zero system, the AlphaStar system, and the AlphaFold system.

A capsule neural network (CapsNet) is a machine learning system that is a type of artificial neural network (ANN) that can be used to better model hierarchical relationships. The approach is an attempt to more closely mimic biological neural organization.

Neural architecture search (NAS) is a technique for automating the design of artificial neural networks (ANN), a widely used model in the field of machine learning. NAS has been used to design networks that are on par or outperform hand-designed architectures. Methods for NAS can be categorized according to the search space, search strategy and performance estimation strategy used:

U-Net is a convolutional neural network that was developed for biomedical image segmentation at the Computer Science Department of the University of Freiburg. The network is based on a fully convolutional neural network whose architecture was modified and extended to work with fewer training images and to yield more precise segmentation. Segmentation of a 512 × 512 image takes less than a second on a modern GPU.

A Siamese neural network is an artificial neural network that uses the same weights while working in tandem on two different input vectors to compute comparable output vectors. Often one of the output vectors is precomputed, thus forming a baseline against which the other output vector is compared. This is similar to comparing fingerprints but can be described more technically as a distance function for locality-sensitive hashing.

<span class="mw-page-title-main">Neural style transfer</span> Type of software algorithm for image manipulation

Neural style transfer (NST) refers to a class of software algorithms that manipulate digital images, or videos, in order to adopt the appearance or visual style of another image. NST algorithms are characterized by their use of deep neural networks for the sake of image transformation. Common uses for NST are the creation of artificial artwork from photographs, for example by transferring the appearance of famous paintings to user-supplied photographs. Several notable mobile apps use NST techniques for this purpose, including DeepArt and Prisma. This method has been used by artists and designers around the globe to develop new artwork based on existent style(s).

<span class="mw-page-title-main">Transformer (deep learning architecture)</span> Machine learning algorithm used for natural-language processing

A transformer is a deep learning architecture developed by Google and based on the multi-head attention mechanism, proposed in a 2017 paper "Attention Is All You Need". It has no recurrent units, and thus requires less training time than previous recurrent neural architectures, such as long short-term memory (LSTM), and its later variation has been prevalently adopted for training large language models (LLM) on large (language) datasets, such as the Wikipedia corpus and Common Crawl. Text is converted to numerical representations called tokens, and each token is converted into a vector via looking up from a word embedding table. At each layer, each token is then contextualized within the scope of the context window with other (unmasked) tokens via a parallel multi-head attention mechanism allowing the signal for key tokens to be amplified and less important tokens to be diminished. The transformer paper, published in 2017, is based on the softmax-based attention mechanism proposed by Bahdanau et. al. in 2014 for machine translation, and the Fast Weight Controller, similar to a transformer, proposed in 1992.

Bidirectional Encoder Representations from Transformers (BERT) is a language model based on the transformer architecture, notable for its dramatic improvement over previous state of the art models. It was introduced in October 2018 by researchers at Google. A 2020 literature survey concluded that "in a little over a year, BERT has become a ubiquitous baseline in Natural Language Processing (NLP) experiments counting over 150 research publications analyzing and improving the model."

Seq2seq is a family of machine learning approaches used for natural language processing. Applications include language translation, image captioning, conversational models, and text summarization. Seq2seq uses sequence transformation: it turns one sequence into another sequence.

Machine learning-based attention is a mechanism which intuitively mimicks cognitive attention. It calculates "soft" weights for each word, more precisely for its embedding, in the context window. These weights can be computed either in parallel or sequentially. "Soft" weights can change during each runtime, in contrast to "hard" weights, which are (pre-)trained and fine-tuned and remain frozen afterwards.

Self-supervised learning (SSL) is a paradigm in machine learning where a model is trained on a task using the data itself to generate supervisory signals, rather than relying on external labels provided by humans. In the context of neural networks, self-supervised learning aims to leverage inherent structures or relationships within the input data to create meaningful training signals. SSL tasks are designed so that solving it requires capturing essential features or relationships in the data. The input data is typically augmented or transformed in a way that creates pairs of related samples. One sample serves as the input, and the other is used to formulate the supervisory signal. This augmentation can involve introducing noise, cropping, rotation, or other transformations. Self-supervised learning more closely imitates the way humans learn to classify objects.

A graph neural network (GNN) belongs to a class of artificial neural networks for processing data that can be represented as graphs.

Perceiver is a transformer adapted to be able to process non-textual data, such as images, sounds and video, and spatial data. Transformers underlie other notable systems such as BERT and GPT-3, which preceded Perceiver. It adopts an asymmetric attention mechanism to distill inputs into a latent bottleneck, allowing it to learn from large amounts of heterogeneous data. Perceiver matches or outperforms specialized models on classification tasks.

Deep learning speech synthesis refers to the application of deep learning models to generate natural-sounding human speech from written text (text-to-speech) or spectrum (vocoder). Deep neural networks (DNN) are trained using a large amount of recorded speech and, in the case of a text-to-speech system, the associated labels and/or input text.

A neural radiance field (NeRF) is a method based on deep learning for reconstructing a three-dimensional representation of a scene from sparse two-dimensional images. The NeRF model enables learning of novel view synthesis, scene geometry, and the reflectance properties of the scene. Additional scene properties such as camera poses may also be jointly learned. NeRF enables rendering of photorealistic views from novel viewpoints. First introduced in 2020, it has since gained significant attention for its potential applications in computer graphics and content creation.

References

  1. 1 2 3 4 Dosovitskiy, Alexey; Beyer, Lucas; Kolesnikov, Alexander; Weissenborn, Dirk; Zhai, Xiaohua; Unterthiner, Thomas; Dehghani, Mostafa; Minderer, Matthias; Heigold, Georg; Gelly, Sylvain; Uszkoreit, Jakob (2021-06-03). "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale". arXiv: 2010.11929 [cs.CV].
  2. 1 2 3 4 Sarkar, Arjun (2021-05-20). "Are Transformers better than CNN's at Image Recognition?". Medium. Retrieved 2021-07-11.
  3. Vaswani, Ashish; Shazeer, Noam; Parmar, Niki; Uszkoreit, Jakob; Jones, Llion; Gomez, Aidan N; Kaiser, Łukasz; Polosukhin, Illia (2017). "Attention is All you Need" (PDF). Advances in Neural Information Processing Systems. Curran Associates, Inc. 30.
  4. 1 2 "Facebook and UC Berkeley Boost CV Performance and Lower Compute Cost With Visual Transformers". Medium. 2020-06-12. Retrieved 2021-07-11.
  5. Wu, Bichen; Xu, Chenfeng; Dai, Xiaoliang; Wan, Alvin; Zhang, Peizhao; Yan, Zhicheng; Masayoshi, Tomizuka; Gonzalez, Joseph; Keutzer, Kurt; Vajda, Peter (2020). "Visual Transformers: Token-based Image Representation and Processing for Computer Vision". arXiv: 2006.03677 [cs.CV].
  6. 1 2 Xiao, Tete; Singh, Mannat; Mintun, Eric; Darrell, Trevor; Dollár, Piotr; Girshick, Ross (2021-06-28). "Early Convolutions Help Transformers See Better". arXiv: 2106.14881 [cs.CV].
  7. 1 2 Liu, Ze; Lin, Yutong; Cao, Yue; Hu, Han; Wei, Yixuan; Zhang, Zheng; Lin, Stephen; Guo, Baining (2021-03-25). "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows". arXiv: 2103.14030 [cs.CV].
  8. Bertasius, Gedas; Wang, Heng; Torresani, Lorenzo (2021-02-09). "Is Space-Time Attention All You Need for Video Understanding?". arXiv: 2102.05095 [cs.CV].
  9. Coccomini, Davide (2021-03-31). "On Transformers, TimeSformers, and Attention. An exciting revolution from text to videos" . Towards Data Science.
  10. Tan, Mingxing; Le, Quoc (23 June 2021). "EfficientNetV2: Smaller Models and Faster Training" (PDF). Proceedings of the 38th International Conference on Machine Learning (PMLR). 139: 10096–10106. arXiv: 2104.00298 . Retrieved 31 October 2023.
  11. Huang, Gao; Liu, Zhuang; van der Maaten, Laurens; Q. Weinberger, Kilian (28 Jan 2018). "Densely Connected Convolutional Networks". arXiv: 1608.06993 [cs.CV].
  12. He, Kaiming; Chen, Xinlei; Xie, Saining; Li, Yanghao; Dollár, Piotr; Girshick, Ross (2021). "Masked Autoencoders Are Scalable Vision Learners". arXiv: 2111.06377 [cs.CV].
  13. Liu, Ze; Hu, Han; Lin, Yutong; Yao, Zhuliang; Xie, Zhenda; Wei, Yixuan; Ning, Jia; Cao, Yue; Zhang, Zheng; Dong, Li; Wei, Furu; Guo, Baining (2022). "Swin Transformer V2: Scaling Up Capacity and Resolution". Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 12009–12019.
  14. Yu, Jiahui; Li, Xin; Koh, Jing Yu; Zhang, Han; Pang, Ruoming; Qin, James; Ku, Alexander; Xu, Yuanzhong; Baldridge, Jason; Wu, Yonghui (2021). "Vector-quantized Image Modeling with Improved VQGAN". arXiv: 2110.04627 [cs.CV].
  15. "Parti: Pathways Autoregressive Text-to-Image Model". sites.research.google. Retrieved 2023-11-03.
  16. Raghu, Maithra; Unterthiner, Thomas; Kornblith, Simon; Zhang, Chiyuan; Dosovitskiy, Alexey (2021-08-19). "Do Vision Transformers See Like Convolutional Neural Networks?". arXiv: 2108.08810 [cs.CV].
  17. Coccomini, Davide (2021-07-24). "Vision Transformers or Convolutional Neural Networks? Both!" . Towards Data Science.
  18. Naseer, Muzammal; Ranasinghe, Kanchana; Khan, Salman; Hayat, Munawar; Khan, Fahad Shahbaz; Yang, Ming-Hsuan (2021-05-21). "Intriguing Properties of Vision Transformers". arXiv: 2105.10497 [cs.CV].
  19. Dai, Zihang; Liu, Hanxiao; Le, Quoc V.; Tan, Mingxing (2021-06-09). "CoAtNet: Marrying Convolution and Attention for All Data Sizes". arXiv: 2106.04803 [cs.CV].
  20. Wu, Haiping; Xiao, Bin; Codella, Noel; Liu, Mengchen; Dai, Xiyang; Yuan, Lu; Zhang, Lei (2021-03-29). "CvT: Introducing Convolutions to Vision Transformers". arXiv: 2103.15808 [cs.CV].
  21. Coccomini, Davide; Messina, Nicola; Gennaro, Claudio; Falchi, Fabrizio (2022). "Combining Efficient Net and Vision Transformers for Video Deepfake Detection". Image Analysis and Processing – ICIAP 2022. Lecture Notes in Computer Science. Vol. 13233. pp. 219–229. arXiv: 2107.02612 . doi:10.1007/978-3-031-06433-3_19. ISBN   978-3-031-06432-6. S2CID   235742764.
  22. Coccomini, Davide (2021-07-24). "Self-Supervised Learning in Vision Transformers" . Towards Data Science.
  23. Caron, Mathilde; Touvron, Hugo; Misra, Ishan; Jegou, Herve; Mairal, Julien; Bojanowski, Piotr; Joulin, Armand (October 2021). "Emerging Properties in Self-Supervised Vision Transformers". 2021 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE. pp. 9630–9640. arXiv: 2104.14294 . doi:10.1109/iccv48922.2021.00951. ISBN   978-1-6654-2812-5.
  24. Doron, Michael; Moutakanni, Théo; Chen, Zitong S.; Moshkov, Nikita; Caron, Mathilde; Touvron, Hugo; Bojanowski, Piotr; Pernice, Wolfgang M.; Caicedo, Juan C. (2023-06-18). "Unbiased single-cell morphology with self-supervised vision transformers". dx.doi.org. doi:10.1101/2023.06.16.545359. PMC   10312751 . Retrieved 2024-02-12.
  25. vit-pytorch on GitHub
  26. Salama, Khalid (2021-01-18). "Image classification with Vision Transformer". keras.io.