Vision transformer

Last updated
The architecture of Vision Transformer. An input image is divided into patches, each of which is linearly mapped through a patch embedding layer, before entering a standard Transformer encoder. Vision Transformer.png
The architecture of Vision Transformer. An input image is divided into patches, each of which is linearly mapped through a patch embedding layer, before entering a standard Transformer encoder.

A vision transformer (ViT) is a transformer designed for computer vision. [1] A ViT breaks down an input image into a series of patches (rather than breaking up text into tokens), serialises each patch into a vector, and maps it to a smaller dimension with a single matrix multiplication. These vector embeddings are then processed by a transformer encoder as if they were token embeddings.

Contents

ViT were designed as alternatives to convolutional neural networks (CNN) in computer vision applications. They have different inductive biases, training stability, and data efficiency. [2] Compared to CNN, ViT is less data efficient, but has higher capacity. Some of the largest modern computer vision models are ViT, such as one with 22B parameters. [3] [4]

Subsequent to its publication, many variants were proposed, with hybrid architectures with both features of ViT and CNN. ViT has found applications in image recognition, image segmentation, and autonomous driving. [5] [6]

History

Transformers were introduced in Attention Is All You Need (2017), [7] and have found widespread use in natural language processing. A 2019 paper [8] applied ideas from the Transformer to computer vision. Specifically, they started with a ResNet, a standard convolutional neural network used for computer vision, and replaced all convolutional kernels by the self-attention mechanism found in a Transformer. It resulted in superior performance. However, it is not a Vision Transformer.

In 2020, an encoder-only Transformer was adapted for computer vision, yielding the ViT, which reached state of the art in image classification, overcoming the previous dominance of CNN. [1] The masked autoencoder (2022) extended ViT to work with unsupervised training. The vision transformer and the masked autoencoder, in turn, stimulated new developments in convolutional neural networks. [9] [10]

Subsequently, there was cross-fertilization between the previous CNN approach and the ViT approach.

In 2021, some important variants of the Vision Transformers were proposed. These variants are mainly intended to be more efficient, more accurate or better suited to a specific domain. Two studies [11] [12] improved efficiency and robustness of ViT by adding a CNN as a preprocessor. The Swin Transformer [13] achieved state-of-the-art results on some object detection datasets such as COCO, by using convolution-like sliding windows of attention mechanism, and the pyramid process in classical computer vision.

Overview

Vision Transformer architecture, showing the encoder-only Transformer blocks inside. Vision Transformer.svg
Vision Transformer architecture, showing the encoder-only Transformer blocks inside.

The basic architecture, used by the original 2020 paper, [1] is as follows. In summary, it is a BERT-like encoder-only Transformer.

The input image is of type , where are height, width, channel (RGB). It is then split into square-shaped patches of type .

For each patch, the patch is pushed through a linear operator, to obtain a vector ("patch embedding"). The position of the patch is also transformed into a vector by "position encoding". The two vectors are added, then pushed through several Transformer encoders.

The attention mechanism in a ViT repeatedly transforms representation vectors of image patches, incorporating more and more semantic relations between image patches in an image. This is analogous to how in natural language processing, as representation vectors flow through a transformer, they incorporate more and more semantic relations between words, from syntax to semantics.

The above architecture turns an image into a sequence of vector representations. To use these for downstream applications, an additional head needs to be trained to interpret them.

For example, to use it for classification, one can add a shallow MLP on top of it that outputs a probability distribution over classes. The original paper uses a linear-GeLU-linear-softmax network. [1]

Variants

Original ViT

The original ViT was an encoder-only Transformer supervise-trained to predict the image label from the patches of the image. As in the case of BERT, it uses a special token <CLS> in the input side, and the corresponding output vector is used as the only input of the final output MLP head. The special token is an architectural hack to allow the model to compress all information relevant for predicting the image label into one vector.

Animation of ViT. The 0th token is the special <CLS>. The other 9 patches are projected by a linear layer before being fed into the Transformer encoder as input tokens 1 to 9. Vision Transformer.gif
Animation of ViT. The 0th token is the special <CLS>. The other 9 patches are projected by a linear layer before being fed into the Transformer encoder as input tokens 1 to 9.

Transformers found their initial applications in natural language processing tasks, as demonstrated by language models such as BERT and GPT-3. By contrast the typical image processing system uses a convolutional neural network (CNN). Well-known projects include Xception, ResNet, EfficientNet, [14] DenseNet, [15] and Inception. [16]

Transformers measure the relationships between pairs of input tokens (words in the case of text strings), termed attention. The cost is quadratic in the number of tokens. For images, the basic unit of analysis is the pixel. However, computing relationships for every pixel pair in a typical image is prohibitive in terms of memory and computation. Instead, ViT computes relationships among pixels in various small sections of the image (e.g., 16x16 pixels), at a drastically reduced cost. The sections (with positional embeddings) are placed in a sequence. The embeddings are learnable vectors. Each section is arranged into a linear sequence and multiplied by the embedding matrix. The result, with the position embedding is fed to the transformer. [16]

Architectural improvements

Pooling

After the ViT processes an image, it produces some embedding vectors. These must be converted to a single class probability prediction by some kind of network. In the original ViT and Masked Autoencoder, they used a dummy [CLS] token , in emulation of the BERT language model. The output at [CLS] is the classification token, which is then processed by a LayerNorm-feedforward-softmax module into a probability distribution.

Global average pooling (GAP) does not use the dummy token, but simply takes the average of all output tokens as the classification token. It was mentioned in the original ViT as being equally good. [17]

Multihead attention pooling (MAP) applies a multiheaded attention block to pooling. Specifically, it takes as input a list of vectors , which might be thought of as the output vectors of a layer of a ViT. It then applies a feedforward layer on each vector, resulting in a matrix . This is then sent to a multiheaded attention, resulting in , where is a matrix of trainable parameters. [18] This was first proposed in the Set Transformer architecture. [19]

Later papers demonstrated that GAP and MAP both perform better than BERT-like pooling. [18] [20] A variant of MAP was proposed as class attention, which applies MAP, then feedforward, then MAP again. [21]

Re-attention was proposed to allow training deep ViT. It changes the multiheaded attention module. [22]

Masked Autoencoder

Masked Autoencoder architecture. Masked Autoencoder.png
Masked Autoencoder architecture.

The Masked Autoencoder [23] took inspiration from denoising autoencoders. It has two ViTs put end-to-end. The first one ("encoder") takes in image patches with positional encoding, and outputs vectors representing each patch. The second one (called "decoder", even though it is still an encoder-only Transformer) takes in vectors with positional encoding and outputs image patches again. During training, both the encoder and the decoder ViTs are used. During inference, only the encoder ViT is used.

During training, each image is cut into patches, and with their positional embeddings added. Of these, only 25% of the patches are selected. The encoder ViT processes the selected patches. No mask tokens are used. Then, mask tokens are added back in, and positional embeddings added again. These are processed by the decoder ViT, which outputs a reconstruction of the full image. The loss is the total mean-squared loss in pixel-space for all masked patches (reconstruction loss is not computed for non-masked patches).

A similar architecture was BERT ViT (BEiT), published concurrently. [24]

DINO

Like the Masked Autoencoder, the DINO (self-distillation with no labels) method is a way to train a ViT by self-supervision. [25] DINO is a form of teacher-student self-distillation. In DINO, the student is the model itself, and the teacher is an exponential average of the student's past states. The method is similar to previous works like momentum contrast [26] and bootstrap your own latent [27]

The loss function used in DINO is the cross-entropy loss between the output of the teacher network () and the output of the student network (). The teacher network is an exponentially decaying average of the student network's past parameters: . The inputs to the networks are two different crops of the same image, represented as and , where is the original image. The loss function is written asOne issue is that the network can "collapse" by always outputting the same value (), regardless of the input. To prevent this collapse, DINO employs two strategies:

Swin Transformer

The Swin Transformer ("Shifted windows") [13] took inspiration from standard CNNs:

It is improved by Swin Transformer V2, [28] which modifies upon the ViT by a different attention mechanism [13] :Figure 1:

TimeSformer

The TimeSformer [29] was designed for video understanding tasks, and it applied a factorized self-attention, similar to the factorized convolution kernels found in the Inception CNN architecture. [30] Schematically, it divides a video into frames, and each frame into a square grid of patches (same as ViT). Let each patch coordinate be denoted by , denoting horizontal, vertical, and time.

The TimeSformer also considered other attention layer designs, such as the "height attention layer" where the requirement is . However, they found empirically that the best design interleaves one space attention layer and one time attention layer.

ViT-VQGAN

In ViT-VQGAN, [31] there are two ViT encoders and a discriminator. One encodes 8x8 patches of an image into a list of vectors, one for each patch. The vectors can only come from a discrete set of "codebook", as in vector quantization. Another encodes the quantized vectors back to image patches. The training objective attempts to make the reconstruction image (the output image) faithful to the input image. The discriminator (usually a convolutional network, but other networks are allowed) attempts to decide if an image is an original real image, or a reconstructed image by the ViT.

The idea is essentially the same as vector quantized variational autoencoder (VQVAE) plus generative adversarial network (GAN).

After such a ViT-VQGAN is trained, it can be used to code an arbitrary image into a list of symbols, and code an arbitrary list of symbols into an image. The list of symbols can be used to train into a standard autoregressive transformer (like GPT), for autoregressively generating an image. Further, one can take a list of caption-image pairs, convert the images into strings of symbols, and train a standard GPT-style transformer. Then at test time, one can just give an image caption, and have it autoregressively generate the image. This is the structure of Google Parti. [32]

Others

Other examples include the visual transformer, [33] CoAtNet, [34] CvT, [35] the data-efficient ViT (DeiT), [36] etc.

Comparison with CNNs

Typically, ViT uses patch sizes larger than standard CNN kernels (3x3 to 7x7). ViT is more sensitive to the choice of the optimizer, hyperparameters, and network depth. Preprocessing with a layer of smaller-size, overlapping (stride < size) convolutional filters helps with performance and stability. [12]

This different behavior seems to derive from the different inductive biases they possess.

CNN applies the same set of filters for processing the entire image. This allows them to be more data efficient and less sensitive to local perturbations. [2] ViT applies self-attention, allowing them to easily capture long-range relationships between patches. They also require more data to train, but they can ingest more training data compared to CNN, which might not improve after training on a large enough training dataset. ViT also appears more robust to input image distortions such as adversarial patches or permutations. [37]

Applications

ViT have been used in many Computer Vision tasks with excellent results and in some cases even state-of-the-art. Image Classification, Object Detection, Video Deepfake Detection, [38] Image segmentation, Anomaly detection, Image Synthesis, Cluster analysis, Autonomous Driving. [5] [6]

ViT had been used for image generation as backbones for GAN [39] and for diffusion models (diffusion transformer, or DiT). [40]

DINO [25] has been demonstrated to learn useful representations for clustering images and exploring morphological profiles on biological datasets, such as images generated with the Cell Painting assay. [41]

See also

Related Research Articles

An autoencoder is a type of artificial neural network used to learn efficient codings of unlabeled data. An autoencoder learns two functions: an encoding function that transforms the input data, and a decoding function that recreates the input data from the encoded representation. The autoencoder learns an efficient representation (encoding) for a set of data, typically for dimensionality reduction.

<span class="mw-page-title-main">Activation function</span> Artificial neural network node function

The activation function of a node in an artificial neural network is a function that calculates the output of the node based on its individual inputs and their weights. Nontrivial problems can be solved using only a few nodes if the activation function is nonlinear. Modern activation functions include the smooth version of the ReLU, the GELU, which was used in the 2018 BERT model, the logistic (sigmoid) function used in the 2012 speech recognition model developed by Hinton et al, the ReLU used in the 2012 AlexNet computer vision model and in the 2015 ResNet model.

There are many types of artificial neural networks (ANN).

<span class="mw-page-title-main">Feature learning</span> Set of learning techniques in machine learning

In machine learning, feature learning or representation learning is a set of techniques that allows a system to automatically discover the representations needed for feature detection or classification from raw data. This replaces manual feature engineering and allows a machine to both learn the features and use them to perform a specific task.

A convolutional neural network (CNN) is a regularized type of feed-forward neural network that learns features by itself via filter optimization. This type of deep learning network has been applied to process and make predictions from many different types of data including text, images and audio. Convolution-based networks are the de-facto standard in deep learning-based approaches to computer vision and image processing, and have only recently have been replaced -- in some cases -- by more recent deep learning architectures such as the transformer. Vanishing gradients and exploding gradients, seen during backpropagation in earlier neural networks, are prevented by using regularized weights over fewer connections. For example, for each neuron in the fully-connected layer, 10,000 weights would be required for processing an image sized 100 × 100 pixels. However, applying cascaded convolution kernels, only 25 neurons are required to process 5x5-sized tiles. Higher-layer features are extracted from wider context windows, compared to lower-layer features.

Multimodal learning, in the context of machine learning, is a type of deep learning using multiple modalities of data, such as text, audio, or images.

Neural machine translation (NMT) is an approach to machine translation that uses an artificial neural network to predict the likelihood of a sequence of words, typically modeling entire sentences in a single integrated model.

<span class="mw-page-title-main">Residual neural network</span> Deep learning method

A residual neural network is a deep learning architecture in which the weight layers learn residual functions with reference to the layer inputs. It was developed in 2015 for image recognition and won that year's ImageNet Large Scale Visual Recognition Challenge.

A Siamese neural network is an artificial neural network that uses the same weights while working in tandem on two different input vectors to compute comparable output vectors. Often one of the output vectors is precomputed, thus forming a baseline against which the other output vector is compared. This is similar to comparing fingerprints but can be described more technically as a distance function for locality-sensitive hashing.

<span class="mw-page-title-main">Neural style transfer</span> Type of software algorithm for image manipulation

Neural style transfer (NST) refers to a class of software algorithms that manipulate digital images, or videos, in order to adopt the appearance or visual style of another image. NST algorithms are characterized by their use of deep neural networks for the sake of image transformation. Common uses for NST are the creation of artificial artwork from photographs, for example by transferring the appearance of famous paintings to user-supplied photographs. Several notable mobile apps use NST techniques for this purpose, including DeepArt and Prisma. This method has been used by artists and designers around the globe to develop new artwork based on existent style(s).

Artificial neural networks (ANNs) are models created using machine learning to perform a number of tasks. Their creation was inspired by neural circuitry. While some of the computational implementations ANNs relate to earlier discoveries in mathematics, the first implementation of ANNs was by psychologist Frank Rosenblatt, who developed the perceptron. Little research was conducted on ANNs in the 1970s and 1980s, with the AAAI calling that period an "AI winter".

<span class="mw-page-title-main">Transformer (deep learning architecture)</span> Machine learning algorithm used for natural-language processing

A transformer is a deep learning architecture developed by researchers at Google and based on the multi-head attention mechanism, proposed in a 2017 paper "Attention Is All You Need". Text is converted to numerical representations called tokens, and each token is converted into a vector via looking up from a word embedding table. At each layer, each token is then contextualized within the scope of the context window with other (unmasked) tokens via a parallel multi-head attention mechanism allowing the signal for key tokens to be amplified and less important tokens to be diminished.

Bidirectional Encoder Representations from Transformers (BERT) is a language model introduced in October 2018 by researchers at Google. It learned by self-supervised learning to represent text as a sequence of vectors. It had the transformer encoder architecture. It was notable for its dramatic improvement over previous state of the art models, and as an early example of large language model. As of 2020, BERT was a ubiquitous baseline in Natural Language Processing (NLP) experiments.

<span class="mw-page-title-main">Attention (machine learning)</span> Machine learning technique

Attention is a machine learning method that determines the relative importance of each component in a sequence relative to the other components in that sequence. In natural language processing, importance is represented by "soft" weights assigned to each word in a sentence. More generally, attention encodes vectors called token embeddings across a fixed-width sequence that can range from tens to millions of tokens in size.

<span class="mw-page-title-main">Contrastive Language-Image Pre-training</span> Technique in neural networks for learning joint representations of text and images

Contrastive Language-Image Pre-training (CLIP) is a technique for training a pair of neural network models, one for image understanding and one for text understanding, using a contrastive objective.

A graph neural network (GNN) belongs to a class of artificial neural networks for processing data that can be represented as graphs.

Perceiver is a transformer adapted to be able to process non-textual data, such as images, sounds and video, and spatial data. Transformers underlie other notable systems such as BERT and GPT-3, which preceded Perceiver. It adopts an asymmetric attention mechanism to distill inputs into a latent bottleneck, allowing it to learn from large amounts of heterogeneous data. Perceiver matches or outperforms specialized models on classification tasks.

Tensor informally refers in machine learning to two different concepts that organize and represent data. Data may be organized in a multidimensional array (M-way array) that is informally referred to as a "data tensor"; however in the strict mathematical sense, a tensor is a multilinear mapping over a set of domain vector spaces to a range vector space. Observations, such as images, movies, volumes, sounds, and relationships among words and concepts, stored in an M-way array ("data tensor") may be analyzed either by artificial neural networks or tensor methods.

The Latent Diffusion Model (LDM) is a diffusion model architecture developed developed by the CompVis group at LMU Munich.

In neural networks, a pooling layer is a kind of network layer that downsamples and aggregates information that is dispersed among many vectors into fewer vectors. It has several uses. It removes redundant information, reducing the amount of computation and memory required, makes the model more robust to small variations in the input, and increases the receptive field of neurons in later layers in the network.

References

  1. 1 2 3 4 Dosovitskiy, Alexey; Beyer, Lucas; Kolesnikov, Alexander; Weissenborn, Dirk; Zhai, Xiaohua; Unterthiner, Thomas; Dehghani, Mostafa; Minderer, Matthias; Heigold, Georg; Gelly, Sylvain; Uszkoreit, Jakob (2021-06-03). "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale". arXiv: 2010.11929 [cs.CV].
  2. 1 2 Raghu, Maithra; Unterthiner, Thomas; Kornblith, Simon; Zhang, Chiyuan; Dosovitskiy, Alexey (2021-08-19). "Do Vision Transformers See Like Convolutional Neural Networks?". arXiv: 2108.08810 [cs.CV].
  3. Dehghani, Mostafa; Djolonga, Josip; Mustafa, Basil; Padlewski, Piotr; Heek, Jonathan; Gilmer, Justin; Steiner, Andreas; Caron, Mathilde; Geirhos, Robert (2023-02-10), Scaling Vision Transformers to 22 Billion Parameters, arXiv: 2302.05442 , retrieved 2024-08-07
  4. "Scaling vision transformers to 22 billion parameters". research.google. Retrieved 2024-08-07.
  5. 1 2 Han, Kai; Wang, Yunhe; Chen, Hanting; Chen, Xinghao; Guo, Jianyuan; Liu, Zhenhua; Tang, Yehui; Xiao, An; Xu, Chunjing; Xu, Yixing; Yang, Zhaohui; Zhang, Yiman; Tao, Dacheng (2023-01-01). "A Survey on Vision Transformer". IEEE Transactions on Pattern Analysis and Machine Intelligence. 45 (1): 87–110. arXiv: 2012.12556 . doi:10.1109/TPAMI.2022.3152247. ISSN   0162-8828. PMID   35180075.
  6. 1 2 Khan, Salman; Naseer, Muzammal; Hayat, Munawar; Zamir, Syed Waqas; Khan, Fahad Shahbaz; Shah, Mubarak (2022-09-13). "Transformers in Vision: A Survey". ACM Comput. Surv. 54 (10s): 200:1–200:41. arXiv: 2101.01169 . doi:10.1145/3505244. ISSN   0360-0300.
  7. Vaswani, Ashish; Shazeer, Noam; Parmar, Niki; Uszkoreit, Jakob; Jones, Llion; Gomez, Aidan N; Kaiser, Łukasz; Polosukhin, Illia (2017). "Attention is All you Need" (PDF). Advances in Neural Information Processing Systems. 30. Curran Associates, Inc.
  8. Ramachandran, Prajit; Parmar, Niki; Vaswani, Ashish; Bello, Irwan; Levskaya, Anselm; Shlens, Jon (2019). "Stand-Alone Self-Attention in Vision Models". Advances in Neural Information Processing Systems. 32. Curran Associates, Inc. arXiv: 1906.05909 .
  9. Liu, Zhuang; Mao, Hanzi; Wu, Chao-Yuan; Feichtenhofer, Christoph; Darrell, Trevor; Xie, Saining (2022). "A ConvNet for the 2020s": 11976–11986. arXiv: 2201.03545 .{{cite journal}}: Cite journal requires |journal= (help)
  10. Woo, Sanghyun; Debnath, Shoubhik; Hu, Ronghang; Chen, Xinlei; Liu, Zhuang; Kweon, In So; Xie, Saining (2023). "ConvNeXt V2: Co-Designing and Scaling ConvNets With Masked Autoencoders": 16133–16142. arXiv: 2301.00808 .{{cite journal}}: Cite journal requires |journal= (help)
  11. Wu, Bichen; Xu, Chenfeng; Dai, Xiaoliang; Wan, Alvin; Zhang, Peizhao; Yan, Zhicheng; Masayoshi, Tomizuka; Gonzalez, Joseph; Keutzer, Kurt; Vajda, Peter (2020). "Visual Transformers: Token-based Image Representation and Processing for Computer Vision". arXiv: 2006.03677 [cs.CV].
  12. 1 2 Xiao, Tete; Singh, Mannat; Mintun, Eric; Darrell, Trevor; Dollár, Piotr; Girshick, Ross (2021-06-28). "Early Convolutions Help Transformers See Better". arXiv: 2106.14881 [cs.CV].
  13. 1 2 3 Liu, Ze; Lin, Yutong; Cao, Yue; Hu, Han; Wei, Yixuan; Zhang, Zheng; Lin, Stephen; Guo, Baining (2021-03-25). "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows". arXiv: 2103.14030 [cs.CV].
  14. Tan, Mingxing; Le, Quoc (23 June 2021). "EfficientNetV2: Smaller Models and Faster Training" (PDF). Proceedings of the 38th International Conference on Machine Learning (PMLR). 139: 10096–10106. arXiv: 2104.00298 . Retrieved 31 October 2023.
  15. Huang, Gao; Liu, Zhuang; van der Maaten, Laurens; Q. Weinberger, Kilian (28 Jan 2018). "Densely Connected Convolutional Networks". arXiv: 1608.06993 [cs.CV].
  16. 1 2 Sarkar, Arjun (2021-05-20). "Are Transformers better than CNN's at Image Recognition?". Medium. Retrieved 2021-07-11.
  17. Dosovitskiy, Alexey; Beyer, Lucas; Kolesnikov, Alexander; Weissenborn, Dirk; Zhai, Xiaohua; Unterthiner, Thomas; Dehghani, Mostafa; Minderer, Matthias; Heigold, Georg; Gelly, Sylvain; Uszkoreit, Jakob (2021-06-03). "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale". arXiv: 2010.11929 [cs.CV].
  18. 1 2 Zhai, Xiaohua; Kolesnikov, Alexander; Houlsby, Neil; Beyer, Lucas (June 2022). "Scaling Vision Transformers". 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE: 1204–1213. arXiv: 2106.04560 . doi:10.1109/cvpr52688.2022.01179. ISBN   978-1-6654-6946-3.
  19. Lee, Juho; Lee, Yoonho; Kim, Jungtaek; Kosiorek, Adam; Choi, Seungjin; Teh, Yee Whye (2019-05-24). "Set Transformer: A Framework for Attention-based Permutation-Invariant Neural Networks". Proceedings of the 36th International Conference on Machine Learning. PMLR: 3744–3753. arXiv: 1810.00825 .
  20. Karamcheti, Siddharth; Nair, Suraj; Chen, Annie S.; Kollar, Thomas; Finn, Chelsea; Sadigh, Dorsa; Liang, Percy (2023-02-24), Language-Driven Representation Learning for Robotics, arXiv: 2302.12766 , retrieved 2024-09-09
  21. Touvron, Hugo; Cord, Matthieu; Sablayrolles, Alexandre; Synnaeve, Gabriel; Jégou, Hervé (2021). "Going Deeper With Image Transformers": 32–42. arXiv: 2103.17239 .{{cite journal}}: Cite journal requires |journal= (help)
  22. Zhou, Daquan; Kang, Bingyi; Jin, Xiaojie; Yang, Linjie; Lian, Xiaochen; Jiang, Zihang; Hou, Qibin; Feng, Jiashi (2021-04-19), DeepViT: Towards Deeper Vision Transformer, arXiv: 2103.11886 , retrieved 2024-09-09
  23. He, Kaiming; Chen, Xinlei; Xie, Saining; Li, Yanghao; Dollár, Piotr; Girshick, Ross (2021). "Masked Autoencoders Are Scalable Vision Learners". arXiv: 2111.06377 [cs.CV].
  24. Bao, Hangbo; Dong, Li; Piao, Songhao; Wei, Furu (2021-10-06). "BEiT: BERT Pre-Training of Image Transformers". International Conference on Learning Representations. arXiv: 2106.08254 .
  25. 1 2 Caron, Mathilde; Touvron, Hugo; Misra, Ishan; Jegou, Herve; Mairal, Julien; Bojanowski, Piotr; Joulin, Armand (October 2021). "Emerging Properties in Self-Supervised Vision Transformers". 2021 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE. pp. 9630–9640. arXiv: 2104.14294 . doi:10.1109/iccv48922.2021.00951. ISBN   978-1-6654-2812-5.
  26. He, Kaiming; Fan, Haoqi; Wu, Yuxin; Xie, Saining; Girshick, Ross (2020). "Momentum Contrast for Unsupervised Visual Representation Learning": 9729–9738. arXiv: 1911.05722 .{{cite journal}}: Cite journal requires |journal= (help)
  27. Grill, Jean-Bastien; Strub, Florian; Altché, Florent; Tallec, Corentin; Richemond, Pierre; Buchatskaya, Elena; Doersch, Carl; Avila Pires, Bernardo; Guo, Zhaohan; Gheshlaghi Azar, Mohammad; Piot, Bilal; kavukcuoglu, koray; Munos, Remi; Valko, Michal (2020). "Bootstrap Your Own Latent - A New Approach to Self-Supervised Learning". Advances in Neural Information Processing Systems. 33. Curran Associates, Inc.: 21271–21284.
  28. Liu, Ze; Hu, Han; Lin, Yutong; Yao, Zhuliang; Xie, Zhenda; Wei, Yixuan; Ning, Jia; Cao, Yue; Zhang, Zheng; Dong, Li; Wei, Furu; Guo, Baining (2022). "Swin Transformer V2: Scaling Up Capacity and Resolution". Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 12009–12019.
  29. Bertasius, Gedas; Wang, Heng; Torresani, Lorenzo (2021-02-09). "Is Space-Time Attention All You Need for Video Understanding?". arXiv: 2102.05095 [cs.CV].
  30. Szegedy, Christian; Vanhoucke, Vincent; Ioffe, Sergey; Shlens, Jon; Wojna, Zbigniew (2016). "Rethinking the Inception Architecture for Computer Vision": 2818–2826. arXiv: 1512.00567 .{{cite journal}}: Cite journal requires |journal= (help)
  31. Yu, Jiahui; Li, Xin; Koh, Jing Yu; Zhang, Han; Pang, Ruoming; Qin, James; Ku, Alexander; Xu, Yuanzhong; Baldridge, Jason; Wu, Yonghui (2021). "Vector-quantized Image Modeling with Improved VQGAN". arXiv: 2110.04627 [cs.CV].
  32. "Parti: Pathways Autoregressive Text-to-Image Model". sites.research.google. Retrieved 2023-11-03.
  33. Wu, Bichen; Xu, Chenfeng; Dai, Xiaoliang; Wan, Alvin; Zhang, Peizhao; Yan, Zhicheng; Tomizuka, Masayoshi; Gonzalez, Joseph; Keutzer, Kurt (2020-11-19), Visual Transformers: Token-based Image Representation and Processing for Computer Vision, arXiv: 2006.03677 , retrieved 2024-08-07
  34. Dai, Zihang; Liu, Hanxiao; Le, Quoc V.; Tan, Mingxing (2021-06-09). "CoAtNet: Marrying Convolution and Attention for All Data Sizes". arXiv: 2106.04803 [cs.CV].
  35. Wu, Haiping; Xiao, Bin; Codella, Noel; Liu, Mengchen; Dai, Xiyang; Yuan, Lu; Zhang, Lei (2021-03-29). "CvT: Introducing Convolutions to Vision Transformers". arXiv: 2103.15808 [cs.CV].
  36. Touvron, Hugo; Cord, Matthieu; Jégou, Hervé (2022). "DeiT III: Revenge of the ViT". In Avidan, Shai; Brostow, Gabriel; Cissé, Moustapha; Farinella, Giovanni Maria; Hassner, Tal (eds.). Computer Vision – ECCV 2022. Lecture Notes in Computer Science. Vol. 13684. Cham: Springer Nature Switzerland. pp. 516–533. doi:10.1007/978-3-031-20053-3_30. ISBN   978-3-031-20053-3.
  37. Naseer, Muzammal; Ranasinghe, Kanchana; Khan, Salman; Hayat, Munawar; Khan, Fahad Shahbaz; Yang, Ming-Hsuan (2021-05-21). "Intriguing Properties of Vision Transformers". arXiv: 2105.10497 [cs.CV].
  38. Coccomini, Davide; Messina, Nicola; Gennaro, Claudio; Falchi, Fabrizio (2022). "Combining Efficient Net and Vision Transformers for Video Deepfake Detection". Image Analysis and Processing – ICIAP 2022. Lecture Notes in Computer Science. Vol. 13233. pp. 219–229. arXiv: 2107.02612 . doi:10.1007/978-3-031-06433-3_19. ISBN   978-3-031-06432-6. S2CID   235742764.
  39. Jiang, Yifan; Chang, Shiyu; Wang, Zhangyang (2021). "TransGAN: Two Pure Transformers Can Make One Strong GAN, and That Can Scale Up". Advances in Neural Information Processing Systems. 34. Curran Associates, Inc.: 14745–14758. arXiv: 2102.07074 .
  40. Peebles, William; Xie, Saining (March 2023). "Scalable Diffusion Models with Transformers". arXiv: 2212.09748v2 [cs.CV].
  41. Doron, Michael; Moutakanni, Théo; Chen, Zitong S.; Moshkov, Nikita; Caron, Mathilde; Touvron, Hugo; Bojanowski, Piotr; Pernice, Wolfgang M.; Caicedo, Juan C. (2023-06-18). "Unbiased single-cell morphology with self-supervised vision transformers". BioRxiv: The Preprint Server for Biology: 2023.06.16.545359. doi:10.1101/2023.06.16.545359. PMC   10312751 . PMID   37398158 . Retrieved 2024-02-12.

Further reading