StyleGAN

Last updated

An image generated using StyleGAN that looks like a portrait of a young woman. This image was generated by an artificial neural network based on an analysis of a large number of photographs. Woman 1.jpg
An image generated using StyleGAN that looks like a portrait of a young woman. This image was generated by an artificial neural network based on an analysis of a large number of photographs.

StyleGAN is a generative adversarial network (GAN) introduced by Nvidia researchers in December 2018, [1] and made source available in February 2019. [2] [3]

Contents

StyleGAN depends on Nvidia's CUDA software, GPUs, and Google's TensorFlow, [4] or Meta AI's PyTorch, which supersedes TensorFlow as the official implementation library in later StyleGAN versions. [5] The second version of StyleGAN, called StyleGAN2, was published on February 5, 2020. It removes some of the characteristic artifacts and improves the image quality. [6] [7] Nvidia introduced StyleGAN3, described as an "alias-free" version, on June 23, 2021, and made source available on October 12, 2021. [8]

History

A direct predecessor of the StyleGAN series is the Progressive GAN, published in 2017. [9]

In December 2018, Nvidia researchers distributed a preprint with accompanying software introducing StyleGAN, a GAN for producing an unlimited number of (often convincing) portraits of fake human faces. StyleGAN was able to run on Nvidia's commodity GPU processors.

In February 2019, Uber engineer Phillip Wang used the software to create the website This Person Does Not Exist, which displayed a new face on each web page reload. [10] [11] Wang himself has expressed amazement, given that humans are evolved to specifically understand human faces, that nevertheless StyleGAN can competitively "pick apart all the relevant features (of human faces) and recompose them in a way that's coherent." [12]

In September 2019, a website called Generated Photos published 100,000 images as a collection of stock photos. [13] The collection was made using a private dataset shot in a controlled environment with similar light and angles. [14]

Similarly, two faculty at the University of Washington's Information School used StyleGAN to create Which Face is Real?, which challenged visitors to differentiate between a fake and a real face side by side. [11] The faculty stated the intention was to "educate the public" about the existence of this technology so they could be wary of it, "just like eventually most people were made aware that you can Photoshop an image". [15]

The second version of StyleGAN, called StyleGAN2, was published on February 5, 2020. It removes some of the characteristic artifacts and improves the image quality. [6] [7]

In 2021, a third version was released, improving consistency between fine and coarse details in the generator. Dubbed "alias-free", this version was implemented with pytorch. [16]

Illicit use

In December 2019, Facebook took down a network of accounts with false identities, and mentioned that some of them had used profile pictures created with machine learning techniques. [17]

Architecture

Progressive GAN

Progressive GAN [9] is a method for training GAN for large-scale image generation stably, by growing a GAN generator from small to large scale in a pyramidal fashion. Like SinGAN, it decomposes the generator as , and the discriminator as .

During training, at first only are used in a GAN game to generate 4x4 images. Then are added to reach the second stage of GAN game, to generate 8x8 images, and so on, until we reach a GAN game to generate 1024x1024 images.

To avoid discontinuity between stages of the GAN game, each new layer is "blended in" (Figure 2 of the paper [9] ). For example, this is how the second stage GAN game starts:

StyleGAN

The main architecture of StyleGAN-1 and StyleGAN-2 StyleGAN-1 and StyleGAN-2.png
The main architecture of StyleGAN-1 and StyleGAN-2

StyleGAN is designed as a combination of Progressive GAN with neural style transfer. [18]

The key architectural choice of StyleGAN-1 is a progressive growth mechanism, similar to Progressive GAN. Each generated image starts as a constant [note 1] array, and repeatedly passed through style blocks. Each style block applies a "style latent vector" via affine transform ("adaptive instance normalization"), similar to how neural style transfer uses Gramian matrix. It then adds noise, and normalize (subtract the mean, then divide by the variance).

At training time, usually only one style latent vector is used per image generated, but sometimes two ("mixing regularization") in order to encourage each style block to independently perform its stylization without expecting help from other style blocks (since they might receive an entirely different style latent vector).

After training, multiple style latent vectors can be fed into each style block. Those fed to the lower layers control the large-scale styles, and those fed to the higher layers control the fine-detail styles.

Style-mixing between two images can be performed as well. First, run a gradient descent to find such that . This is called "projecting an image back to style latent space". Then, can be fed to the lower style blocks, and to the higher style blocks, to generate a composite image that has the large-scale style of , and the fine-detail style of . Multiple images can also be composed this way.

StyleGAN2

StyleGAN2 improves upon StyleGAN in two ways.

One, it applies the style latent vector to transform the convolution layer's weights instead, thus solving the "blob" problem. [19] The "blob" problem roughly speaking is because using the style latent vector to normalize the generated image destroys useful information. Consequently, the generator learned to create a "distraction" by a large blob, which absorbs most of the effect of normalization (somewhat similar to using flares to distract a heat-seeking missile).

Two, it uses residual connections, which helps it avoid the phenomenon where certain features are stuck at intervals of pixels. For example, the seam between two teeth may be stuck at pixels divisible by 32, because the generator learned to generate teeth during stage N-5, and consequently could only generate primitive teeth at that stage, before scaling up 5 times (thus intervals of 32).

This was updated by the StyleGAN2-ADA ("ADA" stands for "adaptive"), [20] which uses invertible data augmentation. It also tunes the amount of data augmentation applied by starting at zero, and gradually increasing it until an "overfitting heuristic" reaches a target level, thus the name "adaptive".

StyleGAN3

StyleGAN3 [21] improves upon StyleGAN2 by solving the "texture sticking" problem, which can be seen in the official videos. [22] They analyzed the problem by the Nyquist–Shannon sampling theorem, and argued that the layers in the generator learned to exploit the high-frequency signal in the pixels they operate upon.

To solve this, they proposed imposing strict lowpass filters between each generator's layers, so that the generator is forced to operate on the pixels in a way faithful to the continuous signals they represent, rather than operate on them as merely discrete signals. They further imposed rotational and translational invariance by using more signal filters. The resulting StyleGAN-3 is able to generate images that rotate and translate smoothly, and without texture sticking.

See also

Notes

  1. It is learned during the training, but afterwards it is held constant, much like a bias vector.

Related Research Articles

In statistical classification, two main approaches are called the generative approach and the discriminative approach. These compute classifiers by different approaches, differing in the degree of statistical modelling. Terminology is inconsistent, but three major types can be distinguished, following Jebara (2004):

  1. A generative model is a statistical model of the joint probability distribution on a given observable variable X and target variable Y; A generative model can be used to "generate" random instances (outcomes) of an observation x.
  2. A discriminative model is a model of the conditional probability of the target Y, given an observation x. It can be used to "discriminate" the value of the target variable Y, given an observation x.
  3. Classifiers computed without using a probability model are also referred to loosely as "discriminative".
<span class="mw-page-title-main">Human image synthesis</span> Computer generation of human images

Human image synthesis is technology that can be applied to make believable and even photorealistic renditions of human-likenesses, moving or still. It has effectively existed since the early 2000s. Many films using computer generated imagery have featured synthetic images of human-like characters digitally composited onto the real or other simulated film material. Towards the end of the 2010s deep learning artificial intelligence has been applied to synthesize images and video that look like humans, without need for human assistance, once the training phase has been completed, whereas the old school 7D-route required massive amounts of human work .

An autoencoder is a type of artificial neural network used to learn efficient codings of unlabeled data. An autoencoder learns two functions: an encoding function that transforms the input data, and a decoding function that recreates the input data from the encoded representation. The autoencoder learns an efficient representation (encoding) for a set of data, typically for dimensionality reduction.

<span class="mw-page-title-main">WikiArt</span> User-generated website displaying artworks

WikiArt is a visual art wiki, active since 2010.

Adversarial machine learning is the study of the attacks on machine learning algorithms, and of the defenses against such attacks. A survey from May 2020 exposes the fact that practitioners report a dire need for better protecting machine learning systems in industrial applications.

<span class="mw-page-title-main">Generative adversarial network</span> Deep learning method

A generative adversarial network (GAN) is a class of machine learning frameworks and a prominent framework for approaching generative AI. The concept was initially developed by Ian Goodfellow and his colleagues in June 2014. In a GAN, two neural networks contest with each other in the form of a zero-sum game, where one agent's gain is another agent's loss.

<span class="mw-page-title-main">Ian Goodfellow</span> American computer scientist

Ian J. Goodfellow is an American computer scientist, engineer, and executive, most noted for his work on artificial neural networks and deep learning. He was previously employed as a research scientist at Google Brain and director of machine learning at Apple and has made several important contributions to the field of deep learning including the invention of the generative adversarial network (GAN). Goodfellow co-wrote, as the first author, the textbook Deep Learning (2016) and wrote the chapter on deep learning in the authoritative textbook of the field of artificial intelligence, Artificial Intelligence: A Modern Approach.

Data augmentation is a statistical technique which allows maximum likelihood estimation from incomplete data. Data augmentation has important applications in Bayesian analysis, and the technique is widely used in machine learning to reduce overfitting when training machine learning models, achieved by training models on several slightly-modified copies of existing data.

<span class="mw-page-title-main">Artificial intelligence art</span> Machine application of knowledge of human aesthetic expressions

Artificial intelligence art is visual artwork created through the use of an artificial intelligence (AI) program.

Artificial neural networks (ANNs) are models created using machine learning to perform a number of tasks. Their creation was inspired by neural circuitry. While some of the computational implementations ANNs relate to earlier discoveries in mathematics, the first implementation of ANNs was by psychologist Frank Rosenblatt, who developed the perceptron. Little research was conducted on ANNs in the 1970s and 1980s, with the AAAI calling that period an "AI winter".

Synthetic media is a catch-all term for the artificial production, manipulation, and modification of data and media by automated means, especially through the use of artificial intelligence algorithms, such as for the purpose of misleading people or changing an original meaning. Synthetic media as a field has grown rapidly since the creation of generative adversarial networks, primarily through the rise of deepfakes as well as music synthesis, text generation, human image synthesis, speech synthesis, and more. Though experts use the term "synthetic media," individual methods such as deepfakes and text synthesis are sometimes not referred to as such by the media but instead by their respective terminology Significant attention arose towards the field of synthetic media starting in 2017 when Motherboard reported on the emergence of AI altered pornographic videos to insert the faces of famous actresses. Potential hazards of synthetic media include the spread of misinformation, further loss of trust in institutions such as media and government, the mass automation of creative and journalistic jobs and a retreat into AI-generated fantasy worlds. Synthetic media is an applied form of artificial imagination.

<span class="mw-page-title-main">Variational autoencoder</span> Deep learning generative model to encode data representation

In machine learning, a variational autoencoder (VAE) is an artificial neural network architecture introduced by Diederik P. Kingma and Max Welling. It is part of the families of probabilistic graphical models and variational Bayesian methods.

An energy-based model (EBM) (also called a Canonical Ensemble Learning(CEL) or Learning via Canonical Ensemble (LCE)) is an application of canonical ensemble formulation of statistical physics for learning from data problems. The approach prominently appears in generative models (GMs).

The Fréchet inception distance (FID) is a metric used to assess the quality of images created by a generative model, like a generative adversarial network (GAN). Unlike the earlier inception score (IS), which evaluates only the distribution of generated images, the FID compares the distribution of generated images with the distribution of a set of real images. The FID metric does not completely replace the IS metric. Classifiers that achieve the best (lowest) FID score tend to have greater sample variety while classifiers achieving the best (highest) IS score tend to have better quality within individual images.

Deep learning speech synthesis refers to the application of deep learning models to generate natural-sounding human speech from written text (text-to-speech) or spectrum (vocoder). Deep neural networks (DNN) are trained using a large amount of recorded speech and, in the case of a text-to-speech system, the associated labels and/or input text.

The Wasserstein Generative Adversarial Network (WGAN) is a variant of generative adversarial network (GAN) proposed in 2017 that aims to "improve the stability of learning, get rid of problems like mode collapse, and provide meaningful learning curves useful for debugging and hyperparameter searches".

The Inception Score (IS) is an algorithm used to assess the quality of images created by a generative image model such as a generative adversarial network (GAN). The score is calculated based on the output of a separate, pretrained Inceptionv3 image classification model applied to a sample of (typically around 30,000) images generated by the generative model. The Inception Score is maximized when the following conditions are true:

  1. The entropy of the distribution of labels predicted by the Inceptionv3 model for the generated images is minimized. In other words, the classification model confidently predicts a single label for each image. Intuitively, this corresponds to the desideratum of generated images being "sharp" or "distinct".
  2. The predictions of the classification model are evenly distributed across all possible labels. This corresponds to the desideratum that the output of the generative model is "diverse".
<span class="mw-page-title-main">Text-to-image model</span> Machine learning model

A text-to-image model is a machine learning model which takes an input natural language description and produces an image matching that description.

In machine learning, diffusion models, also known as diffusion probabilistic models or score-based generative models, are a class of latent variable generative models. A diffusion model consists of three major components: the forward process, the reverse process, and the sampling procedure. The goal of diffusion models is to learn a diffusion process that generates a probability distribution for a given dataset from which we can then sample new elements. They learn the latent structure of a dataset by modeling the way in which data points diffuse through their latent space.

<span class="mw-page-title-main">Audio inpainting</span>

Audio inpainting is an audio restoration task which deals with the reconstruction of missing or corrupted portions of a digital audio signal. Inpainting techniques are employed when parts of the audio have been lost due to various factors such as transmission errors, data corruption or errors during recording.

References

  1. "GAN 2.0: NVIDIA's Hyperrealistic Face Generator". SyncedReview.com. December 14, 2018. Retrieved October 3, 2019.
  2. "NVIDIA Open-Sources Hyper-Realistic Face Generator StyleGAN". Medium.com . February 9, 2019. Retrieved October 3, 2019.
  3. Beschizza, Rob (February 15, 2019). "This Person Does Not Exist". Boing-Boing . Retrieved February 16, 2019.
  4. Larabel, Michael (February 10, 2019). "NVIDIA Opens Up The Code To StyleGAN - Create Your Own AI Family Portraits". Phoronix.com. Retrieved October 3, 2019.
  5. "Looking for the PyTorch version? - Stylegan2". github.com. October 28, 2021. Retrieved August 5, 2022.
  6. 1 2 "Synthesizing High-Resolution Images with StyleGAN2 – NVIDIA Developer News Center". news.developer.nvidia.com. June 17, 2020. Retrieved August 11, 2020.
  7. 1 2 NVlabs/stylegan2, NVIDIA Research Projects, August 11, 2020, retrieved August 11, 2020
  8. Kakkar, Shobha (October 13, 2021). "NVIDIA AI Releases StyleGAN3: Alias-Free Generative Adversarial Networks". MarkTechPost. Retrieved October 14, 2021.
  9. 1 2 3 Karras, Tero; Aila, Timo; Laine, Samuli; Lehtinen, Jaakko (2018). "Progressive Growing of GANs for Improved Quality, Stability, and Variation". International Conference on Learning Representations . arXiv: 1710.10196 .
  10. msmash, n/a (February 14, 2019). "'This Person Does Not Exist' Website Uses AI To Create Realistic Yet Horrifying Faces". Slashdot . Retrieved February 16, 2019.
  11. 1 2 Fleishman, Glenn (April 30, 2019). "How to spot the realistic fake people creeping into your timelines". Fast Company . Retrieved June 7, 2020.
  12. Bishop, Katie (February 7, 2020). "AI in the adult industry: porn may soon feature people who don't exist". The Guardian. Retrieved June 8, 2020.
  13. Porter, Jon (September 20, 2019). "100,000 free AI-generated headshots put stock photo companies on notice". The Verge. Retrieved August 4, 2020.
  14. Timmins, Jane Wakefield and Beth (February 29, 2020). "Could deepfakes be used to train office workers?". BBC News. Retrieved August 4, 2020.
  15. Vincent, James (March 3, 2019). "Can you tell the difference between a real face and an AI-generated fake?". The Verge. Retrieved June 8, 2020.
  16. NVlabs/stylegan3, NVIDIA Research Projects, October 11, 2021
  17. "Facebook's latest takedown has a twist -- AI-generated profile pictures". ABC News. Retrieved August 4, 2020.
  18. Karras, Tero; Laine, Samuli; Aila, Timo (2019). "A Style-Based Generator Architecture for Generative Adversarial Networks" (PDF). 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE. pp. 4396–4405. arXiv: 1812.04948 . doi:10.1109/CVPR.2019.00453. ISBN   978-1-7281-3293-8. S2CID   54482423.
  19. Karras, Tero; Laine, Samuli; Aittala, Miika; Hellsten, Janne; Lehtinen, Jaakko; Aila, Timo (2020). "Analyzing and Improving the Image Quality of StyleGAN" (PDF). 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE. pp. 8107–8116. arXiv: 1912.04958 . doi:10.1109/CVPR42600.2020.00813. ISBN   978-1-7281-7168-5. S2CID   209202273.
  20. Tero, Karras; Miika, Aittala; Janne, Hellsten; Samuli, Laine; Jaakko, Lehtinen; Timo, Aila (2020). "Training Generative Adversarial Networks with Limited Data". Advances in Neural Information Processing Systems. 33.
  21. Karras, Tero; Aittala, Miika; Laine, Samuli; Härkönen, Erik; Hellsten, Janne; Lehtinen, Jaakko; Aila, Timo (2021). Alias-Free Generative Adversarial Networks (PDF). Advances in Neural Information Processing Systems.
  22. Karras, Tero; Aittala, Miika; Laine, Samuli; Härkönen, Erik; Hellsten, Janne; Lehtinen, Jaakko; Aila, Timo. "Alias-Free Generative Adversarial Networks (StyleGAN3)". nvlabs.github.io. Retrieved July 16, 2022.