Neural style transfer

Last updated September 06, 2024

Neural style transfer applied to the Mona Lisa :

The Starry Night

Woman with a Hat

The Great Wave off Kanagawa

Neural style transfer (NST) refers to a class of software algorithms that manipulate digital images, or videos, in order to adopt the appearance or visual style of another image. NST algorithms are characterized by their use of deep neural networks for the sake of image transformation. Common uses for NST are the creation of artificial artwork from photographs, for example by transferring the appearance of famous paintings to user-supplied photographs. Several notable mobile apps use NST techniques for this purpose, including DeepArt and Prisma. This method has been used by artists and designers around the globe to develop new artwork based on existent style(s).

History

NST is an example of image stylization, a problem studied for over two decades within the field of non-photorealistic rendering. The first two example-based style transfer algorithms were image analogies ^[1] and image quilting.^[2] Both of these methods were based on patch-based texture synthesis algorithms.

Given a training pair of images–a photo and an artwork depicting that photo–a transformation could be learned and then applied to create new artwork from a new photo, by analogy. If no training photo was available, it would need to be produced by processing the input artwork; image quilting did not require this processing step, though it was demonstrated on only one style.

NST was first published in the paper "A Neural Algorithm of Artistic Style" by Leon Gatys et al., originally released to ArXiv 2015,^[3] and subsequently accepted by the peer-reviewed CVPR conference in 2016.^[4] The original paper used a VGG-19 architecture ^[5] that has been pre-trained to perform object recognition using the ImageNet dataset.

In 2017, Google AI introduced a method^[6] that allows a single deep convolutional style transfer network to learn multiple styles at the same time. This algorithm permits style interpolation in real-time, even when done on video media.

Mathematics

This section closely follows the original paper.^[4]

Overview

The idea of Neural Style Transfer (NST) is to take two images—a content image ${\vec {p}}$ and a style image ${\vec {a}}$ —and generate a third image ${\vec {x}}$ that minimizes a weighted combination of two loss functions: a content loss ${\mathcal {L}}_{\text{content }}({\vec {p}},{\vec {x}})$ and a style loss ${\mathcal {L}}_{\text{style }}({\vec {a}},{\vec {x}})$ .

The total loss is a linear sum of the two: ${\mathcal {L}}_{\text{NST}}({\vec {p}},{\vec {a}},{\vec {x}})=\alpha {\mathcal {L}}_{\text{content}}({\vec {p}},{\vec {x}})+\beta {\mathcal {L}}_{\text{style}}({\vec {a}},{\vec {x}})$ By jointly minimizing the content and style losses, NST generates an image that blends the content of the content image with the style of the style image.

Both the content loss and the style loss measures the similarity of two images. The content similarity is the weighted sum of squared-differences between the neural activations of a single convolutional neural network (CNN) on two images. The style similarity is the weighted sum of Gram matrices within each layer (see below for details).

The original paper used a VGG-19 CNN, but the method works for any CNN.

Symbols

Let ${\textstyle {\vec {x}}}$ be an image input to a CNN.

Let ${\textstyle F^{l}\in \mathbb {R} ^{N_{l}\times M_{l}}}$ be the matrix of filter responses in layer ${\textstyle l}$ to the image ${\textstyle {\vec {x}}}$ , where:

${\textstyle N_{l}}$ is the number of filters in layer ${\textstyle l}$ ;
${\textstyle M_{l}}$ is the height times the width (i.e. number of pixels) of each filter in layer ${\textstyle l}$ ;
${\textstyle F_{ij}^{l}({\vec {x}})}$ is the activation of the ${\textstyle i^{\text{th}}}$ filter at position ${\textstyle j}$ in layer ${\textstyle l}$ .

A given input image ${\textstyle {\vec {x}}}$ is encoded in each layer of the CNN by the filter responses to that image, with higher layers encoding more global features, but losing details on local features.

Content loss

Let ${\textstyle {\vec {p}}}$ be an original image. Let ${\textstyle {\vec {x}}}$ be an image that is generated to match the content of ${\textstyle {\vec {p}}}$ . Let ${\textstyle P^{l}}$ be the matrix of filter responses in layer ${\textstyle l}$ to the image ${\textstyle {\vec {p}}}$ .

The content loss is defined as the squared-error loss between the feature representations of the generated image and the content image at a chosen layer $l$ of a CNN: ${\mathcal {L}}_{\text{content }}({\vec {p}},{\vec {x}},l)={\frac {1}{2}}\sum _{i,j}\left(A_{ij}^{l}({\vec {x}})-A_{ij}^{l}({\vec {p}})\right)^{2}$ where $A_{ij}^{l}({\vec {x}})$ and $A_{ij}^{l}({\vec {p}})$ are the activations of the $i^{\text{th}}$ filter at position $j$ in layer $l$ for the generated and content images, respectively. Minimizing this loss encourages the generated image to have similar content to the content image, as captured by the feature activations in the chosen layer.

The total content loss is a linear sum of the content losses of each layer: ${\mathcal {L}}_{\text{content }}({\vec {p}},{\vec {x}})=\sum _{l}v_{l}{\mathcal {L}}_{\text{content }}({\vec {p}},{\vec {x}},l)$ , where the $v_{l}$ are positive real numbers chosen as hyperparameters.

Style loss

The style loss is based on the Gram matrices of the generated and style images, which capture the correlations between different filter responses at different layers of the CNN: ${\mathcal {L}}_{\text{style }}({\vec {a}},{\vec {x}})=\sum _{l=0}^{L}w_{l}E_{l},$ where $E_{l}={\frac {1}{4N_{l}^{2}M_{l}^{2}}}\sum _{i,j}\left(G_{ij}^{l}({\vec {x}})-G_{ij}^{l}({\vec {a}})\right)^{2}.$ Here, $G_{ij}^{l}({\vec {x}})$ and $G_{ij}^{l}({\vec {a}})$ are the entries of the Gram matrices for the generated and style images at layer $l$ . Explicitly, $G_{ij}^{l}({\vec {x}})=\sum _{k}F_{ik}^{l}({\vec {x}})F_{jk}^{l}({\vec {x}})$

Minimizing this loss encourages the generated image to have similar style characteristics to the style image, as captured by the correlations between feature responses in each layer. The idea is that activation pattern correlations between filters in a single layer captures the "style" on the order of the receptive fields at that layer.

Similarly to the previous case, the $w_{l}$ are positive real numbers chosen as hyperparameters.

Hyperparameters

In the original paper, they used a particular choice of hyperparameters.

The style loss is computed by $w_{l}=0.2$ for the outputs of layers conv1_1, conv2_1, conv3_1, conv4_1, conv5_1 in the VGG-19 network, and zero otherwise. The content loss is computed by $w_{l}=1$ for conv4_2, and zero otherwise.

The ratio $\alpha /\beta \in [5,50]\times 10^{-4}$ .

Training

Image ${\vec {x}}$ is initially approximated by adding a small amount of white noise to input image ${\vec {p}}$ and feeding it through the CNN. Then we successively backpropagate this loss through the network with the CNN weights fixed in order to update the pixels of ${\vec {x}}$ . After several thousand epochs of training, an ${\vec {x}}$ (hopefully) emerges that matches the style of ${\vec {a}}$ and the content of ${\vec {p}}$ .

As of 2017^[update], when implemented on a GPU, it takes a few minutes to converge.^[8]

Extensions

In some practical implementations, it is noted that the resulting image has too much high-frequency artifact, which can be suppressed by adding the total variation to the total loss.^[9]

NST has also been extended to videos.^[10]

Subsequent work improved the speed of NST for images by using special-purpose normalizations.^[11]^[8]

In a paper by Fei-Fei Li et al. adopted a different regularized loss metric and accelerated method for training to produce results in real-time (three orders of magnitude faster than Gatys).^[12] Their idea was to use not the pixel-based loss defined above but rather a 'perceptual loss' measuring the differences between higher-level layers within the CNN. They used a symmetric convolution-deconvolution CNN. Training uses a similar loss function to the basic NST method but also regularizes the output for smoothness using a total variation (TV) loss. Once trained, the network may be used to transform an image into the style used during training, using a single feed-forward pass of the network. However the network is restricted to the single style in which it has been trained.^[12]

In a work by Chen Dongdong et al. they explored the fusion of optical flow information into feedforward networks in order to improve the temporal coherence of the output.^[13]

Most recently, feature transform based NST methods have been explored for fast stylization that are not coupled to single specific style and enable user-controllable blending of styles, for example the whitening and coloring transform (WCT).^[14]

Related Research Articles

In linear algebra, the Cholesky decomposition or Cholesky factorization is a decomposition of a Hermitian, positive-definite matrix into the product of a lower triangular matrix and its conjugate transpose, which is useful for efficient numerical solutions, e.g., Monte Carlo simulations. It was discovered by André-Louis Cholesky for real matrices, and posthumously published in 1924. When it is applicable, the Cholesky decomposition is roughly twice as efficient as the LU decomposition for solving systems of linear equations.

Multi-task learning (MTL) is a subfield of machine learning in which multiple learning tasks are solved at the same time, while exploiting commonalities and differences across tasks. This can result in improved learning efficiency and prediction accuracy for the task-specific models, when compared to training the models separately. Inherently, Multi-task learning is a multi-objective optimization problem having trade-offs between different tasks. Early versions of MTL were called "hints".

A Hopfield network is a spin glass system used to model neural networks, based on Ernst Ising's work with Wilhelm Lenz on the Ising model of magnetic materials. Hopfield networks were first described with respect to recurrent neural networks independently by Kaoru Nakano in 1971 and Shun'ichi Amari in 1972, and with respect to biological neural networks by William Little in 1974, and were popularised by John Hopfield in 1982. Hopfield networks serve as content-addressable ("associative") memory systems with binary threshold nodes, or with continuous variables. Hopfield networks also provide a model for understanding human memory.

In machine learning, backpropagation is a gradient estimation method commonly used for training neural networks to compute the network parameter updates.

In mathematics, the discrete Laplace operator is an analog of the continuous Laplace operator, defined so that it has meaning on a graph or a discrete grid. For the case of a finite-dimensional graph, the discrete Laplace operator is more commonly called the Laplacian matrix.

In machine learning, kernel machines are a class of algorithms for pattern analysis, whose best known member is the support-vector machine (SVM). These methods involve using linear classifiers to solve nonlinear problems. The general task of pattern analysis is to find and study general types of relations in datasets. For many algorithms that solve these tasks, the data in raw representation have to be explicitly transformed into feature vector representations via a user-specified feature map: in contrast, kernel methods require only a user-specified kernel, i.e., a similarity function over all pairs of data points computed using inner products. The feature map in kernel machines is infinite dimensional but only requires a finite dimensional matrix from user-input according to the Representer theorem. Kernel machines are slow to compute for datasets larger than a couple of thousand examples without parallel processing.

Transfer learning (TL) is a technique in machine learning (ML) in which knowledge learned from a task is re-used in order to boost performance on a related task. For example, for image classification, knowledge gained while learning to recognize cars could be applied when trying to recognize trucks. This topic is related to the psychological literature on transfer of learning, although practical ties between the two fields are limited. Reusing/transferring information from previously learned tasks to new tasks has the potential to significantly improve learning efficiency.

An autoencoder is a type of artificial neural network used to learn efficient codings of unlabeled data. An autoencoder learns two functions: an encoding function that transforms the input data, and a decoding function that recreates the input data from the encoded representation. The autoencoder learns an efficient representation (encoding) for a set of data, typically for dimensionality reduction.

Neural cryptography is a branch of cryptography dedicated to analyzing the application of stochastic algorithms, especially artificial neural network algorithms, for use in encryption and cryptanalysis.

Algebraic signal processing (ASP) is an emerging area of theoretical signal processing (SP). In the algebraic theory of signal processing, a set of filters is treated as an (abstract) algebra, a set of signals is treated as a module or vector space, and convolution is treated as an algebra representation. The advantage of algebraic signal processing is its generality and portability.

In the mathematical theory of artificial neural networks, universal approximation theorems are theorems of the following form: Given a family of neural networks, for each function $from a certain function space, there exists a sequence of neural networks from the family, such that according to some criterion. That is, the family of neural networks is dense in the function space.$

In computer vision, pattern recognition, and robotics, point-set registration, also known as point-cloud registration or scan matching, is the process of finding a spatial transformation that aligns two point clouds. The purpose of finding such a transformation includes merging multiple data sets into a globally consistent model, and mapping a new measurement to a known data set to identify features or to estimate its pose. Raw 3D point cloud data are typically obtained from Lidars and RGB-D cameras. 3D point clouds can also be generated from computer vision algorithms such as triangulation, bundle adjustment, and more recently, monocular image depth estimation using deep learning. For 2D point set registration used in image processing and feature-based image registration, a point set may be 2D pixel coordinates obtained by feature extraction from an image, for example corner detection. Point cloud registration has extensive applications in autonomous driving, motion estimation and 3D reconstruction, object detection and pose estimation, robotic manipulation, simultaneous localization and mapping (SLAM), panorama stitching, virtual and augmented reality, and medical imaging.

Interatomic potentials are mathematical functions to calculate the potential energy of a system of atoms with given positions in space. Interatomic potentials are widely used as the physical basis of molecular mechanics and molecular dynamics simulations in computational chemistry, computational physics and computational materials science to explain and predict materials properties. Examples of quantitative properties and qualitative phenomena that are explored with interatomic potentials include lattice parameters, surface energies, interfacial energies, adsorption, cohesion, thermal expansion, and elastic and plastic material behavior, as well as chemical reactions.

Adversarial machine learning is the study of the attacks on machine learning algorithms, and of the defenses against such attacks. A survey from May 2020 exposes the fact that practitioners report a dire need for better protecting machine learning systems in industrial applications.

<span class="mw-page-title-main">Residual neural network</span> Deep learning method

A residual neural network is a deep learning architecture in which the weight layers learn residual functions with reference to the layer inputs. It was developed in 2015 for image recognition and won that year's ImageNet Large Scale Visual Recognition Challenge.

A capsule neural network (CapsNet) is a machine learning system that is a type of artificial neural network (ANN) that can be used to better model hierarchical relationships. The approach is an attempt to more closely mimic biological neural organization.

The convolutional sparse coding paradigm is an extension of the global sparse coding model, in which a redundant dictionary is modeled as a concatenation of circulant matrices. While the global sparsity constraint describes signal $as a linear combination of a few atoms in the redundant dictionary, usually expressed as for a sparse vector, the alternative dictionary structure adopted by the convolutional sparse coding model allows the sparsity prior to be applied locally instead of globally: independent patches of are generated by "local" dictionaries operating over stripes of .$

In network science, the network entropy is a disorder measure derived from information theory to describe the level of randomness and the amount of information encoded in a graph. It is a relevant metric to quantitatively characterize real complex networks and can also be used to quantify network complexity

Tensor informally refers in machine learning to two different concepts that organize and represent data. Data may be organized in a multidimensional array (M-way array) that is informally referred to as a "data tensor"; however in the strict mathematical sense, a tensor is a multilinear mapping over a set of domain vector spaces to a range vector space. Observations, such as images, movies, volumes, sounds, and relationships among words and concepts, stored in an M-way array ("data tensor") may be analyzed either by artificial neural networks or tensor methods.

In machine learning, normalization is a statistical technique with various applications. There are mainly two forms of normalization, data normalization and activation normalization. Data normalization, or feature scaling, is a general technique in statistics, and it includes methods that rescale input data so that they have well-behaved range, mean, variance, and other statistical properties. Activation normalization is specific to deep learning, and it includes methods that rescale the activation of hidden neurons inside a neural network.

References

↑ Hertzmann, Aaron; Jacobs, Charles E.; Oliver, Nuria; Curless, Brian; Salesin, David H. (August 2001). "Image analogies". ACM: 327–340. doi:10.1145/383259.383295. ISBN 978-1-58113-374-5.{{cite journal}}: Cite journal requires |journal= (help)
↑ Efros, Alexei A.; Freeman, William T. (August 2001). "Image quilting for texture synthesis and transfer". ACM: 341–346. doi:10.1145/383259.383296. ISBN 978-1-58113-374-5.{{cite journal}}: Cite journal requires |journal= (help)
↑ Gatys, Leon A.; Ecker, Alexander S.; Bethge, Matthias (26 August 2015). "A Neural Algorithm of Artistic Style". arXiv: 1508.06576 [cs.CV].
1 2 Gatys, Leon A.; Ecker, Alexander S.; Bethge, Matthias (2016). Image Style Transfer Using Convolutional Neural Networks. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 2414–2423.
↑ "Very Deep CNNS for Large-Scale Visual Recognition". Robots.ox.ac.uk. 2014. Retrieved 13 February 2019.
↑ Dumoulin, Vincent; Shlens, Jonathon S.; Kudlur, Manjunath (9 February 2017). "A Learned Representation for Artistic Style". arXiv: 1610.07629 [cs.CV].
↑ Zhang, Aston; Lipton, Zachary; Li, Mu; Smola, Alexander J. (2024). "14.12. Neural Style Transfer". Dive into deep learning. Cambridge New York Port Melbourne New Delhi Singapore: Cambridge University Press. ISBN 978-1-009-38943-3.
1 2 Huang, Xun; Belongie, Serge (2017). "Arbitrary Style Transfer in Real-Time With Adaptive Instance Normalization": 1501–1510.{{cite journal}}: Cite journal requires |journal= (help)
↑ Jing, Yongcheng; Yang, Yezhou; Feng, Zunlei; Ye, Jingwen; Yu, Yizhou; Song, Mingli (2020-11-01). "Neural Style Transfer: A Review". IEEE Transactions on Visualization and Computer Graphics. 26 (11): 3365–3385. doi:10.1109/TVCG.2019.2921336. ISSN 1077-2626.
↑ Ruder, Manuel; Dosovitskiy, Alexey; Brox, Thomas (2016). "Artistic Style Transfer for Videos". Pattern Recognition. Lecture Notes in Computer Science. Vol. 9796. pp. 26–36. arXiv: 1604.08610 . doi:10.1007/978-3-319-45886-1_3. ISBN 978-3-319-45885-4. S2CID 47476652.
↑ Ulyanov, Dmitry; Vedaldi, Andrea; Lempitsky, Victor (2017-11-06), Instance Normalization: The Missing Ingredient for Fast Stylization, doi:10.48550/arXiv.1607.08022 , retrieved 2024-08-08
1 2 Johnson, Justin; Alahi, Alexandre; Li, Fei-Fei (2016). "Perceptual Losses for Real-Time Style Transfer and Super-Resolution". arXiv: 1603.08155 [cs.CV].
↑ Chen, Dongdong; Liao, Jing; Yuan, Lu; Yu, Nenghai; Hua, Gang (2017). "Coherent Online Video Style Transfer". arXiv: 1703.09211 [cs.CV].
↑ Li, Yijun; Fang, Chen; Yang, Jimei; Wang, Zhaowen; Lu, Xin; Yang, Ming-Hsuan (2017). "Universal Style Transfer via Feature Transforms". arXiv: 1705.08086 [cs.CV].

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Hertzmann, Aaron; Jacobs, Charles E.; Oliver, Nuria; Curless, Brian; Salesin, David H. (August 2001). "Image analogies". ACM: 327–340. doi:10.1145/383259.383295. ISBN 978-1-58113-374-5.{{cite journal}}: Cite journal requires |journal= (help)

[2] Efros, Alexei A.; Freeman, William T. (August 2001). "Image quilting for texture synthesis and transfer". ACM: 341–346. doi:10.1145/383259.383296. ISBN 978-1-58113-374-5.{{cite journal}}: Cite journal requires |journal= (help)

[:0-3] Gatys, Leon A.; Ecker, Alexander S.; Bethge, Matthias (26 August 2015). "A Neural Algorithm of Artistic Style". arXiv: 1508.06576 [cs.CV].

[:1-4] 1 2 Gatys, Leon A.; Ecker, Alexander S.; Bethge, Matthias (2016). Image Style Transfer Using Convolutional Neural Networks. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 2414–2423.

[5] "Very Deep CNNS for Large-Scale Visual Recognition". Robots.ox.ac.uk. 2014. Retrieved 13 February 2019.

[6] Dumoulin, Vincent; Shlens, Jonathon S.; Kudlur, Manjunath (9 February 2017). "A Learned Representation for Artistic Style". arXiv: 1610.07629 [cs.CV].

[7] Zhang, Aston; Lipton, Zachary; Li, Mu; Smola, Alexander J. (2024). "14.12. Neural Style Transfer". Dive into deep learning. Cambridge New York Port Melbourne New Delhi Singapore: Cambridge University Press. ISBN 978-1-009-38943-3.

[:2-8] 1 2 Huang, Xun; Belongie, Serge (2017). "Arbitrary Style Transfer in Real-Time With Adaptive Instance Normalization": 1501–1510.{{cite journal}}: Cite journal requires |journal= (help)

[9] Jing, Yongcheng; Yang, Yezhou; Feng, Zunlei; Ye, Jingwen; Yu, Yizhou; Song, Mingli (2020-11-01). "Neural Style Transfer: A Review". IEEE Transactions on Visualization and Computer Graphics. 26 (11): 3365–3385. doi:10.1109/TVCG.2019.2921336. ISSN 1077-2626.

[10] Ruder, Manuel; Dosovitskiy, Alexey; Brox, Thomas (2016). "Artistic Style Transfer for Videos". Pattern Recognition. Lecture Notes in Computer Science. Vol. 9796. pp. 26–36. arXiv: 1604.08610 . doi:10.1007/978-3-319-45886-1_3. ISBN 978-3-319-45885-4. S2CID 47476652.

[11] Ulyanov, Dmitry; Vedaldi, Andrea; Lempitsky, Victor (2017-11-06), Instance Normalization: The Missing Ingredient for Fast Stylization, doi:10.48550/arXiv.1607.08022 , retrieved 2024-08-08

[Perceptual_Losses_for_Real-Time_Sty-12] 1 2 Johnson, Justin; Alahi, Alexandre; Li, Fei-Fei (2016). "Perceptual Losses for Real-Time Style Transfer and Super-Resolution". arXiv: 1603.08155 [cs.CV].

[13] Chen, Dongdong; Liao, Jing; Yuan, Lu; Yu, Nenghai; Hua, Gang (2017). "Coherent Online Video Style Transfer". arXiv: 1703.09211 [cs.CV].

[14] Li, Yijun; Fang, Chen; Yang, Jimei; Wang, Zhaowen; Lu, Xin; Yang, Ming-Hsuan (2017). "Universal Style Transfer via Feature Transforms". arXiv: 1705.08086 [cs.CV].

[1]

[2]

[3]

[4]

[5]

[6]

[8]

[9]

[10]

[11]

[12]

[13]

[14]