Skip to the content.

Neural Style Transfer - Art Generation using Neural networks

A really cool implementation of CNN is the Neural Style Transfer for Art Generation. It basically merges two images - one Content image and other Style image to create a new image which is a combination of the two.

Some of the generated images are posted on This Page



Nomenclature used:
Content Image (C)
Style Image (S)
Generated Image (G)

What are Deep ConvNets Learning?

Vizualizing what a deep network is learning

Neurons in each layer are activated by a specific portion of an image (activation means the unit achieves its maximum value).

For example, a particular neuron in a layer might get activated by a 45-degree line or a specific color or a round shape. If the image contains any suce feature, that particular neuron will attain its maximum value.

How to get this - Pick a unit in layer 1. Find the nine image patches that maximizes the unit’s activations. Repeat for other units


The units in the initial few layers learn the basic features (a line, a color, etc) while units in the later layers learn more complex functions (faces, people, animals, clouds, tyres etc). This is shown in the images below (shown as a small part of the layer) where the layer 1 learn the basic features while as we move forward the units in deeper layers learn more complex functions.

All Layers

All Layers

Layer 1

Layer 1

Layer 2

Layer 2

Layer 3


Layer 4

Layer 4

Layer 5

Layer 5

Neural Style Transfer : Cost Function

The cost function $J(G)$ for a generated image is made up of two individual cost functions - Content Cost Function $J_{Content}$ and Style Cost Function $J_{Style}$, weighted by the hyper parameters $\alpha$ and $\beta$

$J(G)$ = $\alpha J_{Content}(C,G)$ + $\beta J_{Style}(S,G)$

To create the generated image G:

  1. Initialize G randomly (100x100x3) - this will just create a random noisy image
  2. Use Gradient Descent to minimize $J(G)$, which will start with the noisy generated image G and then slowly blend in C and S together as per the weights

Content Cost Function $J_{Content}(C,G)$

  • Say you use hidden layer $l$ to compute content cost
  • Use a pre-trained Conv network (ex. VGG-19 network)
  • Let $a^{l(C)}$ and $a^{l(G)}$ be activations of layer $l$ on the images
  • If $a^{l(C)}$ and $a^{l(G)}$ are similar, both images have similar content

$J_{Content}(C,G)$ = $\frac12 || a^{l(C)} - a^{l(G)}||^2$

Style Cost Function $J_{Style}(S,G)$

Meaning of the “Style” of an image

Say you are using layer $l$’s activation to measure “style”.
Define style as the correlation between activations across channels.

Layer 1

What does it means for the two channels to be correlated? It means that whatever part of the image has the vertical texture (1,2 grid) that part of the image will also have the orangish tint (2,1 grid). Uncorrelated means the vertical texture in 1,2 won’t have the orangish tint in 2,1

Correlation tells you which of these high level texture components tend to occur together in an image.

Layer 1

Style Matrix

Let $a_{i,j,k}^{[l]}$ = activation at (i,j,k).

i: Height H, j: Width W

$G^{l(S)}$ is $n_c^{[l]}xn_c^{[l]}$

Height and width of this matrix $G^{l(s)}$ for layer l is number of channels x number of channels

Calculating how correlated are these channeks $K$ and $K’$ $G_{KK’}^{l(S)}$ = $\sum_{i=1}^{n_h^{[l]}}\sum_{j=1}^{n_w^{[l]}} a_{ijk}^{(S)[l]} a_{ijk’}^{(S)[l]}$

This is going to be the style matrix for input Style image S

Doing the same thing for the generated image G

$G_{KK’}^{l(G)}$ = $\sum_{i=1}^{n_h^{[l]}}\sum_{j=1}^{n_w^{[l]}} a_{ijk}^{(G)[l]} a_{ijk’}^{(G)[l]}$

K and K’ will range from 1 to $n_c^{[l]}$

In linear Algebra, this matrix G is called the Gram Matrix

$G_{KK’}^{l(S)}$ : Style of image S

$G_{KK’}^{l(S)}$ : Style of image G

Style Cost Function

$J_{Style}^{[l]} = || G^{(S)[l]} - G^{(G)[l]} ||^2_F $


$J_{Style}^{[l]} = \frac1{(2 n_H^{[l]}n_W^{[l]}n_C^{[l]})^2}\sum_K \sum_{K’} (G_{KK’}^{(S)[l]} - G_{KK’}^{(G)[l]})^2$

You’ll get better results if you use the style cost function from multiple deeper layers.

Overall style cost function

$J_{Style}(S,G) = \sum_l \lambda^{[l]} J_{Style}^{[l]}(S,G)$

Overall Cost Function

$J(G)$ = $\alpha J_{Content}(C,G)$ + $\beta J_{Style}(S,G)$

Source material from Andrew NG’s awesome course on Coursera. The material in the video has been written in a text form so that anyone who wishes to revise a certain topic can go through this without going through the entire video lectures.

Written on December 26, 2017