Style transfer using convolutional neural nets

Using deep neural nets, it is possible to change a photo or video to mimic the style of a piece of art. We have the original image, the style source image, and the pastiche. Here’s an example of J at Jack Block Park using the style of Rain Princess by Leonid Afremov. You can see the pastiche appears to be made of colorful oil strokes.

Jack Block Park beach
Jack Block Park beach
Rain Princess by Leonid Afremov
Rain Princess by Leonid Afremov
Image with style from Rain Princess
Image with style from Rain Princess

We run the original image through a neural net. I initialized the neural net with weights from a VGG19 model trained on ImageNet. By doing so, we already have a good basis for features important to object categorization and invariance to translation, rotation, etc.

The neural net’s loss function seeks to create a pastiche that minimizes content loss (the content of the original image) and the style loss (the style of the style source image). By adjusting weights in our loss function, we can change the importance of minimizing either content or style loss.

Here’s a picture of Cherry Creek Falls in the style of Vincent van Gogh’s The Starry Night. In the first pastiche, minimizing content loss is more important. In the second pastiche, minimizing style loss is more important.

Cherry Creek Falls
Cherry Creek Falls
Cherry Creek Falls weighted to minimize content loss
Cherry Creek Falls weighted to minimize content loss
Cherry Creek Falls weighted to minimize style loss
Cherry Creek Falls weighted to minimize style loss

And here’s a video of me floating in a pool. Even on a 10-second 240×134 video, the style transfer process was slow, because I trained a neural net on each frame. At 60fps, there are hundreds of frames, so hundreds of corresponding neural nets that were trained.

How to make a panorama

I hiked Mount Pilatus, near Lucerne, Switzerland. At the summit, I walked around the Dragon Trail, so-named because people used to believe dragons lived at the top of the mountain. I took a few photos of the absolutely gorgeous view of houses, lakes, greenery, and other mountains.

Here are the original images:

Mount Pilatus left image
Mount Pilatus left image
Mount Pilatus right image
Mount Pilatus right image

You can see that the two images overlap. I would like to merge the images. Here’s how that can be done:

  1. Find keypoints (points of interest) in each image.
  2. Encode the keypoints in each image. Compare the encodings to determine if a keypoint in the left image matches a keypoint in the right image.
  3. From the matching keypoints, calculate how the 2nd image needs to be altered in order to fuse the images together.
  4. Combine the images!

Let’s do this!

1. Find keypoints

For each image, we create a Gaussian pyramid. At each level of the pyramid, we apply a Gaussian filter.

Next, we make a Difference of Gaussian pyramid. At each level of the Gaussian pyramid, we subtract the previous level.

Here’s what those pyramids look like. Click on the thumbnails to get a better view.

Left image - Gaussian pyramid
Left image – Gaussian pyramid
Left image - Difference of Gaussian pyramid
Left image – Difference of Gaussian pyramid
Right image - Gaussian pyramid
Right image – Gaussian pyramid
Right image - Difference of Gaussian pyramid
Right image – Difference of Gaussian pyramid

The Difference of Gaussian pyramids allow us to find corners, edges, and blobs in the image.

Next, we find possible interest points. Interest points are local extrema in space; they should have max or min pixel value compared to neighboring pixels. Interest points should also be extrema in scale; in the Difference of Gaussian pyramid, the point should be the max or min of the neighboring levels above and below.

We take all the points that are extrema and that have Difference of Gaussian magnitude above a certain threshold.

Finally, we filter out points that are edges. Edges make lousy keypoints,  because it’s difficult to tell where a point is on an edge, and edges are not as distinctive as corners and blobs. To get rid of edges, we calculate the principal curvature ratio for each of the candidate points. Edges have principal curvature that is much larger in one direction than the other direction. We can filter out points that do not meet a certain threshold for principal curvature.

Now we’ve got our keypoints for each image:

Left image keypoints
Left image keypoints
Right image keypoints
Right image keypoints

2. Encode the keypoints

We can use a BRIEF (Binary Robust Independent Elementary Features) descriptor to encode keypoints. At each keypoint, we take a 9×9 pixel patch centered at that keypoint. Then we generate a vector of bits by randomly sampling and comparing pixels in the patch.

We compare keypoints in the left image and the right image by taking the Hamming distance of the BRIEF descriptors. The Hamming distance is the number of bits that are different.

Here’s what those matches look like. If there is a keypoint in the left image that matches one to the right image, there is a red line.

Keypoint matches between left image and right image
Keypoint matches between left image and right image

There are a ton of matches! Some of them are outliers and obviously erroneous matches.

3. Calculate planar homography

We need to figure out how the left image and right image are related to each other. So we need to calculate the 3×3 homography matrix that tells us how an image taken from one camera view relates to an image taken from a different camera view. We can do this by finding the matrix that minimizes the least squares error.

But we ran into a problem: a lot of the matches are outliers, so they will skew the homography matrix with disastrous results. We solve this problem with RANSAC, random sample consensus. First, we pick 4 random matching pairs (we need at least 4 pairs of points to solve for the homography matrix). We calculate a homography matrix from these random pairs. Next, we apply the matrix to all the matching points in one image, then compare them with the matching points in the other. If the difference is within a certain tolerance level, the point pair is an inlier. We iterate this process, keeping the homography matrix that resulted in the most inliers.

4. Combine the images

Now we can combine the two images. Using the homography matrix, we warp one image into the reference frame of the other. We make adjustments so that both images fit into the panorama without clipping. Where the images overlap, we blend them together.

Panorama - left and right image combined
Panorama – left and right image combined

And here’s a panorama with three photos. The photo at the top of this blog post is used on the far left. Click to see it in its full glory:

Panorama with three images
Panorama with three images

In this panorama, there is clipping so that there is no blank space in the image.

Measures of similarity in the 20 newsgroups dataset

The 20 newsgroups dataset is a data set of posts on 20 topics, ranging from cryptology to guns to baseball. I looked at 3 measures of similarity: Jaccard, cosine, and L2. Comparing each article with every other article, and taking the average similarity for that newsgroup, we get the following heat maps:

Cosine similarity heat map
Cosine similarity heat map
Jaccard similarity heat map
Jaccard similarity heat map
L2 similarity heat map
L2 similarity heat map

Cosine similarity seems the most reasonable, because it considers the relative frequency of words instead of the actual frequency. Take the case where there are two articles, A and B, and article A is the same as article B, except each word in A appears twice as many times in B. The similarity measure ought to indicate the articles are highly similar. The Jaccard similarity would be 0.5, cosine similarity would be 1, and L2 similarity would be some non-zero number. With Jaccard and L2 similarity, the number of words in each article has some influence on the similarity measure, so when one article has a lot more words than another, they will appear more dissimilar.

Let’s look at the cosine similarity plot, but with values < 0.45 removed:

Cosine similarity > 0.45 heat map
Cosine similarity > 0.45 heat map

Pairs of similar newsgroups include soc.religion.christian + soc.religion.christian,  talk.politics.guns + talk.politics.guns, soc.religion.christian + talk.politics.guns. Perhaps these two newsgroups have similar demographics. Other similar pairs include soc.religion.christian + alt.atheism and soc.religion.christian + talk.religion.misc. This seems plausible, that there is some overlap discussing religion or lack of it.


Next, we look at nearest-neighbor counts. For each article in a newsgroup, there is an article in another newsgroup that has largest similarity.

Jaccard similarity nearest-neighbor heat map
Jaccard similarity nearest-neighbor heat map

The average similarity plots are symmetric, because in the formulas for different similarity measures, for any article x and y, (x,y) and (y, x) return the same value, there’s nothing dependent on the order of the bag-of-words vectors.

The nearest-neighbor plot is asymmetric. If an article A has the largest Jaccard similarity to an article B, that does not mean that B has the largest Jaccard similarity to A. For example, say there are three articles X, Y, and Z. X and Y are similar, but Z is very different from both. If Z is most similar to, say, X, that does not mean X is most similar to Z, in this case X is most similar to Y. So, just because an article in a newsgroup M has the largest similarity to an article in a newsgroup N, does not mean that an article in newsgroup N will have the largest similarity to an article in newsgroup M.

Looking at the Jaccard nearest-neighbor heat map, these groups are similar: talk.religion.misc + alt.atheism, soc.religion.christian + alt.atheism, rec.sport.hockey + rec.sport.baseball, comp.sys.ibm.pc.hardware + comp.os.ms-windows.misc,  comp.sys.mac.hardware + comp.sys.ibm.pc.hardware.

Comparing the Jaccard plots, there is some overlap in similar newsgroups, such as soc.religion.christian + alt.atheism. In the nearest-neighbor plot, there are some newsgroups that appear similar that do not seem similar in the average similarity plot, such as comp.sys.mac.hardware + comp.sys.ibm.pc.hardware and rec.sport.hockey + rec.sport.baseball. Average similarity plots appear to have a more even distribution of similarity measures, whereas the counts in the nearest-neighbor plot are mostly low with some high counts.

Using average similarity is more suited to comparing newsgroups. With nearest-neighbors, each article has some discrete influence on similarity, so disparate newsgroups could wrongfully appear similar. It could be the case that the articles in a newsgroup are extremely dissimilar to articles in other newsgroups, such as the articles in misc.forsale. Looking at the Jaccard and cosine average similarity plots, it appears misc.forsale is dissimilar to the other newsgroups. In the nearest-neighbor plot, a noticeable number of articles in misc.forsale are nearest-neighbors to comp.sys.ibm.pc.hardware, probably because there are a lot of PCs for sale, but not the other way around. Likewise, the articles in rec.sport.hockey and rec.sport.baseball might not be similar to each other, but they are more similar to each other than to other newsgroups.


Next, we look at how reducing the number of dimensions affects the quality of results for measures of similarity. Here’s the cosine similarity nearest-neighbor heat map:

Cosine similarity nearest-neighbor heat map
Cosine similarity nearest-neighbor heat map

Now we reduce the dimensions by randomly drawing the features with a standard normal distribution.

Cosine similarity nearest-neighbor heat map, d=10
Cosine similarity nearest-neighbor heat map, d=10
Cosine similarity nearest-neighbor heat map, d=25
Cosine similarity nearest-neighbor heat map, d=25
Cosine similarity nearest-neighbor heat map, d=50
Cosine similarity nearest-neighbor heat map, d=50
Cosine similarity nearest-neighbor heat map, d=100
Cosine similarity nearest-neighbor heat map, d=100

 

Wall-clock times (seconds)

With no dimension reduction, calculating cosine similarities took 202.858168125 sec, finding nearest neighbors took 0.902053117752 sec.

d dimension reduction calculating cosine similarities finding nearest neighbors
10 1.89096689224 34.9237360954 0.762381076813
25 4.87242698669 81.7924189568 0.696530103683
50 8.96683502197 158.616721869 0.77707695961
100 18.9475579262 319.640784025 0.732910871506

For dimension reduction and calculating cosine similarities, wall-clock time increased linearly with d.

Target dimension d=100 gave comparable results to the original embedding.


Now let’s look at a single article, and see how cosine similarities compare after dimension reduction.

Cosine similarities for a single article, d=10
Cosine similarities for a single article, d=10
Cosine similarities for a single article, d=25
Cosine similarities for a single article, d=25
Cosine similarities for a single article, d=50
Cosine similarities for a single article, d=50
Cosine similarities for a single article, d=100
Cosine similarities for a single article, d=100

The error is the vertical distance from a point on the scatterplot to y=x. As d increases, the sum of the errors and the standard deviation of the errors gets smaller, because more of the information about the original words in full dimensions has been retained.

Looking at the target dimension vs. sum of errors:

d          sum of errors

10        300.6589293

25        113.3587640

50        74.9733475

100      67.0568351

It appears that the sum of errors asymptotically decreases as d increases.


Now we try dimension reduction with a random sign (±1) instead of a normal distribution.

Cosine similarity nearest-neighbor heat map, random sign, d=10
Cosine similarity nearest-neighbor heat map, random sign, d=10
Cosine similarity nearest-neighbor heat map, random sign, d=25
Cosine similarity nearest-neighbor heat map, random sign, d=25
Cosine similarity nearest-neighbor heat map, random sign, d=50
Cosine similarity nearest-neighbor heat map, random sign, d=50
Cosine similarity nearest-neighbor heat map, random sign, d=100
Cosine similarity nearest-neighbor heat map, random sign, d=100
Cosine similarities for a single article, random sign, d=10
Cosine similarities for a single article, random sign, d=10
Cosine similarities for a single article, random sign, d=25
Cosine similarities for a single article, random sign, d=25
Cosine similarities for a single article, random sign, d=50
Cosine similarities for a single article, random sign, d=50
Cosine similarities for a single article, random sign, d=100
Cosine similarities for a single article, random sign, d=100
d sum of errors, random normal distribution sum of errors (d), random sign
10 300.6589293 191.6908585
25 113.358764 111.0387588
50 74.9733475 84.4925114
100 67.0568351 66.6889494

The results of dimension reduction by random sign and random normal distribution were similar. For both dimensionally-reduced matrices, the plot for d=100 was comparable to the one with full dimensions.

PCA on data from the 1000 genomes project

Taking a dataset of individuals from the 1000 genomes project, with a subsample of ~10,000 nucleobases for each individual, the nucleobases were given a binary encoding based on the mode for that nucleobase position.


The individuals were from 7 African populations:

YRI: Yoruba in Ibadan, Nigeria
LWK: Luhya in Webuye, Kenya
GWD: Gambian in Western Divisions in the Gambia
MSL: Mende in Sierra Leone
ESN: Esan in Nigeria
ASW: Americans of African Ancestry in SW USA
ACB: African Caribbeans in Barbados

A map of the populations in the data
A map of the populations in the data

Plotting the first and second principal components, we see the components capture geographic information.

v1 vs. v2 components, grouped by population
v1 vs. v2 components, grouped by population

On the v1 axis, the populations appear genetically similar except for LWK, ACB, and ASW. The LWK population in east Africa is relatively dissimilar to the populations on the west coast of Africa. Populations ACB and ASW are even more dissimilar and have a wide spread. Perhaps there is greater genetic diversity for the ACB and ASW populations because they are more likely to have mixed ancestry. So the first principal component captures genetic similarity to west African coast populations.

On the v2 axis, we see GWD in a cluster and MSL in a cluster, and ESN, YRI, and LWK in a cluster. ACB and ASW span both the MSL and the ESN/YRI/LWK clusters. So the second principal component captures the split between the two populations on the western part of the coast (GWD + MSL) from the other central and eastern populations (ESN/YRI/LWK), while suggesting individuals in the ACB and ASW populations could have ancestry from either region.


Plotting the first and third principal components, we see the third component captures gender.

v1 vs. v3 components, grouped by gender
v1 vs. v3 components, grouped by gender

Plotting the first and fourth principal components, we see the fourth component captures whether the individual belongs to the LWK population.

v1 vs. v4 components, grouped by population
v1 vs. v4 components, grouped by population

Python libraries used: numpy, scikit-learn, pandas, matplotlib