Using deep neural nets, it is possible to change a photo or video to mimic the style of a piece of art. We have the original image, the style source image, and the pastiche. Here’s an example of J at Jack Block Park using the style of Rain Princess by Leonid Afremov. You can see the pastiche appears to be made of colorful oil strokes.
We run the original image through a neural net. I initialized the neural net with weights from a VGG19 model trained on ImageNet. By doing so, we already have a good basis for features important to object categorization and invariance to translation, rotation, etc.
The neural net’s loss function seeks to create a pastiche that minimizes content loss (the content of the original image) and the style loss (the style of the style source image). By adjusting weights in our loss function, we can change the importance of minimizing either content or style loss.
Here’s a picture of Cherry Creek Falls in the style of Vincent van Gogh’s The Starry Night. In the first pastiche, minimizing content loss is more important. In the second pastiche, minimizing style loss is more important.
And here’s a video of me floating in a pool. Even on a 10-second 240×134 video, the style transfer process was slow, because I trained a neural net on each frame. At 60fps, there are hundreds of frames, so hundreds of corresponding neural nets that were trained.
I hiked Mount Pilatus, near Lucerne, Switzerland. At the summit, I walked around the Dragon Trail, so-named because people used to believe dragons lived at the top of the mountain. I took a few photos of the absolutely gorgeous view of houses, lakes, greenery, and other mountains.
Here are the original images:
You can see that the two images overlap. I would like to merge the images. Here’s how that can be done:
Find keypoints (points of interest) in each image.
Encode the keypoints in each image. Compare the encodings to determine if a keypoint in the left image matches a keypoint in the right image.
From the matching keypoints, calculate how the 2nd image needs to be altered in order to fuse the images together.
Combine the images!
Let’s do this!
1. Find keypoints
For each image, we create a Gaussian pyramid. At each level of the pyramid, we apply a Gaussian filter.
Next, we make a Difference of Gaussian pyramid. At each level of the Gaussian pyramid, we subtract the previous level.
Here’s what those pyramids look like. Click on the thumbnails to get a better view.
The Difference of Gaussian pyramids allow us to find corners, edges, and blobs in the image.
Next, we find possible interest points. Interest points are local extrema in space; they should have max or min pixel value compared to neighboring pixels. Interest points should also be extrema in scale; in the Difference of Gaussian pyramid, the point should be the max or min of the neighboring levels above and below.
We take all the points that are extrema and that have Difference of Gaussian magnitude above a certain threshold.
Finally, we filter out points that are edges. Edges make lousy keypoints, because it’s difficult to tell where a point is on an edge, and edges are not as distinctive as corners and blobs. To get rid of edges, we calculate the principal curvature ratio for each of the candidate points. Edges have principal curvature that is much larger in one direction than the other direction. We can filter out points that do not meet a certain threshold for principal curvature.
Now we’ve got our keypoints for each image:
2. Encode the keypoints
We can use a BRIEF (Binary Robust Independent Elementary Features) descriptor to encode keypoints. At each keypoint, we take a 9×9 pixel patch centered at that keypoint. Then we generate a vector of bits by randomly sampling and comparing pixels in the patch.
We compare keypoints in the left image and the right image by taking the Hamming distance of the BRIEF descriptors. The Hamming distance is the number of bits that are different.
Here’s what those matches look like. If there is a keypoint in the left image that matches one to the right image, there is a red line.
There are a ton of matches! Some of them are outliers and obviously erroneous matches.
3. Calculate planar homography
We need to figure out how the left image and right image are related to each other. So we need to calculate the 3×3 homography matrix that tells us how an image taken from one camera view relates to an image taken from a different camera view. We can do this by finding the matrix that minimizes the least squares error.
But we ran into a problem: a lot of the matches are outliers, so they will skew the homography matrix with disastrous results. We solve this problem with RANSAC, random sample consensus. First, we pick 4 random matching pairs (we need at least 4 pairs of points to solve for the homography matrix). We calculate a homography matrix from these random pairs. Next, we apply the matrix to all the matching points in one image, then compare them with the matching points in the other. If the difference is within a certain tolerance level, the point pair is an inlier. We iterate this process, keeping the homography matrix that resulted in the most inliers.
4. Combine the images
Now we can combine the two images. Using the homography matrix, we warp one image into the reference frame of the other. We make adjustments so that both images fit into the panorama without clipping. Where the images overlap, we blend them together.
And here’s a panorama with three photos. The photo at the top of this blog post is used on the far left. Click to see it in its full glory:
In this panorama, there is clipping so that there is no blank space in the image.
The 20 newsgroups dataset is a data set of posts on 20 topics, ranging from cryptology to guns to baseball. I looked at 3 measures of similarity: Jaccard, cosine, and L2. Comparing each article with every other article, and taking the average similarity for that newsgroup, we get the following heat maps:
Cosine similarity seems the most reasonable, because it considers the relative frequency of words instead of the actual frequency. Take the case where there are two articles, A and B, and article A is the same as article B, except each word in A appears twice as many times in B. The similarity measure ought to indicate the articles are highly similar. The Jaccard similarity would be 0.5, cosine similarity would be 1, and L2 similarity would be some non-zero number. With Jaccard and L2 similarity, the number of words in each article has some influence on the similarity measure, so when one article has a lot more words than another, they will appear more dissimilar.
Let’s look at the cosine similarity plot, but with values < 0.45 removed:
Pairs of similar newsgroups include soc.religion.christian + soc.religion.christian, talk.politics.guns + talk.politics.guns, soc.religion.christian + talk.politics.guns. Perhaps these two newsgroups have similar demographics. Other similar pairs include soc.religion.christian + alt.atheism and soc.religion.christian + talk.religion.misc. This seems plausible, that there is some overlap discussing religion or lack of it.
Next, we look at nearest-neighbor counts. For each article in a newsgroup, there is an article in another newsgroup that has largest similarity.
The average similarity plots are symmetric, because in the formulas for different similarity measures, for any article x and y, (x,y) and (y, x) return the same value, there’s nothing dependent on the order of the bag-of-words vectors.
The nearest-neighbor plot is asymmetric. If an article A has the largest Jaccard similarity to an article B, that does not mean that B has the largest Jaccard similarity to A. For example, say there are three articles X, Y, and Z. X and Y are similar, but Z is very different from both. If Z is most similar to, say, X, that does not mean X is most similar to Z, in this case X is most similar to Y. So, just because an article in a newsgroup M has the largest similarity to an article in a newsgroup N, does not mean that an article in newsgroup N will have the largest similarity to an article in newsgroup M.
Looking at the Jaccard nearest-neighbor heat map, these groups are similar: talk.religion.misc + alt.atheism, soc.religion.christian + alt.atheism, rec.sport.hockey + rec.sport.baseball, comp.sys.ibm.pc.hardware + comp.os.ms-windows.misc, comp.sys.mac.hardware + comp.sys.ibm.pc.hardware.
Comparing the Jaccard plots, there is some overlap in similar newsgroups, such as soc.religion.christian + alt.atheism. In the nearest-neighbor plot, there are some newsgroups that appear similar that do not seem similar in the average similarity plot, such as comp.sys.mac.hardware + comp.sys.ibm.pc.hardware and rec.sport.hockey + rec.sport.baseball. Average similarity plots appear to have a more even distribution of similarity measures, whereas the counts in the nearest-neighbor plot are mostly low with some high counts.
Using average similarity is more suited to comparing newsgroups. With nearest-neighbors, each article has some discrete influence on similarity, so disparate newsgroups could wrongfully appear similar. It could be the case that the articles in a newsgroup are extremely dissimilar to articles in other newsgroups, such as the articles in misc.forsale. Looking at the Jaccard and cosine average similarity plots, it appears misc.forsale is dissimilar to the other newsgroups. In the nearest-neighbor plot, a noticeable number of articles in misc.forsale are nearest-neighbors to comp.sys.ibm.pc.hardware, probably because there are a lot of PCs for sale, but not the other way around. Likewise, the articles in rec.sport.hockey and rec.sport.baseball might not be similar to each other, but they are more similar to each other than to other newsgroups.
Next, we look at how reducing the number of dimensions affects the quality of results for measures of similarity. Here’s the cosine similarity nearest-neighbor heat map:
Now we reduce the dimensions by randomly drawing the features with a standard normal distribution.
Wall-clock times (seconds)
With no dimension reduction, calculating cosine similarities took 202.858168125 sec, finding nearest neighbors took 0.902053117752 sec.
calculating cosine similarities
finding nearest neighbors
For dimension reduction and calculating cosine similarities, wall-clock time increased linearly with d.
Target dimension d=100 gave comparable results to the original embedding.
Now let’s look at a single article, and see how cosine similarities compare after dimension reduction.
The error is the vertical distance from a point on the scatterplot to y=x. As d increases, the sum of the errors and the standard deviation of the errors gets smaller, because more of the information about the original words in full dimensions has been retained.
Looking at the target dimension vs. sum of errors:
d sum of errors
It appears that the sum of errors asymptotically decreases as d increases.
Now we try dimension reduction with a random sign (±1) instead of a normal distribution.
sum of errors, random normal distribution
sum of errors (d), random sign
The results of dimension reduction by random sign and random normal distribution were similar. For both dimensionally-reduced matrices, the plot for d=100 was comparable to the one with full dimensions.
Taking a dataset of individuals from the 1000 genomes project, with a subsample of ~10,000 nucleobases for each individual, the nucleobases were given a binary encoding based on the mode for that nucleobase position.
The individuals were from 7 African populations:
YRI: Yoruba in Ibadan, Nigeria
LWK: Luhya in Webuye, Kenya
GWD: Gambian in Western Divisions in the Gambia
MSL: Mende in Sierra Leone
ESN: Esan in Nigeria
ASW: Americans of African Ancestry in SW USA
ACB: African Caribbeans in Barbados
Plotting the first and second principal components, we see the components capture geographic information.
On the v1 axis, the populations appear genetically similar except for LWK, ACB, and ASW. The LWK population in east Africa is relatively dissimilar to the populations on the west coast of Africa. Populations ACB and ASW are even more dissimilar and have a wide spread. Perhaps there is greater genetic diversity for the ACB and ASW populations because they are more likely to have mixed ancestry. So the first principal component captures genetic similarity to west African coast populations.
On the v2 axis, we see GWD in a cluster and MSL in a cluster, and ESN, YRI, and LWK in a cluster. ACB and ASW span both the MSL and the ESN/YRI/LWK clusters. So the second principal component captures the split between the two populations on the western part of the coast (GWD + MSL) from the other central and eastern populations (ESN/YRI/LWK), while suggesting individuals in the ACB and ASW populations could have ancestry from either region.
Plotting the first and third principal components, we see the third component captures gender.
Plotting the first and fourth principal components, we see the fourth component captures whether the individual belongs to the LWK population.