I hiked Mount Pilatus, near Lucerne, Switzerland. At the summit, I walked around the Dragon Trail, so-named because people used to believe dragons lived at the top of the mountain. I took a few photos of the absolutely gorgeous view of houses, lakes, greenery, and other mountains.

Here are the original images:

You can see that the two images overlap. I would like to merge the images. Here’s how that can be done:

- Find keypoints (points of interest) in each image.
- Encode the keypoints in each image. Compare the encodings to determine if a keypoint in the left image matches a keypoint in the right image.
- From the matching keypoints, calculate how the 2nd image needs to be altered in order to fuse the images together.
- Combine the images!

Let’s do this!

## 1. Find keypoints

For each image, we create a Gaussian pyramid. At each level of the pyramid, we apply a Gaussian filter.

Next, we make a Difference of Gaussian pyramid. At each level of the Gaussian pyramid, we subtract the previous level.

Here’s what those pyramids look like. Click on the thumbnails to get a better view.

The Difference of Gaussian pyramids allow us to find corners, edges, and blobs in the image.

Next, we find possible interest points. Interest points are local extrema in space; they should have max or min pixel value compared to neighboring pixels. Interest points should also be extrema in scale; in the Difference of Gaussian pyramid, the point should be the max or min of the neighboring levels above and below.

We take all the points that are extrema and that have Difference of Gaussian magnitude above a certain threshold.

Finally, we filter out points that are edges. Edges make lousy keypoints, because it’s difficult to tell where a point is on an edge, and edges are not as distinctive as corners and blobs. To get rid of edges, we calculate the principal curvature ratio for each of the candidate points. Edges have principal curvature that is much larger in one direction than the other direction. We can filter out points that do not meet a certain threshold for principal curvature.

Now we’ve got our keypoints for each image:

## 2. Encode the keypoints

We can use a BRIEF (Binary Robust Independent Elementary Features) descriptor to encode keypoints. At each keypoint, we take a 9×9 pixel patch centered at that keypoint. Then we generate a vector of bits by randomly sampling and comparing pixels in the patch.

We compare keypoints in the left image and the right image by taking the Hamming distance of the BRIEF descriptors. The Hamming distance is the number of bits that are different.

Here’s what those matches look like. If there is a keypoint in the left image that matches one to the right image, there is a red line.

There are a ton of matches! Some of them are outliers and obviously erroneous matches.

## 3. Calculate planar homography

We need to figure out how the left image and right image are related to each other. So we need to calculate the 3×3 homography matrix that tells us how an image taken from one camera view relates to an image taken from a different camera view. We can do this by finding the matrix that minimizes the least squares error.

But we ran into a problem: a lot of the matches are outliers, so they will skew the homography matrix with disastrous results. We solve this problem with RANSAC, random sample consensus. First, we pick 4 random matching pairs (we need at least 4 pairs of points to solve for the homography matrix). We calculate a homography matrix from these random pairs. Next, we apply the matrix to all the matching points in one image, then compare them with the matching points in the other. If the difference is within a certain tolerance level, the point pair is an inlier. We iterate this process, keeping the homography matrix that resulted in the most inliers.

## 4. Combine the images

Now we can combine the two images. Using the homography matrix, we warp one image into the reference frame of the other. We make adjustments so that both images fit into the panorama without clipping. Where the images overlap, we blend them together.

And here’s a panorama with three photos. The photo at the top of this blog post is used on the far left. Click to see it in its full glory:

In this panorama, there is clipping so that there is no blank space in the image.