Summary of Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization

This paper discusses Gradient-weighted Class Activation Mapping (Grad-CAM), a method of visualizing a model that is superior to Class Activation Mapping (CAM).

A requirement of CAM is that the final convolutional feature maps are connected to the outputs. CAM uses these weights between the convolutional feature maps and the outputs. So CAM has the drawback that the model cannot have fully-connected layers after the CNNs. For image classification tasks, having a fully-connected layer at the end typically improves accuracy. The CNNs extract useful features, then the fully-connected layer enables learning combinations of these features for classification. To fulfill CAM’s requirement of having no fully-connected layers, the model performance would drop.

With Grad-CAM, the model architecture does not have to be altered in order to be interpretable. The way Grad-CAM works is, first, the input image is forward-propagated through the model. Then, given the true label / desired output, we perform back propagation until we reach the final CNN feature maps. The gradients of the feature maps are pooled, then passed through a ReLU to only keep pixels that contributed positively to the classifier’s decision.

CAM and Grad-CAM can provide a heatmap to visualize what the model is looking at when it decides on a classification, but these heatmaps are fairly course regions. The paper also proposes Guided Grad-CAM, a high resolution visualization that highlights particular features of interest in an image.

Grad-CAM can be used to identify bias in the dataset by showing what attributes the model is focusing on in making its decisions. The example used in the paper is classifying pictures of people into doctors or nurses. Grad-CAM showed that the model was looking at face and hair, revealing a gender bias, which they found in the distribution of their training data. Identifying bias is especially important for a multi-modal problem because each modality has its own dataset, so bias can be introduced by each modality. This bias is additive, so the resulting model can have more bias than a unimodal modal.


Summary of Representation Learning: A Review and New Perspectives

In Representation Learning: A Review and New Perspectives, Bengio et al. discuss distributed and deep representations. The authors also discuss three lines of research in representation learning: probabilistic models, reconstruction-based algorithms, and manifold-learning approaches. Though the paper separates these methods into discrete buckets, there is actually a lot of overlap between them.

The motivation behind using distributed representations is that there are multiple underlying factors for an observed input. In a distributed representation, a model can represent an exponential number of input regions with subsets of features. Different directions in input space correspond to a factor that is being varied. The motivation behind deep representations is that there is a hierarchy of explanatory factors. In a deep representation model, as we go from input to output layer, each layer is some increasingly complex combination of lower-level features. Ideally, a model should disentangle factors of variation, and one of the beliefs motivating deep representations is that the abstract, complex factors can have great explanatory power. Both distributed and deep representations reuse features. Distributed representations reuse features because they represent subsets of features rather than individual features; likewise, deep representations reuse features in that more abstract features are built from lower-level features.  A model can be both distributed and deep.

A joint representation for multimodal data can be learned by having input layers / hidden layers for each modality, then a hidden layer that projects each modality into a joint space. A coordinated representation can be learned by projecting each modality into a separate space but coordinated on some metric. For example, in Andrew et al., ICML 2013, there is an input layer and hidden layers for each modality, with the objective that the resulting representations from each modality are highly correlated.

From a high level, distributed and deep representations are similar to joint and coordinated multimodal representations in that we can use multiple features at once to enhance the model’s explanatory capabilities; for example, we have multiple features from each modality, and these features improve the model by adding complementary or redundant information for inference. In addition, we can generate abstract factors by combining features from separate modalities.

Probabilistic models use latent random variables to explain the distribution of the observed data. In training model parameters, the goal is to maximize the likelihood of the training data. One example of how probabilistic models are used in joint representations is in Deep Learning for Robust Feature Generation in Audiovisual Emotion Recognition by Kim et al., where the audio and video modalities go through separate stacked restricted Boltzmann machines, then get mapped into a joint representation to recognize emotions. An example of coordinated representation is in Deep Correspondence Restricted Boltzmann Machine for Cross-Modal Retrieval, in which Feng et al. train separate image and text modalities with restricted Boltzmann machines, and impose a correlation constraint as the coordination metric.

With reconstruction-based algorithms, the input is encoded into some feature vector representation, and can be decoded from feature space back into input space. One example of reconstruction-based algorithms in the multimodal domain is in Ngiam et al.’s paper,  Multimodal Deep Learning, where audio and video modalities go through separate autoencoders, followed by an autoencoder that joins them into a joint representation. In Cross-modal Retrieval with Correspondence Autoencoder, Feng et al. use a coordinated representation, training image and text modalities with autoencoders and imposing a correlation constraint.

Manifold-learning approaches are based on the hypothesis that the distribution of real-world data in high dimensions is concentrated near areas of low dimensionality. So with manifold-learning approaches, we can reduce the dimension of our input to a lower dimensional subspace while retaining most of the explanatory information. Again, we can reduce the dimensionality of each modality individually, then map each modality into a joint representation, or we can keep the modalities separate but impose some constraint as in a coordinated representation.



I was in Seoul for a whirlwind week. In the daytime, I explored the city. In the evenings, I caught up on schoolwork. Seoul is a dynamic mix of old and new. The city is constantly evolving, whether due to Korea’s long history as a battlefield between China, Japan, and Russia, or due to gentrification and modernization.

As a tourist, the language barrier was challenging for me. The public transportation had signs and announcements in English, but I found that most of the locals were unwilling to speak English or embarrassed by their command of the language. Most of the people who spoke English well had studied or worked abroad. Since I was in Korea, I tried to learn Korean. But it devolved into “hi,” “thank you,” and pointing at things. Once, it took 5 minutes to communicate our desired destination to a taxi driver.

There are a few coffeeshops on every block. How does the economy support so many coffeeshops? If there is so much supply, then there must be high demand. Someone told me that Chinese people drink tea because the water is dirty, so they need to boil the water. But Seoul’s water has historically been clean, so tea never really caught on. As a result, they primarily drink water and coffee.

Namdaemun Market

Namdaemun Market
Namdaemun Market

The market was the hub of our trip, as our buses tended to transfer near this stop. On weekends, the alleyways of the market are packed. We tried some lobster covered in cheese and noodles from the street vendors.

Bukchon Hanok Village

Bukchon Hanok Village is a residential neighborhood with traditional houses. The traditional style was once thought of as old and low-class, but the style has come back into vogue. Tourists in rented hanboks posed in front of the ornate doorways. People live in those houses, and there were signs posted requesting that people keep their voices down, as the residents are tired of the throngs of people hanging out right outside their homes.

I enjoyed seeing so many people dressed up in hanboks. Wearing a hanbok is not cultural appropriation, rather it is respecting and appreciating Korean traditions. The palaces even waive entrance fees for those in traditional garb.

Gyeongbokgung Palace

Gwanghwamun gate
Gwanghwamun gate

Gyeongbokgung is the main palace in Seoul. Gyeongbokgung’s land area is only a tenth of its previous size, as buildings have been sold or destroyed throughout various battles, foreign occupations, and modernization.

Gyeongbokgung throne hall
Gyeongbokgung throne hall

The architecture and decor is rife with symbolism. The throne hall is visually centered. The hall is protected by the Chinese Four Symbols: the white tiger, black turtle, vermillion bird, and azure dragon. The steps that surround the throne room each have a symbol statue. One of the symbols is painted on each of the north, west, south, east gates to the palace grounds. Finally, the mountains and river that surround Seoul each also represent one of the symbols.

Unlike traditional Chinese buildings, the Korean buildings have curved roofs. Also, the support pillars are unequal heights, which gives the appearance that the pillars are equal height. If the pillars were the same height, then an optical illusion would make the pillars appear to be different heights.

Nearby is the Blue House, where the president lives. The building is painted white with a traditional roof in a vibrant blue color (most traditional buildings have green, black, or brown roofs). I would say that security is pretty lax in Seoul, but the Blue House was by far the most secure area. We weren’t even allowed to cross the street to stand outside the gate to the compound.

Also nearby is the Folk Museum, which documents the lifestyle of Koreans before modernization. Historically, there has been a lot of Chinese influence.

Changdeokgung Palace

Changdeokgung Secret Garden
Changdeokgung Secret Garden

Korea no longer has a state-sponsored royal family, but everyone I talked to spoke highly of their past royalty. Except for some familial infighting for succession, the emperors were described as hardworking. They studied history for hours each day, were advised by government officials organized into 18 levels of rankings, and worked themselves to early graves for the benefit of the people.

King Sejong is particularly well-regarded. He invented and promulgated the Korean alphabet, Hangul, which was a lot easier to learn than thousands of Chinese characters. Hangul is the only language with a known creator and time of invention.

Historians recorded the history of the royal family. Unlike the Chinese emperor, who would read and edit the written history if desired, the Korean emperor did not change the recorded history. A member of the royal family fell off a horse, and the emperor requested that the historian not record this embarrassing incident. The historical record mentions both the horse incident and the emperor’s request.


Cheonggyecheon is a stream that cuts through the downtown core of the city, below street level. The stream is stocked with fish. Cranes and other birds wade in the shallows. The banks are covered in horsetail grass. It is an oasis of nature in the shadow of skyscrapers.

The stream has a long history. Originally, the stream was natural, flowing from the mountains. Later, it was paved over to make another road, but the road created noise and congestion. So the road was torn down and replaced with a straight, man-made stream fed by pipes. The flowing water cools the surrounding roads.

The bridges over the stream have interesting stories, too. One of the bridges has the gravestone of a previous queen embedded into it. The queen tried some political maneuvering in an attempt to leapfrog her own son onto the throne, so the firstborn and rightful heir hated her, and moved her gravestone to the bridge.

Gwangjang Market

Gwangjang Market was a burst of color. Stores lined the sides of the road, and food booths with seating lined the middle. Pennant flags of other countries were strung across the ceiling. All around me, I could see and smell delicious food. Live squid seemed like a popular dish.

Dongdaemun Design Plaza

The Dongdaemun Plaza, designed by Zaha Hadid, has a facade of curving metal plates. Inside, there are several design exhibitions.  At the time we went, it was Seoul Fashion Week. Never before have I seen such a concentration of people taller than 6’4″. Waifish trendsetters in eye-catching clothes posed for photographers. Even the children were looking fly in sunglasses, fur jackets, and metal hardware.

I ate so much food in Seoul. I would say the food offerings and flavors match that of Korean restaurants in America (unlike many Chinese restaurants, which sweeten the food for American palates). They have the best friend chicken— the oil is light, skin is crispy, meat is tender and juicy.


J and I hiked the Bugandae trail of Bukhansan, a mountain in north Seoul.  At first the trail was made up of bricks, with the occasional car driving by. Then the trail narrowed into an endless staircase of white rock.

I described Bukhansan as a “non-trivial hike.” Someone in the hotel overheard me and said that phrase is an oxymoron. I packed light for our trip, so I did not have my hiking boots or poles. But all around me, the locals were fully decked out in sleek matching hiking sets, some with scarves tied around their necks. Basically, we shared the trail with a bunch of dignified, trendy, well-prepared physically-fit old people.

As I continued hiking, my knees became wobbly, and I wished I had purchased hiking poles from one of the numerous purveyors at the base of the mountain. The hike was tiring, but I took a simple pleasure in every step. The leaves had a beautiful red, the trees around me had the novelty of being a distinctly Asian variety, and the air was fresher than the smog of the city. Along steep sections, there was a thick metal rope for hikers to grab onto. The trees along the route had smooth bark where thousands of hikers had latched on for stability.

I was traveling minimally. I had a backpack full of water and snacks, and I’m a lightweight person. But J was having some trouble, so we took frequent breaks. I held his backpack, and it was unreasonably heavy. I wondered what he had packed, because I thought that I was carrying all the essentials for our hike.

“What’s in here?” I asked.

A laptop, an empty metal thermos, and a liter of Japanese lube.

Tough Mudder

I read some think pieces about obstacle course races. They ask, why are people willing to spend $100+ to put themselves through pain for a few hours? Some say with modern, sterile, white-collar office jobs, people are not in tune with their bodies. They seek danger and excitement, and want to test their physical abilities. Anyway, I signed up because some computer science graduate students wanted to form a team, and I thought, “Sure, why not? I’ve never done one before.”

Before the race started, the MC gave a motivational speech. The gist of the speech was, Tough Mudder is not just a race, it’s a lifestyle and commitment to fitness. We all took a knee, not just the people from my school, but everyone, because we were all on the same team. We would help each other, perfect strangers, get through the race. It felt rather cultish. They played the Star-Spangled Banner. A man covered in tattoos exuberantly signed during the anthem.

It was raining, so the muddy course was even muddier. In addition, our team had a late start time, so the course was beaten down and slippery from all the previous racers. Despite running most of the time, I was numb from cold.

For the first obstacle, we had to army crawl through mud under barbed wire. I thought this was a suitable choice for the first obstacle, because it made me have no qualms about getting dirty the rest of the race. On the downside, it became harder to recognize my teammates, since everyone was covered in mud from head to toe.

I would describe most obstacles as people pulling me up and over things. I was pulled over a vertical wall, a rope wall, a curved wall, a human ladder, and multiple elevated banks while wading in water. I would not have been able to complete the race on my own. Some obstacles were not really obstacles. For example, one “obstacle” was running up and down a hill several times. Another “obstacle” was swimming across part of a lake. My favorite obstacle was “Block Ness Monster,” which consisted of large rectangular prisms in water. We had to work together to rotate and go over the prisms.

After the race, I couldn’t wait to shower and never do that again. But I know that time will soften the memories of hypothermia with a nostalgic glow, and in a moment of idiocy I will sign up for another obstacle course race, because “Sure, why not? Obstacle course races are fun.”