This paper discusses Gradient-weighted Class Activation Mapping (Grad-CAM), a method of visualizing a model that is superior to Class Activation Mapping (CAM).
A requirement of CAM is that the final convolutional feature maps are connected to the outputs. CAM uses these weights between the convolutional feature maps and the outputs. So CAM has the drawback that the model cannot have fully-connected layers after the CNNs. For image classification tasks, having a fully-connected layer at the end typically improves accuracy. The CNNs extract useful features, then the fully-connected layer enables learning combinations of these features for classification. To fulfill CAM’s requirement of having no fully-connected layers, the model performance would drop.
With Grad-CAM, the model architecture does not have to be altered in order to be interpretable. The way Grad-CAM works is, first, the input image is forward-propagated through the model. Then, given the true label / desired output, we perform back propagation until we reach the final CNN feature maps. The gradients of the feature maps are pooled, then passed through a ReLU to only keep pixels that contributed positively to the classifier’s decision.
CAM and Grad-CAM can provide a heatmap to visualize what the model is looking at when it decides on a classification, but these heatmaps are fairly course regions. The paper also proposes Guided Grad-CAM, a high resolution visualization that highlights particular features of interest in an image.
Grad-CAM can be used to identify bias in the dataset by showing what attributes the model is focusing on in making its decisions. The example used in the paper is classifying pictures of people into doctors or nurses. Grad-CAM showed that the model was looking at face and hair, revealing a gender bias, which they found in the distribution of their training data. Identifying bias is especially important for a multi-modal problem because each modality has its own dataset, so bias can be introduced by each modality. This bias is additive, so the resulting model can have more bias than a unimodal modal.