Summary of Multimodal Explanations: Justifying Decisions and Pointing to the Evidence

This paper proposes the Pointing and Justification (PJ-X) architecture, which generates multimodal explanations.

The textual justifications are obtained by conditioning on the inputs— the image and question— and the answer. The answers are embedded, followed by an MLP. The questions and images are also embedded, followed by an MLP. Then the answer feature and question-image feature are combined with element-wise multiplication, followed by signed-square root and L2 normalization. This multimodal feature goes through another MLP, then is normalized to generate a Visual Pointing attention map. The Visual Pointing attention map is merged with an encoding of the question and answer, then fed into an LSTM decoder to generate the textual justification.

In the paper’s experiment, one model is trained on image descriptions, and another model is trained on answer explanations. The model using descriptions performed worse in all metrics than the model using explanations, because descriptions generically describe the scene in an image, whereas explanations focus on information that is task-specific, as shown in Figure 2. For visual question answering, the explanation focuses on evidence that is relevant to the question and answer. For activity recognition, the explanation focuses on evidence that a particular activity (such as juggling) is being performed.

Metrics such as BLEU and METEOR were used to show that the ground truth explanations and generated explanations were more similar for the model trained on explanations than the model trained on descriptions. It seems obvious that a model trained on explanations would be better at producing explanations that sound like explanations than a model trained on descriptions. It seems like the model trained on explanations could just be learning a language model to generate text that sounds like natural explanations, so it was not clear to me that the model was actually learning to focus on and be better at pointing to relevant evidence.

To prove that the model is indeed learning to explain, the model accuracy on the VQA task could be reported. Ignoring the justifications for the answers, if the model had higher accuracy than a common baseline, that would show that the model is learning something about finding relevant evidence for a question, and is not just overfitting on the explanations that it was trained on. The authors “freeze or finetune the weights of the answering model when training the multimodal explanation model.” To conduct this experiment, the model would have to be trained without freezing the weights.


Summary of Summary of Stacked Latent Attention for Multimodal Reasoning

This paper proposes Stacked Latent Attention, an improvement over standard attention.

The standard attention mechanism works as follows: the input is a set of vectors v and input state h. Relative importance ei of vector vi to h is typically calculated with a two-layer perceptron Wuσ(Wvvi+ Whh). The attention weights are calculated by taking a softmax over the relative importance. The attention layer outputs a content vector z, a sum of the input vectors v weighted by the attention weights.

A stacked attention model chains standard attention layers together, so that the input h of the current layer is the content vector of the previous attention layer z.

The standard attention mechanism suffers from 3 issues.

First, latent spatial information is discarded. When attention layers are stacked together, the input h of the current layer is the content vector of the previous attention layer z, but this content vector discarded the spatial information si= σ(Wvvi+ Whh) from the previous layer’s intermediate step, so the current layer must perform spatial reasoning from scratch.

Second, the content vector from the previous attention layer might be focused on the wrong position, so the current attention layer has to work with a bad input state.

Finally, each attention layer has a sigmoid activation function, so there can be a vanishing gradient problem.

The Stacked Latent Attention (SLA) model mitigates these issues by having a fully convolutional pathway for the spatial information {si} from input to final output. In each attention layer, once {si} is calculated, the model branches off. Attention is calculated in a separate path, then concatenated to the {si} to produce z. In this way, the spatial information is retained and there is no vanishing gradient problem.

Conventional stacked attention models did not perform better when number of stacked layers of attention were greater than two because of the vanishing gradient issue. SLA performs better because at each attention layer, the attention is calculated in a separate path, so there is a pathway from input to output with no sigmoids. The authors provide two visualizations to support their claim. In Figure 5, for layers closer to the input, the SLA has greater variance in its gradient than the stacked attention model. The stacked attention model’s gradient variance is peakier at zero. In Figure 6, the distribution of attention weights is plotted. The attention weights for SLA are larger, indicating a stronger supervisory signal. The difference is especially noticeable for the first attention layer, indicating that SLA is more effective at backpropagating the gradient.

If hidden state size was increased, then I would expect more of the weights to be sparse, with values and gradients closer to zero. However, I would still expect the plots to show the same general trend of SLA mitigating the vanishing gradient issue.

Figure 4 shows only marginal improvement in accuracy when there are 3 stacked layers instead of 2.  It would be interesting to see if the model accuracy continues to improve with 4 or more stacked layers.

In addition, the paper says that SLA “can be used as a direct replacement in any task utilizing attention,” so it would be interesting to see the effect of replacing attention with stacked latent attention in a popular unimodal model.


Summary of Learning to Reason: End-to-End Module Networks for Visual Question Answering

This paper proposes End-to-End Module Networks (N2NMNs). N2NMNs parse a question into subtasks, and picks a relevant module for that subtask. The model learns to both pick a suitable layout of modules to answer the question, and the network parameters for each module.

The N2NMN is more interpretable than Multimodal Compact Bilinear Pooling (MCB) because visualizations are not just generated on the sentence level of the question, but also at the word level. For these word-level visualizations, one can see the module it is paired with (such as find, filter, relocate). Thus, we know what action the model is taking from the module and the object that this action is applied on from the word, such as “find” (module) a “green matte ball” (word). The model picks a layout that applies these modules in some order. So we can visualize what the model is doing at each step, and could see at which step the model went wrong. Alternatively, perhaps the model executed the subproblems flawlessly, which we could determine from the visualizations, but the issue was the layout it chose was bad for the question. With MCB, we do not get these word-level, step-wise visualizations, and there is no interpretable information about the layout nor the modules/subproblems.

The authors use behavior cloning to train the model, which is arguably not necessary, but is used for practical reasons. Behavior cloning provides a good initialization of parameters. In all the paper’s experimental results (for CLEVR, VQA, and SHAPES), the model with policy search after behavior cloning outperformed the model with policy search from scratch. I would argue though, that policy search after behavior cloning does not necessarily outperform a model trained from scratch. One famous example is the unsupervised AlphaGo Zero, which outperforms the supervised AlphaGo. When I think of the multidimensional space for model parameters with respect to the loss function, with behavior cloning, the model parameters have been initialized near a good local extremum. But theoretically, with enough random model parameter initialization trials, the model trained from scratch could end up at a better local extremum than the model initialized with behavior cloning. However, the parameter search space is huge, so it could be impractical in time or cost to try out sufficiently many random initializations.

Fine-tuning with end-to-end training is needed because the expert policy might not be optimal. After using behavior cloning, further training is needed to make the model better at learning suitable layouts.


Summary of Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding

Previous approaches to combining two modalities were to concatenate the modalities, perform some element-wise operation, or do an outer product. The advantage of an outer product over the other two is that an outer product captures more of the interactions between the two modalities, because outer product multiplies each element in one vector with each element in the other vector(s). However, an outer product requires a large number of parameters, so a model with outer product requires lots of memory and computation, and the number of parameters could even be intractable.

Multimodal Compact Bilinear Pooling (MCB) approximates an outer product by randomly projecting each modality’s vector into a higher dimensional space, then convolving the vectors together. In this way, MCB is more computationally tractable than an outer product, while still capturing the multiplicative interactions between modalities.

One problem with MCB is its interpretability, but I can think of a way to make to better understand what the MCB model is doing. We could add a ReLU layer after each bounding box’s final CNN feature map. In the Gao et al. (2016) paper that this paper cites, it is proven that we can back-propagate through a compact bilinear pool. That means for the MCB grounding architecture, we should be able to apply Grad-CAM. So, we can provide the question, image, and possible bounding boxes as input and forward propagate them through the model. Then, given the correct bounding box, we can backpropagate that through the model until we reach the final CNN feature maps. Now, unlike in the Grad-CAM paper, there is not one final CNN feature map, but many final CNN feature maps—one for each of the bounding box options. For each possible bounding box, we pool the gradients of the feature maps and pass the output through a ReLU. In this way, for each bounding box option, we can see which pixels contributed to the grounding model’s decision.


Summary of On Deep Multi-View Representation Learning: Objectives and Optimization

This paper proposes deep canonically correlated autoencoders (DCCAE). The objective of DCCAE contains both a term for canonical correlation and a term for reconstruction errors of the autoencoders. The canonical correlation term in the objective encourages learning information shared across modalities, while the reconstruction term encourages learning unimodal information. The merits of combining the objectives of canonical correlation and reconstruction errors are two-fold. First, there can be a trade-off parameter that weighs learning shared information vs. learning unimodal information. As shown in the paper, for some problems, minimizing reconstruction error is important, such as in the adjective-noun word embedding task, while in other problems it is not important, such as in the noisy MNIST and speech recognition experiments. Second, the reconstruction error objective acts as a regularization term, which can increase model performance since Canonical Correlation Analysis (CCA) can be prone to overfitting.

The authors claim that the uncorrelated feature constraint is necessary, and the results from two models prove this. The authors’ DCCAE model and correlated autoencoders (CorrAE) model have the same objective function, except the CorrAE model relaxes DCCAE’s constraints to have uncorrelated features. In an experiment of classifying noisy MNIST digits, DCCAE significantly outperformed CorrAE, having clustering accuracy, normalized mutual information of cluster, and classification error rates of 97.5, 93.4, and 2.2 to CorrAE’s 65.5, 67.2, and 12.9, respectively. In addition, a t-SNE visualization showed that DCCAE gave a cleaner separation of different digit clusters than CorrAE. In the speech recognition experiment, DCCAE had a lower error rate than CorrAE (mean of 24.5 vs. 30.6 phone error rates, respectively). Finally, in the multilingual word embedding experiment, DCCAE’s embedding had higher Spearman’s correlation than CorrAE’s.

Uncorrelatedness between different feature dimensions is important so that the features can contribute complementary information. From a statistics perspective, this is because when variables are independent of each other, the information they provide is additive. Also, having disentangled features makes them easier to interpret.