Summary of Learning to Reason: End-to-End Module Networks for Visual Question Answering

This paper proposes End-to-End Module Networks (N2NMNs). N2NMNs parse a question into subtasks, and picks a relevant module for that subtask. The model learns to both pick a suitable layout of modules to answer the question, and the network parameters for each module.

The N2NMN is more interpretable than Multimodal Compact Bilinear Pooling (MCB) because visualizations are not just generated on the sentence level of the question, but also at the word level. For these word-level visualizations, one can see the module it is paired with (such as find, filter, relocate). Thus, we know what action the model is taking from the module and the object that this action is applied on from the word, such as “find” (module) a “green matte ball” (word). The model picks a layout that applies these modules in some order. So we can visualize what the model is doing at each step, and could see at which step the model went wrong. Alternatively, perhaps the model executed the subproblems flawlessly, which we could determine from the visualizations, but the issue was the layout it chose was bad for the question. With MCB, we do not get these word-level, step-wise visualizations, and there is no interpretable information about the layout nor the modules/subproblems.

The authors use behavior cloning to train the model, which is arguably not necessary, but is used for practical reasons. Behavior cloning provides a good initialization of parameters. In all the paper’s experimental results (for CLEVR, VQA, and SHAPES), the model with policy search after behavior cloning outperformed the model with policy search from scratch. I would argue though, that policy search after behavior cloning does not necessarily outperform a model trained from scratch. One famous example is the unsupervised AlphaGo Zero, which outperforms the supervised AlphaGo. When I think of the multidimensional space for model parameters with respect to the loss function, with behavior cloning, the model parameters have been initialized near a good local extremum. But theoretically, with enough random model parameter initialization trials, the model trained from scratch could end up at a better local extremum than the model initialized with behavior cloning. However, the parameter search space is huge, so it could be impractical in time or cost to try out sufficiently many random initializations.

Fine-tuning with end-to-end training is needed because the expert policy might not be optimal. After using behavior cloning, further training is needed to make the model better at learning suitable layouts.


Summary of Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding

Previous approaches to combining two modalities were to concatenate the modalities, perform some element-wise operation, or do an outer product. The advantage of an outer product over the other two is that an outer product captures more of the interactions between the two modalities, because outer product multiplies each element in one vector with each element in the other vector(s). However, an outer product requires a large number of parameters, so a model with outer product requires lots of memory and computation, and the number of parameters could even be intractable.

Multimodal Compact Bilinear Pooling (MCB) approximates an outer product by randomly projecting each modality’s vector into a higher dimensional space, then convolving the vectors together. In this way, MCB is more computationally tractable than an outer product, while still capturing the multiplicative interactions between modalities.

One problem with MCB is its interpretability, but I can think of a way to make to better understand what the MCB model is doing. We could add a ReLU layer after each bounding box’s final CNN feature map. In the Gao et al. (2016) paper that this paper cites, it is proven that we can back-propagate through a compact bilinear pool. That means for the MCB grounding architecture, we should be able to apply Grad-CAM. So, we can provide the question, image, and possible bounding boxes as input and forward propagate them through the model. Then, given the correct bounding box, we can backpropagate that through the model until we reach the final CNN feature maps. Now, unlike in the Grad-CAM paper, there is not one final CNN feature map, but many final CNN feature maps—one for each of the bounding box options. For each possible bounding box, we pool the gradients of the feature maps and pass the output through a ReLU. In this way, for each bounding box option, we can see which pixels contributed to the grounding model’s decision.


Summary of On Deep Multi-View Representation Learning: Objectives and Optimization

This paper proposes deep canonically correlated autoencoders (DCCAE). The objective of DCCAE contains both a term for canonical correlation and a term for reconstruction errors of the autoencoders. The canonical correlation term in the objective encourages learning information shared across modalities, while the reconstruction term encourages learning unimodal information. The merits of combining the objectives of canonical correlation and reconstruction errors are two-fold. First, there can be a trade-off parameter that weighs learning shared information vs. learning unimodal information. As shown in the paper, for some problems, minimizing reconstruction error is important, such as in the adjective-noun word embedding task, while in other problems it is not important, such as in the noisy MNIST and speech recognition experiments. Second, the reconstruction error objective acts as a regularization term, which can increase model performance since Canonical Correlation Analysis (CCA) can be prone to overfitting.

The authors claim that the uncorrelated feature constraint is necessary, and the results from two models prove this. The authors’ DCCAE model and correlated autoencoders (CorrAE) model have the same objective function, except the CorrAE model relaxes DCCAE’s constraints to have uncorrelated features. In an experiment of classifying noisy MNIST digits, DCCAE significantly outperformed CorrAE, having clustering accuracy, normalized mutual information of cluster, and classification error rates of 97.5, 93.4, and 2.2 to CorrAE’s 65.5, 67.2, and 12.9, respectively. In addition, a t-SNE visualization showed that DCCAE gave a cleaner separation of different digit clusters than CorrAE. In the speech recognition experiment, DCCAE had a lower error rate than CorrAE (mean of 24.5 vs. 30.6 phone error rates, respectively). Finally, in the multilingual word embedding experiment, DCCAE’s embedding had higher Spearman’s correlation than CorrAE’s.

Uncorrelatedness between different feature dimensions is important so that the features can contribute complementary information. From a statistics perspective, this is because when variables are independent of each other, the information they provide is additive. Also, having disentangled features makes them easier to interpret.


Summary of StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks

StackGAN synthesizes photo-realistic images from text descriptions in two stages. The first stage generates a low-resolution sketch with colors and basic shapes, then the second stage adds details and produces a higher resolution image.

The paper introduces a Conditioning Augmentation technique to prevent mode collapse. Mode collapse happens when the distribution of data has several modes, but only one is represented by the generator. Then the generator’s output will all be the same, or all the samples will share similar attributes. Mode collapse can occur when the generator fools the discriminator by producing output near one mode. Then the discriminator learns to predict if the output is real or fake by checking whether the output is near that mode. To fool the discriminator, the generator will produce output near a different mode. This process is repeated, with the generator and discriminator moving from mode to mode. The conditioning augmentation technique alleviates the problem of mode collapse by encouraging smoothness in the latent conditioning manifold. Smoothness in the manifold will discourage peakiness and discrete clusters, so the generator cannot fool the discriminator by simply moving from mode to mode because the distribution is not concentrated into clusters.

The authors evaluate their model in comparison to other models by using inception score and human rankings. There are some flaws to using inception score. For one thing, inception score cannot detect mode collapse.

We want a variety in what the GAN produces, and we want what is produced to be high quality. Inception score supposedly captures these two attributes that we want in a GAN, and is calculated as below:

inception score

KL divergence measures how much the p(y|x) probability distribution differs from the p(y) probability distribution. In the context of this model, p(y|x) is the probability of the text sentence conditioned on the generated image. Ideally, the p(y|x) distribution should be low entropy, because the images ought to have high quality so we ought to be able to predict a text sentence given a generated image. And ideally, the p(y) distribution should be uniform (so high entropy), because we want the generated images to have a large variety; that is, we do not want the GAN to favor a particular text sentence y. Then a performant GAN should have a high KL divergence, because the distribution p(y|x) has low entropy and p(y) has high entropy, and hence should have a high inception score.

However, if mode collapse occurs, then p(y) would be a constant, so p(y) would have a uniform distribution. Then the GAN could have a high inception score, because it has a uniform p(y) as a performant GAN ought to, but is actually not producing a variety of samples.

Apart from metrics discussed in the paper, I can think of alternative evaluation methods to measuring the relevance of the generated image to the input text:

  1. For each model (ie. StackGan, StackGAN Stage-I only, and other models such as GAN-INT-CLS, etc.), run the generated image through an image captioning model. Measure the distance of the captions to the original sentence with a translation metric such as BLEU. A performant GAN’s caption ought to have lower distance to the original sentence.
  2. For each model, run the generated image through an object detector. Take the objects from the original sentence and see the confidence of each of those objects in the generated images. For each object in the original sentence, the object detector ought to have high confidence that the object is in the generated image.
  3. For each model, look at the k-nearest neighborhood of the generated image. Run those k-nearest neighbors through an image captioning model or object detector, as in 1 and 2. A performant GAN’s nearest neighbors ought to have captions with a lower distance to the original sentence, and high confidence that the sentence’s objects are in the neighbor images.


Summary of Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization

This paper discusses Gradient-weighted Class Activation Mapping (Grad-CAM), a method of visualizing a model that is superior to Class Activation Mapping (CAM).

A requirement of CAM is that the final convolutional feature maps are connected to the outputs. CAM uses these weights between the convolutional feature maps and the outputs. So CAM has the drawback that the model cannot have fully-connected layers after the CNNs. For image classification tasks, having a fully-connected layer at the end typically improves accuracy. The CNNs extract useful features, then the fully-connected layer enables learning combinations of these features for classification. To fulfill CAM’s requirement of having no fully-connected layers, the model performance would drop.

With Grad-CAM, the model architecture does not have to be altered in order to be interpretable. The way Grad-CAM works is, first, the input image is forward-propagated through the model. Then, given the true label / desired output, we perform back propagation until we reach the final CNN feature maps. The gradients of the feature maps are pooled, then passed through a ReLU to only keep pixels that contributed positively to the classifier’s decision.

CAM and Grad-CAM can provide a heatmap to visualize what the model is looking at when it decides on a classification, but these heatmaps are fairly course regions. The paper also proposes Guided Grad-CAM, a high resolution visualization that highlights particular features of interest in an image.

Grad-CAM can be used to identify bias in the dataset by showing what attributes the model is focusing on in making its decisions. The example used in the paper is classifying pictures of people into doctors or nurses. Grad-CAM showed that the model was looking at face and hair, revealing a gender bias, which they found in the distribution of their training data. Identifying bias is especially important for a multi-modal problem because each modality has its own dataset, so bias can be introduced by each modality. This bias is additive, so the resulting model can have more bias than a unimodal modal.