Summary of Learning to Reason: End-to-End Module Networks for Visual Question Answering

This paper proposes End-to-End Module Networks (N2NMNs). N2NMNs parse a question into subtasks, and picks a relevant module for that subtask. The model learns to both pick a suitable layout of modules to answer the question, and the network parameters for each module.

The N2NMN is more interpretable than Multimodal Compact Bilinear Pooling (MCB) because visualizations are not just generated on the sentence level of the question, but also at the word level. For these word-level visualizations, one can see the module it is paired with (such as find, filter, relocate). Thus, we know what action the model is taking from the module and the object that this action is applied on from the word, such as “find” (module) a “green matte ball” (word). The model picks a layout that applies these modules in some order. So we can visualize what the model is doing at each step, and could see at which step the model went wrong. Alternatively, perhaps the model executed the subproblems flawlessly, which we could determine from the visualizations, but the issue was the layout it chose was bad for the question. With MCB, we do not get these word-level, step-wise visualizations, and there is no interpretable information about the layout nor the modules/subproblems.

The authors use behavior cloning to train the model, which is arguably not necessary, but is used for practical reasons. Behavior cloning provides a good initialization of parameters. In all the paper’s experimental results (for CLEVR, VQA, and SHAPES), the model with policy search after behavior cloning outperformed the model with policy search from scratch. I would argue though, that policy search after behavior cloning does not necessarily outperform a model trained from scratch. One famous example is the unsupervised AlphaGo Zero, which outperforms the supervised AlphaGo. When I think of the multidimensional space for model parameters with respect to the loss function, with behavior cloning, the model parameters have been initialized near a good local extremum. But theoretically, with enough random model parameter initialization trials, the model trained from scratch could end up at a better local extremum than the model initialized with behavior cloning. However, the parameter search space is huge, so it could be impractical in time or cost to try out sufficiently many random initializations.

Fine-tuning with end-to-end training is needed because the expert policy might not be optimal. After using behavior cloning, further training is needed to make the model better at learning suitable layouts.


Summary of Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding

Previous approaches to combining two modalities were to concatenate the modalities, perform some element-wise operation, or do an outer product. The advantage of an outer product over the other two is that an outer product captures more of the interactions between the two modalities, because outer product multiplies each element in one vector with each element in the other vector(s). However, an outer product requires a large number of parameters, so a model with outer product requires lots of memory and computation, and the number of parameters could even be intractable.

Multimodal Compact Bilinear Pooling (MCB) approximates an outer product by randomly projecting each modality’s vector into a higher dimensional space, then convolving the vectors together. In this way, MCB is more computationally tractable than an outer product, while still capturing the multiplicative interactions between modalities.

One problem with MCB is its interpretability, but I can think of a way to make to better understand what the MCB model is doing. We could add a ReLU layer after each bounding box’s final CNN feature map. In the Gao et al. (2016) paper that this paper cites, it is proven that we can back-propagate through a compact bilinear pool. That means for the MCB grounding architecture, we should be able to apply Grad-CAM. So, we can provide the question, image, and possible bounding boxes as input and forward propagate them through the model. Then, given the correct bounding box, we can backpropagate that through the model until we reach the final CNN feature maps. Now, unlike in the Grad-CAM paper, there is not one final CNN feature map, but many final CNN feature maps—one for each of the bounding box options. For each possible bounding box, we pool the gradients of the feature maps and pass the output through a ReLU. In this way, for each bounding box option, we can see which pixels contributed to the grounding model’s decision.


Summary of On Deep Multi-View Representation Learning: Objectives and Optimization

This paper proposes deep canonically correlated autoencoders (DCCAE). The objective of DCCAE contains both a term for canonical correlation and a term for reconstruction errors of the autoencoders. The canonical correlation term in the objective encourages learning information shared across modalities, while the reconstruction term encourages learning unimodal information. The merits of combining the objectives of canonical correlation and reconstruction errors are two-fold. First, there can be a trade-off parameter that weighs learning shared information vs. learning unimodal information. As shown in the paper, for some problems, minimizing reconstruction error is important, such as in the adjective-noun word embedding task, while in other problems it is not important, such as in the noisy MNIST and speech recognition experiments. Second, the reconstruction error objective acts as a regularization term, which can increase model performance since Canonical Correlation Analysis (CCA) can be prone to overfitting.

The authors claim that the uncorrelated feature constraint is necessary, and the results from two models prove this. The authors’ DCCAE model and correlated autoencoders (CorrAE) model have the same objective function, except the CorrAE model relaxes DCCAE’s constraints to have uncorrelated features. In an experiment of classifying noisy MNIST digits, DCCAE significantly outperformed CorrAE, having clustering accuracy, normalized mutual information of cluster, and classification error rates of 97.5, 93.4, and 2.2 to CorrAE’s 65.5, 67.2, and 12.9, respectively. In addition, a t-SNE visualization showed that DCCAE gave a cleaner separation of different digit clusters than CorrAE. In the speech recognition experiment, DCCAE had a lower error rate than CorrAE (mean of 24.5 vs. 30.6 phone error rates, respectively). Finally, in the multilingual word embedding experiment, DCCAE’s embedding had higher Spearman’s correlation than CorrAE’s.

Uncorrelatedness between different feature dimensions is important so that the features can contribute complementary information. From a statistics perspective, this is because when variables are independent of each other, the information they provide is additive. Also, having disentangled features makes them easier to interpret.


Summary of AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks

In previous approaches to generating an image from a sentence using GANs, the entire sentence was encoded as a single vector, and the GAN was conditioned on this vector.  Attentional Generative Adversarial Network (AttnGAN) also conditions on the sentence vector, but improves on the previous approaches by refining the image in multiple stages using word vectors as well.

The authors propose two novel components in the AttnGAN: an attentional generative network, and a deep attentional multimodal similarity model (DAMSM).

The attentional generative network works as follows. First, the model calculates hidden state as a function of the sentence vector and some random noise. A low-resolution image is generated from this hidden state. Next, the image is improved iteratively in stages. At each stage, there is an attention model. The attention model takes in the hidden state of the previous stage as well as the word features, and calculates a word-context vector to emphasize words that need more representation for each subregion of the image. Next, the stage has a model that takes in the word-context vector and the previous hidden state to calculate the new hidden state. A higher resolution image is generated from the new hidden state. For example, in the paper, in a particular iteration and select regions of the image, a word that the model prioritizes is “red,” so when the bird image is refined, the bird is more red in the new image.

The loss from the attentional generative network has two main parts: unconditional loss that reflects whether the image is real or fake (for a more realistic-looking image) and conditional loss that reflects whether the image and sentence match.

There is an additional loss term that comes from the DAMSM, and this loss term makes sure each word is represented in the image (not just looking at the entire sentence as in the attentional generative network loss). The DAMSM maps words and image regions into a common semantic space to measure how much the words and image regions match up. Sentence text is encoded with a bi-LSTM. The image is encoded with a CNN into the text feature space. For each word in the sentence, attention is applied to the image encoding, so for each word, we have region-context vectors of how the image represents that word. Then the DAMSM loss compares the attention vector of how the image represents that word to the text encoding. So visually, to minimize DAMSM loss for the word “red,” the bird should be a vibrant red versus a muddied-red color.

The paper proves impressive results by adjusting model hyperparameters. Two stages of refinement in the attentional generative network is more performant (performance is measured by inception score and R-precision) than one stage of refinement, and one stage is better than no stage. This shows that iterative attention-based refinements improve image quality. Also, by increasing the weight applied to DAMSM loss, performance improves, which shows that DAMSM improves performance. In addition, the paper claims that training an AttnGAN is more stable than training other GANs, since mode collapse never occurred.

I thought the paper does a good job detailing how the model works, so the results should be clear and reproducible. Qualitatively, AttnGAN did well on the Caltech-USCD Birds dataset, but the images generated from MS-COCO do not look realistic. MS-COCO has more objects and more complex scenarios.

From the AttnGAN trained on MS-COCO, here’s an example of an image generated from the caption, “A man and women rest as their horse stands on the trail.”

A man and women rest as their horse stands on the trail
A man and women rest as their horse stands on the trail

Grass and horse parts are visible. Textures are sharp. But the objects do not have a reasonable shape. Interestingly, when this image is run through a Fast R-CNN object detector trained on VGG, the brown blob is labeled “horse” with 89% confidence.


Summary of StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks

StackGAN synthesizes photo-realistic images from text descriptions in two stages. The first stage generates a low-resolution sketch with colors and basic shapes, then the second stage adds details and produces a higher resolution image.

The paper introduces a Conditioning Augmentation technique to prevent mode collapse. Mode collapse happens when the distribution of data has several modes, but only one is represented by the generator. Then the generator’s output will all be the same, or all the samples will share similar attributes. Mode collapse can occur when the generator fools the discriminator by producing output near one mode. Then the discriminator learns to predict if the output is real or fake by checking whether the output is near that mode. To fool the discriminator, the generator will produce output near a different mode. This process is repeated, with the generator and discriminator moving from mode to mode. The conditioning augmentation technique alleviates the problem of mode collapse by encouraging smoothness in the latent conditioning manifold. Smoothness in the manifold will discourage peakiness and discrete clusters, so the generator cannot fool the discriminator by simply moving from mode to mode because the distribution is not concentrated into clusters.

The authors evaluate their model in comparison to other models by using inception score and human rankings. There are some flaws to using inception score. For one thing, inception score cannot detect mode collapse.

We want a variety in what the GAN produces, and we want what is produced to be high quality. Inception score supposedly captures these two attributes that we want in a GAN, and is calculated as below:

inception score

KL divergence measures how much the p(y|x) probability distribution differs from the p(y) probability distribution. In the context of this model, p(y|x) is the probability of the text sentence conditioned on the generated image. Ideally, the p(y|x) distribution should be low entropy, because the images ought to have high quality so we ought to be able to predict a text sentence given a generated image. And ideally, the p(y) distribution should be uniform (so high entropy), because we want the generated images to have a large variety; that is, we do not want the GAN to favor a particular text sentence y. Then a performant GAN should have a high KL divergence, because the distribution p(y|x) has low entropy and p(y) has high entropy, and hence should have a high inception score.

However, if mode collapse occurs, then p(y) would be a constant, so p(y) would have a uniform distribution. Then the GAN could have a high inception score, because it has a uniform p(y) as a performant GAN ought to, but is actually not producing a variety of samples.

Apart from metrics discussed in the paper, I can think of alternative evaluation methods to measuring the relevance of the generated image to the input text:

  1. For each model (ie. StackGan, StackGAN Stage-I only, and other models such as GAN-INT-CLS, etc.), run the generated image through an image captioning model. Measure the distance of the captions to the original sentence with a translation metric such as BLEU. A performant GAN’s caption ought to have lower distance to the original sentence.
  2. For each model, run the generated image through an object detector. Take the objects from the original sentence and see the confidence of each of those objects in the generated images. For each object in the original sentence, the object detector ought to have high confidence that the object is in the generated image.
  3. For each model, look at the k-nearest neighborhood of the generated image. Run those k-nearest neighbors through an image captioning model or object detector, as in 1 and 2. A performant GAN’s nearest neighbors ought to have captions with a lower distance to the original sentence, and high confidence that the sentence’s objects are in the neighbor images.