Summary of AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks

In previous approaches to generating an image from a sentence using GANs, the entire sentence was encoded as a single vector, and the GAN was conditioned on this vector.  Attentional Generative Adversarial Network (AttnGAN) also conditions on the sentence vector, but improves on the previous approaches by refining the image in multiple stages using word vectors as well.

The authors propose two novel components in the AttnGAN: an attentional generative network, and a deep attentional multimodal similarity model (DAMSM).

The attentional generative network works as follows. First, the model calculates hidden state as a function of the sentence vector and some random noise. A low-resolution image is generated from this hidden state. Next, the image is improved iteratively in stages. At each stage, there is an attention model. The attention model takes in the hidden state of the previous stage as well as the word features, and calculates a word-context vector to emphasize words that need more representation for each subregion of the image. Next, the stage has a model that takes in the word-context vector and the previous hidden state to calculate the new hidden state. A higher resolution image is generated from the new hidden state. For example, in the paper, in a particular iteration and select regions of the image, a word that the model prioritizes is “red,” so when the bird image is refined, the bird is more red in the new image.

The loss from the attentional generative network has two main parts: unconditional loss that reflects whether the image is real or fake (for a more realistic-looking image) and conditional loss that reflects whether the image and sentence match.

There is an additional loss term that comes from the DAMSM, and this loss term makes sure each word is represented in the image (not just looking at the entire sentence as in the attentional generative network loss). The DAMSM maps words and image regions into a common semantic space to measure how much the words and image regions match up. Sentence text is encoded with a bi-LSTM. The image is encoded with a CNN into the text feature space. For each word in the sentence, attention is applied to the image encoding, so for each word, we have region-context vectors of how the image represents that word. Then the DAMSM loss compares the attention vector of how the image represents that word to the text encoding. So visually, to minimize DAMSM loss for the word “red,” the bird should be a vibrant red versus a muddied-red color.

The paper proves impressive results by adjusting model hyperparameters. Two stages of refinement in the attentional generative network is more performant (performance is measured by inception score and R-precision) than one stage of refinement, and one stage is better than no stage. This shows that iterative attention-based refinements improve image quality. Also, by increasing the weight applied to DAMSM loss, performance improves, which shows that DAMSM improves performance. In addition, the paper claims that training an AttnGAN is more stable than training other GANs, since mode collapse never occurred.

I thought the paper does a good job detailing how the model works, so the results should be clear and reproducible. Qualitatively, AttnGAN did well on the Caltech-USCD Birds dataset, but the images generated from MS-COCO do not look realistic. MS-COCO has more objects and more complex scenarios.

From the AttnGAN trained on MS-COCO, here’s an example of an image generated from the caption, “A man and women rest as their horse stands on the trail.”

A man and women rest as their horse stands on the trail
A man and women rest as their horse stands on the trail

Grass and horse parts are visible. Textures are sharp. But the objects do not have a reasonable shape. Interestingly, when this image is run through a Fast R-CNN object detector trained on VGG, the brown blob is labeled “horse” with 89% confidence.