StackGAN synthesizes photo-realistic images from text descriptions in two stages. The first stage generates a low-resolution sketch with colors and basic shapes, then the second stage adds details and produces a higher resolution image.
The paper introduces a Conditioning Augmentation technique to prevent mode collapse. Mode collapse happens when the distribution of data has several modes, but only one is represented by the generator. Then the generator’s output will all be the same, or all the samples will share similar attributes. Mode collapse can occur when the generator fools the discriminator by producing output near one mode. Then the discriminator learns to predict if the output is real or fake by checking whether the output is near that mode. To fool the discriminator, the generator will produce output near a different mode. This process is repeated, with the generator and discriminator moving from mode to mode. The conditioning augmentation technique alleviates the problem of mode collapse by encouraging smoothness in the latent conditioning manifold. Smoothness in the manifold will discourage peakiness and discrete clusters, so the generator cannot fool the discriminator by simply moving from mode to mode because the distribution is not concentrated into clusters.
The authors evaluate their model in comparison to other models by using inception score and human rankings. There are some flaws to using inception score. For one thing, inception score cannot detect mode collapse.
We want a variety in what the GAN produces, and we want what is produced to be high quality. Inception score supposedly captures these two attributes that we want in a GAN, and is calculated as below:
KL divergence measures how much the p(y|x) probability distribution differs from the p(y) probability distribution. In the context of this model, p(y|x) is the probability of the text sentence conditioned on the generated image. Ideally, the p(y|x) distribution should be low entropy, because the images ought to have high quality so we ought to be able to predict a text sentence given a generated image. And ideally, the p(y) distribution should be uniform (so high entropy), because we want the generated images to have a large variety; that is, we do not want the GAN to favor a particular text sentence y. Then a performant GAN should have a high KL divergence, because the distribution p(y|x) has low entropy and p(y) has high entropy, and hence should have a high inception score.
However, if mode collapse occurs, then p(y) would be a constant, so p(y) would have a uniform distribution. Then the GAN could have a high inception score, because it has a uniform p(y) as a performant GAN ought to, but is actually not producing a variety of samples.
Apart from metrics discussed in the paper, I can think of alternative evaluation methods to measuring the relevance of the generated image to the input text:
- For each model (ie. StackGan, StackGAN Stage-I only, and other models such as GAN-INT-CLS, etc.), run the generated image through an image captioning model. Measure the distance of the captions to the original sentence with a translation metric such as BLEU. A performant GAN’s caption ought to have lower distance to the original sentence.
- For each model, run the generated image through an object detector. Take the objects from the original sentence and see the confidence of each of those objects in the generated images. For each object in the original sentence, the object detector ought to have high confidence that the object is in the generated image.
- For each model, look at the k-nearest neighborhood of the generated image. Run those k-nearest neighbors through an image captioning model or object detector, as in 1 and 2. A performant GAN’s nearest neighbors ought to have captions with a lower distance to the original sentence, and high confidence that the sentence’s objects are in the neighbor images.