Summary of Summary of Stacked Latent Attention for Multimodal Reasoning

This paper proposes Stacked Latent Attention, an improvement over standard attention.

The standard attention mechanism works as follows: the input is a set of vectors v and input state h. Relative importance ei of vector vi to h is typically calculated with a two-layer perceptron Wuσ(Wvvi+ Whh). The attention weights are calculated by taking a softmax over the relative importance. The attention layer outputs a content vector z, a sum of the input vectors v weighted by the attention weights.

A stacked attention model chains standard attention layers together, so that the input h of the current layer is the content vector of the previous attention layer z.

The standard attention mechanism suffers from 3 issues.

First, latent spatial information is discarded. When attention layers are stacked together, the input h of the current layer is the content vector of the previous attention layer z, but this content vector discarded the spatial information si= σ(Wvvi+ Whh) from the previous layer’s intermediate step, so the current layer must perform spatial reasoning from scratch.

Second, the content vector from the previous attention layer might be focused on the wrong position, so the current attention layer has to work with a bad input state.

Finally, each attention layer has a sigmoid activation function, so there can be a vanishing gradient problem.

The Stacked Latent Attention (SLA) model mitigates these issues by having a fully convolutional pathway for the spatial information {si} from input to final output. In each attention layer, once {si} is calculated, the model branches off. Attention is calculated in a separate path, then concatenated to the {si} to produce z. In this way, the spatial information is retained and there is no vanishing gradient problem.

Conventional stacked attention models did not perform better when number of stacked layers of attention were greater than two because of the vanishing gradient issue. SLA performs better because at each attention layer, the attention is calculated in a separate path, so there is a pathway from input to output with no sigmoids. The authors provide two visualizations to support their claim. In Figure 5, for layers closer to the input, the SLA has greater variance in its gradient than the stacked attention model. The stacked attention model’s gradient variance is peakier at zero. In Figure 6, the distribution of attention weights is plotted. The attention weights for SLA are larger, indicating a stronger supervisory signal. The difference is especially noticeable for the first attention layer, indicating that SLA is more effective at backpropagating the gradient.

If hidden state size was increased, then I would expect more of the weights to be sparse, with values and gradients closer to zero. However, I would still expect the plots to show the same general trend of SLA mitigating the vanishing gradient issue.

Figure 4 shows only marginal improvement in accuracy when there are 3 stacked layers instead of 2.  It would be interesting to see if the model accuracy continues to improve with 4 or more stacked layers.

In addition, the paper says that SLA “can be used as a direct replacement in any task utilizing attention,” so it would be interesting to see the effect of replacing attention with stacked latent attention in a popular unimodal model.