This paper proposes the Pointing and Justification (PJ-X) architecture, which generates multimodal explanations.
The textual justifications are obtained by conditioning on the inputs— the image and question— and the answer. The answers are embedded, followed by an MLP. The questions and images are also embedded, followed by an MLP. Then the answer feature and question-image feature are combined with element-wise multiplication, followed by signed-square root and L2 normalization. This multimodal feature goes through another MLP, then is normalized to generate a Visual Pointing attention map. The Visual Pointing attention map is merged with an encoding of the question and answer, then fed into an LSTM decoder to generate the textual justification.
In the paper’s experiment, one model is trained on image descriptions, and another model is trained on answer explanations. The model using descriptions performed worse in all metrics than the model using explanations, because descriptions generically describe the scene in an image, whereas explanations focus on information that is task-specific, as shown in Figure 2. For visual question answering, the explanation focuses on evidence that is relevant to the question and answer. For activity recognition, the explanation focuses on evidence that a particular activity (such as juggling) is being performed.
Metrics such as BLEU and METEOR were used to show that the ground truth explanations and generated explanations were more similar for the model trained on explanations than the model trained on descriptions. It seems obvious that a model trained on explanations would be better at producing explanations that sound like explanations than a model trained on descriptions. It seems like the model trained on explanations could just be learning a language model to generate text that sounds like natural explanations, so it was not clear to me that the model was actually learning to focus on and be better at pointing to relevant evidence.
To prove that the model is indeed learning to explain, the model accuracy on the VQA task could be reported. Ignoring the justifications for the answers, if the model had higher accuracy than a common baseline, that would show that the model is learning something about finding relevant evidence for a question, and is not just overfitting on the explanations that it was trained on. The authors “freeze or finetune the weights of the answering model when training the multimodal explanation model.” To conduct this experiment, the model would have to be trained without freezing the weights.