Previous approaches to combining two modalities were to concatenate the modalities, perform some element-wise operation, or do an outer product. The advantage of an outer product over the other two is that an outer product captures more of the interactions between the two modalities, because outer product multiplies each element in one vector with each element in the other vector(s). However, an outer product requires a large number of parameters, so a model with outer product requires lots of memory and computation, and the number of parameters could even be intractable.
Multimodal Compact Bilinear Pooling (MCB) approximates an outer product by randomly projecting each modality’s vector into a higher dimensional space, then convolving the vectors together. In this way, MCB is more computationally tractable than an outer product, while still capturing the multiplicative interactions between modalities.
One problem with MCB is its interpretability, but I can think of a way to make to better understand what the MCB model is doing. We could add a ReLU layer after each bounding box’s final CNN feature map. In the Gao et al. (2016) paper that this paper cites, it is proven that we can back-propagate through a compact bilinear pool. That means for the MCB grounding architecture, we should be able to apply Grad-CAM. So, we can provide the question, image, and possible bounding boxes as input and forward propagate them through the model. Then, given the correct bounding box, we can backpropagate that through the model until we reach the final CNN feature maps. Now, unlike in the Grad-CAM paper, there is not one final CNN feature map, but many final CNN feature maps—one for each of the bounding box options. For each possible bounding box, we pool the gradients of the feature maps and pass the output through a ReLU. In this way, for each bounding box option, we can see which pixels contributed to the grounding model’s decision.