This paper proposes End-to-End Module Networks (N2NMNs). N2NMNs parse a question into subtasks, and picks a relevant module for that subtask. The model learns to both pick a suitable layout of modules to answer the question, and the network parameters for each module.
The N2NMN is more interpretable than Multimodal Compact Bilinear Pooling (MCB) because visualizations are not just generated on the sentence level of the question, but also at the word level. For these word-level visualizations, one can see the module it is paired with (such as find, filter, relocate). Thus, we know what action the model is taking from the module and the object that this action is applied on from the word, such as “find” (module) a “green matte ball” (word). The model picks a layout that applies these modules in some order. So we can visualize what the model is doing at each step, and could see at which step the model went wrong. Alternatively, perhaps the model executed the subproblems flawlessly, which we could determine from the visualizations, but the issue was the layout it chose was bad for the question. With MCB, we do not get these word-level, step-wise visualizations, and there is no interpretable information about the layout nor the modules/subproblems.
The authors use behavior cloning to train the model, which is arguably not necessary, but is used for practical reasons. Behavior cloning provides a good initialization of parameters. In all the paper’s experimental results (for CLEVR, VQA, and SHAPES), the model with policy search after behavior cloning outperformed the model with policy search from scratch. I would argue though, that policy search after behavior cloning does not necessarily outperform a model trained from scratch. One famous example is the unsupervised AlphaGo Zero, which outperforms the supervised AlphaGo. When I think of the multidimensional space for model parameters with respect to the loss function, with behavior cloning, the model parameters have been initialized near a good local extremum. But theoretically, with enough random model parameter initialization trials, the model trained from scratch could end up at a better local extremum than the model initialized with behavior cloning. However, the parameter search space is huge, so it could be impractical in time or cost to try out sufficiently many random initializations.
Fine-tuning with end-to-end training is needed because the expert policy might not be optimal. After using behavior cloning, further training is needed to make the model better at learning suitable layouts.