In *Representation Learning: A Review and New Perspectives*, Bengio et al. discuss distributed and deep representations. The authors also discuss three lines of research in representation learning: probabilistic models, reconstruction-based algorithms, and manifold-learning approaches. Though the paper separates these methods into discrete buckets, there is actually a lot of overlap between them.

The motivation behind using distributed representations is that there are multiple underlying factors for an observed input. In a distributed representation, a model can represent an exponential number of input regions with subsets of features. Different directions in input space correspond to a factor that is being varied. The motivation behind deep representations is that there is a hierarchy of explanatory factors. In a deep representation model, as we go from input to output layer, each layer is some increasingly complex combination of lower-level features. Ideally, a model should disentangle factors of variation, and one of the beliefs motivating deep representations is that the abstract, complex factors can have great explanatory power. Both distributed and deep representations reuse features. Distributed representations reuse features because they represent subsets of features rather than individual features; likewise, deep representations reuse features in that more abstract features are built from lower-level features. A model can be both distributed and deep.

A joint representation for multimodal data can be learned by having input layers / hidden layers for each modality, then a hidden layer that projects each modality into a joint space. A coordinated representation can be learned by projecting each modality into a separate space but coordinated on some metric. For example, in Andrew et al., ICML 2013, there is an input layer and hidden layers for each modality, with the objective that the resulting representations from each modality are highly correlated.

From a high level, distributed and deep representations are similar to joint and coordinated multimodal representations in that we can use multiple features at once to enhance the model’s explanatory capabilities; for example, we have multiple features from each modality, and these features improve the model by adding complementary or redundant information for inference. In addition, we can generate abstract factors by combining features from separate modalities.

Probabilistic models use latent random variables to explain the distribution of the observed data. In training model parameters, the goal is to maximize the likelihood of the training data. One example of how probabilistic models are used in joint representations is in *Deep Learning for Robust Feature Generation in Audiovisual Emotion Recognition *by Kim et al., where the audio and video modalities go through separate stacked restricted Boltzmann machines, then get mapped into a joint representation to recognize emotions. An example of coordinated representation is in *Deep Correspondence Restricted Boltzmann Machine for Cross-Modal Retrieval*, in which Feng et al. train separate image and text modalities with restricted Boltzmann machines, and impose a correlation constraint as the coordination metric.

With reconstruction-based algorithms, the input is encoded into some feature vector representation, and can be decoded from feature space back into input space. One example of reconstruction-based algorithms in the multimodal domain is in Ngiam et al.’s paper, *Multimodal Deep Learning*, where audio and video modalities go through separate autoencoders, followed by an autoencoder that joins them into a joint representation. In *Cross-modal Retrieval with Correspondence Autoencoder*, Feng et al. use a coordinated representation, training image and text modalities with autoencoders and imposing a correlation constraint.

Manifold-learning approaches are based on the hypothesis that the distribution of real-world data in high dimensions is concentrated near areas of low dimensionality. So with manifold-learning approaches, we can reduce the dimension of our input to a lower dimensional subspace while retaining most of the explanatory information. Again, we can reduce the dimensionality of each modality individually, then map each modality into a joint representation, or we can keep the modalities separate but impose some constraint as in a coordinated representation.