This paper proposes deep canonically correlated autoencoders (DCCAE). The objective of DCCAE contains both a term for canonical correlation and a term for reconstruction errors of the autoencoders. The canonical correlation term in the objective encourages learning information shared across modalities, while the reconstruction term encourages learning unimodal information. The merits of combining the objectives of canonical correlation and reconstruction errors are two-fold. First, there can be a trade-off parameter that weighs learning shared information vs. learning unimodal information. As shown in the paper, for some problems, minimizing reconstruction error is important, such as in the adjective-noun word embedding task, while in other problems it is not important, such as in the noisy MNIST and speech recognition experiments. Second, the reconstruction error objective acts as a regularization term, which can increase model performance since Canonical Correlation Analysis (CCA) can be prone to overfitting.

The authors claim that the uncorrelated feature constraint is necessary, and the results from two models prove this. The authors’ DCCAE model and correlated autoencoders (CorrAE) model have the same objective function, except the CorrAE model relaxes DCCAE’s constraints to have uncorrelated features. In an experiment of classifying noisy MNIST digits, DCCAE significantly outperformed CorrAE, having clustering accuracy, normalized mutual information of cluster, and classification error rates of 97.5, 93.4, and 2.2 to CorrAE’s 65.5, 67.2, and 12.9, respectively. In addition, a t-SNE visualization showed that DCCAE gave a cleaner separation of different digit clusters than CorrAE. In the speech recognition experiment, DCCAE had a lower error rate than CorrAE (mean of 24.5 vs. 30.6 phone error rates, respectively). Finally, in the multilingual word embedding experiment, DCCAE’s embedding had higher Spearman’s correlation than CorrAE’s.

Uncorrelatedness between different feature dimensions is important so that the features can contribute complementary information. From a statistics perspective, this is because when variables are independent of each other, the information they provide is additive. Also, having disentangled features makes them easier to interpret.