i-MAE: Are Latent Representations in Masked Autoencoders Linearly Separable?

1Peking University, 2Hong Kong University of Science and Technology, 3Mohamed bin Zayed University of Artificial Intelligence 4Carnegie Mellon University,
*Equal Contribution
Framework i-MAE

We introduce interpretable Masked Autoencoders (i-MAE), a simple yet effective pre-training framework for exploring the characteristics of MAE-learned representations.

Abstract

Masked image modeling (MIM) has been recognized as a strong and popular self-supervised pre-training approach in the vision domain. However, the interpretability of the mechanism and properties of the learned representations by such a scheme are so far not well-explored. In this work, through comprehensive experiments and empirical studies on Masked Autoencoders (MAE), we address two critical questions to explore the behaviors of the learned representations:

(i) Are the latent representations in Masked Autoencoders linearly separable if the input is a mixture of two images instead of one? This can be concrete evidence used to explain why MAE-learned representations have superior performance on downstream tasks, as proven by many literature impressively.

(ii) What is the degree of semantics encoded in the latent feature space by Masked Autoencoders?

To explore these two problems, we propose a simple yet effective Interpretable MAE (i-MAE) framework with a two-way image reconstruction and a latent feature reconstruction with distillation loss to help us understand the behaviors inside MAE’s structure. Extensive experiments are conducted to verify the observations. Furthermore, we also examine the existence of linear separability and the degree of semantics in the latent space by proposing two novel metrics. The surprising and consistent results between the qualitative and quantitative experiments demonstrate that i-MAE is a superior framework design for interpretability research of MAE frameworks, as well as achieving better representational ability.

Interactive Demo

Subordinate Reconstruction With Mix & Mask Ratios

i-MAE explores the linearly separability of the two image inputs with both different mix ratios and mask ratios. Use the slider here to view reconstructions from different mask ratios and mixing ratios. Note that the reconstruction target changes at mix ratio = 0.5.
- a higher mix/max ratio (1.0)
- a lower mix/max ratio (0.0)

Loading...

Input Mixture

Loading...

Subordinate Reconstruction

Loading...

Reconstruction + Visible


Mask Ratio


Mixture Ratio (target changes at 0.5)


Drag the sliders and play round!


Loading...

Target Image


BibTeX

@article{zhang2022i-mae,
    title={i-MAE: Are Latent Representations in Masked Autoencoders Linearly Separable?},
    author = {Zhang, Kevin and Shen, Zhiqiang},
    journal={arXiv preprint arXiv:2210.11470},
    year={2022}
}