MLP-MIXER: An all MLP Architecture for Vision

This is a very recent paper that challenges the need for complicated transformer-based models for huge datasets and questions the inductive biases presently in place for the present image recognition tasks.

This paper argues that given a huge dataset (size 100M+), the performance of traditional CNN-based architectures or the new transformer-based architectures are only marginally better than a classic MLP based architecture, thus questioning the inductive biases of both CNNs and Transformers for images.

Please feel free to read along with the paper with my notes and highlights.

ColorMeaning
GreenTopics about the current paper
YellowTopics about other relevant references
BlueImplementation details/ maths
RedText including my thoughts, questions, and understandings

Citation

@misc{tolstikhin2021mlpmixer,
      title={MLP-Mixer: An all-MLP Architecture for Vision}, 
      author={Ilya Tolstikhin and Neil Houlsby and Alexander Kolesnikov and Lucas Beyer and Xiaohua Zhai and Thomas Unterthiner and Jessica Yung and Andreas Steiner and Daniel Keysers and Jakob Uszkoreit and Mario Lucic and Alexey Dosovitskiy},
      year={2021},
      eprint={2105.01601},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}