MLP-Mixer Annotated Paper

MLP-MIXER: An all MLP Architecture for Vision

This is a very recent paper that challenges the need for complicated transformer-based models for huge datasets and questions the inductive biases presently in place for the present image recognition tasks.

This paper argues that given a huge dataset (size 100M+), the performance of traditional CNN-based architectures or the new transformer-based architectures are only marginally better than a classic MLP based architecture, thus questioning the inductive biases of both CNNs and Transformers for images.

Please feel free to read along with the paper with my notes and highlights.

Color	Meaning
Green	Topics about the current paper
Yellow	Topics about other relevant references
Blue	Implementation details/ maths
Red	Text including my thoughts, questions, and understandings

Citation

@misc{tolstikhin2021mlpmixer,
      title={MLP-Mixer: An all-MLP Architecture for Vision}, 
      author={Ilya Tolstikhin and Neil Houlsby and Alexander Kolesnikov and Lucas Beyer and Xiaohua Zhai and Thomas Unterthiner and Jessica Yung and Andreas Steiner and Daniel Keysers and Jakob Uszkoreit and Mario Lucic and Alexey Dosovitskiy},
      year={2021},
      eprint={2105.01601},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

MLP-MIXER: An all MLP Architecture for Vision#

Citation#

MLP-MIXER: An all MLP Architecture for Vision

Citation