MLP-Mixer Annotated Paper

less than 1 minute read

MLP-MIXER: An all MLP Architecture for Vision

This is a very recent paper that challenges the need for complicated transformer-based models for huge datasets and questions the inductive biases presently in place for the present image recognition tasks.

This paper argues that given a huge dataset (size 100M+), the performance of traditional CNN-based architectures or the new transformer-based architectures are only marginally better than a classic MLP based architecture, thus questioning the inductive biases of both CNNs and Transformers for images.

Please feel free to read along with the paper with my notes and highlights.

Color Meaning
Green Topics about the current paper
Yellow Topics about other relevant references
Blue Implementation details/ maths
Red Text including my thoughts, questions, and understandings


CITATION

@misc{tolstikhin2021mlpmixer,
      title={MLP-Mixer: An all-MLP Architecture for Vision}, 
      author={Ilya Tolstikhin and Neil Houlsby and Alexander Kolesnikov and Lucas Beyer and Xiaohua Zhai and Thomas Unterthiner and Jessica Yung and Andreas Steiner and Daniel Keysers and Jakob Uszkoreit and Mario Lucic and Alexey Dosovitskiy},
      year={2021},
      eprint={2105.01601},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}