MLP-Mixer Annotated Paper
MLP-MIXER: An all MLP Architecture for Vision
This is a very recent paper that challenges the need for complicated transformer-based models for huge datasets and questions the inductive biases presently in place for the present image recognition tasks.
This paper argues that given a huge dataset (size 100M+), the performance of traditional CNN-based architectures or the new transformer-based architectures are only marginally better than a classic MLP based architecture, thus questioning the inductive biases of both CNNs and Transformers for images.
Please feel free to read along with the paper with my notes and highlights.
Color | Meaning |
---|---|
Green | Topics about the current paper |
Yellow | Topics about other relevant references |
Blue | Implementation details/ maths |
Red | Text including my thoughts, questions, and understandings |
CITATION
@misc{tolstikhin2021mlpmixer,
title={MLP-Mixer: An all-MLP Architecture for Vision},
author={Ilya Tolstikhin and Neil Houlsby and Alexander Kolesnikov and Lucas Beyer and Xiaohua Zhai and Thomas Unterthiner and Jessica Yung and Andreas Steiner and Daniel Keysers and Jakob Uszkoreit and Mario Lucic and Alexey Dosovitskiy},
year={2021},
eprint={2105.01601},
archivePrefix={arXiv},
primaryClass={cs.CV}
}