MLP-MIXER: An all MLP Architecture for Vision
This is a very recent paper that challenges the need for complicated transformer-based models for huge datasets and questions the inductive biases presently in place for the present image recognition tasks.
This paper argues that given a huge dataset (size 100M+), the performance of traditional CNN-based architectures or the new transformer-based architectures are only marginally better than a classic MLP based architecture, thus questioning the inductive biases of both CNNs and Transformers for images.
Please feel free to read along with the paper with my notes and highlights.
| Color | Meaning |
|---|---|
| Green | Topics about the current paper |
| Yellow | Topics about other relevant references |
| Blue | Implementation details/ maths |
| Red | Text including my thoughts, questions, and understandings |