Fastformer: Additive Attention Can Be All You Need

Of late this paper is all the rage with its claims to introduce an attention mechanism that has a linear time complexity with respect to the sequence length. Why is this such a big deal you ask? Well, If you are familiar with transformers, one of the biggest downsides is the quadratic complexity which creates a huge bottleneck for longer sequences. So if additive attention works out, we will no longer have a strict cap of 512 tokens as introduced in the original and subsequent transformer-based architectures. This paper compares itself with other well-known efficient transformer techniques and conducts experiments on five well-known datasets.

Please feel free to read along with the paper with my notes and highlights.

ColorMeaning
GreenTopics about the current paper
YellowTopics about other relevant references
BlueImplementation details/ maths/experiments
RedText including my thoughts, questions, and understandings

Follow me on Github and star this repo for regular updates. GitHub

Also, Follow me on Twitter.

PS: For now, the PDF Above does not render properly on mobile devices, so please download the pdf from the above button or get it from my Github

Citation

@misc{wu2021fastformer,
      title={Fastformer: Additive Attention Can Be All You Need}, 
      author={Chuhan Wu and Fangzhao Wu and Tao Qi and Yongfeng Huang and Xing Xie},
      year={2021},
      eprint={2108.09084},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}