Fastformer Annotated Paper

1 minute read

Fastformer: Additive Attention Can Be All You Need

Of late this paper is all the rage with its claims to introduce an attention mechanism that has a linear time complexity with respect to the sequence length. Why is this such a big deal you ask? Well, If you are familiar with transformers, one of the biggest downsides is the quadratic complexity which creates a huge bottleneck for longer sequences. So if additive attention works out, we will no longer have a strict cap of 512 tokens as introduced in the original and subsequent transformer-based architectures. This paper compares itself with other well-known efficient transformer techniques and conducts experiments on five well-known datasets.

Please feel free to read along with the paper with my notes and highlights.

Color Meaning
Green Topics about the current paper
Yellow Topics about other relevant references
Blue Implementation details/ maths/experiments
Red Text including my thoughts, questions, and understandings

Follow me on Github and star this repo for regular updates. GitHub

Also, Follow me on Twitter.

PS: For now, the PDF Above does not render properly on mobile devices, so please download the pdf from the above button or get it from my Github


CITATION

@misc{wu2021fastformer,
      title={Fastformer: Additive Attention Can Be All You Need}, 
      author={Chuhan Wu and Fangzhao Wu and Tao Qi and Yongfeng Huang and Xing Xie},
      year={2021},
      eprint={2108.09084},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}