BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

The revolutionary paper by Google that increased the State-of-the-art performance for various NLP tasks and set the stepping stone for many other revolutionary architectures.
This paper leads the way and sets a direction for the entire domain. It shows clear benefits of using pre-trained models(trained on huge datasets) and transfer learning independent of the downstream tasks.
Please feel free to read along with the paper with my notes and highlights.
| Color | Meaning |
|---|---|
| Green | Topics about the current paper |
| Yellow | Topics about other relevant references |
| Blue | Implementation details/ maths/experiments |
| Red | Text including my thoughts, questions, and understandings |
I have added the architectural details, my insights on the transformer architecture and some idea about positional embeddings in the end.