DiT Annotated Paper

DIT: SELF-SUPERVISED PRE-TRAINING FOR DOCUMENT IMAGE TRANSFORMER DocumentAI with Images has a new leader in town and its DiT! A yet another stellar paper from the folks at Microsoft advancing the field of DocumentAI. This new paper essentially draws inspiration from various papers to come up with a clean end-to-end pre-trained network for various image tasks like document image classification, document layout analysis, table detection, etc. This also lays a foundation for all the upcoming multimodal networks for document understanding and plays an important role in the upcoming LayoutLMv3. Read along to explore this easy-to-read paper which potentially generates a lot of impact in the field. ...

April 21, 2022 路 2 min 路 Akshay Uppal

WebFormer Annotated Paper

WebFormer: The Web-page Transformer for Structure Information Extraction Understanding tokens from unstructured web pages is challenging in practice due to a variety of web layout patterns, this is where WebFormer comes into play. In this paper, the authors propose a novel architecture, WebFormer, a Web-page transFormer model for structure information extraction from web documents. This paper also introduces rich attention patterns between HTML tokens and text tokens, which leverages the web layout for effective attention weight computation. This can prove to be a big leap in web page understanding as it provides great incremental results and a way forward for the domain. ...

March 7, 2022 路 2 min 路 Akshay Uppal

LayoutLMv2 Annotated Paper

LayoutLMv2: Multi-Modal Pre-Training For Visually-Rich Document Understanding Microsoft delivers again with LayoutLMv2 to further mature the field of document understanding. The new pre-training tasks, the spatial aware self-attention, and the fact that image information is integrated into the pre-training stage itself distinguish this paper from its predecessor LayouLM and establish a new state-of-the-art performance for six widely used datasets in different tasks. This takes a step further in understanding documents through visual cues along with the textual content and layout information through a multi-modal model approach and carefully integrates Image, text, and layout information in the new self-attention mechanism. ...

December 16, 2021 路 2 min 路 Akshay Uppal

Fastformer Annotated Paper

Fastformer: Additive Attention Can Be All You Need Of late this paper is all the rage with its claims to introduce an attention mechanism that has a linear time complexity with respect to the sequence length. Why is this such a big deal you ask? Well, If you are familiar with transformers, one of the biggest downsides is the quadratic complexity which creates a huge bottleneck for longer sequences. So if additive attention works out, we will no longer have a strict cap of 512 tokens as introduced in the original and subsequent transformer-based architectures. This paper compares itself with other well-known efficient transformer techniques and conducts experiments on five well-known datasets. ...

October 4, 2021 路 2 min 路 Akshay Uppal

LayoutLM Annotated Paper

LayoutLM: Pre-training of Text and Layout for Document Image Understanding Diving deeper into the domain of understanding documents, today we have a brilliant paper by folks at Microsoft. The main idea of this paper is to jointly model the text as well as layout information for documents. The authors talk about the importance of layout features in the form of 2D positional embeddings and Visual features in the form of token-wise image embeddings along with the textual features for state of the art document understanding. This paper is a solid milestone in this domain and is now actively used as a benchmark of comparison for the latest research in the area. ...

August 26, 2021 路 2 min 路 Akshay Uppal