LayoutLMv2 Annotated Paper
LayoutLMv2: Multi-Modal Pre-Training For Visually-Rich Document Understanding
Microsoft delivers again with LayoutLMv2 to further mature the field of document understanding. The new pre-training tasks, the spatial aware self-attention, and the fact that image information is integrated into the pre-training stage itself distinguish this paper from its predecessor LayouLM and establish a new state-of-the-art performance for six widely used datasets in different tasks. This takes a step further in understanding documents through visual cues along with the textual content and layout information through a multi-modal model approach and carefully integrates Image, text, and layout information in the new self-attention mechanism.
Please feel free to read along with the paper with my notes and highlights.
Color | Meaning |
---|---|
Green | Topics about the current paper |
Yellow | Topics about other relevant references |
Blue | Implementation details/ maths/experiments |
Red | Text including my thoughts, questions, and understandings |
!!! | No supporting evidence of the claim, so take it with a grain of salt |
Follow me on Github and star this repo for regular updates. GitHub
Also, Follow me on Twitter.
PS: For now, the PDF Above does not render properly on mobile devices, so please download the pdf from the above button or get it from my Github
CITATION
@misc{xu2021layoutlmv2,
title={LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding},
author={Yang Xu and Yiheng Xu and Tengchao Lv and Lei Cui and Furu Wei and Guoxin Wang and Yijuan Lu and Dinei Florencio and Cha Zhang and Wanxiang Che and Min Zhang and Lidong Zhou},
year={2021},
eprint={2012.14740},
archivePrefix={arXiv},
primaryClass={cs.CL}
}