WebFormer Annotated Paper

1 minute read

WebFormer: The Web-page Transformer for Structure Information Extraction

Understanding tokens from unstructured web pages is challenging in practice due to a variety of web layout patterns, this is where WebFormer comes into play. In this paper, the authors propose a novel architecture, WebFormer, a Web-page transFormer model for structure information extraction from web documents. This paper also introduces rich attention patterns between HTML tokens and text tokens, which leverages the web layout for effective attention weight computation. This can prove to be a big leap in web page understanding as it provides great incremental results and a way forward for the domain.

Please feel free to read along with the paper with my notes and highlights.

Color Meaning
Green Topics about the current paper
Yellow Topics about other relevant references
Blue Implementation details/ maths/experiments
Red Text including my thoughts, questions, and understandings

Follow me on Github and star this repo for regular updates. GitHub

Also, Follow me on Twitter.

PS: For now, the PDF Above does not render properly on mobile devices, so please download the pdf from the above button or get it from my Github


CITATION

@misc{wu2021fastformer,
      title={WebFormer: The Web-page Transformer for Structure Information Extraction}, 
      author={Qifan Wang and Yi Fang and Anirudh Ravula and Fuli Feng and Xiaojun Quan and Dongfang Liu},
      year={2022},
      eprint={2202.00217},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}