Go home now Header Background Image
Submission Procedure
share: |
Follow us
Volume 14 / Issue 11

available in:   PDF (971 kB) PS (3 MB)
Similar Docs BibTeX   Write a comment
Links into Future
DOI:   10.3217/jucs-014-11-1893


Recognising Informative Web Page Blocks Using Visual Segmentation for Efficient Information Extraction

Jinbeom Kang (Hanyang University, Korea)

Joongmin Choi (Hanyang University, Korea)

Abstract: As web sites are getting more complicated, the construction of web information extraction systems becomes more troublesome and time-consuming. A common theme is the difficulty in locating the segments of a page in which the target information is contained, which we call the informative blocks. This article reports on the Recognising Informative Page Blocks algorithm (RIPB), which is able to identify the informative block in a web page so that information extraction algorithms can work on it more efficiently. RIPB relies on an existing algorithm for vision-based page block segmentation to analyse and partition a web page into a set of visual blocks, and then groups related blocks with similar content structures into block clusters by using a tree edit distance method. RIPB recognises the informative block cluster by using tree alignment and tree matching. A series of experiments were performed, and the conclusions were that RIPB was more than 95% accurate in recognising informative block clusters, and improved the efficiency of information extraction by 17%.

Keywords: information extraction, informative block, visual block

Categories: H.3.7, H.5.4