Unsupervised Structured Data Extraction from Template-generated Web Pages
            
            
               Tomas Grigalis (Vilnius Gediminas Technical University, Lithuania)  
              
             
            
            
               Antanas Čenys (Vilnius Gediminas Technical University, Lithuania)  
              
             
                    
            
              Abstract: This paper studies structured data extraction   from template-generated Web pages. Such pages contain most of   structured data on the Web. Extracted structured data can be later   integrated and reused in very big range of applications, such as   price comparison portals, business intelligence tools, various   mashups and etc. It encourages industry and academics to seek   automatic solutions. To tackle the problem of automatic structured   Web data extraction we present a new approach - structured data   extraction based on clustering visually similar Web page   elements. Our method called ClustVX combines visual and pure HTML   features of Web page to cluster visually similar Web page elements   and then extract structured Web data. ClustVX can extract structured   data from Web pages where more than one data record is present. With   extensive experimental evaluation on three benchmark datasets we   demonstrate that ClustVX achieves better results than other   state-of-the-art automatic structured Web data extraction methods. 
             
            
              Keywords: Deep Web, data extraction, structured web data, wrapper induction 
             
            Categories: H.0, H.2.8, H.3.3, H.3.5  
           |