Go home now Header Background Image
Search
Submission Procedure
share: |
 
Follow us
 
 
 
 
Volume 14 / Issue 11

available in:   PDF (304 kB) PS (2 MB)
 
get:  
Similar Docs BibTeX   Write a comment
  
get:  
Links into Future
 
DOI:   10.3217/jucs-014-11-1857

 

Structure-Based Crawling in the Hidden Web

Marcio Vidal (Federal University of Amazonas, Brazil)

Altigran S. da Silva (Federal University of Amazonas, Brazil)

Edleno S. de Moura (Federal University of Amazonas, Brazil)

João M.B. Cavalcanti (Federal University of Amazonas, Brazil)

Abstract: The number of applications that need to crawl the Web to gather data is growing at an ever increasing pace. In some cases, the criterion to determine what pages must be included in a collection is based on theirs contents; in others, it would be wiser to use a structure-based criterion. In this article, we present a proposal to build structure-based crawlers that just requires a few examples of the pages to be crawled and an entry point to the target web site. Our crawlers can deal with form-based web sites. Contrarily to other proposals, ours does not require a sample database to fill in the forms, and does not require the user to interact heavily. Our experiments prove that our precision is 100% in seventeen real-world web sites, with both static and dynamic content, and that our recall is 95% in the eleven static web sites examined.

Keywords: Web crawling, hidden web, tree-edit distance, web wrappers

Categories: H.3.3, H.3.4, H.3.5, H.3.7