Go home now Header Background Image
Submission Procedure
share: |
Follow us
Volume 14 / Issue 11

available in:   PDF (16 kB) PS (10 kB)
Similar Docs BibTeX   Write a comment
Links into Future

Wrapping Web Data Islands

J.UCS Special Issue

Rafael Corchuelo
(Universidad de Sevilla, Sevilla, Spain

José L. Arjona
(Universidad de Huelva, Huelva, Spain

David Ruiz
(Universidad de Sevilla, Sevilla, Spain

EAI, EII, and ETL are the three key abbreviations regarding integration. EAI stands for Enterprise Information Integration and refers to a number of technologies and best practices that help engineers integrate business applications, either to keep their data synchronised or to devise new emerging functionality [Hohpe and Woolf, 2003]. EII stands for Enterprise Information Integration; in this case, the focus is on creating live views of the data a number of applications manipulate [Kambhampati and Knoblock, 2003]. The focus of ETL, which stands for Extract, Transform, and Load, is on off-line data views that are typically used to feed business intelligence processes [Silvers, 2008]. According to a recent report [Weiss, 2005], companies spend $5-20 on integration per dollar spent on devising and implementing new applications. This is the reason why EAI, EII, and ETL are a common hobbyhorse for chief information and technology officers.

Our focus regarding integration is on web sites that do not provide a programmatic interface, which are very common nowadays. Such sites are difficult to integrate into automated business processes, which is the reason why we refer to them as web data islands. The Web Services or the Semantic Web initiatives [Papazoglou, 2007] [Antoniou and van Harmelen, 2008] provide excellent technologies by means of which web sites can provide a programmatic interface or, at least, provide data that is structured according to an ontology. However, re-engineering a web data island to endow it with a web service or with semantic annotations is not generally feasible. This has motivated many researchers to work on wrappers, which implement programmatic interfaces to web data islands by emulating the interaction of a human user, i.e., they fill forms in, navigate through the resulting pages, select the most appropriate data pages, extract data from them, structure them according to an ontology, and verify that the results are valid. Thanks to wrappers, integrating web data islands into automated business processes has become a common practice [Chidlovskii et al., 2006].

Page 1808

The goal of this special issue was to report on the state of the art regarding wrappers. We think we have succeeded in this endeavour since we have selected six papers that report on novel systems and techniques that help engineers build wrappers, namely:

  • Jim Blythe, Dipsy Kapoor, Craig A. Knoblock, Kristina Lerman, and Steven Minton are the authors of the first article. They report on a system to help users query web sites, extract and ontologise the data they provide, and create complex procedures to exploit these data by means of a simple natural language interface.
  • The article by Paula Montoto, Alberto Pan, Juan Raposo, José Losada, Fernando Bellas, and Víctor Carneiro reports on a language that helps engineers devise and implement wrappers. They criticise the common query wrapper model whereby a wrapper gets a query as input and outputs a result set, and support the implementation of wrappers in which navigation depends on the results that are being retrieved or wrappers that can insert, delete or update information.
  • The third article was written by Marcio Vidal, Altigran S. da Silva, Edleno S. de Moura, and João M.B. Cavalcanti. It reports on a novel technique that uses a structural criterion to crawl a web site for pages about a given topic. The experiments prove that the technique is effective enough and can sort out the difference between closely-related topics, e.g., films and actors.
  • Lorenzo Blanco, Valter Crescenzi, and Paolo Merialdo also focus on structural crawling, and present a technique that helps classify the template from which a given page is generated. It has also proved to be very effective with closely-related topics.
  • The fifth article reports on a technique that helps identify what the areas of interest of a web page are. It relies on a visual segmentation algorithm and helps information extractors work more efficiently and effectively. The article was contributed by Jinbeom Kang and Joongmin Choi.
  • Dawn G. Gregg reports on the results of a series of experiments she has conducted to explore how resilient information extractors are.


We would like to thank our reviewers for their hard work and enthusiasm to produce this special issue, and the J.UCS staff for their help and understanding. Without a shadow of doubt, they were fundamental to producing this issue, and we would not like to miss this opportunity to express our gratitude. The work of the guest editors was supported by the IntegraWeb project (grants CICYT TIN2007-64119 and JA TIC-2602). Part of the budget comes from FEDER funds.

Page 1809


[Antoniou and van Harmelen, 2008] Antoniou, G. and van Harmelen, F. (2008). A SemanticWeb Primer. The MIT Press, 2 edition.

[Chidlovskii et al., 2006] Chidlovskii, B., Roustant, B., and Brette, M. (2006). DocumentumECI self-repairing wrappers: performance analysis. In Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data, pages 708-717.

[Hohpe and Woolf, 2003] Hohpe, G. and Woolf, B. (2003). Enterprise Integration Patterns: Designing, Building, and Deploying Messaging Solutions. Addison-Wesley.

[Kambhampati and Knoblock, 2003] Kambhampati, S. and Knoblock, C. (2003). Information integration on the web. IEEE Intelligent Systems, 18(5):14-15.

[Papazoglou, 2007] Papazoglou, M. (2007). Web Services: Principles and Technology. Prentice Hall.

[Silvers, 2008] Silvers, F. (2008). Building and Maintaining a Data Warehouse. Auerbach.

[Weiss, 2005] Weiss, J. (2005). Aligning relationships: Optimizing the value of strategic out-sourcing. Technical report, IBM.

Page 1809