Fingerprinting Lexical Contexts over the Web
Vincenzo Di Lecce (Polytechnic of Bari, Italy)
Marco Calabrese (Polytechnic of Bari, Italy)
Domenico Soldo (Polytechnic of Bari, Italy)
Abstract: In this paper a novel technique for identifying lexical contexts in web resources is presented. The basic idea is to consider web site anchortexts as lexicalized descriptions of an individual ontology organized in the form of a graph of concept words. In the search for peculiar semantic patterns, the concept of web minutia (transposed from the forensic domain) is introduced. The proposed technique consists in searching for web minutiae in the analyzed web sites by means of a golden ontology. Web minutiae act as fingerprints for context-specific web resources; in this sense they are a powerful computational tool to identify and categorize the Web. The WordNet database has been used as golden ontology for our experiments on English web documents. WordNet allows for indexing and retrieving word senses and inter-word taxonomical relations like hyponymy and hypernymy. It has proven to be an efficient mediator between web ontologies and context-dependent taxonomies. Our experiments have been carried out on a preliminary data set of several tens of thousand links taken by web sites of thirteen UK universities. Preliminary results seem to confirm the ability of web minutiae to identify lexical contexts across the Web.
Keywords: Semantic Web, Web Mining, WordNet, golden ontology, knowledge discovery, minutia
Categories: I.2.4, L.1.4