Go home now Header Background Image
Search
Submission Procedure
share: |
 
Follow us
 
 
 
 
Volume 2 / Issue 7 / Abstract

available in:   PDF (201 kB) PS (124 kB)
 
get:  
Similar Docs BibTeX   Write a comment
  
get:  
Links into Future
 
DOI:   10.3217/jucs-002-07-0515

On Combining the Knowledge of Heterogeneous Information Repositories

Uwe M. Borghoff
Rank Xerox Research Centre, Grenoble Laboratory
6, chemin de Maupertuis. F-38240 Meylan, France.
Email: borghoff@grenoble.rxrc.xerox.com

Johann H. Schlichter
Institut für Informatik, Technische Universitäat München
D-80290 München, Germany.
Email: schlicht@informatik.tu-muenchen.de

Abstract: The Internet facilitates access to a large amount of electronic information. However, in order to exploit the flood of information, sophisticated search facilities are needed which convert the inundation of electronic data coming from numerous sources into real knowledge. From this knowledge a whole range of users will benefit, from business people to casual surfers and shoppers on the Internet.

Intelligent agents or knowledge brokers play a vital role in realizing this vision. This paper presents a framework for knowledge brokers who search for information which is potentially available but stored in a way not always foreseen how the information will be exploited. More striking, the paper presents an architectural framework where the user can retrieve and combine knowledge uniformly, irrespective of where or how the knowledge-representing information is stored.

Lessons learned from a prototype implementation allow a discussion of shortcomings due to the emphasis of current information repositories and their interfaces, above all their poor support for knowledge combination and the difficulty of localizing the appropriate information repositories.

Key Words: distributed agent system, knowledge brokers, information gathering, data retrieval, Internet.

Category: H.2.5., H.3.

1 Introduction

Since their origins, database systems have provided data storage and data retrieval in a uniform way. These systems have emphasized data consistency and data coherence. Data have been stored in a way which has allowed appropriate searches with well-defined formatted retrieval languages. Roughly speaking, the same still holds to be true within the context of federated databases [Sheth and Larson, 1990] . Federated databases hide their different data models and query languages. Through schema integration and a common query language (e. g. SQL), the federation appears as a logically integrated database of existing heterogeneous distributed databases.

New electronic information sources, such as distributed file systems, Internet-based information sources, or other on-line information repositories proliferate. These distributed "collections" of information have a slightly different orientation. They are used to present data, sometimes are tailored to present the information graphically, rather than storing them for retrieval. In the Internet

Page 515

many people now include their private collections of data for public use. Large data collections of bibliographical information, e. g. in BibTeX format, are accessible electronically. In general, a variety of miscellaneous information is privately updated, gathered for special needs, and provided to the large user community. In fact, if a user is looking for a particular piece of information, there is a good chance that this information is stored somewhere in publishing tools like Gopher or the World-Wide Web. However, it is difficult to discover the location and the way in which the information is provided. Since search forms or other discovery tools are developed independently, interoperability is difficult to achieve. In many cases, moreover, a user must contact more than one source to reach the final destination where the answer to his/her question can be found.

To combine search results, to extract knowledge from one search in order to trigger a subsequent search, and, finally, to evaluate the search hits, are major challenges in searching heterogeneous information repositories of the Internet. This paper presents an architectural framework to tackle these challenges.

The paper is organized as follows. After a brief discussion of related work in Sect. 2, we present in Sect. 3 the architectural framework which we used for knowledge combination. The prototype architecture consists of a user interface, knowledge brokers which specify electronic information through constraints, and wrappers which provide the interfaces to existing external data sources. Sect. 4 discusses shortcomings of the current approach. Sect. 5 concludes the paper.

2 Related Work

Well-established publishing systems, like Gopher and the World-Wide Web, provide a seamless information space in the Internet, at least as far as graphical browsing is concerned. Index and search subsystems appeared hand in hand with the rapid growth in the amount of information and in the number of users having specific needs. [ Obraczka et al., 1993] and [Schwartz et al., 1992] give an overview of resource discovery approaches.

Some of the earliest Internet indexing approaches were the Wide-Area Information Servers (WAIS) [Kahle and Medlar, 1991], providing a Z39.50-based search and retrieval interface, and Archie [Emtage and Deutsch, 1992]. Archie periodically contacts a set of registered servers to gather a file index. Similar to that is Aliweb [Aliweb, URL] which contains user-written summaries of server contents that are displayed on request.

More recently, with Glimpse (GLobal IMPlicit SEarch) [Manber and Wu, 1994], [Glimpse, URL] an index/search subsystem has been installed that allows sophisticated searches over entire file systems. Among other things, it allows misspelling and regular expression searches over non-uniform information including many types of documents. At the University of Karlsruhe, a prominent application has been realized on top of Glimpse, namely the sophisticated search facility for a large collection of computer science bibliographies [CS biblio, URL].

Although multi-source index/search subsystems have already been built for Gopher, with Veronica [Veronica, URL], and for the Web, with Alta Vista [Alta Vista, URL], Lycos [Mauldin and Leavitt, 1994] [Lycos, URL], and the World-Wide Web Worm [McBryan, 1994], [WWWW, URL], retrieval engines or retrieval support systems for heterogeneous information are still open research fields [Barbara, 1993]. Early prototypes have however gotten an airing. The system Inquery

Page 516

[Callan et al., 1992], [Callan et al., 1995], currently being developed at Amherst University, calculates the appropriateness of heterogeneous information sources with respect to a given query. It chooses the best fitting sources and conducts the search processes. At Stanford University gGLOSS (generalized Glossary-Of-Servers Server) [Gravano and Garcia-Molina, 1995] addresses a similar idea. It keeps sophisticated statistics on available databases to determine an estimate of which databases are most appropriate for a given query. The search process is performed through a ranked list of databases. In contrast to Archie, which gathers an index without having a particular query in mind, Inquery and gGLOSS provide their indexes dynamically and are tailored to individual needs, via a single query. The indexes then guide individual searches across the set of servers.

As soon as appropriate index/search prototypes were implemented, intelligent agents [CACM, 1994], [Wooldridge and Jennings, 1995] or knowledge brokers [Barbara and Clifton, 1992] started to exploit these subsystems. Harvest [Bowman et al., 1994a] [Harvest, URL], for example, exploits as an index/search sub-system, both Glimpse and Nebula [Bowman et al., 1994b]. Knowledge brokers are autonomous entities that may collaborate, negotiate, and coordinate, but which by no means can be coerced into activities such as searching information or answering a query whose scope does not conform to the broker's ability in query handing (for more details see [Andreoli et al., 1995]). Thus, knowledge brokers are generally used in combination with index/search subsystems.

In the Constraint-Based Knowledge Broker model (CBKB) constraints have been introduced to flexibly manage the search space of broker agents as well as to flexibly adapt user requests and answers from information providers. [Andreoli et al., 1996] presents the theoretical background of CBKB. Protocol issues within CBKB are addressed in [Arcelli et al., 1995] [Borghoff et al., 1996b]. [Fikes et al., 1995] also use logic-based models to capture the domain of expertise of information brokers. Rather than using constraints, their modeling language is based on a predicate logic with contexts. The Tsimmis project [Chawathe et al., 1994] takes a different approach using a self-describing object model for the internal representation of information and queries.

3 Knowledge Combination

We have developed a framework for knowledge brokers which searches information in a wide variety of heterogeneous information repositories and which combines the search results into user-customizable knowledge. During individual searches a variety of index/search subsystems are contacted, a variety of protocols are used, and different information formats are homogenized. The implemented prototype system is based on a distributed architecture following the client/server model.

The developed framework differs from other approaches for agent-based information gathering on the Web not only in the technology, but also in some fundamental assumptions. We differ from those approaches which view the Web as a kind of global market where agents -- roaming over open electronic domains -- will meet and gather information, possibly leading to business transactions. On the contrary, we see the Web as evolving into a galaxy of intranets, linking together information providers and users around common interests. On the basis

Page 517

of these social and economic considerations, technological choices can be consequently specialized to optimally fit the requirements of specific intranets and user communities. One such case of specialization of our framework, namely the adaptation of an agent infrastructure for constraint-based information gathering to the requirements of a network publication system for research and education, has been presented in [Borghoff et al., 1996c].

3.1 Architecture

As emphasized in Fig. 1, the architecture consists of four major components: user interface, knowledge brokers, wrappers and external databases. Note that the notion of database is used in a loose sense. It should be read in a comprehensive meaning, containing "real" database management systems but also distributed file systems, Internet-based information sources, or other on-line information repositories. The important part is the existence of some sort of index/search subsystem.

For the discussion of the architecture and its features only the first three components (user interface, knowledge brokers and wrappers) are of interest. With respect to the backends (external databases) we rely on already existing databases, e. g. databases accessible through the World-Wide Web, or locally available databases such as the Xerox phonebook. The external databases which are integrated into the prototype system cover a wide range of different information types ranging from collections of bibliographies and books to databases of telephone numbers for Xerox employees to listings of opera performances. In this heterogeneous environment some databases incorporate highly structured information while the information of other databases is completely unstructured.

The goal of the prototype system was the integration of heterogeneous databases storing different types of information, using different data representations, supporting different query languages and different interface protocols, into one environment. Among the different interface options and corresponding access methods, we especially focus on SQL databases, WAIS, and FTP, Gopher or HTTP servers in the case of databases accessible through the World-Wide Web. The users retrieve and combine knowledge uniformly, no matter where and how the knowledge-representing information is stored. The main emphasis was put on constraint-based knowledge combination. Thus, the approach using federated databases [Sheth and Larson, 1990] is not sufficient for our goal. Similarly the Teamwork approach proposed by Denzinger [Denzinger, 1995] applies knowledge combination to distributed search problems whose descriptions offer no natural way of dividing them a priori into subproblems.

3.2 Implementation

The graphical user interface has been implemented as Java applet which make the system queryable from standard Web browsers with applet support. The format of the queries/subqueries sent to the brokers is based on constraints over feature structures. Adapters are provided to convert this format to human readable graphical widgets on the users side. The core of the system is given by the brokers, which provide various important services such as intelligent cacheing and knowledge combination. The current implementation uses ForumTalk. Brokers are implemented as ForumTalk

Page 518

Figure 1: Architecture sketched with an example configuration

coordinators programmed in LO [Andreoli, 1995] , while the user interface (programmed in Java, cf. Fig. 2) and the wrappers (programmed in Python) are external processes accessed through ForumTalk plugs.

Page 519

This screenshot includes (from front to back) the query window where users formulate sophisticated queries, the session control where users organize parallel query sessions, and the report window where answers are converted to HTML for further processing.

Figure 2: Graphical user interface

The system is up and running. For more details on how to run the system, check http://www.xerox.fr/ and follow grenoble/ct/prototypes/cbkb/home.html.

Page 520

3.3 Knowledge Brokers with Constraints Solvers

In the following we concentrate on a specific implementation of knowledge brokers, the so-called Constraint-Based Knowledge Brokers (CBKB) [Andreoli et al., 1996] . CBKB search heterogeneous information in various databases organized along different formats and with different requirements in mind. The basic model of retrieval exploits the standard request/subrequest model [Vielle, 1986] where subrequest generation depends on results obtained from precedent requests.

CBKB specify, in a single formalism, knowledge search requests, the answers to these requests, and the state of the knowledge brokers. This formalism extends the notion of standard feature constraints. The full formalism, including the state specification of the knowledge brokers, is not relevant for the remainder of the paper. The interested reader is referred to [Andreoli et al., 1995] for a detailed discussion.

In the most primitive reading, feature constraints are simple multisets of attribute/value pairs. Feature constraints have been extensively studied [Ait-Kaci et al., 1994] and, among other things, can be naturally exploited to provide structured representations of electronic information.

Assume we are interested in some recent publications about knowledge brokers written by the colleagues of a given person. First of all we have to find out who is a colleague. The Internet provides this information already for many company employees. For example, the electronic Xerox phonebook allows searches of the form "who is in the same group as hpersoni". For all or some of the corresponding colleagues of person, we subsequently search for their recent publications about the given topic (in the example below the given topic is "knowledge brokers"). Again, there is a possibility to search for this kind of information. Among the variety of possible index/search subsystems, the Library of Congress WAIS-gateway or an installation of the Glimpse subsystem for bibliographical data fulfill our needs. Let us therefore assume the following query of a broker broker_phone_glimpse constraining a variable R.

<Id> R R: mainquery, R: subquery1 -> Pos1, Pos1: Person-Search, Pos1: pattern -> X1, X1: "who is in the same group as Steve Freeman", Pos1: surname -> X2, R: subquery2 -> Pos2, Pos2: Publication-Search, Pos2: pattern1 -> X2, Pos2: pattern2 -> X3, X3: "knowledge brokers", Pos2: year -> X4, X4 > X5, X5: 1992

Within a constraint, interdependencies or precedence conditions may exist between feature entries: in our example, the variables for surname and pattern1 must coincide, the variable for year must hold a value greater than 1992.

The query is tagged with a unique identifier Id. The constraint variable R knows two request positions, one of type Person-Search (position Pos1) and one of type Publication-Search (position Pos2). The mediator of the wrapper

Page 521

component maps each type to the appropriate information source. In the case of our example, Person-Search will be mapped to the Xerox phonebook while the Glimpse subsystem is used for the publication search. Thus, the query of broker broker_phone_glimpse can be broken into two subqueries. The feature pattern specifies the search pattern for Person-Search while the combination of the features pattern1 and pattern2 define the pattern of the subsequent search within the information repository selected for Publication-Search.

It is worthwhile to note that a threshold mechanism [Andreoli et al., 1994] allows the withholding of the submission of a subquery until sufficient information is specified. In our case, assume that the subquery to Publication-Search needs at least the surname, i. e. a value for X2.

Therefore, the first subquery

<Id> Pos1 Pos1: Person-Search, Pos1: pattern -> X1, X1: "who is in the same group as Steve Freeman", Pos1: surname -> X2

is submitted to broker_phone to search in the people information repository.

Upon reception of an answer, broker_phone checks whether the answer is plausible. More specifically, a constraint solver checks whether the answer satisfies the property expressed in the query's constraint, i. e. whether the answer entails the request. A satisfactory answer is sent to the requester broker_phone_glimpse, e. g.

<Id> Pos1 Pos1: Person-Search, Pos1: pattern -> X1, X1: "who is in the same group as Steve Freeman", Pos1: surname -> X2, X2: "Andreoli", Pos1: given_name -> X6, X6: "Jean-Marc", Pos1: nickname -> X7, X7: "JM"

Since the threshold is now reached, the second subquery to broker_glimpse is immediately launched as

<Id> Pos2 Pos2: Publication-Search, Pos2: pattern1 -> X2, X2: "Andreoli" Pos2: pattern2 -> X3, X3: "knowledge brokers", Pos2: year -> X4, X4 > X5, X5: 1992

The corresponding constraint specifies a query for the constrained variable Pos2 of type Publication-Search. The constraint variable Pos2 includes the form attributes pattern1 and pattern2 which are mapped by the mediator to appropriate search attributes of the selected bibliographical database. A year-attribute is required for each answer.

Page 522

Again upon reception of an answer, broker glimpse checks whether the answer entails the query. If so, the answer is sent to broker_phone_glimpse. In turn, the knowledge broker broker phone glimpse checks whether the full answer entails the original query. If so, we have obtained the first relevant publication of our interest.

3.4 Wrappers

A wrapper consists of a mediator and a set of database interfaces defining the interfaces to these external databases which may be accessed for knowledge retrieval. Mediator and database interfaces communicate through a constraint exchange protocol. They may reside on different machines connected by local or wide area networks.

The mediator acts as a client while the database interfaces play the role of servers. Thus, a database interface is continuously waiting for queries from the mediator, i. e. requests from knowledge brokers (in our example, from the knowledge brokers broker phone and broker glimpse). The protocol between the knowledge brokers and the mediator simply exchanges structured messages in the form of constraints.

There may exist multiple wrappers in the environment providing broker access to a wide variety of external databases. The scope of different wrappers may even overlap because duplicate and redundant information will be filtered out by the knowledge brokers [Andreoli et al., 1996] .

3.4.1 Mediator

The mediator handles the communication between the knowledge brokers and the database interfaces. It accepts queries of knowledge brokers and returns to them answers received from the external databases via the database interfaces. The mediator uses the same internal format as the knowledge brokers (i. e. constraints) to represent queries and answers. The mediator's main functionality is the selection of the information source to which a search request should be sent in order to satisfy the query. The initial constraint sent by a knowledge broker specifies the information source only according to the type of the requested information rather than by providing its name. The mediator determines the name of the appropriate information source and inserts it into the original query. In the subquery submitted by broker phone the form attribute Person-Search is replaced by the database name XeroxPhone:

<Id> Pos1 Pos1: XeroxPhone, Pos1: pattern -> X1, X1: "who is in the same group as Steve Freeman", Pos1: surname -> X2

The database name also incorporates the network address, e. g. the URL in case of a Web-interface. For simplicity reasons that information is not explicitly represented in our example. The form attribute pattern need not be replaced because the Xerox phonebook uses the same attribute name to specify search

Page 523

pattern. The modified constraint is then propagated to the database interface which handles queries to the selected information source.

Currently, the system manages the mapping between the type of information source and its name and address by simple table lookups. In the next phase, sophisticated mediators [Wiederhold, 1992] will provide a more flexible mapping, similar to the mechanisms used in the UMDL-project [Birmingham, 1995], [Birmingham et al., 1995] . There, so-called registry agents capture the address and content information of each database. Query planning agents negotiate with the registry agents about particular addresses of databases (see Fig. 3). During the negotiation phase, facilitator agents might be involved to resolve different protocols. [Borghoff et al., 1996a] discuss other ways of multi-agent coordination.

The mediator also handles the communication to the knowledge brokers. Answers received by the database interfaces are propagated to the correct knowledge broker determined by the identifier included in the answer.

Figure 3: Negotiation between mediator, registry, and query planning agents

Page 524

3.4.2 Database Interface

Our distinction between different database interfaces is based on the categorization of external databases according to their supported interfaces for submitting queries and returning answers. Examples of database categories are HTTP, Z39.50 and SQL. As illustrated in Fig. 1, each database interface itself consists of a database-independent (the interface core) and several database-dependent parts.

The database-independent part handles the receipt of queries, the construction of answers which are sent to the appropriate knowledge brokers via the mediator, and, in the case of differing internal formats used by knowledge brokers and the database interface, the conversion of queries and answers.

The following discussion is based on our implementation of the HTTP database interface. Other database interfaces are quite similar even if they might differ in their internal data formats.

Within the HTTP database interface queries and answers are represented by structured information consisting of an identifier, a database name, and multiple attribute/value pairs. In principle, the specification of the query and the answers managed by the database interface could use the same formalism as the knowledge brokers and the mediator. However, the specification of the database interface does not enforce constraint solving. Therefore, it is cumbersome to resolve interdependency relationships, and, in most cases, it is fully irrelevant to know of precedence conditions. Attribute/value pairs are easier to handle and convenient to implement. They control the query sent to external databases supporting the HTTP interface.

For example, the specification

<Id> Glimpse query1 = "Andreoli" query2 = "knowledge brokers" errors = 0 case = "on"

results in a query sent to the Glimpse subsystem. Note here, that the form attributes pattern1 and pattern2 of the original request are mapped into the form attributes query1 and query2 used by the Glimpse interface. The values of query1 and query2 are combined and used as search patterns disallowing any error during case-insensitive pattern matching.

The interface core constructs out of the answers received from the external database, and parsed by the database-dependent part, answer messages represented in the internal format of the database interface. In the case of the HTTP interface each answer message consists of an identifier for matching the query and the answer messages, and a set of attribute/value pairs representing the information of a single search hit. In general, the attribute names are derived from the answers received from the external database.

For example, the query above may result in the following answer message.

<Id> bibtype = "inproceedings" key = "PASCO-1994"

Page 525

author = "J.-M. Andreoli and U. M. Borghoff and R. Pareschi" title = "Constraint-Based Knowledge Brokers" booktitle = "Proc. 1st Int. Symp. on Parallel Symbolic Computation (PASCO'94)" address = "Hagenberg/Linz, Austria" editor = "H. Hong" publisher = "Lect. Notes Series in Comp. Vol. 5. Singapore, New Jersey, London, Hong Kong: World Scientific" year = "1994" month = "sep" pages = "1--11" url = "http://www.xerox.fr/ grenoble/ct/articles/94-pasco.ps.Z"

The interface core generates for every search hit received from the external database a separate answer message. These answer messages are then converted into constraints which are then sent to the mediator for propagation to the requesting knowledge broker.

The special characteristics of the individual external databases are encapsulated in the database-dependent parts which deal with the structure of the submitted queries as well as the structure of the answers received from the database.

For databases accessed over the Web, the structure of the queries are quite similar. Besides the database address the only additional difference is the number and names of the form attributes to be provided by the query. Typically, the knowledge brokers do not specify all attributes possible in the form. Some of the attributes are optional, or indeed unknown. If unspecified attributes are needed or useful during the query, the database-dependent part appends these attributes with default values. This explains the entries

  errors = 0 
  case = "on" 

in the example above.

On the other hand, the knowledge brokers may have specified attributes that have no corresponding form attribute. Attributes that cannot be converted and assigned to an appropriate form entry, or attributes that have no specified value, are simply ignored for the query. This explains why the entry

year > 1992

is missing in the example above. In general, it is necessary to examine the HTML form page manually in order to extract the required information for both the query composition as well as the answer decomposition. However, the more complex problem is the answer decomposition, that is, how to extract automatically the relevant information representing answers to our previously submitted queries. The specification and the implementation of parsers interpreting database-specific answers require a lot more effort because the structure of database entries is not standardized. Especially in the Web, HTML pages are created by individual designers according to their own taste

Page 526

and perspective. In most cases these pages are prepared for visual presentation to the user and not to support automatic interpretation by agents. Examples of structuring methods for distinguishing between different search hits include definition lists, preformatted text, set of anchors where each anchor represents one search hit, and occasionally a list of entries separated by paragraph markers. For example, the Glimpse subsystem returns the entries of the bibliography databases as they were entered by the database administrators. Currently it is not possible to fully automate the generation of parsers which extract relevant information of answers and propagate them to the knowledge brokers. However, due to similarities of the answers a common base may be used for all database-dependent parts. Each database-dependent part must be manually customized and the effort depends on the complexity of the answer information provided by the external database.

Adding new external database interfaces to the prototype system is quite straight forward. First, a new database-dependent part must be provided. In many cases it can be derived from an existing one by integrating the appropriate modifications. Second, the new database-dependent part must be integrated into the interface core which handles the type of interface supported by the external database. Third, the dictionary of the mediator must be extended to include the new external database.

3.4.3 Conversion between Constraints and Database Interface Formats

The knowledge brokers as well as the interface core use a specific internal format to represent the query and the answer to the query. As we have seen, knowledge brokers use a specific format suitable to their constraint solvers whereas the interface core uses a data format specifically adapted to the functionality of the database interface protocol. For example, the HTTP interface core handles multiple attribute/value pairs (without interdependencies between attributes) convenient for implementation. Thus, every interface core which uses an internal data format different from constraints as used by knowledge brokers incorporates a converter between these two data formats. As already demonstrated above we will use the HTTP interface core as an example.

From a constraint specification to a specification using attribute/value pairs, the converter extracts first the identifier, second, using the constrained variable and its type, the database name, and then the feature entries with a possibly referenced value.

From a specification using attribute/value pairs to a constraint specification, the obvious task of the converter is to instantiate the free variables in the query's constraint with received values (in the current prototype, a first soundness check is done within the converter itself).

When the interface core provides additional information within the answer, the converter generates new feature entries extending the original constraint of the query. In the example above, additional feature entries for the given name and the nickname were added. Subsequent searches may use the given name to further prune the search specification or to extract the publication of the wanted person, if the surname is ambiguous. The same holds true for the additional entry url. As Fig. 1 (see dashed arrows) illustrates, a knowledge broker broker_document

Page 527

may use this anchor for a subsequent search to retrieve the postscript version of the paper in question.

4 Shortcomings of the Current Approach

Knowledge brokers must cope with an open, heterogeneous world where information repositories provide a wide variety of different query and answer formats. As already mentioned in Sect. 3.4.2, database-dependent parts must be customized manually because of the heterogeneity with respect to the structuring of the answers as well as the lack of any information describing the semantics of answer components. However, it would facilitate the integration of new database-dependent parts if external databases would support empty queries which return as results the structure of accepted queries as well as the structure of returned answers. That information could be used to automatically generate the parser for interpreting answers and the code for constructing queries. Within the broker environment there are still conventions necessary which match the information received from the mediator to the appropriate attributes derived from the empty query. However, these conventions would be local to the broker environment and would not require external information repositories to follow these standards. Tsimmis [Chawathe et al., 1994] also attempts to apply a generator approach to integrate new information sources into their environment. There, a human might study samples of the data and develop an appropriate specification for the interface generator. Harvest pursues an object-oriented approach to enable the representation, manipulation and display of arbitrarily complex data in application-specific ways [Chhabra et al., 1994]. An access of a Web datum results in the invocation of the appropriate method which is stored in a type/method registry. Methods are executed either locally or remotely.

Another shortcoming is caused by the type and format of the information provided by the external databases. Often they do not provide the information in a way necessary to interoperate efficiently with other databases in subsequent searches. For example, queries to the Xerox phonebook return fully specified names of people including nicknames, given names, initials, surname, and generation qualifiers, such as "jr" or "III". Using for instance the name "Jimmy Jim R. Smith jr" without any modifications as pattern for a subsequent search in a bibliographical database will rather likely result in no or very few answers. The initial query to the Xerox phonebook leads to overspecified search patterns for subsequent searches in bibliographical databases. Thus, it is necessary to extract the relevant information which in most cases is specific to the external database and often even to the individual answer.

As already known from traditional ways of accessing large information repositories there is often a thin line between over- and underspecified queries. However, generally the user can refine the constraints in subsequent queries to the same database in order to control the amount of information returned. In the case of complex broker systems this is rather difficult because the answers of multiple databases are combined by the brokers to construct new queries to other databases. Currently the answers do not include enough semantics to extract automatically that information which is relevant for subsequent queries. That information must be provided manually by the designer of the database-dependent part. Ideally, all backends would encapsulate a query-customizable

Page 528

extraction method within the answers. For a first step towards this vision see [Schwartz and Hardy, 1996] .

A related problem exists for complex queries involving multiple sources of information. The decomposition into individual queries must be implemented manually to cater subsequent individual searches. The constraint-based approach, however, already paves the way for sophisticated expressions of data dependencies.

An issue only briefly mentioned so far is the mechanism for localization of external databases. For the current prototype system all external databases were localized manually and incorporated into the mediator dictionary individually. This approach is not feasible for the rapidly growing number of information repositories. Thus, the networked community must provide directory services which associate information types or subject areas with database names, location information and methods to access the database entries. White pages are not sufficient; we need yellow pages equipped with meta-information of semantical behavior. The directory information could be constructed passively using indexing tools such as Archie or the World-Wide Web Worm, or actively by requiring information repositories to register their database services. In the former case only a limited amount of information can be extracted automatically by these tools. In the latter case information repositories could provide a lot more semantical information to the directory services. This meta-information could then be used by mediators and knowledge brokers to select the appropriate external databases for knowledge retrieval and to adjust to the query decompositions. Again, mediators can support the query process effectively. For instance, mediators may select, with respect to the requester site, the best replica of the external database out of a set of mirror sites. On the other hand, they may choose, with respect to some ranking process `a la gGLOSS [Gravano and Garcia-Molina, 1995] , among different databases providing related information.

5 Conclusion

Due to the proliferation of heterogeneous information sources and their availability over the Internet, we aimed at searching these sources within a uniform framework for knowledge brokers. We allowed the combination of search results, the extraction of knowledge from one search in order to trigger a subsequent search, and, finally, the evaluation of the hits according to constraints.

We introduced the overall architecture and discussed the main constituents such as user interface, knowledge brokers, wrappers, and external databases. Knowledge brokers specify electronic information through constraints, whereas wrappers, providing the interfaces to existing external databases, specify the information in a database convenient format, e. g. attribute/value pairs. A converter translates between both representations. We explained the current state of a prototype system that distinguishes between a mediator, a database-independent part (the interface core) and specific database interfaces.

We finally discussed shortcomings of the current approach. These are mostly caused by the emphasis of current information repositories and their interfaces, above all their poor support for knowledge combination and the difficulty of localizing the appropriate information repositories.

Page 529

Despite the pragmatism to stick to essentials while implementing a prototype for an early proof of concept, the current approach (with one wrapper containing a dozen database interfaces) performs well. The adaptation of additional database interfaces to new data sources is straight forward. In contrast to the time that was needed to implement the database-independent part as well as the knowledge brokers, the time to implement an additional new database interface is negligible. For instance, the database-dependent part of the Xerox phonebook interface was written in three hours, and of the Glimpse interface in less than a day.

References

[Ait-Kaci et al., 1994] H. Ait-Kaci, A. Podelski, and G. Smolka. A feature-based constraint-system for logic programming with entailment. Theoretical Computer Science, 122, 263--283, 1994

[Aliweb, URL] http://web.nexor.co.uk/public/aliweb/aliweb.html

[Alta Vista, URL] http://altavista.digital.com/

[Andreoli et al., 1994] J.-M. Andreoli, U. M. Borghoff, and R. Pareschi. Constraint-based knowledge brokers. In H. Hong, editor, Proc. 1st Int. Symp. on Parallel Symbolic Computation (PASCO '94), pp. 1--11, Hagenberg/Linz, Austria, September 1994. Lecture Notes Series in Computing 5, Singapore, New Jersey, London, Hong Kong: World Scientific

[Andreoli et al., 1995] J.-M. Andreoli, U. M. Borghoff, R. Pareschi, and J. H. Schlichter. Constraint agents for the information age. J. Universal Computer Science, 1, 12, 762--789, December 1995. Electronic version available at http://hyperg.iicm.tu-graz.ac.at/Cjucs root

[Andreoli et al., 1996] J.-M. Andreoli, U. M. Borghoff, and R. Pareschi. The constraint-based knowledge broker model: Semantics, implementation and analysis. J. Symbolic Computation, 1996 (in press)

[Andreoli, 1995] J-M. Andreoli. Programming in ForumTalk. Technical Report CT-- 003, Rank Xerox Research Centre, Grenoble Lab., France, 1995

[Arcelli et al., 1995] F. Arcelli, U. M. Borghoff, F. Formato, and R. Pareschi. Tuning constraint-based communication in distributed problem solving. In Proc. 1st Int. Workshop on Concurrent Constraint Programming (CCP '95), Venice, Italy, May 1995

[Barbara and Clifton, 1992] D. Barbara and C. Clifton. Information brokers: Sharing knowledge in a heterogeneous distributed system. Technical Report MITL--TR--31-- 92, Matsushita Information Technology Lab., Princeton, NJ, October 1992

[Barbara, 1993] D. Barbara. Extending the scope of database systems. Technical Report MITL--TR--44--93, Matsushita Information Technology Lab., Princeton, NJ, 1993

[Birmingham et al., 1995] W. P. Birmingham, E. H. Durfee, T. Mullen, and M. P. Wellman. The distributed agent architecture of the University of Michigan digital library. In Proc. AAAI Spring Symp. Series on Information Gathering from Distributed Heterogeneous Environments, Stanford, CA, March 1995

[Birmingham, 1995] W. P. Birmingham. An agent-based architecture for digital libraries. D-Lib Magazine, July 1995. Available at http://www.cnri.reston.va.us/home/dlib/July95/07contents.html

[Borghoff et al., 1996a] U. M. Borghoff, P. Bottoni, P. Mussio, and R. Pareschi. A systemic metaphor of multi-agent coordination in living systems. In A. Javor, A. Lehmann, and I. Molnar, editors, Proc. Europ. Simulation Multiconf. (ESM '96), pp. 245--253, Budapest, Hungary, June 1996. San Diego, CA: The Society of Computer Simulation

Page 530

[Borghoff et al., 1996b] U. M. Borghoff, R. Pareschi, F. Arcelli, and F. Formato. Constraint-based protocols for distributed problem solving. Science of Computer Programming, 1996 (in press)

[Borghoff et al., 1996c] U. M. Borghoff, R. Pareschi, H. Karch, M. Nöhmeier, and J. H. Schlichter. Constraint-based information gathering for a network publication system. In Proc. 1st Int. Conf. on the Practical Application of Intelligent Agents and Multi-Agent Technology (PAAM '96), pp. 45--59, London, U. K., April 1996. Blackpool, U. K.: The Practical Application Company Ltd

[Bowman et al., 1994a] C. M. Bowman, P. B. Danzig, D. R. Hardy, U. Manber, and M. F. Schwartz. The Harvest information discovery and access system. In Proc. 2nd Int. World-Wide Web Conf., pp. 763--771, Chicago, IL, October 1994

[Bowman et al., 1994b] C. M. Bowman, C. Dharap, M. Baruah, B. Camargo, and S. Potti. A file system for information management. In Proc. Conf. on Intelligent Information Management Systems, Washington, DC, June 1994

[CACM, 1994] Special issue on intelligent agents. Communications of the ACM, July 1994

[Callan et al., 1992] J. P. Callan, W. B. Croft, and S. M. Harding. The Inquery retrieval system. In Proc. 3rd Int. Conf. on Database and Expert Systems Applications, pp. 78--83, September 1992

[Callan et al., 1995] J. P. Callan, Z. Lu, and W. B. Croft. Searching distributed collections with inference networks. In Proc. 18th ACM SIGIR Conf. on Research and Development in Information Retrieval, Seattle, WA, July 1995

[Chawathe et al., 1994] S. Chawathe, H. Garcia-Molina, J. Hammer, K. Ireland, Y. Papakonstantinou, J. Ullman, and J. Widom. The Tsimmis project: Integration of heterogeneous information sources. In Proc. IPSJ Conf., Tokyo, Japan, October 1994. Los Alamitos, CA: IEEE Comp. Soc. Press

[Chhabra et al., 1994] B. Chhabra, D. Hardy, A. Hundhausen, D. Merkel, J. Noble, and M. F. Schwartz. Integrating complex data access methods into the Mosaic/WWW environment. In Proc. 2nd Int. World-Wide Web Conf., pp. 909--919, Chicago, IL, October 1994

[CS biblio, URL] http://liinwww.ira.uka.de/bibliography/index.html

[Denzinger, 1995] J. Denzinger. Knowledge-based distributed search using Teamwork. In V. Lesser, editor, Proc. 1st Int. Conf. on Multi-Agent Systems (ICMAS '95), pp. 81--88, San Francisco, CA, June 1995. Menlo Park, Cambridge, London: AAAI Press, MIT Press

[Emtage and Deutsch, 1992] A. Emtage and P. Deutsch. Archie: An electronic directory service for the internet. In Proc. Usenix Conf. Winter '92, pp. 93--110, Sunset Beach, CA, January 1992. Berkeley, CA: Usenix Association

[Fikes et al., 1995] R. Fikes, R. Engelmore, A. Farquhar, and W. Pratt. Networkbased information brokers. In Proc. AAAI Spring Symp. Series on Information Gathering from Distributed Heterogeneous Environments, Stanford, CA, March 1995

[Glimpse, URL] http://glimpse.cs.arizona.edu:1994/bib/

[Gravano and Garcia-Molina, 1995] L. Gravano and H. Garcia-Molina. Generalizing gloss to vector-space databases and broker hierarchies. In U. Dayal, P. M. D. Gray, and S. Nishio, editors, Proc. 21st Int. Conf. on Very Large Data Bases, pp. 78--89, Zurich, Switzerland, September 1995. San Francisco, CA: Morgan Kaufmann

[Harvest, URL] http://harvest.cs.colorado.edu/

[Kahle and Medlar, 1991] B. Kahle and A. Medlar. An information system for corporate users: Wide area information servers. Connexions: The Interoperability Report, 5, 11, 2--9, November 1991

[Lycos, URL] http://lycos.cs.cmu.edu/

[Manber and Wu, 1994] U. Manber and S. Wu. Glimpse: A tool to search trough entire file systems. In Proc. Usenix Conf. Winter '94, pp. 23--32, San Francisco, CA, January 1994. Berkeley, CA: Usenix Association

Page 531

[Mauldin and Leavitt, 1994] M. L. Mauldin and J. R. R. Leavitt. Web agent related research at the center for machine translation. In Proc. ACM Special Interest Group on Networked Information Discovery and Retrieval (SIGNIDR '94), McLean, VA, August 1994

[McBryan, 1994] O. A. McBryan. GENVL and WWWW: Tools for taming the web. In O. Nierstrasz, editor, Proc. 1st Int. World-Wide Web Conf., Geneva, Switzerland, May 1994

[Obraczka et al., 1993] K. Obraczka, P. B. Danzig, and S.-H. Li. Internet resource discovery services. IEEE Computer, 26, 9, 8--22, September 1993

[Schwartz and Hardy, 1996] M. F. Schwartz and D. R. Hardy. Customized information extraction as a basis for resource discovery. ACM Transactions on Computer Systems, 14, 2, May 1996

[Schwartz et al., 1992] M. F. Schwartz, A. Emtage, B. Kahle, and B. C. Neuman. A comparison of internet resource discovery approaches. Computing Systems, 5, 4, 461--493, Fall 1992

[Sheth and Larson, 1990] A. P. Sheth and J. A. Larson. Federated database systems for managing distributed, heterogeneous, and autonomous databases. ACM Computing Surveys, 22, 3, 183--236, September 1990

[Veronica, URL] gopher://gopher.unr.edu/11/veronica

[Vielle, 1986] L. Vielle. Recursive axioms in deductive databases: The query-subquery approach. In L. Kerschberg, editor, Proc. 1st Conf. on Expert Database Systems, Menlo Park, CA, April 1986. Benjamin/Cummings Publ. Company

[Wiederhold, 1992] G. Wiederhold. Mediators in the architecture of future information systems. IEEE Computer, 25, 3, 38--49, March 1992

[Wooldridge and Jennings, 1995] M. Wooldridge and N. R. Jennings. Intelligent agents: Theory and practice. Knowledge Engineering Review, 10, 2, 115--152, 1995

[WWWW, URL] http://wwww.cs.colorado.edu/WWWW/

Acknowledgement

The system described in this paper is an effort of many people. We would like to especially thank Jean-Marc Andreoli, Pierre-Yves Chevalier, Martina Nöhmeier (visiting from TU München, Germany), Remo Pareschi, and Jutta Willamowski. We thank Francesca Arcelli and Ferrante Formato (both Univ. Salerno, Italy) for their work on theoretical aspects of information reuse during the brokerage sessions. For the integration of the system into the so-called Physicists Network Publishing System thanks are due to Bernd Diekmann, Eberhard Hilf, Thomas Severiens, Heinrich Stamerjohanns (all Univ. of Oldenburg, Germany), and Harald Karch (Rank Xerox Germany). Last but not least we would like to thank the anonymous referees for their thorough comments which greatly helped to improve the paper.

Page 532