ASF ASF logo HOME | MIRROR | BROWSE | SEARCH | E-MAIL |
Updated 20 SEP 1999 Security and Privacy Notice | URL:/framework.html

Framework for Advanced Search

I. Background

Organizations are increasingly providing network access to information resources, including services as well as documents and data. A significant issue facing organizations is how to integrate and optimize information management so searchers can quickly and easily locate what they seek amidst resources numbering in the millions and beyond.

Searchers, whether the public or internal organization employees, need mechanisms to discover and use diverse resources and services across many management domains, both centralized and decentralized. These mechanisms need to be standards-based for long-term sustainability and integration with legacy systems. Depending on the application, organizations may also need to provide secure, private, reliable, and verifiable results for searchers and intermediaries on public or private networks.

Proprietary mechanisms for organizing and searching centralized catalogs, directories, inventories, databases, and file systems have been developed over many decades. More recently, mechanisms for searching decentralized networks such as the World Wide Web have also been introduced. Unfortunately, virtually all such mechanisms are single-purpose facilities without support for interoperable search. For example, searchers can not issue a single query to a bibliographic catalog of publications and an organizational directory, nor can they uniformly query inventory systems and Web page collections.

From a searcher’s perspective, finding relevant resources and services involves looking for particular characteristics across a wide scope of content and heterogeneous information resources. For example, news media documents have descriptive information such as the date of publication, while maps have descriptive geographic information. This descriptive information is called "metadata" and is typically separated into one or more elements, each describing a different characteristic of the resource.

For the purpose of interoperable search, a metadata element describing one information resource may be deemed semantically equivalent to some other metadata element describing another type of resource. For example, a metadata element named headline describing a news article may be deemed semantically equivalent to another named title describing a book. Similarly, the from metadata element in an e-mail message may be semantically equivalent to an author.

It should be understood that there is in this Framework no assertion that any two concepts are semantically equivalent in an absolute sense. Rather, it is only asserted that a particular information provider, intermediary, or searcher may regard two concepts as semantically equivalent for a particular purpose. Such a semantic equivalence may be quite subjective and might well be asserted only relative to the particular situation at hand. This position addresses the practical reality that it is impossible to build and sustain over the long term a broad consensus on correspondence between separately maintained systems of characterization.

Mapping a metadata element to its equivalent based on known semantics is a prerequisite to semantic interoperability and relatively straightforward. (A one-to-one correspondence is the simplest case, but other mapping functions can be achieved using full or partial automation.) The other prerequisite is to make use of the common semantic references in open standards supporting information search and retrieval. With this semantic approach, any metadata scheme with defined elements can be made interoperable for searching.

One key success factor in applying interoperable search is that the necessary tools must be widely available. Standards based tools are needed for content owners, intermediaries, and other information providers to describe information resources and services using metadata elements referenced to registered semantics. Standards based tools are also needed for searchers and intermediaries to discover information resources and services using metadata elements with registered semantics.

It is important to note that interoperable search can be implemented piecemeal. Libraries have been deploying catalog interoperability using a common semantic base represented in Machine Readable Cataloging (MARC) and the ISO 23950 search protocol. Many other communities are deploying search interoperability using profiles of ISO 23950 such as the Global Information Locator Service (GILS) or the Content Standard for Geospatial Metadata (ISO 10546-15). The ISO Basic Semantic Registry follows ISO 11179 to enable interoperability among Electronic Document Interchange systems. Many other semantic registries are also finding use, including Dublin Core, Encoded Archival Description, WebDAV, etc.

II. An Open Framework for Advanced Search

This document describes a Framework for Advanced Search designed to enhance networked information discovery. The framework allows solution providers to develop interoperable solutions meeting the two prerequisites for interoperable search:

(1) Standards based tools for content owners, intermediaries, and other information providers to describe information resources and services using metadata elements referenced to registered semantics.

(2) Standards based tools for searchers and intermediaries to discover information using open standards for information search and retrieval that make use of metadata elements with registered semantics.

Developers should be able to discern where their components fit in the framework and how to implement interfaces so that their components interoperate with other components. Information management professionals should refer to this document to understand how it may influence information management strategy as well as software selection.

The Framework for Advanced Search recognizes six processes in the management of information: Create, Capture, Organize, Access, Discover, and Use. These processes are here regarded as types of activities rather than a specific sequence of operations. Information that has been created, captured, and organized may next be prepared for access, but it can also be re-captured or re-organized by another processing chain. For example, an archivist might acquire an information product and handle it as a new product in an archival processing chain with a Capture process involving archival quality media and an Organize process emphasizing characteristics needed for long-term access. Each of the process interfaces is a potential departure point for derived products, re-purposing operations, and ancillary services.

These processes are not always addressed separately. For instance, many Internet search services combine the Capture, Organize, and Access processes. Intermediaries may concentrate on the Organize and Access processes, leaving the Create and Capture processes to information providers and the Discover and Use processes to end-users.

The Create process consists of activities that produce information, typically using tools such as word processors, spreadsheets, e-mail software, CAD systems, database management systems, etc.

The Capture process is composed of activities that enable the capture and representation of information in a structured form, including associated metadata that may be embedded, linked by reference, or provided contextually by the retrieval protocol. Methods range from manual collecting to Web crawling. For example, the metadata for a document could be manually created by an organization's librarian, or partially generated by a program analyzing the content to discern characteristics such as Title and Subject.

Within the Organize process, tools for managing content and metadata may be as basic as file system management, or more specialized to support collaboration, workflow control, rights management, authentication, knowledge management, version maintenance, archiving, etc.

The Access process consists of the activities that enable organized information to be queried, requested, and disseminated to searchers. In this framework, the Access process presumes peer computer network technology, although local file system, CD-ROM, and non-electronic publishing are also possible.

The Discover process consists of the activity of locating information without specific foreknowledge of its particular characteristics. It includes tools by which people or software agents search or browse across variously organized collections of resources, from Web pages to libraries, museums, and data centers. Popular Internet-wide search services support discovery using full-text search against a sampled subset of Web page textual content, sometimes also including machine-aided subject clustering, and increasingly providing additional precision through embedded metadata or contextual metadata such as usage statistics and reference frequency. One example mechanism to facilitate referrals and query routing is the Whois++ protocol and the Common Indexing Protocol (CIP) that grew out of it.

The Use process concerns the application of information to work activities such as deepening understanding, making decisions and exploiting opportunities. Associated metadata is especially crucial when information is re-purposed or included in derivative products and services.

III. Interfaces in the Framework for Advanced Search

The Framework for Advanced Search focuses on the interfaces between these information management processes:

  1. Create - Capture Interface
  2. Capture - Organize Interface
  3. Organize - Access Interface
  4. Access - Discover Interface
  5. Discover - Use Interface

These interfaces are crucial to interoperability. For example, a document management system used in the Capture process would need to interface with a word processor used in the Create process. The captured content and metadata would have to be interfaced to an indexer employed in the Organize process. The indexed content and metadata would need to be interfaced to an external Access process such as a network. At the Access - Discover interface, the GILS Profile of the ISO 23950 international standard for information search would be supported in addition to customized interfaces using mechanisms such as HTTP and other query mechanisms and protocols. Distributed searching, through search referrals or other means, could also be supported at the Discover - Use interface.

In an ideal framework, open standard interfaces specifying the syntax and semantics of required data and signals would exist for each of the five interfaces. However, networked information discovery is still in a period of rapid evolution and the available open standards for these interfaces are yet fragmentary and incomplete. The following interface descriptions describe currently available interface components that would be supported by tools designed for optimal interoperability.

  1. Create - Capture Interface - The Framework for Advanced Search does not specify a single standard interface between information creation and capture. If creation and capture are to be handled separately, the format of content and associated properties or metadata must be openly accessible, typically through an application program interface.
  2. Capture - Organize Interface - A specification of the Capture - Organize interface is not available as an open standard, but some aspects are addressed in open standards. For example, raw content and metadata can be in the form of HTML (including the W3C META tag convention), ASCII text files, USMARC records, electronic mail (RFC 822) folders, Usenet news archives, IAFA templates, BIBTeX, UNIX filenames, SGML, and Whois++ templates. Metadata can be encoded using the W3C XML or RDF syntax with GILS, Dublin Core, or other registered semantics. Navigation of files can be accomplished on the Internet with or without SSL (Secure Socket Layer) or SET (Secure Electronic Transaction), via the HTTP, FTP, Gopher, Whois++, and LDAP protocols.
  3. Organize - Access Interface - Although some aspects of the Organize - Access interface are addressed in standards such as ODBC and JDBC, a wide range of strategies and algorithms are being used and the situation does not seem ripe for broad standardization. It may be possible to standardize a mechanism to communicate metadata semantics, however (see Semantic Mapping discussion below).
  4. Access - Discover Interface - Specific profiles of Internet standards such as HTTP, HTML, CGI, XML, FTP, Gopher, LDAP, and Whois++ can be used at this interface. X.500 is an extensive specification for this interface, as is the GILS Profile of ISO 23950.
  5. Discover - Use Interface - Various kinds of tools are used to discover and use information, including Web servers and desktop applications. Browsing using HTTP can be quite effective, particularly if constrained to a small collection or when a user values serendipity. For cross-domain searching, various approaches can be used at this interface, such as Whois++ and CIP.

IV. Semantic Mapping in the Framework for Advanced Search

An overarching requirement among the Framework interfaces is that there exist a standard way to communicate metadata semantics. In essence, the way information resources are characterized in the Create, Capture, and Organize processes must be matched to the skill sets of searchers and the capabilities of the tools provided in the Access, Discover, and Use processes.

Metadata standards in use today typically address multiple aspects of information management and so they apply to various of the five interfaces. Object properties defined within typical popular integrated desktop office packages are employed in Organize and Access, but they also interact with the Create and Capture processes. GILS metadata elements are defined for use in the Access process but are also found in "usage guidelines" (analogous to bibliographic cataloging rules) and are involved in the Capture and Organize processes when encoded directly into locator records. When used to create references to less tangible information resources (e.g., a person who is a subject matter expert), GILS semantics also find application in the Create process. Such multiple application is common to many metadata schemes, although often implemented in an ad hoc manner.

There are a variety of current initiatives focused on enhanced sharing of metadata, including proposals in the World Wide Web Consortium and the International Standards Organization, as well as other standards bodies and special interest consortia. Until an overall metadata architecture emerges, the ASF strategy is to focus on the simpler matter of a common mechanism to communicate about semantic mappings across the five interfaces defined in this Framework.

Practical semantic interoperability is no stronger than the weakest link along any given chain of provider, intermediary, and searcher. Typically, the smallest set of registered semantics is no more than a dozen or so elements. An example of the semantic map function arises when an Organize process operates on a set of files captured into an FTP directory in anticipation of making the contents searchable through a GILS-compliant Access process. If the particular files being organized only have Internet Archive File Attributes (IAFA), the semantic map for those files may only address a few available attributes. The semantic map could be enhanced with other registered semantics the provider may have cross-walked, perhaps some Dublin Core elements. An intermediary picking up this semantic map then has the choice to use IAFA semantics directly or use any of the cross-walked semantics the intermediary may understand. Of course, the intermediary may ignore the map altogether, but at least the provider will have made a best effort to be understandable.

A draft of an Advanced Search Facility semantic map </semantic-map.html> has been suggested as a fairly generic and extensible mechanism to handle many different metadata tagging conventions and semantic registries. This semantic mapping approach simply employs a file transfer mechanism with XML encoding. The draft semantic map includes 90 metadata elements from the ISO Basic Semantic Registry (following ISO 11179) and the GILS Profile (following ISO 23950). This useful set of registered semantics is cross-walked to Dublin Core, MARC bibliographic cataloging, and EDI (Electronic Data Interchange) used in electronic commerce.


Please send comments on this document to Eliot Christian at the U.S. Geological Survey <echristi@usgs.gov>.