Resource Indexing and Discovery
In a Globally Distributed Digital Library
Position Paper for an NSF-EU Digital Library Collaboratory Working Group
Budapest, Hungary, November 1997
Luis Gravano
Computer Science Department
Columbia University
The Internet has grown dramatically over the past few years. Information
sources are available everywhere. Unfortunately, these information sources
vary widely in the types of information and access interfaces that they
provide. Therefore, using this wealth of resources effectively presents
interesting and challenging problems. In effect, users have information
needs, and should not be concerned with the format of the available data,
or with the interface and access capabilities of the data sources.
Increasingly, users want to issue complex queries across Internet sources
to obtain the data they require. Because of the size of the Internet, it
is not possible anymore to process such queries in naive ways, e.g., by
accessing all the available sources. Thus, we must process queries in a
way that scales with the number of sources. Also, sources vary in the type
of information objects they contain and in the interface they present to
their users. Some sources contain text documents and support simple query
models where a query is just a list of keywords. Other sources contain
more structured data and provide query interfaces in the style of relational
databases. User queries might require accessing sources supporting
radically different interfaces and query models. Thus, we must process
queries in a way that deals with heterogeneous sources.
Users should be able to express their information needs and receive
the relevant data even when finding this data requires accessing sources
of textual and non-textual documents, or sources that do not cooperate
by exporting content summaries, for example. Furthermore, users should
receive this data ordered starting from those objects that are potentially
most useful, because the number of objects that match a query might be
very large. Many problems need to be solved before we can provide
users with sophisticated, seamless, and transparent access to the large
number and variety of Internet sources. Below is a description of some
of these problems, which range from improving systems that already exist
(e.g., WWW search engines for HTML documents), to dealing with sources
that are currently largely ignored by WWW search engines (e.g., ``uncooperative,''
non-HTML text sources, relational databases, image repositories).
Query specification/user interface
There is much more to a description of an information need than a simple
list of words. In effect, when users look for information, they often have
many other requirements in mind. The following are just a few examples
of these requirements:
-
The right ``register'': for example, a scientific research report,
a gossip column from a tabloid, or a university's academic calendar
-
The right ``geographic relevance'': for example, a ``locally relevant''
resource, or a ``globally relevant'' resource, but most likely not a resource
that is only relevant to some county, say, that is far away from the user's
residence
-
The right ``popularity level'': for example, a really popular resource,
accessed and referred to massively, or an obscure resource that nobody
knows about
-
...
A challenge is to provide user interfaces and systems that would manage
to gather these user requirements without overwhelming unsophisticated
users with too much complexity.
Smart Query Processing over Text Documents
Current WWW search engines generally do a poor job at ranking pages for
a given user query. Typically, these engines rank the available WWW pages
for the query based on the pages' contents. These page ranks are computed
by following variants of the vector-space and probabilistic retrieval models
developed over the years by the information retrieval community. The number
of WWW pages and the wide difference in their quality and scope make this
approach inappropriate in many cases: users are overwhelmed with large
numbers of highly ranked, low quality pages that happen to include the
query words many times.
An interesting problem is to use all available information on the WWW
to do a better job at ranking documents for queries, taking into consideration
the special user requirements that we discussed above. A key challenge
in mining all this information for query processing is efficiency, since
the volume of the information at hand is extremely large, and growing fast.
Promising sources of information to employ include available citation information
(e.g., as in Stanford's BackRub
system), query logs, response times, user feedback, and quality reports.
For example, initial work on mining query logs tries to predict what pages
are likely to be useful to users based on their browsing behavior and that
of previous users.
Resource Discovery over Search-Only Text Sources
Search engines currently ignore the non-HTML contents of sources that are
``hidden'' behind search interfaces. In effect, search engines cannot ``crawl''
inside of such sources and follow links to extract all documents in the
sources. Therefore, we have to resort to other mechanisms to reason about
the sources contents, and determine that they are relevant to users' information
needs. A possible solution to this problem is to have sources cooperate
by exporting content summaries and metadata following a known protocol
(e.g., Z39-50's
Explain facility, or the STARTS
protocol proposal). However, if sources do not cooperate, then we have
to devise alternative mechanisms for extracting meaningful content summaries
automatically. Dealing with uncooperative sources is a crucial problem,
since many high-quality information sources (e.g., the
Internet Movie Database) follow into this category.
Resource Discovery over Arbitrary Sources
So far we have discussed resource discovery over sources of text documents.
However, many sources on the Internet host other kinds of information,
like ``relational-like'' data, images, and video. A particularly challenging
open issue is how to summarize the contents of such sources in an automatic
and scalable way so that we can reason about the sources when processing
user queries.
Ultimately, our goal is to allow transparent query processing over sources
with varying data types. For example, users should be able to issue queries
whose processing involves accessing text, relational, and image sources.
Before we can process queries that span several source and data types,
we need to address the following issues:
-
Defining the meaningful combinations of data types and operations.
To extract the information that users need, we might need to perform join-like
operations involving, say, two repositories of text documents.
-
Defining expressive query languages. As we mentioned above, users
should express their requests using simple interfaces. We should then translate
user requests into queries written in a query language that models the
wide variety of sources and data types available on the Internet.
-
Defining efficient execution plans for queries spanning several source
and data types. Finally, once we have produced a complex query expressed
in the query language discussed above and reflecting the user's information
need, we have to design efficient, incremental query plans to execute it.
Producing these plans involves deciding what sources are relevant for evaluating
the different query pieces, evaluating these pieces at the sources using
the available interfaces and query models, and finally combining the answers
produced by the sources into a coherent query result for the user that
issued the query.
Luis Gravano
gravano@cs.columbia.edu