Selecting Task-Relevant Sources for Just-in-Time Retrieval
David B. Leake and Ryan Scherle Jay Budzik and Kristian Hammond
Computer Science Department The Institute for the Learning Sciences
Indiana University Northwestern University
150 S. Woodlawn Avenue 1890 Maple Avenue
Bloomington, IN 47405 Evanston, IL 60201, U.S.A.
fleake,abplcu@r.postjobfree.com fbudzik,abplcu@r.postjobfree.com
Abstract ipate the user's information needs, gather the needed
information and present it to the user before the user
Just-in-time" information systems monitor their requests it. Such systems require methods for 1 de-
users' tasks, anticipate task-based information needs, termining the type of information the user requires,
and proactively provide their users with relevant in-
and 2 focusing retrieval on information that satis es
formation. The e ectiveness of such systems depends
the user's needs.
both on their capability to track user tasks and on
There are now large numbers of focused information
their ability to retrieve information that satis es task-
sources on the web, providing a rich range of special-
based needs. The Watson system Budzik et al. 1998;
Budzik & Hammond 1999 provides a framework for ized information aimed at satisfying particular infor-
monitoring user tasks and identifying relevant con- mation needs.1 Finding the right sources itself requires
tent areas, and uses this information to generate fo- expertise, limiting the usefulness of these systems for
cused queries for general-purpose search engines and non-expert users. However, if their information can
for specialized search engines integrated into the sys- be provided automatically, this drawback is nulli ed.
tem. The proliferation of specialized search engines
This paper describes ongoing research on enabling just-
and information repositories on the Web provides a
in-time information systems to automatically select in-
rich source of additional information pre-focused for
formation sources that are appropriate to the user's
a wide range information needs, potentially enabling
needs.
just-in-time systems to exploit that focus by query-
ing the most relevant sources. However, putting this We are investigating this problem with SourceSe-
into practice depends on having general scalable meth- lect, a source selection system integrated with the Wat-
ods for selecting the best sources to satisfy the user's son system Budzik et al. 1998; Budzik & Hammond
needs. This paper describes early research on aug- 1999 . Watson automatically ful lls users' informa-
menting Watson with a general-purpose capability for tion needs by monitoring their interactions with every-
automatic information source selection. It presents a day applications, anticipating their information needs,
source selection method that has been integrated into
and querying Internet information sources for that in-
Watson and discusses general issues and research di-
formation. The initial version of Watson focuses on
rections for task-relevant source selection.
identifying task-relevant content areas and automat-
ically generating content-relevant queries for general-
Introduction purpose search engines and a small set of speci c search
As the volume of available information grows, the bur- engines associated to particular query types by hand-
den of information access grows as well. Just-in-time" made strategies. SourceSelect provides an initial ap-
JIT information systems address this problem by proach to adding a general-purpose capability for iden-
shielding the user from the information access task. tifying and accessing content-relevant search engines.
Instead of requiring a user to recognize the need for Given a query from Watson, the system does a two-
information and initiate queries to satisfy it, these sys- step retrieval, rst using vector-space retrieval meth-
tems observe the user's actions in a task context, antic- ods to associate queries to relevant sources, and then
using automatically-generated queries to guide search
David Leake gratefully acknowledges the support of
within those sources. In the combined system, Wat-
the InfoLab and Computer Science Department of North-
son monitors user activities, identi es relevant content
western University during his sabbatical leave. His research
areas, and provides SourceSelect with context infor-
is supported in part by NASA under award No NCC 2-1035.
mation. The SourceSelect system determines appro-
Ryan Scherle's research is supported in part by the Depart-
priate information sources, formulates queries to those
ment of Education under award P200A80301-98. The re-
search of Kristian Hammond and Jay Budzik is funded by
a grant from the National Science Foundation, McKinsey For a sampling of some of these, see The Scout Report
1
and Company, and gifts from Microsoft Research. http: wwwscout.cs.wisc.edu scout report .
sources, sends o those queries, and collates their re- The RESULT PROCESSOR interprets and lters
sults for Watson to pass them on to the user. No user the result list. Results are gathered and clustered
intervention is required to target candidate sources. using several heuristic result similarity metrics, ef-
fectively eliminating redundant results due to mir-
The paper begins by sketching the Watson frame-
rors, multiple equivalent DNS host names, etc. . The
work and discussing the value of specialized informa-
resulting list is presented to the user in a separate
tion source selection. It next describes the system
window.
and the source selection methods it implements. It
then discusses central issues for intelligent source se- The above mechanism allows Watson to suggest re-
lection and how the approach relates to other current lated information to a user as she writes or browses
approaches. the Web. Watson observes user interaction with Mi-
crosoft Word and Internet Explorer, and uses informa-
Just-in-Time Information Access: tion sources ranging from general-purpose information
The Watson Framework repositories such as newspaper archives or AltaVista,
to special-purpose information sources such as image
The Intelligent Information Laboratory InfoLab at search engines and automatic map generators.
Northwestern University is developing a class of When a user navigates to a new Web page, Watson
systems called Information Management Assistants suggests pages related to the topic of the page at hand.
IMAs . These systems observe users as they go about Similarly, as a user composes a document in Microsoft
completing tasks in everyday software applications and Word, Watson suggests Web pages on the topic of the
uses its observations to anticipate the user's informa- document she is composing. This is illustrated in Fig-
tion needs. They then automatically ful ll these needs ure 1.
by querying traditional information sources such as
Internet search engines, ltering the results and pre-
Motivations for Automatic Source
senting them to the user. IMAs embody a just-in-
Selection
time information infrastructure in which information
is brought to users as they need it, without requiring A well-known problem in generating Internet searches
explicit requests. Essentially, they allow these appli- is that queries usually return a wide range of informa-
cations to serve as interfaces for information systems, tion that may not be relevant to user tasks. For the
paving the way for removing the notion of query from query home sales," for example, the rst page of re-
information systems altogether. sults for a recent query to AltaVista contained pointers
The rst IMA developed at the InfoLab is Wat- to information on real estate, realtors and mortgages.
son, an IMA that observes user interaction with ap- This is useful information if the user is interested in
plications such as Netscape Navigator, Microsoft In- the mechanics of selling a home. However, if the user
ternet Explorer, and Microsoft Word. From its obser- is an economist interested in economic indicators, these
vations and a basic knowledge of information scripts references are of little use.
standard information-seeking behaviors in routine If the context for the home sales query" is known
situations Watson anticipates a user's information to be that the user is working on a document on eco-
needs. It then attempts to automatically ful ll them nomics, it is possible to anticipate the type of result
using common Internet information resources. that will be useful. One way to do this is to add addi-
The conceptual architecture for IMAs has four com- tional search terms. This can be useful, but it is some-
ponents Budzik et al. 1998 : times di cult even for an expert to select the right
query terms for the desired subset of information to be
The ANTICIPATOR uses an explicit task model to
retrieved.
interpret user actions and anticipate a user's infor-
Sending queries to specialized search engines makes
mation needs.
it possible to delineate context in advance of the query
The CONTENT ANALYZER employs a model of itself. A search engine such as CNN nancial, for ex-
the content of a document in a given application ample, provides a focus towards nancial news, and
in order to produce a content representation of the sending the home sales" query there yields the in-
document the user is currently manipulating. formation an economist might want: information on
changes in aggregate sales trends.
The RESOURCE SELECTOR receives the repre-
sentation produced by the CONTENT ANALYZER The number of specialized search engines and repos-
and selects information sources on the basis of the itories is large and rapidly increasing, providing the
perceived information need and the content of the opportunity to select task-relevant sources to improve
document at hand, using a description of the avail- search results if the right sources can be found. Un-
able information sources. In most cases, this re- fortunately, nding the right sources can itself require
sults in an information request being sent to external considerable expertise. However, if a system such as
sources. A result list is returned in the form of an Watson could automatically provide information from
HTML page. the right sources, the usefulness of its results could po-
Figure 1: Watson suggesting information sources to assist in a research paper.
tentially be increased without burden for the user. The each search engine, to nd specialized search engines
goal of the SourceSelect project is to develop methods relevant to the query recall that Watson's queries are
for automatically identifying relevant information and processed to include terms associated with the task
satisfying the information needs. context . This identi es a set of search engines whose
focuses are believed relevant to the query, based on a
SourceSelect pre-set threshold for su cient relevance. This thresh-
old has been set arbitrarily, but we plan to investigate
SourceSelect bridges the gap between a representation the e ects of tuning. The query is then sent to the
of the type of information relevant to the user's task, selected specialized search engines in addition to the
as generated by Watson, and information sources on general search engines. For some search engines, the
the Internet. The aim is a scalable approach that can length of the query is reduced to the rst few terms to
improve focus while requiring minimal knowledge to be improve retrieval performance. When results are re-
coded. Consequently, we have begun by investigating turned from these search engines, they are sent back
the use of IR methods to form the association between to the Watson engine for clustering and display
queries and sources. The choice of sources is based
When the selected search engines are especially ap-
whenever possible on easily accessible information that
propriate, this method can markedly improve the qual-
does not require representing the focuses of the search
ity of the results generated for a query. For example,
engines by hand.
while browsing a page on www.cbs.com concerning the
Our method divides the search engines used by Wat-
Dow Jones industrial average crossing the 10,000 mark,
son into two groups, general and speci c. Every fo-
the suggestions in Table 1 were generated by standard
cused search engine has a list of keywords associated
Watson and Watson with SourceSelect page titles are
with it, keywords gathered from the META tags on
shown . The original version of Watson found some
the search engine's main page. A small percentage of
sites that relate to nancial news, but the results were
search engines do not have keywords in META tags;
not very useful for someone with an interest in the
their keyword lists are constructed manually. The sys-
Dow. With SourceSelect, the keywords Watson gen-
tem can currently access six specialized search engines
erated for the paper matched with keywords for the
for various topics: CNN, CNNfn, Indiana University,
CNN nancial search engine, and better results were
the India engine Khoj, HumorSearch, and ESPN.
produced.
When a query is generated by the Watson engine,
the SourceSelect uses a vector-space retrieval algorithm Two key questions for this approach are whether the
Salton & McGill 1983 to go against the keywords for selection of specialized search engines will improve re-
Standard Watson Watson with SourceSelect
WDBJ 7 news at 6 for 08 11 96 Technology Stocks Slip In Lackluster Trading
The 6 O'Clock Report, Wednesday, 8 5 98 When a fund company is publicly traded - Mar. 19, 1999
85 Documents about 'Dog bites & Stats' Dow manages slight gain in early morning trading
http: www.io.com nuka Text kpnuka981029.txt Toon Inn
Log from the hatching of . . . Kereneth's Clutch Ista... Dow slides 31.13 in jittery trading
Factors In uencing Media Coverage of Business Crises CNNfn - Dow Squeezes out gain - Nov. 4, 1997
CNNfn - Dow breaks . . . losing streak - June 17, 1996
Dow closes up 337.17 in record gain on busiest . . . day ever
Tech Stocks Solid As Dow And Nasdaq Gain
Table 1: Example of Watson results with and without source selection.
sults for queries in their context area, and whether proach, the context alone would be used to select spe-
possible erroneous selection of specialized search en- cialized search engines to then be presented with the
gines will degrade performance for queries that are not query that assumes that context.
in their content area. Informal trials are encouraging
Source characterization
and we are now designing experiments to test these
two questions.
Our initial method for describing the focuses of speci c
Issues search engines relies on the keywords selected by search
engine developers to describe them. These tags provide
Issues for automatic source selection include how to a reasonable rst pass to characterizations, but there
identify the user's information needs, how to select is no guarantee that these tags will be accurate. In
sources relevant to those needs, and how to access and some cases the inaccuracies are intentional, as search
exploit the information they provide. We discuss each engines add popular tags merely to increase the chance
of these in turn. that the tags for their search engines will match queries
presented to other search engines, to increase their traf-
Identifying needed information c. We plan to explore other methods for characteriz-
A key goal of IMAs is to automatically provide users ing information sources, such as generating term vec-
with the right information, rather than forcing them tors directly from crawling site contents for accessible
to interrupt their tasks as they notice needs for infor- repositories. We also plan to investigate methods for
mation and try to satisfy those needs through manual more exible matching of page descriptions, such as us-
searching. Achieving this goal depends on the system ing a hierarchy to provide more exible matching for
being able to determine what information is relevant to related terms.
the current goals, without directly querying the user.
In principle, abductive plan recognition could be used Engine-Speci c Query Generation
to explain the user's actions and anticipate informa-
tion needs. In practice, however, there are many rea- Being able to select specialized information sources
sons this is not possible: it is too di cult to generate raises interesting questions about how to transform
high-level explanations for user behavior, processing general queries into queries that exploit the contextual
cost is too high, too much background knowledge is focus provided by a specialized search engine. When
required, and too many explanations are possible for generating a query for a general-purpose search engine
the observed behaviors. such as AltaVista, much of the query content is needed
The Watson approach is to use limited task knowl- to disambiguate the required context. Once a con-
edge, at the level of how particular applications are text is established by the specialized source, that in-
used and how to infer content information likely to be formation is no longer necessary. Some search engines
relevant, to guide its description of relevant content. automatically AND the terms in queries as their de-
For example, Watson's knowledge includes that head- fault processing mode e.g., ESPN , making it possible
ings in documents are likely to be important. Based that the additional terms included for disambiguation
on this knowledge, it describes the important content will prevent useful information from being retrieved.
of a document by generating a term vector that gives For www.humorsearch.com, which has a very small
greater weights to terms in headings. Thus content- database, queries with more than two terms appear to
relevance is used as an easier-to-compute proxy for seldom retrieve any results. In general, being able to
task-relevance. access specialized information sources raises interest-
An issue to explore is whether it is worthwhile to ing questions of how to tailor queries to those sources,
preserve the context independently of a query describ- in light of both the information needed and the char-
ing information needs within that context. In this ap- acteristics of the sources themselves.
Parser Selection and Wrapper Generation The Internet Sleuth http: www.isleuth.com is a
search engine that indexes other specialized search en-
Accessing specialized search engines requires having gines. It allows the user to e ectively perform a source-
mechanisms for extracting the information that they selection algorithm by hand.
return and making it available in a useful form. This Apple's Sherlock http: www.apple.com sherlock
corresponds to the well-known problem of wrapper allows the user to select the search engines that will
generation. SourceSelect relies on hand-coded wrap- be queried. This approach puts the burden of source
pers to access its information sources, but ideally selection entirely on the user. The user is forced to
would exploit either a standard set of wrappers to allow remember which search engines give the most relevant
semi-automatic selection or wrapper learning methods results for each type of query he may want to use.
e.g., Kushmerick, Doorenbos, & Weld 1997 to fa-
The GlOSS Gravano, Garcia-Molina, & Tomasic
cilitate the addition of new sources. E ective methods
1994 system obtains the index from each of its in-
for wrapper generation are one precondition for auto-
formation sources, and combines these indices to form
matic addition of new information sources.
a meta-index, which is used for source selection. The
drawback of this approach is that all of the informa-
Collating Results tion sources must cooperate by providing their indices
A nal issue is how to merge the results of multiple in order for the meta-index to be built.
specialized sources. SourceSelect currently relies on EMIR Kulyukin 1999 maintains positive and nega-
heuristic clustering algorithms in Watson to group re- tive keyword vectors for each of its information sources.
sults. These algorithms use information such as the Like GlOSS, it needs the cooperation of the informa-
titles of pages and the structure of URLs to decide tion sources to maintain an accurate representation of
when two pages are similar. For specialized informa- their contents.
tion sources, these heuristics could be augmented with Most of these systems, use general-purpose search
heuristics that also consider the implicit context pro- engines for their information sources. While general-
vided by the sources of the information themselves. purpose search engines provide the broadest coverage,
focused search engines can have a much greater con-
Perspective centration of relevant links within their subject area.
When Watson's contextual information is added to ba-
The basic Watson system addresses task-relevant fo-
sic source selection, focused search engines appear to
cusing by automatically generating queries relevant to
provide better results than general search engines.
content areas associated with the task. The addition
of SourceSelect adds task-based focusing for selecting
Conclusion
where the query is sent. The premises of this approach
contrast dramatically with those of a search engine SourceSelect augments Watson's just-in-time retrieval
such as Google http: www.google.com , in which the framework with the capability to choose specialized in-
goal of a search is to nd a consensus" answer. In our formation sources related to the current context. The
approach, the goal of a search is to nd the answer most goal is to leverage o existing information resources to
relevant to a speci c information-seeking context, and automatically provide the user with task-relevant infor-
the use of specialized resources helps assure the rele- mation. The current version of SourceSelect matches
vance of the result to that context. term vector descriptions of the content area of inter-
Surprisingly little work has been done on source se- est to descriptions from the tags of specialized search
lection. The most notable example, a previous version engines to select sources expected to be relevant to
of SavvySearch Dreilinger & Howe 1997 kept track those content areas, queries those sources, and for-
of how well search engines handled past queries, and wards those results to Watson for presentation to the
used vector-space retrieval to match the current query user. Initial tests have been encouraging; next steps
to a search engine that has previously done well with include addition of other specialized search engines,
similar queries. ProFusion Gauch & Wang 1996 used formal evaluation, and exploration of alternative meth-
a handbuilt knowledge hierarchy to categorize queries ods for describing task-relevant content and selecting
and select relevant search engines. More recently, an information sources.
agent-based learning system was added to ProFusion to
References
manipulate each engine's place in the hierarchy based
on past searches Fan & Gauch 1999 .
Budzik, J., and Hammond, K. 1999. Watson: a just-
Older systems, like Metacrawler Selberg & Et-
in-time information environment. In AAAI Workshop
zioni 1995 use only general search engines and send
on Intelligent Information Systems. In Press.
the query to all of them. Bandwidth constraints
limit the number of search engines that can be Budzik, J.; Hammond, K.; Marlow, C.; and
queried. The current incarnation of SavvySearch Scheinkman, A. 1998. Anticipating information
http: www.savvysearch.com now appears to use needs: Everyday applications as interfaces to inter-
this approach as well. net information resources. In Proceedings of the 1998
World Conference on the WWW, Internet, and In-
tranet.
Dreilinger, D., and Howe, A. 1997. Experiences
with selecting search engines using meta-search. ACM
Transactions on Information Systems 15 3 .
Fan, Y., and Gauch, S. 1999. Adaptive agents
for information gathering from multiple distributed
information sources. In Proceedings of the 1999
AAAI Spring Symposium on Intelligent Agents in Cy-
berspace. AAAI Press.
Gauch, S., and Wang, G. 1996. Information fusion
with ProFusion. In WebNet '96: The First World
Conference of the Web Society, 174 179.
Gravano, L.; Garcia-Molina, H.; and Tomasic, A.
1994. Precision and recall of gloss estimators for
database discovery. In Proceedings of the third in-
ternational Conference on Parllel and Distributed In-
formation Systems PDIS '94 .
Kulyukin, V. 1999. Application-embedded retrieval
from distributed free-text collections. In Proceedings
of the Sixteenth National Conference on Arti cal In-
telligence. AAAI Press. In press.
Kushmerick, N.; Doorenbos, R.; and Weld, D. 1997.
Wrapper induction for information extraction. In Pro-
ceedings of the Fourteenth International Joint Confer-
ence on Arti cial Intelligence. Morgan Kaufmann.
Salton, G., and McGill, M. 1983. Introduction to
modern information retrieval. New York: McGraw-
Hill.
Selberg, E., and Etzioni, O. 1995. Multi-service
search and comparison using the metacrawler. In Pro-
ceedings of the Fourth World Wide Web Conference,
195 208.