Resume

Information Sales

Location:

Bloomington, IN

Posted:

November 14, 2012

Contact this candidate

Resume:

Selecting Task-Relevant Sources for Just-in-Time Retrieval

David B. Leake and Ryan Scherle Jay Budzik and Kristian Hammond

Computer Science Department The Institute for the Learning Sciences

Indiana University Northwestern University

150 S. Woodlawn Avenue 1890 Maple Avenue

Bloomington, IN 47405 Evanston, IL 60201, U.S.A.

fleake,abplcu@r.postjobfree.com fbudzik,abplcu@r.postjobfree.com

Abstract ipate the user's information needs, gather the needed

information and present it to the user before the user

Just-in-time" information systems monitor their requests it. Such systems require methods for 1 de-

users' tasks, anticipate task-based information needs, termining the type of information the user requires,

and proactively provide their users with relevant in-

and 2 focusing retrieval on information that satis es

formation. The e ectiveness of such systems depends

the user's needs.

both on their capability to track user tasks and on

There are now large numbers of focused information

their ability to retrieve information that satis es task-

sources on the web, providing a rich range of special-

based needs. The Watson system Budzik et al. 1998;

Budzik & Hammond 1999 provides a framework for ized information aimed at satisfying particular infor-

monitoring user tasks and identifying relevant con- mation needs.1 Finding the right sources itself requires

tent areas, and uses this information to generate fo- expertise, limiting the usefulness of these systems for

cused queries for general-purpose search engines and non-expert users. However, if their information can

for specialized search engines integrated into the sys- be provided automatically, this drawback is nulli ed.

tem. The proliferation of specialized search engines

This paper describes ongoing research on enabling just-

and information repositories on the Web provides a

in-time information systems to automatically select in-

rich source of additional information pre-focused for

formation sources that are appropriate to the user's

a wide range information needs, potentially enabling

needs.

just-in-time systems to exploit that focus by query-

ing the most relevant sources. However, putting this We are investigating this problem with SourceSe-

into practice depends on having general scalable meth- lect, a source selection system integrated with the Wat-

ods for selecting the best sources to satisfy the user's son system Budzik et al. 1998; Budzik & Hammond

needs. This paper describes early research on aug- 1999 . Watson automatically ful lls users' informa-

menting Watson with a general-purpose capability for tion needs by monitoring their interactions with every-

automatic information source selection. It presents a day applications, anticipating their information needs,

source selection method that has been integrated into

and querying Internet information sources for that in-

Watson and discusses general issues and research di-

formation. The initial version of Watson focuses on

rections for task-relevant source selection.

identifying task-relevant content areas and automat-

ically generating content-relevant queries for general-

Introduction purpose search engines and a small set of speci c search

As the volume of available information grows, the bur- engines associated to particular query types by hand-

den of information access grows as well. Just-in-time" made strategies. SourceSelect provides an initial ap-

JIT information systems address this problem by proach to adding a general-purpose capability for iden-

shielding the user from the information access task. tifying and accessing content-relevant search engines.

Instead of requiring a user to recognize the need for Given a query from Watson, the system does a two-

information and initiate queries to satisfy it, these sys- step retrieval, rst using vector-space retrieval meth-

tems observe the user's actions in a task context, antic- ods to associate queries to relevant sources, and then

using automatically-generated queries to guide search

David Leake gratefully acknowledges the support of

within those sources. In the combined system, Wat-

the InfoLab and Computer Science Department of North-

son monitors user activities, identi es relevant content

western University during his sabbatical leave. His research

areas, and provides SourceSelect with context infor-

is supported in part by NASA under award No NCC 2-1035.

mation. The SourceSelect system determines appro-

Ryan Scherle's research is supported in part by the Depart-

priate information sources, formulates queries to those

ment of Education under award P200A80301-98. The re-

search of Kristian Hammond and Jay Budzik is funded by

a grant from the National Science Foundation, McKinsey For a sampling of some of these, see The Scout Report

and Company, and gifts from Microsoft Research. http: wwwscout.cs.wisc.edu scout report .

sources, sends o those queries, and collates their re- The RESULT PROCESSOR interprets and lters

sults for Watson to pass them on to the user. No user the result list. Results are gathered and clustered

intervention is required to target candidate sources. using several heuristic result similarity metrics, ef-

fectively eliminating redundant results due to mir-

The paper begins by sketching the Watson frame-

rors, multiple equivalent DNS host names, etc. . The

work and discussing the value of specialized informa-

resulting list is presented to the user in a separate

tion source selection. It next describes the system

window.

and the source selection methods it implements. It

then discusses central issues for intelligent source se- The above mechanism allows Watson to suggest re-

lection and how the approach relates to other current lated information to a user as she writes or browses

approaches. the Web. Watson observes user interaction with Mi-

crosoft Word and Internet Explorer, and uses informa-

Just-in-Time Information Access: tion sources ranging from general-purpose information

The Watson Framework repositories such as newspaper archives or AltaVista,

to special-purpose information sources such as image

The Intelligent Information Laboratory InfoLab at search engines and automatic map generators.

Northwestern University is developing a class of When a user navigates to a new Web page, Watson

systems called Information Management Assistants suggests pages related to the topic of the page at hand.

IMAs . These systems observe users as they go about Similarly, as a user composes a document in Microsoft

completing tasks in everyday software applications and Word, Watson suggests Web pages on the topic of the

uses its observations to anticipate the user's informa- document she is composing. This is illustrated in Fig-

tion needs. They then automatically ful ll these needs ure 1.

by querying traditional information sources such as

Internet search engines, ltering the results and pre-

Motivations for Automatic Source

senting them to the user. IMAs embody a just-in-

Selection

time information infrastructure in which information

is brought to users as they need it, without requiring A well-known problem in generating Internet searches

explicit requests. Essentially, they allow these appli- is that queries usually return a wide range of informa-

cations to serve as interfaces for information systems, tion that may not be relevant to user tasks. For the

paving the way for removing the notion of query from query home sales," for example, the rst page of re-

information systems altogether. sults for a recent query to AltaVista contained pointers

The rst IMA developed at the InfoLab is Wat- to information on real estate, realtors and mortgages.

son, an IMA that observes user interaction with ap- This is useful information if the user is interested in

plications such as Netscape Navigator, Microsoft In- the mechanics of selling a home. However, if the user

ternet Explorer, and Microsoft Word. From its obser- is an economist interested in economic indicators, these

vations and a basic knowledge of information scripts references are of little use.

standard information-seeking behaviors in routine If the context for the home sales query" is known

situations Watson anticipates a user's information to be that the user is working on a document on eco-

needs. It then attempts to automatically ful ll them nomics, it is possible to anticipate the type of result

using common Internet information resources. that will be useful. One way to do this is to add addi-

The conceptual architecture for IMAs has four com- tional search terms. This can be useful, but it is some-

ponents Budzik et al. 1998 : times di cult even for an expert to select the right

query terms for the desired subset of information to be

The ANTICIPATOR uses an explicit task model to

retrieved.

interpret user actions and anticipate a user's infor-

Sending queries to specialized search engines makes

mation needs.

it possible to delineate context in advance of the query

The CONTENT ANALYZER employs a model of itself. A search engine such as CNN nancial, for ex-

the content of a document in a given application ample, provides a focus towards nancial news, and

in order to produce a content representation of the sending the home sales" query there yields the in-

document the user is currently manipulating. formation an economist might want: information on

changes in aggregate sales trends.

The RESOURCE SELECTOR receives the repre-

sentation produced by the CONTENT ANALYZER The number of specialized search engines and repos-

and selects information sources on the basis of the itories is large and rapidly increasing, providing the

perceived information need and the content of the opportunity to select task-relevant sources to improve

document at hand, using a description of the avail- search results if the right sources can be found. Un-

able information sources. In most cases, this re- fortunately, nding the right sources can itself require

sults in an information request being sent to external considerable expertise. However, if a system such as

sources. A result list is returned in the form of an Watson could automatically provide information from

HTML page. the right sources, the usefulness of its results could po-

Figure 1: Watson suggesting information sources to assist in a research paper.

tentially be increased without burden for the user. The each search engine, to nd specialized search engines

goal of the SourceSelect project is to develop methods relevant to the query recall that Watson's queries are

for automatically identifying relevant information and processed to include terms associated with the task

satisfying the information needs. context . This identi es a set of search engines whose

focuses are believed relevant to the query, based on a

SourceSelect pre-set threshold for su cient relevance. This thresh-

old has been set arbitrarily, but we plan to investigate

SourceSelect bridges the gap between a representation the e ects of tuning. The query is then sent to the

of the type of information relevant to the user's task, selected specialized search engines in addition to the

as generated by Watson, and information sources on general search engines. For some search engines, the

the Internet. The aim is a scalable approach that can length of the query is reduced to the rst few terms to

improve focus while requiring minimal knowledge to be improve retrieval performance. When results are re-

coded. Consequently, we have begun by investigating turned from these search engines, they are sent back

the use of IR methods to form the association between to the Watson engine for clustering and display

queries and sources. The choice of sources is based

When the selected search engines are especially ap-

whenever possible on easily accessible information that

propriate, this method can markedly improve the qual-

does not require representing the focuses of the search

ity of the results generated for a query. For example,

engines by hand.

while browsing a page on www.cbs.com concerning the

Our method divides the search engines used by Wat-

Dow Jones industrial average crossing the 10,000 mark,

son into two groups, general and speci c. Every fo-

the suggestions in Table 1 were generated by standard

cused search engine has a list of keywords associated

Watson and Watson with SourceSelect page titles are

with it, keywords gathered from the META tags on

shown . The original version of Watson found some

the search engine's main page. A small percentage of

sites that relate to nancial news, but the results were

search engines do not have keywords in META tags;

not very useful for someone with an interest in the

their keyword lists are constructed manually. The sys-

Dow. With SourceSelect, the keywords Watson gen-

tem can currently access six specialized search engines

erated for the paper matched with keywords for the

for various topics: CNN, CNNfn, Indiana University,

CNN nancial search engine, and better results were

the India engine Khoj, HumorSearch, and ESPN.

produced.

When a query is generated by the Watson engine,

the SourceSelect uses a vector-space retrieval algorithm Two key questions for this approach are whether the

Salton & McGill 1983 to go against the keywords for selection of specialized search engines will improve re-

Standard Watson Watson with SourceSelect

WDBJ 7 news at 6 for 08 11 96 Technology Stocks Slip In Lackluster Trading

The 6 O'Clock Report, Wednesday, 8 5 98 When a fund company is publicly traded - Mar. 19, 1999

85 Documents about 'Dog bites & Stats' Dow manages slight gain in early morning trading

http: www.io.com nuka Text kpnuka981029.txt Toon Inn

Log from the hatching of . . . Kereneth's Clutch Ista... Dow slides 31.13 in jittery trading

Factors In uencing Media Coverage of Business Crises CNNfn - Dow Squeezes out gain - Nov. 4, 1997

CNNfn - Dow breaks . . . losing streak - June 17, 1996

Dow closes up 337.17 in record gain on busiest . . . day ever

Tech Stocks Solid As Dow And Nasdaq Gain

Table 1: Example of Watson results with and without source selection.

sults for queries in their context area, and whether proach, the context alone would be used to select spe-

possible erroneous selection of specialized search en- cialized search engines to then be presented with the

gines will degrade performance for queries that are not query that assumes that context.

in their content area. Informal trials are encouraging

Source characterization

and we are now designing experiments to test these

two questions.

Our initial method for describing the focuses of speci c

Issues search engines relies on the keywords selected by search

engine developers to describe them. These tags provide

Issues for automatic source selection include how to a reasonable rst pass to characterizations, but there

identify the user's information needs, how to select is no guarantee that these tags will be accurate. In

sources relevant to those needs, and how to access and some cases the inaccuracies are intentional, as search

exploit the information they provide. We discuss each engines add popular tags merely to increase the chance

of these in turn. that the tags for their search engines will match queries

presented to other search engines, to increase their traf-

Identifying needed information c. We plan to explore other methods for characteriz-

A key goal of IMAs is to automatically provide users ing information sources, such as generating term vec-

with the right information, rather than forcing them tors directly from crawling site contents for accessible

to interrupt their tasks as they notice needs for infor- repositories. We also plan to investigate methods for

mation and try to satisfy those needs through manual more exible matching of page descriptions, such as us-

searching. Achieving this goal depends on the system ing a hierarchy to provide more exible matching for

being able to determine what information is relevant to related terms.

the current goals, without directly querying the user.

In principle, abductive plan recognition could be used Engine-Speci c Query Generation

to explain the user's actions and anticipate informa-

tion needs. In practice, however, there are many rea- Being able to select specialized information sources

sons this is not possible: it is too di cult to generate raises interesting questions about how to transform

high-level explanations for user behavior, processing general queries into queries that exploit the contextual

cost is too high, too much background knowledge is focus provided by a specialized search engine. When

required, and too many explanations are possible for generating a query for a general-purpose search engine

the observed behaviors. such as AltaVista, much of the query content is needed

The Watson approach is to use limited task knowl- to disambiguate the required context. Once a con-

edge, at the level of how particular applications are text is established by the specialized source, that in-

used and how to infer content information likely to be formation is no longer necessary. Some search engines

relevant, to guide its description of relevant content. automatically AND the terms in queries as their de-

For example, Watson's knowledge includes that head- fault processing mode e.g., ESPN , making it possible

ings in documents are likely to be important. Based that the additional terms included for disambiguation

on this knowledge, it describes the important content will prevent useful information from being retrieved.

of a document by generating a term vector that gives For www.humorsearch.com, which has a very small

greater weights to terms in headings. Thus content- database, queries with more than two terms appear to

relevance is used as an easier-to-compute proxy for seldom retrieve any results. In general, being able to

task-relevance. access specialized information sources raises interest-

An issue to explore is whether it is worthwhile to ing questions of how to tailor queries to those sources,

preserve the context independently of a query describ- in light of both the information needed and the char-

ing information needs within that context. In this ap- acteristics of the sources themselves.

Parser Selection and Wrapper Generation The Internet Sleuth http: www.isleuth.com is a

search engine that indexes other specialized search en-

Accessing specialized search engines requires having gines. It allows the user to e ectively perform a source-

mechanisms for extracting the information that they selection algorithm by hand.

return and making it available in a useful form. This Apple's Sherlock http: www.apple.com sherlock

corresponds to the well-known problem of wrapper allows the user to select the search engines that will

generation. SourceSelect relies on hand-coded wrap- be queried. This approach puts the burden of source

pers to access its information sources, but ideally selection entirely on the user. The user is forced to

would exploit either a standard set of wrappers to allow remember which search engines give the most relevant

semi-automatic selection or wrapper learning methods results for each type of query he may want to use.

e.g., Kushmerick, Doorenbos, & Weld 1997 to fa-

The GlOSS Gravano, Garcia-Molina, & Tomasic

cilitate the addition of new sources. E ective methods

1994 system obtains the index from each of its in-

for wrapper generation are one precondition for auto-

formation sources, and combines these indices to form

matic addition of new information sources.

a meta-index, which is used for source selection. The

drawback of this approach is that all of the informa-

Collating Results tion sources must cooperate by providing their indices

A nal issue is how to merge the results of multiple in order for the meta-index to be built.

specialized sources. SourceSelect currently relies on EMIR Kulyukin 1999 maintains positive and nega-

heuristic clustering algorithms in Watson to group re- tive keyword vectors for each of its information sources.

sults. These algorithms use information such as the Like GlOSS, it needs the cooperation of the informa-

titles of pages and the structure of URLs to decide tion sources to maintain an accurate representation of

when two pages are similar. For specialized informa- their contents.

tion sources, these heuristics could be augmented with Most of these systems, use general-purpose search

heuristics that also consider the implicit context pro- engines for their information sources. While general-

vided by the sources of the information themselves. purpose search engines provide the broadest coverage,

focused search engines can have a much greater con-

Perspective centration of relevant links within their subject area.

When Watson's contextual information is added to ba-

The basic Watson system addresses task-relevant fo-

sic source selection, focused search engines appear to

cusing by automatically generating queries relevant to

provide better results than general search engines.

content areas associated with the task. The addition

of SourceSelect adds task-based focusing for selecting

Conclusion

where the query is sent. The premises of this approach

contrast dramatically with those of a search engine SourceSelect augments Watson's just-in-time retrieval

such as Google http: www.google.com , in which the framework with the capability to choose specialized in-

goal of a search is to nd a consensus" answer. In our formation sources related to the current context. The

approach, the goal of a search is to nd the answer most goal is to leverage o existing information resources to

relevant to a speci c information-seeking context, and automatically provide the user with task-relevant infor-

the use of specialized resources helps assure the rele- mation. The current version of SourceSelect matches

vance of the result to that context. term vector descriptions of the content area of inter-

Surprisingly little work has been done on source se- est to descriptions from the tags of specialized search

lection. The most notable example, a previous version engines to select sources expected to be relevant to

of SavvySearch Dreilinger & Howe 1997 kept track those content areas, queries those sources, and for-

of how well search engines handled past queries, and wards those results to Watson for presentation to the

used vector-space retrieval to match the current query user. Initial tests have been encouraging; next steps

to a search engine that has previously done well with include addition of other specialized search engines,

similar queries. ProFusion Gauch & Wang 1996 used formal evaluation, and exploration of alternative meth-

a handbuilt knowledge hierarchy to categorize queries ods for describing task-relevant content and selecting

and select relevant search engines. More recently, an information sources.

agent-based learning system was added to ProFusion to

References

manipulate each engine's place in the hierarchy based

on past searches Fan & Gauch 1999 .

Budzik, J., and Hammond, K. 1999. Watson: a just-

Older systems, like Metacrawler Selberg & Et-

in-time information environment. In AAAI Workshop

zioni 1995 use only general search engines and send

on Intelligent Information Systems. In Press.

the query to all of them. Bandwidth constraints

limit the number of search engines that can be Budzik, J.; Hammond, K.; Marlow, C.; and

queried. The current incarnation of SavvySearch Scheinkman, A. 1998. Anticipating information

http: www.savvysearch.com now appears to use needs: Everyday applications as interfaces to inter-

this approach as well. net information resources. In Proceedings of the 1998

World Conference on the WWW, Internet, and In-

tranet.

Dreilinger, D., and Howe, A. 1997. Experiences

with selecting search engines using meta-search. ACM

Transactions on Information Systems 15 3 .

Fan, Y., and Gauch, S. 1999. Adaptive agents

for information gathering from multiple distributed

information sources. In Proceedings of the 1999

AAAI Spring Symposium on Intelligent Agents in Cy-

berspace. AAAI Press.

Gauch, S., and Wang, G. 1996. Information fusion

with ProFusion. In WebNet '96: The First World

Conference of the Web Society, 174 179.

Gravano, L.; Garcia-Molina, H.; and Tomasic, A.

1994. Precision and recall of gloss estimators for

database discovery. In Proceedings of the third in-

ternational Conference on Parllel and Distributed In-

formation Systems PDIS '94 .

Kulyukin, V. 1999. Application-embedded retrieval

from distributed free-text collections. In Proceedings

of the Sixteenth National Conference on Arti cal In-

telligence. AAAI Press. In press.

Kushmerick, N.; Doorenbos, R.; and Weld, D. 1997.

Wrapper induction for information extraction. In Pro-

ceedings of the Fourteenth International Joint Confer-

ence on Arti cial Intelligence. Morgan Kaufmann.

Salton, G., and McGill, M. 1983. Introduction to

modern information retrieval. New York: McGraw-

Hill.

Selberg, E., and Etzioni, O. 1995. Multi-service

search and comparison using the metacrawler. In Pro-

ceedings of the Fourth World Wide Web Conference,

195 208.

Contact this candidate