Quality It

Location:

Ithaca, NY

Posted:

January 27, 2013

Contact this candidate

Resume:

Static Score Bucketing in Inverted Indexes

Chavdar Botev Nadav Eiron, Marcus Fontoura, Ning Li,

Cornell University Eugene Shekita

**** ***** **** *** ******* Research Center

Ithaca, NY, USA 650 Harry Rd

San Jose, CA, USA

******@**.*******.***

*****@*******.***.***

ABSTRACT mentally show that the degradation in quality is controllable

and can be made negligible.

Maintaining strict static score order of inverted lists is a

heuristic used by search engines to improve the quality of

2. INDEXING ALGORITHM

query results when the entire inverted lists cannot be pro-

cessed. This heuristic, however, increases the cost of index Our indexing algorithm is based on a re-merge index-

generation and requires complex index build algorithms. In update strategy, which Lester et al. [5] found to have good

this paper, we study a new index organization based on performance. In the re-merge strategy, new documents are

static score bucketing. We show that this new technique sig- added to a delta index, which can be an in-memory index.

ni cantly improves in index build performance while having When the delta index gets to a certain size it is merged with

minimal impact on the quality of search results. the main index to produce the a new main index and the

delta index is reset. The challenge is that using static rank

Categories and Subject Descriptors H.3[Information

to dictate inverted list order presents problems when merg-

Systems]: Information Storage and Retrieval; E.1[Data]:

ing indexes, especially when the docid order is the same as

Data Structures

the static score order [3]. Since docids need to be changed

General Terms: Algorithms, Performance

to re ect new static scores, a full sort of the output posting

Keywords: Indexing, Search engines, Static scoring lists is required to re ect the new scores in new main index.

For instance, when a single document with a high score is

1. INDEXING WITH STATIC SCORE added to the system, most of the docids become invalid.

We now present the re-merge algorithm based on static

BUCKETING

score bucketing. The algorithm has a time complexity that

In previous work, Long and Suel [6] have proposed an in- is linear in the combined size of the main and delta indexes.

verted lists organization that is based on a static rank order Our algorithm uses the following in-memory structures: (i)

of the postings. Such an organization improves the result remList : the list of documents that need to be removed,

quality when the entire index cannot be scanned. However, and (ii) changeTable : a mapping from the docid of docu-

it degrades index build performance and increases the com- ments that changed their bucket to their new bucket num-

plexity of index build algorithms [3]. This issue stems from ber. These structures can be readily stored in memory.

the fact that a change in the score of a single document may Function IndexMerge

translate to many updates to postings lists. 1 for each Term t {

In this work, we propose relaxing the total static score 2 bucketPages = new Page[numberOfbuckets];

order to a partial order. The partial order groups postings 3 main[] = mainIndex.getPostings(t);

with similar static scores into buckets. Thus, we only need 4 delta[] = deltaIndex.getPostings(t);

to maintain order across buckets and we allow the docu- 5 while (exists main[] and exists delta[]) {

ment order inside of a bucket to be arbitrary. For example, 6 curDocid = min docid from main[] and delta[];

postings inside a bucket can be stored in an increasing do- 7 if (remList.contains(curDocid)) continue;

cid (document identi er) order, which need not be the same 8 if (changeTable.contains(curDocid))

as the static score order. This allows for an e cient query 9 curBucket = changeTable.lookup(curDocid);

evaluation, since posting lists can be joined, and for e cient 10 else curBucket = current bucket for curDocid;

indexing algorithms, since a full sort on the static scores 11 addToBucket(curBucket, curDocid, bucketPages);

can be avoided. Although this approach can lead to slight 12 } //end while (main merge loop);

degradation in the quality of the returned results, we experi- 13 } //end for

The algorithm merges, for each term t, the corresponding

postings lists from the delta and the main indexes: delta[]

and main[] arrays respectively. The size of these arrays is

the number of buckets and delta[] and main[] contain one

posting list per bucket for the given term t. These posting

lists are sorted by docid. Each iteration of the main loop

CIKM 05, October 31 November 5, 2005, Bremen, Germany.

processes the posting with the minimum docid (line 6). If its

ACM 1-59593-140-6/05/0010.

311

document has been deleted, that posting entry is not added

to the merged index (line 7). The output postings are parti-

tioned into several output streams (the bucketPages array)

based on the static score bucket of the posting document.

These output stream are represented by bu er pages, which

are ushed to disk whenever the they ll up. If the docu-

ment in question is in the changeTable, it is redirected to the

output stream corresponding to the new static score bucket

(line 8). Otherwise, it remains on the output stream that

corresponds to its old bucket (line 11).

3. EXPERIMENTAL RESULTS

We performed two main sets of experiments. The rst set Figure 1: Static score = in-degree, Number of buck-

tested the index build performance. The second set of exper- ets = 64

iments tested the change in quality of the results when using

Kendal Tau similarity measure [2]) varies with the number

inverted lists bucketing. For both sets of experiments, re-

of scanned documents. The ground truth for our experi-

sults were compared to the baseline system with documents

ments were the results returned by the search engine using

completely ordered by their static scores.

its standard total ordering of t he inverted lists based on the

Experimental setup. Our scheme introduces several

static score, and scanning the entire inverted lists (i.e. no

tunable parameters that potentially in uence performance.

early-termination ).

The bucket type parameter speci es the function used for

Figure 1 presents the experiments using in-degree static

bucketing. We follow Haveliwala [4] who describes static

score and 64 buckets. The experiment compares the di er-

score quantization methods in the related problem of nd-

ent bucketing types when the number of scanned postings

ing e cient encodings of static scores. For our experiments,

increases. As a reference, we have also included the results

we used the following non-linear quantizations: logarith-

with early termination when no bucketing is applied (the

mic G(x) logx, square root G(x) (x), exponential

no-buckets series).

G(x) xb and equi-depth where each bucket contains ap-

As it can be expected, the di erence to the ground truth

proximately the same number of documents. We also experi-

quickly decreases when the engine scans larger potions. The

mented with an additional bucketing scheme, which we refer

comparison to the reference no-buckets series shows that

to as adaptive. It groups documents so that the maximum

most of this di erence can be attributed to the early ter-

di erence of the static scores within a bucket does not exceed

mination and not to the bucketing type. Furthermore, the

a xed value. A second parameter that we tested is the num-

gure shows that the di erences between the various bucket-

ber of buckets per inverted list. Finally, we experimented

ing types are small. Experiments with larger values for the

with two types of static score methods: in-degree (the num-

number of documents scanned (not presented due to lack

ber of hosts referring to the document) [3] and PageRank (a

of space) show that the di erences get even smaller when

measure of the popularity based on the random walk model

larger potions of the posting lists are scanned.

[1]).

The experiments with PageRank show similar trends and

We performed our experiments with a real-world query

we do not present them due to lack of space. The di erences

load from the IBM intranet: a sample of 277 unique queries

between the tested bucketing schemes were even smaller due

out of 116,313 total queries. All the experiments were exe-

to the higher precision of the PageRank scores, which allows

cuted using the Trevi intranet search engine [3].

better clustering of the postings into buckets.

Experiments on indexing performance. For these

experiments, we implemented the index re-merge algorithm

described in Section 2. We compared the proposed algo- 4. REFERENCES

rithm with a re-merge algorithm based on strict static score [1] S. Brin and L. Page. The anatomy of a large-scale hypertextual

(web) search engine. Computer Networks and ISDN Systems,

ordering. We built an index with four buckets using all the

30(1 7):107 117, 1998.

bucket types previously described. The results for all the [2] R. Fagin, R. Kumar, and D. Sivakumar. Comparing top k lists.

bucketing methods we tested were similar. SIAM Journal on Discrete Mathematics, 17(1):134 160, 2003.

We xed a main index for 500K documents and varied [3] M. Fontoura, A. Neumann, S. Rajagopalan, E. Shekita, and

J. Zien. High performance index build algorithms for intranet

the delta index size, from 64,000 documents to 128,000 doc-

search engines. In VLDB 2004.

uments. The results indicated that using static score buck- [4] T. Haveliwala. E cient encoding for document ranking vectors.

eting we can achieve more than 20% increase in index build In Proc. of 4th Int. Conference on Internet Computing, 2003.

performance to an already highly-optimized indexing algo- [5] N. Lester, J. Zobel, and H. E. Williams. In-place versus re-build

versus re-merge: index maintenance strategies for text retrieval

rithm. Furthermore, the results are independent of the size

systems. In CRPIT 2004.

of the delta index being merged. Thus, a search engine us- [6] X.Long and T. Suel. Optimized query execution in large search

ing static score bucketing can have an increased indexing engines with global page ordering. In Proc. of the 29th Int.

Conf. on Very Large Databases, 2003.

throughput. Therefore, the search engine can update its in-

dex more frequently, resulting in more up-to-date content

provided to the users.

Experiments on query-results quality. These experi-

ments illustrate how the query results change when we apply

static score bucketing to the inverted lists. We tested how

the number of inversion between the result rankings (the

312

Contact this candidate