Static Score Bucketing in Inverted Indexes
Chavdar Botev Nadav Eiron, Marcus Fontoura, Ning Li,
Cornell University Eugene Shekita
**** ***** **** *** ******* Research Center
Ithaca, NY, USA 650 Harry Rd
San Jose, CA, USA
******@**.*******.***
*****@*******.***.***
ABSTRACT mentally show that the degradation in quality is controllable
and can be made negligible.
Maintaining strict static score order of inverted lists is a
heuristic used by search engines to improve the quality of
2. INDEXING ALGORITHM
query results when the entire inverted lists cannot be pro-
cessed. This heuristic, however, increases the cost of index Our indexing algorithm is based on a re-merge index-
generation and requires complex index build algorithms. In update strategy, which Lester et al. [5] found to have good
this paper, we study a new index organization based on performance. In the re-merge strategy, new documents are
static score bucketing. We show that this new technique sig- added to a delta index, which can be an in-memory index.
ni cantly improves in index build performance while having When the delta index gets to a certain size it is merged with
minimal impact on the quality of search results. the main index to produce the a new main index and the
delta index is reset. The challenge is that using static rank
Categories and Subject Descriptors H.3[Information
to dictate inverted list order presents problems when merg-
Systems]: Information Storage and Retrieval; E.1[Data]:
ing indexes, especially when the docid order is the same as
Data Structures
the static score order [3]. Since docids need to be changed
General Terms: Algorithms, Performance
to re ect new static scores, a full sort of the output posting
Keywords: Indexing, Search engines, Static scoring lists is required to re ect the new scores in new main index.
For instance, when a single document with a high score is
1. INDEXING WITH STATIC SCORE added to the system, most of the docids become invalid.
We now present the re-merge algorithm based on static
BUCKETING
score bucketing. The algorithm has a time complexity that
In previous work, Long and Suel [6] have proposed an in- is linear in the combined size of the main and delta indexes.
verted lists organization that is based on a static rank order Our algorithm uses the following in-memory structures: (i)
of the postings. Such an organization improves the result remList : the list of documents that need to be removed,
quality when the entire index cannot be scanned. However, and (ii) changeTable : a mapping from the docid of docu-
it degrades index build performance and increases the com- ments that changed their bucket to their new bucket num-
plexity of index build algorithms [3]. This issue stems from ber. These structures can be readily stored in memory.
the fact that a change in the score of a single document may Function IndexMerge
translate to many updates to postings lists. 1 for each Term t {
In this work, we propose relaxing the total static score 2 bucketPages = new Page[numberOfbuckets];
order to a partial order. The partial order groups postings 3 main[] = mainIndex.getPostings(t);
with similar static scores into buckets. Thus, we only need 4 delta[] = deltaIndex.getPostings(t);
to maintain order across buckets and we allow the docu- 5 while (exists main[] and exists delta[]) {
ment order inside of a bucket to be arbitrary. For example, 6 curDocid = min docid from main[] and delta[];
postings inside a bucket can be stored in an increasing do- 7 if (remList.contains(curDocid)) continue;
cid (document identi er) order, which need not be the same 8 if (changeTable.contains(curDocid))
as the static score order. This allows for an e cient query 9 curBucket = changeTable.lookup(curDocid);
evaluation, since posting lists can be joined, and for e cient 10 else curBucket = current bucket for curDocid;
indexing algorithms, since a full sort on the static scores 11 addToBucket(curBucket, curDocid, bucketPages);
can be avoided. Although this approach can lead to slight 12 } //end while (main merge loop);
degradation in the quality of the returned results, we experi- 13 } //end for
The algorithm merges, for each term t, the corresponding
postings lists from the delta and the main indexes: delta[]
and main[] arrays respectively. The size of these arrays is
the number of buckets and delta[] and main[] contain one
posting list per bucket for the given term t. These posting
Copyright is held by the author/owner.
lists are sorted by docid. Each iteration of the main loop
CIKM 05, October 31 November 5, 2005, Bremen, Germany.
processes the posting with the minimum docid (line 6). If its
ACM 1-59593-140-6/05/0010.
311
document has been deleted, that posting entry is not added
to the merged index (line 7). The output postings are parti-
tioned into several output streams (the bucketPages array)
based on the static score bucket of the posting document.
These output stream are represented by bu er pages, which
are ushed to disk whenever the they ll up. If the docu-
ment in question is in the changeTable, it is redirected to the
output stream corresponding to the new static score bucket
(line 8). Otherwise, it remains on the output stream that
corresponds to its old bucket (line 11).
3. EXPERIMENTAL RESULTS
We performed two main sets of experiments. The rst set Figure 1: Static score = in-degree, Number of buck-
tested the index build performance. The second set of exper- ets = 64
iments tested the change in quality of the results when using
Kendal Tau similarity measure [2]) varies with the number
inverted lists bucketing. For both sets of experiments, re-
of scanned documents. The ground truth for our experi-
sults were compared to the baseline system with documents
ments were the results returned by the search engine using
completely ordered by their static scores.
its standard total ordering of t he inverted lists based on the
Experimental setup. Our scheme introduces several
static score, and scanning the entire inverted lists (i.e. no
tunable parameters that potentially in uence performance.
early-termination ).
The bucket type parameter speci es the function used for
Figure 1 presents the experiments using in-degree static
bucketing. We follow Haveliwala [4] who describes static
score and 64 buckets. The experiment compares the di er-
score quantization methods in the related problem of nd-
ent bucketing types when the number of scanned postings
ing e cient encodings of static scores. For our experiments,
increases. As a reference, we have also included the results
we used the following non-linear quantizations: logarith-
p
with early termination when no bucketing is applied (the
mic G(x) logx, square root G(x) (x), exponential
no-buckets series).
G(x) xb and equi-depth where each bucket contains ap-
As it can be expected, the di erence to the ground truth
proximately the same number of documents. We also experi-
quickly decreases when the engine scans larger potions. The
mented with an additional bucketing scheme, which we refer
comparison to the reference no-buckets series shows that
to as adaptive. It groups documents so that the maximum
most of this di erence can be attributed to the early ter-
di erence of the static scores within a bucket does not exceed
mination and not to the bucketing type. Furthermore, the
a xed value. A second parameter that we tested is the num-
gure shows that the di erences between the various bucket-
ber of buckets per inverted list. Finally, we experimented
ing types are small. Experiments with larger values for the
with two types of static score methods: in-degree (the num-
number of documents scanned (not presented due to lack
ber of hosts referring to the document) [3] and PageRank (a
of space) show that the di erences get even smaller when
measure of the popularity based on the random walk model
larger potions of the posting lists are scanned.
[1]).
The experiments with PageRank show similar trends and
We performed our experiments with a real-world query
we do not present them due to lack of space. The di erences
load from the IBM intranet: a sample of 277 unique queries
between the tested bucketing schemes were even smaller due
out of 116,313 total queries. All the experiments were exe-
to the higher precision of the PageRank scores, which allows
cuted using the Trevi intranet search engine [3].
better clustering of the postings into buckets.
Experiments on indexing performance. For these
experiments, we implemented the index re-merge algorithm
described in Section 2. We compared the proposed algo- 4. REFERENCES
rithm with a re-merge algorithm based on strict static score [1] S. Brin and L. Page. The anatomy of a large-scale hypertextual
(web) search engine. Computer Networks and ISDN Systems,
ordering. We built an index with four buckets using all the
30(1 7):107 117, 1998.
bucket types previously described. The results for all the [2] R. Fagin, R. Kumar, and D. Sivakumar. Comparing top k lists.
bucketing methods we tested were similar. SIAM Journal on Discrete Mathematics, 17(1):134 160, 2003.
We xed a main index for 500K documents and varied [3] M. Fontoura, A. Neumann, S. Rajagopalan, E. Shekita, and
J. Zien. High performance index build algorithms for intranet
the delta index size, from 64,000 documents to 128,000 doc-
search engines. In VLDB 2004.
uments. The results indicated that using static score buck- [4] T. Haveliwala. E cient encoding for document ranking vectors.
eting we can achieve more than 20% increase in index build In Proc. of 4th Int. Conference on Internet Computing, 2003.
performance to an already highly-optimized indexing algo- [5] N. Lester, J. Zobel, and H. E. Williams. In-place versus re-build
versus re-merge: index maintenance strategies for text retrieval
rithm. Furthermore, the results are independent of the size
systems. In CRPIT 2004.
of the delta index being merged. Thus, a search engine us- [6] X.Long and T. Suel. Optimized query execution in large search
ing static score bucketing can have an increased indexing engines with global page ordering. In Proc. of the 29th Int.
Conf. on Very Large Databases, 2003.
throughput. Therefore, the search engine can update its in-
dex more frequently, resulting in more up-to-date content
provided to the users.
Experiments on query-results quality. These experi-
ments illustrate how the query results change when we apply
static score bucketing to the inverted lists. We tested how
the number of inversion between the result rankings (the
312