Data Security

Location:

State of Palestine

Posted:

May 13, 2017

Contact this candidate

Resume:

EFFICIENT AND SECURE DOCUMENT SIMILARITY

SEARCH OVER CLOUD UTILIZING MAPREDUCE

by Mahmoud Alewiwi

Submitted to the Graduate School of Engineering and Natural Sciences

in partial fulfillment of the requirements for the degree of Doctor of Philosophy

Sabanci University

December, 2015

EFFICIENT AND SECURE DOCUMENT SIMILARITY

SEARCH OVER CLOUD UTILIZING MAPREDUCE

APPROVED BY:

Prof.Dr. Erkay SAVAS ̧

(Thesis Supervisor)

Prof.Dr. Yu ̈cel SAYGIN

(Internal Examiner)

Assoc.Prof.Dr. Kemal KILIC ̧

(Internal Examiner)

Asst.Prof.Dr. Selc ̧uk BAKTIR

(External Examiner)

Asst.Prof.Dr. Ahmet Onur DURAHIM

(External Examiner)

DATE OF APPROVAL:

Acknowledgments

I wish to express my gratitude to my supervisor Erkay Sava ̧s for his invaluable guidance, support and patience all through my thesis. I am also grateful to Cengiz O ̈rencik for his guidance and valuable contributions to this thesis.

Special thanks to my colleague Ay ̧se Selc ̧uk, for her collaborating in ad- ministrating the Hadoop framework and her kind suggestions. I am grateful to all my friends from Cryptography and Information Se- curity Lab. (i.e., FENS 2001), Sabanci University and Data Security and Privacy Lab for being very supportive.

I am indebted to the members of the committee of my thesis for reviewing my thesis and providing very useful feedback.

I am grateful to TU ̈BI ̇TAK (The Scientific and Technological Research Council of Turkey), for the support under Grant Number 113E537. Especially, I would like to thank to my family, wife, and sons for being patient during my study. I owe acknowledgment to them for their encour- agement, and love throughout di cult times in my graduate years. iii

EFFICIENT AND SECURE DOCUMENT SIMILARITY

SEARCH OVER CLOUD UTILIZING MAPREDUCE

Mahmoud Alewiwi

Computer Science and Engineering

Ph.D. Thesis, 2015

Thesis Supervisor: Prof.Dr. Erkay Sava ̧s

Keywords: Similarity, Privacy, Cloud Computing, MapReduce, Hadoop, Cryptography, Encryption

Abstract

Document similarity has important real life applications such as finding du- plicate web sites and identifying plagiarism. While the basic techniques such as k-similarity algorithms have been long known, overwhelming amount of data, being collected such as in big data setting, calls for novel algorithms to find highly similar documents in reasonably short amount of time. In particular, pairwise comparison of documents sharing a common feature, necessitates prohibitively high storage and computation power. The wide spread availability of cloud computing provides users easy access to high storage and processing power. Furthermore, outsourcing their data to the cloud guarantees reliability and availability for their data while privacy and security concerns are not always properly addressed. This leads to the prob- lem of protecting the privacy of sensitive data against adversaries including the cloud operator.

Generally, traditional document similarity algorithms tend to compare all the documents in a data set sharing same terms (words) with query docu- ment. In our work, we propose a new filtering technique that works on plain- text data, which decreases the number of comparisons between the query set iv

and the search set to find highly similar documents. The technique, referred as ZOLIP algorithm, is e cient and scalable, but does not provide security. We also design and implement three secure similarity search algorithms for text documents, namely Secure Sketch Search, Secure Minhash Search and Secure ZOLIP. The first algorithm utilizes locality sensitive hashing tech- niques and cosine similarity. While the second algorithm uses the Minhash Algorithm, the last one uses the encrypted ZOLIP Signature, which is the secure version of the ZOLIP algorithm.

We utilize the Hadoop distributed file system and the MapReduce parallel programming model to scale our techniques to big data setting. Our experi- mental results on real data show that some of the proposed methods perform better than the previous work in the literature in terms of the number of joins, and therefore, speed.

MAPREDUCE I ̇LE BULUT U ̈ZERI ̇NDE DOKU ̈MANLAR

I ̇C ̧ I ̇N

VERI ̇MLI ̇ VE GU ̈VENLI ̇ BENZERLI ̇K HESAPLAMA

Mahmoud Alewiwi

Bilgisayar Bilimi ve Mu ̈hendisli ̆gi

Ph.D. tez, 2015

Tez Danı ̧smanı: Prof.Dr. Erkay Sava ̧s

O ̈zet

Doku ̈manlar arasında benzerlik arama i ̧sleminin gerc ̧ek hayatta tekrar- layan web sayfalarını ya da intihalleri bulmak gibi ̈onemli uygulama alanıları vardır. Her ne kadar k-benzerlik algoritması gibi temel teknikler literatu ̈rde uzun zamandır mevcut olsa da, ̈ozellikle c ̧ok bu ̈yu ̈k boyutlardaki verilerle c ̧alı ̧smanın gerekli oldu ̆gu bu ̈yu ̈k veri uygulamalarında bu tu ̈r basit teknikler yava ̧s ve yetersiz kalırlar. O ̈zellikle doku ̈manları ikili olarak bir ortak ter- imi ic ̧eriyor mu diye kar ̧sıla ̧stırmak c ̧ok yu ̈ksek depolama ve hesaplama gu ̈cu ̈ gereksinimleri do ̆gurur. Bulut bili ̧simin hızla yaygınla ̧sması, kullanıcıların bu ihtiyac ̧larına cevap vermektedir. Veriyi bu tu ̈r bulut servis sa ̆glayıcılar u ̈zerinden payla ̧smak, verinin eri ̧silebilirli ̆gini garanti etse de, verinin mahremiyeti ve gizlili ̆gi garanti edilemez. Bu durum, ̈ozellikle hassas verilerin mahremiyetini koruma problemini ortaya c ̧ıkarmı ̧stır.

Geleneksel doku ̈manlar arası benzerlik bulma algoritmaları c ̧o ̆gunlukla sorgulanan doku ̈manı veri tabanındaki di ̆ger tu ̈m doku ̈manlarla kar ̧sıla ̧stırmayı gerektirir. Bizim ̈onerdi ̆gimiz sistemde ise, ac ̧ık ( ̧sifrelenmemi ̧s) metin verileri u ̈zerinde gerekli olan kar ̧sıla ̧stırma sayısını ̈onemli oranda azaltan yeni bir fil- treleme tekni ̆gi kullanımı ̈onerilmi ̧stir. Bu sistem ac ̧ık veriler u ̈zerindeki ben- vi

zerlik kar ̧sıla ̧stırmalarında verimli olarak c ̧alı ̧smaktadır ve ̈olc ̧eklenebilirdir, ancak bir gu ̈venlik sa ̆glamaz.

Bu sistemin yanı sıra, mahremiyeti de sa ̆glayacak u ̈c ̧ gu ̈venli benzer- lik arama algoritması da (Secure Sketch Search, Secure Minhash Search ve Secure ZOLIP) tasarlanmı ̧stır. Bunlardan ilki doku ̈manlar arasındaki kosinu ̈s benzerli ̆gini konum hassasiyetli ̈ozu ̈tleme (locality sensitive hashing) teknikleri kullanarak yapar. I ̇kinci y ̈ontem MinHash algoritmalarını kul- lanırken u ̈c ̧u ̈ncu ̈su ̈ ise daha ̈once ac ̧ık metinler ic ̧in tasarladı ̆gımız ZOLIP imzalarının ̧sifrelenmi ̧s hallerini kullanarak benzerlik hesaplaması yapar. O ̈nerdi ̆gimiz y ̈ontemleri gerc ̧eklerken bu ̈yu ̈k veriler ic ̧in de ̈olc ̧eklenebilir olması ic ̧in, Hadoop da ̆gıtık dosya sistemleri ve MapReduce paralel program- lama modelinden yararlanıyoruz. Gerc ̧ek veriler u ̈zeride yaptı ̆gımız deneyler, ̈onerilen y ̈ontemlerin bazılarının literatu ̈rde var olan di ̆ger sistemlerden daha az sayıda birle ̧stirme/kar ̧sıla ̧stırma i ̧slemine ihtiyac ̧ duydu ̆gunu, ve dolayısıyla daha hızlı oldu ̆gunu g ̈ostermi ̧stir.

vii

Contents

Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv O ̈zet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi 1 INTRODUCTION 1

1.1 Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2 RELATED WORKS 6

2.1 Related Work on Similarity Search . . . . . . . . . . . . . . . 6 3 PRELIMINARIES 11

3.1 Term Relevancy Score . . . . . . . . . . . . . . . . . . . . . . 11 3.2 Cosine Similarity . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.3 Z-Order Mapping . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.4 Locality Sensitive Hashing (LSH) . . . . . . . . . . . . . . . . 22 3.5 Hadoop and MapReduce Framework . . . . . . . . . . . . . . 23 3.6 Hash-based Message Authentication Code (HMAC) . . . . . . 25 4 EFFICIENT DOCUMENT SIMILARITY SEARCH UTI-

LIZING Z-ORDER PREFIX FILTERING 27

viii

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.2 The Proposed Filtering Method . . . . . . . . . . . . . . . . . 30 4.2.1 Phase 1: Near-Duplicate Detection (NDD) . . . . . . . 31 4.2.2 Phase 2: Common Important Terms (CIT) . . . . . . . 35 4.2.3 Phase 3: Join Phase(JP) . . . . . . . . . . . . . . . . . 40 4.2.4 R-S Join . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.3.1 Setup and Data Description . . . . . . . . . . . . . . . 43 4.3.2 Performance Analysis . . . . . . . . . . . . . . . . . . . 44 4.3.3 Accuracy Analysis . . . . . . . . . . . . . . . . . . . . 48 4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 5 SECURE DOCUMENT SIMILARITY SEARCH UTILIZ-

ING SECURE SKETCHES 54

5.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . 55 5.2 Secure Similarity Search . . . . . . . . . . . . . . . . . . . . . 57 5.3 Secure Sketch Construction . . . . . . . . . . . . . . . . . . . 58 5.4 Enhanced Security . . . . . . . . . . . . . . . . . . . . . . . . 60 5.5 Security Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 63 5.6 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 66 5.7 Similarity Evaluation . . . . . . . . . . . . . . . . . . . . . . . 67 5.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 6 SECURE DOCUMENT SIMILARITY SEARCH UTILIZ-

ING MINHASH 72

6.1 The Framework . . . . . . . . . . . . . . . . . . . . . . . . . . 73 6.2 Security Model . . . . . . . . . . . . . . . . . . . . . . . . . . 75 6.3 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . 76 ix

6.3.1 Secure Index Generation . . . . . . . . . . . . . . . . . 76 6.3.2 Secure Query Generation . . . . . . . . . . . . . . . . . 81 6.3.3 Secure Search . . . . . . . . . . . . . . . . . . . . . . . 83 6.4 Security Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 87 6.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 6.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 7 EFFICIENT, SECURE DOCUMENT SIMILARITY SEARCH

UTILIZING Z-ORDER SPACE FILLING CURVES 94

7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 7.2 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . 97 7.3 Secure ZOLIP Similarity Search . . . . . . . . . . . . . . . . . 98 7.3.1 Secure Index and Query Generation . . . . . . . . . . . 98 7.3.2 Secure Search . . . . . . . . . . . . . . . . . . . . . . . 100 7.4 Security Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 104 7.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 105 7.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 8 CONCLUSION AND FUTURE WORK 109

List of Figures

3.1 Z-Order Space Filling . . . . . . . . . . . . . . . . . . . . . . 15 3.2 Data Points on Z-Order Curve . . . . . . . . . . . . . . . . . . 16 3.3 MapReduce Job Execution . . . . . . . . . . . . . . . . . . . . 25 4.1 An Example Execution of ZOLIP Phase 1 . . . . . . . . . . . 34 4.2 An Example Execution of ZOLIP Phase 2 . . . . . . . . . . . 39 4.3 An Example Execution of ZOLIP Phase 3 . . . . . . . . . . . 41 4.4 Performance Comparison between the Proposed Algorithm (ZOLIP) and the Method by Vernica et al. [1] for k = 10. . . . . . . . . 44 4.5 E ect of Increase in λ on E ciency for k = 10 . . . . . . . . . 45 4.6 Running Time of Each Phase for λ = 8. . . . . . . . . . . . . . 46 4.7 Running Time of Each Phase for di erent k where, Query Size is 10, 000 and λ = 8 . . . . . . . . . . . . . . . . . . . . . . . 46 4.8 Running Times for the Reuters data set for λ = 8. . . . . . . . 47 5.1 Average Accuracy Rate . . . . . . . . . . . . . . . . . . . . . . 68 5.2 Time Complexity for Sketch Similarity Search, D = 510, 000 69 5.3 Time Complexity for Encrypted Sketch Similarity Search . . . 69 6.1 The framework . . . . . . . . . . . . . . . . . . . . . . . . . . 73 6.2 Flowchart of secure index generation . . . . . . . . . . . . . . 77 6.3 Average precision rates for k-NN search with di erent λ and k 91 xi

6.4 Average search time for kNN search with di erent λ . . . . . . 92 7.1 Flowchart of secure index and query generation . . . . . . . . 99 7.2 Time Complexity . . . . . . . . . . . . . . . . . . . . . . . . . 107 7.3 Average Accuracy Rate . . . . . . . . . . . . . . . . . . . . . . 107 xii

List of Tables

4.1 Average of missed queries of ZOLIP Filtering Algorithm with k = 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.2 Accuracy of the top k documents for ZOLIP Filtering Algo- rithm with k = 2. . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.3 Accuracy of the top-k documents for ZOLIP Filtering Algo- rithm with k = 2. . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.4 Accuracy of ZOLIP Filtering Algorithm with Di erent Values of k when λ = 8. . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.5 Relative Error on the Sum (RES) for Di erent Values of k when λ = 8. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 5.1 Common Notations . . . . . . . . . . . . . . . . . . . . . . . . 56 7.1 Common Notations . . . . . . . . . . . . . . . . . . . . . . . . 97 xiii

List of Algorithms

1 Near-Dupplicate Detection(NDD) . . . . . . . . . . . . . . . . 32 2 Common Important Terms(CIT) . . . . . . . . . . . . . . . . 36 3 Join Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4 Secure Multiplication (E(ab)) . . . . . . . . . . . . . . . . . . 62 5 Enhanced Secure Similarity Search . . . . . . . . . . . . . . . 64 6 Secure Index Generation . . . . . . . . . . . . . . . . . . . . . 81 7 Secure Query Generation . . . . . . . . . . . . . . . . . . . . . 84 xiv

Chapter 1

INTRODUCTION

Big data, referring to not only the huge amount of data being collected, but also associated opportunities, has big potential for major improvements in many fields from health care to business management. Therefore, there is an ever-increasing demand for e cient and scalable tools that can analyze and process immense amount of, possibly unstructured, data, which keeps increasing in size and complexity.

Finding similarities (or duplicates) among multiple data items, or docu- ments, is one of the fundamental operations, which can be extremely chal- lenging due to the nature of big data. In particular, similarity search on a huge data set, where the documents are represented as multi-dimensional feature vectors, necessitates pair-wise comparisons, which requires the com- putation of a distance metric, and therefore can be very time and resource consuming, if not infeasible.

However, complexity of establishing such a powerful infrastructure may be costly or not available especially for small and medium-sized enterprises

(SMEs). Cloud computing o ers an ideal solution for this problem. The currently available cloud services can provide both storage and computation 1

capability for massive volumes of data. This motivates us to find new and e cient document similarity searching algorithms that can work in big data setting utilizing cloud computing.

While data outsourcing to cloud is a feasible solution for many organi- zations the fact that the outsourced data may contain sensitive information leads to privacy breaches [2, 3]. Secure processing of outsourced data opera- tions require protection of the confidentiality of both the outsourced data and the submitted queries. Moreover, it also requires to maintain the confiden- tiality of the patterns such as di erent accesses/queries aiming to retrieve the same data. Encryption of data prior to outsourcing may provide the confidentiality of the content of the data. However, the classical encryption methods do not provide even simple operations over the ciphertext. In our work, we first concentrate on finding similar documents over data sets in plaintext using an e cient algorithm, which utilizes a new filtering technique and cosine similarity between two documents. Then, we propose secure search algorithms that aim to find similar documents without revealing sensitive data.

1.1 Motivations

While the basic techniques such as k-similarity algorithms have been long known, overwhelming amount of data, being collected such as in big data setting, calls for novel algorithms to find highly similar documents in reason- ably short amount of time. In particular, pairwise comparison of documents’ features, a key operation in calculating document similarity, necessitates pro- hibitively high storage and computation power.

Finding similarities (or duplicates) among multiple data items, or docu- 2

ments, is one of the fundamental operations, which can be extremely chal- lenging due to the nature of big data. In particular, similarity search on a huge data set, where the documents are represented as multi-dimensional feature vectors, necessitates pair-wise comparisons, which requires the com- putation of a distance metric, and therefore can be very time and resource consuming, if not infeasible.

A commonly used technique, known as filtering, decreases the number of pairwise comparisons by skipping the comparison of two documents if they are not potentially similar; e.g., they do not share any common feature. Also, representation, storage, management and processing of documents play an important role in the performance of a similarity search method. A dis- tributed file system and a parallel programming model such as MapReduce [4] are necessary components of a scalable and e cient solution in big data ap- plications.

From security perspective, secure data mining operations require protec- tion of the confidentiality of both the outsourced data along with its index that allows searching capability and the submitted queries. Moreover, it also requires to maintain the confidentiality of the search and access patterns such as di erent accesses/queries aiming the same data. Data encryption before outsourcing may provide the confidentiality of the content of the data. How- ever, classical encryption methods do not allow even simple operations over the ciphertext. In the past few years several solutions have been proposed for e cient search operations over encrypted data utilizing a searchable index structure that accurately represents the actual data without revealing the sensitive information.

1.2 Contributions

This thesis focuses on the general problem of detecting the k-most similar documents for a given (set of) document(s). It presents four novel algorithms: i) one algorithm for unprotected document sets aiming fast and a scalable search operation based on filtering and ii) three algorithms for secure search operation utilizing various encryption techniques. In the first algorithm, where the search is performed over plaintext data, two cases are considered: i) finding k-most similar documents for each document within a given data set (self join), and ii) finding k-most similar documents for each document in one set from the other set (R-S join), for instance query set and data set. In secure search algorithms, only R-S join is considered as self join is not feasible due to the large sizes of data set used in the experiments. The contributions of this thesis as well as the techniques employed are summarized as follows:

• We propose an e cient document similarity algorithm that search for document similarity over plaintext data sets.

• We utilize Z-order and propose a Z-order prefix filtering technique to enhance the e ciency of the algorithm.

• We utilize term frequency-inverse document frequency (tf-idf) as a term relevancy score for weight or importance of a term/word of a document.

• We use cosine similarity metric to find similarity between documents whenever it is possible.

• We also propose several approaches that enable enhanced security prop- erties such as search and access pattern privacy and document and query confidentiality.

• We propose three secure document similarity search schemes. The first one is based on secure sketches. The second one is based on locality sensitive hashing (LSH)(i.e MinHash). The last one uses the Z-order prefix encrypted using HMAC algorithm. The security properties of the proposed algorithms are di erent while some of them provide ac- cess and search pattern privacy in addition to document and query confidentiality, the others provide basic security for data and query privacy.

• For all the above algorithms, we use the MapReduce parallel processing framework which is a popular computing model for big data applica- tions in recent times.

1.3 Outline

The thesis is organized as follows: the next chapter (Chapter 2), presents a literature review on prior work related to document similarity over plain and encrypted data and indexes. In Chapter 3, we provide the preliminaries that will be used throughout the thesis. In Chapter 4, we introduce a novel document similarity search algorithm that is based on Z-order prefix filter- ing. Chapters 5, 6 and 7 give the details of three di erent secure document similarity search algorithms, respectively. In Chapter 5, we explain Secure Sketch algorithm. In Chapter 6, a secure search algorithm based on a lo- cality sensitive hash function known as MinHash is explained. And finally, in Chapter 7 we explain the secure ZOLIP algorithm, which is the secure version of the algorithm given in Chapter 4. Finally, chapter 8 concludes the thesis.

Chapter 2

RELATED WORKS

This chapter presents a short survey on previous works in the literature re- lated to document similarity over plaintext and encrypted documents. E - ciency and accuracy of di erent algorithms are discussed and their advantages and disadvantages are pointed out.

2.1 Related Work on Similarity Search

In the literature, the problem of set-similarity on a single machine is con- sidered in several works [5–8]. These works are mainly focused on reducing the complexity of vector similarity join. Angiulli et al. [9] used the Z-order space filling curve in order to find the similarity between two high dimen- sional spatial data sets using Minkowski metrics. This method performs well for finding close pairs in high dimensional data, but it is not suitable for text based similarity detection. For text based similarity, as in the case of document similarity problem, the cosine similarity metric is more suitable than the Minkowski metric.

Connor and Kumar [10] suggested another technique for the similar doc- 6

ument detection problem. They used a binary search technique to find k- nearest neighbors (k-NN) within a selected Z hypercube. A popular approach in other works is adapting filtering techniques that filter out pairs that can- not surpass a given similarity threshold. Filtering decreases the number of candidates for the computation of similarity metric and, therefore, the num- ber of similarity join operations by eliminating the documents that do not share a common important term with the query.

There are various filtering techniques used in the literature. A prefix filtering method is suggested by Chaudhuri et al. [7]. The length filtering method is utilized in the works [5] and [8]. Positional and su x filters are proposed by Xiao et al. [11]. Sarawagi and Kirpal [6] proposed a method called PPJoin+ that utilizes inverted index and uses a Pair-Count algo- rithm which generates pairs that share certain number of tokens. Arasu et al. [5] proposed a signature based method, in which the features of docu- ments are represented by signatures and the similarity among the documents is calculated using the similarity of the underlying signatures. Zhu et al. [12] suggested a searching technique based on cosine similarity. They proposed an algorithm that utilizes a diagonal traversal strategy to filter out unrelated documents. In this algorithm, the elements in the data set are represented by binary vectors, meaning that only the existence of terms is considered, ignoring their frequencies or importance in the data set. The MapReduce computing model is also considered for the similarity search problem and this leads to parallel join algorithms for large data sets that are stored on cloud servers. Elsayed et al. [13] suggested a MapReduce Model with a Ful l-Filtering technique. They used a simple filter that finds only the pairs that share common tokens. The proposed method is composed of two phases. While the first phase parses and creates the indexes for the 7

terms in each document, the second phase finds the similar pairs that share these terms. Vernica et al. [1] used the PPJoin+ method [6] in order to perform the self-join operation. Yang et al. [14] proposed a method that uses prefix and su x filters with two phases of MapReduce. Inverted index is used in [15] combined with prefix filtering. A modified double pass MapReduce prefix filtering method was proposed by Baraglia et al. [16]. Phan et al. [17] used Bloom filtering for building similarity pairs, in which each pair should intersect at least in one position with the arrays generated by the Bloom filters.

The previous works in the literature of similarity search do not take the importance of the terms in documents into consideration to the best of our knowledge (at least to the extent in this work). This a ects the seman- tic similarity between documents (i.e., some documents may have the same terms but in di erent contexts). In our algorithm, in order to address this issue, we utilize a cosine similarity based filtering technique using the relative importance of terms in documents for finding similar documents. Over the years, several secure similar document detection methods have been proposed in the literature. There are two main assumptions on this topic: similar document detection among two parties and similar document detection over encrypted cloud data. The core of search over cloud data de- pends on searchable encryption methods, therefore several di erent search- able encryption methods are proposed over the recent years [18, 19] The majority of the works aim similar document detection among two parties. The parties A and B want to compute the similarity between their documents a and b respectively, without disclosing a or b. In this approach, the parties know the data of their own in plaintext form, but do not know the documents in the other party [20–22]. Jiang et al. [20] proposed a cosine 8

similarity based similar document detection method between two parties. They propose two approaches one with randommatrix multiplication and one with component-wise homomorphic encryption. An e cient similarity search method among two parties is proposed by Murugesan et al. [21]. They explore clustering based solutions that are significantly e cient while providing high accuracy. Buyukbilen and Bakiras [22] proposed another similar document detection method between two parties. They generate document fingerprints using simhash and reduce the problem to a secure XOR operation between two bit vectors. The secure XOR operation is formulated as a secure two party computation protocol.

The other important line of research is similar document detection over encrypted cloud data. This approach is more challenging than the former one since the cloud cannot access the plaintext version of the data it stores. Wong et al. [23] propose a SCONEDB (Secure Computation ON Encrypted DataBase) model, which captures execution and security requirements. They developed an asymmetric scalar product preserving encryption (ASPE). In this method the query points and database points are encrypted di erently, which avoids distance recoverability using only the encrypted values. Yao et al. [24] investigate the secure nearest neighbor problem and rephrased its definition. Instead of finding the encrypted exact nearest neighbor, server finds a relevant encrypted partition such that the exact nearest neighbor is guaranteed to be in that partition. Elmehdwi et al. [25] also consider the k-NN problem over encrypted data base outsourced to a cloud. This method can protect the confidentiality of users’ search patterns and access patterns. The method uses the euclidean distance for finding the similarity and utilize several subroutines such as secure minimum, secure multiplication and secure OR operations. Overall, the method provides the same security 9

guarantees with the method proposed in Chapter 5 but considers Euclidean distance, where cosine similarity is considered in our work. Cosine similarity is especially useful for high-dimensional spaces. In document similarity, each term is assigned to a dimension and the documents is characterized by a vector where the value of each dimension is the corresponding tf-idf score. Therefore, cosine similarity captures the similarity among two documents better than the Euclidean similarity.

Chapter 3

PRELIMINARIES

To understand the proposed schemes and follow the pertinent discussions in this thesis, this chapter provides explanations for the following prelimi- naries: “Term Relevancy Scoring”, “Z-Order Mapping”, “Locality Sensitive Hashing” and “Hadoop and MapReduce Framework”.

3.1 Term Relevancy Score

We can represent a data object (e.g., a document, an image, a video file, etc.) as a vector of features, which identifies that data object. In this thesis, we use documents that are represented by a set of terms (i.e., keywords, words from human language). More formally, each document di in the data set D contains a certain number of terms from a global dictionary T, where T = δ is the total number of terms in the dictionary. Each document in the data set is represented as a vector of term weights derived from the dictionary T . In our scheme, a component of the term vector for the document di is in fact the relevance score of the corresponding term tj

Contact this candidate