Post Job Free

Resume

Sign in

Project Manager Email

Location:
East Ellijay, GA
Posted:
November 17, 2012

Contact this candidate

Resume:

MailRank: Using Ranking for Spam Detection

Paul - Alexandru Chirita Jorg Diederich Wolfgang Nejdl

L3S Research Center / L3S Research Center / L3S Research Center /

University of Hannover University of Hannover University of Hannover

Deutscher Pavillon, Expo Plaza 1 Deutscher Pavillon, Expo Plaza 1 Deutscher Pavillon, Expo Plaza 1

30539 Hannover, Germany 30539 Hannover, Germany 30539 Hannover, Germany

abph3i@r.postjobfree.com abph3i@r.postjobfree.com abph3i@r.postjobfree.com

ABSTRACT ltering almost impossible. Currently, spam emails already out-

number non-spam ones, so-called ham emails . Existing spam l-

Can we use social networks to combat spam? This paper investi-

ters such as the SpamAssassin System1, SpamBouncer2, or Mozilla

gates the feasibility of MailRank, a new email ranking and classi-

Junk Mail Control3 still exhibit some problems, which can be clas-

cation scheme exploiting the social communication network cre-

si ed in two main categories:

ated via email interactions. The underlying email network data is

1. Maintenance, for both the initialization and the adaptation

collected from the email contacts of all MailRank users and up-

of the lter during operation, since all spam lters rely on

dated automatically based on their email activities to achieve an

a certain amount of input data to be maintained: Content-

easy maintenance. MailRank is used to rate the sender address of

based lters require keywords and rules for spam recogni-

arriving emails such that emails from trustworthy senders can be

tion, blacklists have to be populated with IP addresses from

ranked and classi ed as spam or non-spam. The paper presents two

known spammers, and Bayesian lters need a training set

variants: Basic MailRank computes a global reputation score for

of spam / ham messages. This input data has to be created

each email address, whereas in Personalized MailRank the score of

when the lter is used rst (the cold-start problem), and

each email address is different for each MailRank user. The eval-

it also has to be adapted continuously to counter attacks of

uation shows that MailRank is highly resistant against spammer

spammers [7, 19].

attacks, which obviously have to be considered right from the be-

2. Residual error rates, since current spam lters cannot elim-

ginning in such an application scenario. MailRank also performs

inate the spam problem completely. First, a non-negligible

well even for rather sparse networks, i.e., where only a small set of

number of spam emails still reaches the end user, so-called

peers actually take part in the ranking of email addresses.

false negatives. Second, some ham messages are discarded

because the anti-spam system considers them as spam. Such

Categories and Subject Descriptors false positives are especially annoying if the sender of the

email is from the recipient s community and thus already

G.2.2 [Discrete Mathematics]: Graph Theory; H.3.4 [Information

known to the user, or at least known by somebody else the

Systems]: Information Storage and Retrieval Systems and Soft-

user knows directly. Therefore, there is a high probability

ware; H.2.7 [Information Systems]: Database Management Se-

that an email received from somebody within the social net-

curity, Integrity and Protection

work of the receiver is a ham message. This implies that a

social network formed by email communication can be used

General Terms as a strong foundation for spam detection.

Algorithms, Experimentation, Measurements Even if there existed a perfect anti-spam system, an additional

problem would arise for high-volume email users, some of which

simply get too many ham emails. In these cases, an automated

Keywords

support for email ranking would be highly desirable. Reputation

Email Reputation, SPAM, MailRank, Personalization algorithms are useful in this scenario, because they provide a rat-

ing for each email address, which can subsequently be used to sort

1. INTRODUCTION incoming emails. Such ratings can be gained in two ways, glob-

ally or personally. The main idea of a global scheme is that people

While scienti c collaboration without email is almost unthink-

share their personal ratings such that a single global rating (called

able, the tremendous increase of unsolicited email (spam) over the

reputation) can be inferred for each email address. The implemen-

past years [5] has rendered email communication without spam

tation of such a scheme can, for example, be based on network rep-

utation algorithms [6] or on collaborative ltering techniques [17].

In case of a personalized scheme, the ratings (called trust in this

case) are typically different for each email user and depend on her

Permission to make digital or hard copies of all or part of this work for

personal social network. Such a scheme is reasonable since some

personal or classroom use is granted without fee provided that copies are

people with a presumably high global reputation (e.g., Linus Tor-

not made or distributed for pro t or commercial advantage and that copies

bear this notice and the full citation on the rst page. To copy otherwise, to

republish, to post on servers or to redistribute to lists, requires prior speci c 1

http://spamassassin.apache.org

permission and/or a fee. 2

http://www.spambouncer.org

CIKM 05 Bremen, Germany 3

http://www.mozilla.org/start/1.5/extra/using-junk-control.html

Copyright 200X ACM X-XXXXX-XX-X/XX/XX 5.00.

valds) might not be very important in the personal context of a user, protocol. Challenge-response schemes [16] require a manual effort

compared to other persons (e.g., the project manager). to send the rst email to a particular recipient. For example, the

In this paper we propose MailRank, a new approach to ranking sender has to go to a certain web page and activate the email man-

and classifying emails according to the address of email senders. ually, which might involve answering a simple question (such as

The central procedure is to collect data about trusted email ad- solving a simple mathematical equation). Afterwards, the sender

dresses from different sources and to create a graph for the social will be added to the recipient s whitelist such that further emails

network, derived from each user s communication circle [1]. There can be sent without the activation procedure. The activation task

are two MailRank variants, which both apply a power-iteration al- is considered too complex for spammers, who usually try to send

gorithm on the email network graph: Basic MailRank results in a millions of spam emails at once. An automatic scheme is used in

the greylisting approach4, where the receiving email server requires

global reputation for each known email address, and Personalized

MailRank computes a personalized trust value. MailRank allows to each unknown sending email server to resend the email again later.

classify email addresses into spammer address and non-spammer To prevent spammers from forging their identity (and allow for

address and additionally to determine the relative rank of an email tracking them), several approaches for sender authentication [5]

address with respect to other email addresses. This paper analyzes have been proposed. They basically add another entry to the DNS

the performance of MailRank under several scenarios, including server, which announces the designated email servers for a partic-

sparse networks, and shows its resilience against spammer attacks. ular domain. A server can use a reverse lookup to verify if a re-

The paper is organized as follows: Section 2 provides informa- ceived email actually came from one of these email servers. Sender

tion about existing anti-spam approaches, trust and reputation algo- authentication is a requirement for whitelist approaches since oth-

rithms, as well as a description of PageRank and some approaches erwise spammers can just use well-known email addresses in the

to personalizing it. In Sect. 3, we describe our proposed variants of From: line. Though it is already implemented by large email

MailRank, which we then evaluate in Sect. 4. Finally, our results providers (e.g., AOL, Yahoo), it also requires further mechanisms,

are summarized in Sect. 5. such as a blacklist or a whitelist, for an effective spam ltering since

spammers can easily set up their own domains and DNS servers.

Recent approaches have started to exploit information from so-

2. BACKGROUND AND RELATED WORK cial networks for spam detection. Such social network based ap-

proaches construct a graph, whose vertices represent email ad-

2.1 Anti-Spam Approaches dresses. A directed edge is added between two nodes A and B,

Because of the high relevance of the spam problem, many at- if A has sent an email to B . Boykin and Roychowdhury [1] ini-

tempts to counter spam have been started in the past, including tially classify email addresses based on the clustering coef cient of

some law initiatives. Technical anti-spam approaches comprise one the graph subcomponent: For spammers, this coef cient is very low

or several of the following basic approaches [16]: because they typically do not exchange emails with each other. In

Content-based approaches contrast, the clustering coef cient of the subgraph representing the

Header-based approaches actual social network of a non-spammer (colleagues, friends, etc.)

is rather high. The scheme can classify 53% of the emails correctly

Protocol-based approaches

as ham or spam, leaving the remaining emails for further exami-

Approaches based on sender authentication

nation by other approaches. Spammers can attack the scheme by

Approaches based on social networks

cooperating and building their own social networks. Golbeck and

Content-based approaches [7] analyze the subject of an email

Hendler propose another scheme to rank email addresses, based on

or the email body for certain keywords (statically provided or dy-

exchange of reputation values [6]. The main problem of this ap-

namically learned using a Bayesian lter) or patterns that are typ-

proach is that its attack resilience has not been veri ed.

ical for spam emails (e.g., URLs with numeric IP addresses in the

email body). The advantage of content-based schemes is their abil- 2.2 Trust and Reputation Algorithms

ity to lter quite a high number of spam messages. For exam-

Trust and reputation algorithms have become increasingly pop-

ple, SpamAssassin can recognize 97% of the spam if an appro-

ular to rank a set of items, such as web pages (web reputation) or

priately trained Bayesian lter is used together with the available

people (social reputation), for example, when selling products in

static rules [10]. The main drawback is that they (e.g., the set of

online auctions. Their main advantage is that most of them are de-

static keywords) have to be adapted continuously since otherwise

signed for high attack resilience.

the high spam recognition rate will decrease [10].

Web reputation schemes result in a single score for each Web

Header-based approaches examine the headers of email mes-

page. PageRank [15] computes these scores by means of link anal-

sages to detect spam. Whitelist schemes collect all email addresses

ysis, i.e., based on the graph inferred from the link structure of the

of known non-spammers in a whitelist to decrease the number of

Web. The main idea is that a page has a high rank if the sum of the

false positives from content-based schemes. In contrast, black-

ranks of its backlinks is high . Given a page p, the set of its input

list schemes store the IP addresses (email addresses can be forged

links I (p) and output links O(p), the PageRank score is computed

easily) of all known spammers and refuse to accept emails from

according to the formula:

them. A manual creation of such lists is typically highly accurate

but puts quite a high burden on the user to maintain it. PGP key P R(q )

P R(p) = c + (1 c) E (p) (1)

servers could be considered a manually created global whitelist. O(q )

q I (p)

An automatic creation can be realized, for instance based on pre-

vious results of a content-based lter as is done with so-called au- The damping factor c

towhitelists in SpamAssassin. Both blacklists and whitelists are antee convergence and to limit the effect of rank sinks, one very

rather dif cult to maintain, especially when faced with attacks from simple attack on PageRank. Intuitively, a random surfer will fol-

spammers who want to get their email addresses on the list (whitelist) low an outgoing link from the current page with probability c or

or off the list (blacklist).

4

Protocol-based approaches propose changes to the utilized email http://projects.puremagic.com/greylisting/

will get bored and select a random page with probability (1 c) speci ed in the votes. Therefore, it is not necessary that all email

(i.e., the E vector has all entries equal to 1/N, where N is the users participate in MailRank to bene t from it: For example, U3

does not specify any vote but still receives a vote from U1 and will,

number of pages in the Web graph). To achieve personalization,

thus, achieve some score (if U1 is not a spammer itself).

the random surfer must be redirected towards the preferred pages

by modifying the entries of E. Several distributions for this vector MailRank has the following advantages:

Shorter individual cold-start phase. If a MailRank user

have been proposed since: TrustRank [8] biases towards a set of

does not know an email address X, MailRank can provide a

ham pages in order to identify Web spam, HubRank [2] gives an

rank for X as long as at least another MailRank user has pro-

additional importance to hubs, pages collecting links to many other

important pages on the Web, etc. vided information about it. Thus, the so-called cold-start

Personalized PageRank [11] uses a new approach: it focuses on phase, i.e., the time a system has to learn until it becomes

user pro les. One Personalized PageRank Vector (PPV) is com- functional, is reduced: While most successful anti-spam ap-

puted for each user. The personalization aspect stems from a set proaches (e.g., Bayesian lters) have to be trained for each

of hubs (H), each user having to select her preferred pages from it. single user (in case of an individual lter) or a group of users

For each page of H, an auxiliary PPV called basis vector is precom- (for example, in case of a company-wide lter), MailRank

puted. Then, PPVs for any preference set P are expressed as a lin- requires only a single global cold start phase when the sys-

ear combination of basis vectors. To avoid the massive storage re- tem is bootstrapped. In this sense it is similar to globally

sources the basis hub vectors would use, they are decomposed into managed whitelists, but it requires less administrative efforts

partial vectors (encoding the part unique to each page, computed to manage the list and it can additionally provide informa-

at run-time) and the hubs skeleton (capturing the interrelationships tion about how good an email address is, and not only a

among hub vectors, stored off-line). Section 3.3 discusses how this classi cation into good or bad .

can be adapted to our email ranking and classi cation scenario. High attack resilience. MailRank is based on a power it-

Social reputation schemes are usually designed for use within eration algorithm, which is typically highly resistant against

P2P networks. However, they provide an useful insight into uti- attacks. This will be discussed for MailRank in particular in

lizing link analysis to construct reputation systems, as well as into Section 4.3.

identifying different attack scenarios. [21] presents a categoriza- Partial participation. Building on the power-law nature of

tion of trust metrics, as well as a xed-point personalized trust al- email networks, MailRank can compute a rank for a high

gorithm inspired by spreading activation models. It can be viewed number of email addresses even if only a subset of email

as an application of PageRank on a sub-graph of the social net- users actively participates in MailRank.

work. [18] builds a Web of trust asking each user to maintain trust

Stable results. Social networks are typically very stable, so

values on a small number of other users. The algorithm presented

the computed ratings of the email addresses will also change

is also based on a power iteration, but designed for an application

only slowly over time. Hence, spammers need to behave well

within the context of the Semantic Web, composed of logical asser-

for quite some time to achieve a high rank. Though this can-

tions. Finally, EigenTrust [12] is a pure xed-point PageRank-like

not resolve the spam problem entirely (in the worst case, a

distributed computation of reputation values for P2P environments.

spammer could, for example, buy email addresses from peo-

This algorithm is also used in the MailTrust approach [13].

ple who have behaved well for some time), it will increase

the cost for using new email addresses.

3. MAILRANK Can reduce load on email servers. Email servers don t have

In order to compute a rank for each email address, MailRank to process the email body to detect spam. This signi cantly

collects data about the social networks derived from email commu- reduces the computational power for spam detection com-

nication of all MailRank users and aggregates them into a single pared to, for example, content-based approaches or collabo-

email network. Figure 1 depicts an example email network graph. rative lters [13].

Node U1 represents the email address of U1, node U2 the email ad- Personalization. In contrast to spam classi cation approaches

that distinguish only between spam and non-spam, rank-

ing approaches more easily enable personalization features.

This is important since there are certain email addresses (e.g.,

newsletters), which some people consider to be spammers

while others don t. To deal with such cases, a MailRank user

can herself decide about the score threshold, below which

all email addresses are considered spammers. Moreover, she

could use two thresholds to determine spammers, don t

Figure 1: Sample email network know, and non-spammers . Furthermore, she might want

to give more importance to her relatives or to her manager,

dress of U2, and so on. U1 has sent emails to U2, U4, and U3 ; U2 than to other unrelated persons with a globally high reputa-

has sent emails to U1 and U4, etc. These communication acts are tion (e.g., Linus Torvalds).

then interpreted as trust votes, e.g., from U1 towards U2, U4 and Scalable computation. Power iteration algorithms have been

U3, and depicted in the gure using arrows. shown to be computationally feasible even for very large

Building upon the email network graph, we can use a power iter- graphs even in the presence of personalization [11].

ation algorithm to compute a score for each email address. This can

Can also counter other forms of spam. When receiving

subsequently be used for at least two purposes, namely: (1) Clas-

spam phone calls (SPIT5 ), for example, it is not possible to

si cation into spam and ham emails, and (2) build up a ranking

among the remaining ham emails.

5

The computation includes the email addresses of all voters (i.e. Spam over Internet Telephony,

the actively participating MailRank users) and the email addresses http://www.infoworld.com/article/04/09/07/HNspamspit 1.html

analyze the content of the call before accepting / rejecting it. will be in the biasing set and can in this way counter malicious

At best, only the caller identi er is available, which is similar collectives entirely. An automatic selection can avoid the (possi-

to the sender email address. MailRank can be used to analyze bly costly) manual selection of the biasing set. A semi-automatic

the caller identi er to decide whether a caller is a spammer selection of the biasing set can use the above described automatic

or not. selection to propose a biasing set for being veri ed manually to be

The following sections provide more information about each cen- free of spammers. We propose the following heuristics to deter-

tral aspect of MailRank: what data are used by the algorithm, where mine the biasing set automatically:

We rst determine the size p of the biasing set by adding the

these data are stored, how the ranks are generated and how we can

ranks of the R nodes with the highest rank such that the sum of the

nally use them for computing global or personalized reputation

ranks of these R nodes is equal to 20% of the total rank in the sys-

scores.

tem. Also, we additionally limit p to the minimum of R and 0.25%

3.1 Bootstrapping the email network of the total number of email addresses in the graph6 . In this manner

we limit the biasing set to the few most reputable members of the

As for all trust / reputation algorithms, it is necessary to collect

social network, because of the power-law distribution of email ad-

as many personal votes as possible in order to compute relevant

dresses [4, 9]. Thus, we can exclude spammers effectively even if

ratings. Collecting the personal ratings should require few or no

the spammer email addresses constitute the majority in the graph.

manual user interactions in order to achieve a high acceptance of

The result of the overall MailRank algorithm, the nal vector

the system. Similarly, the system should be maintained with lit-

of MailRank scores, can be used to tag an incoming email on the

tle or no effort at all, thus having the rating of each email address

email proxy as (1) non-spammer, if the nal score of the sender

computed automatically.

email address is larger than a threshold T, (2) spammer, if the nal

To achieve these goals, we use already existing data inferred

score of the sender email address is smaller than T, or (3) unknown,

from the communication dynamics, i.e., who has exchanged emails

if the email address is not yet known to the system7 .

with whom. This results in a global email social network. We dis-

Each user can adjust T according to her preferred ltering level.

tinguish three information sources as best serving our purposes:

If T = 0, the algorithm is effectively used to compute the transitive

1. Email Address Books. If A has the addresses B1, B2, Bn

closure of the email network graph starting from the biasing set.

in its Address Book, then A can be considered to trust them

This is suf cient to detect all those spammers for which no user

all, or to vote for them.

reachable from the biasing set has issued a vote. With T > 0, it

2. The To: Fields of outgoing emails (i.e., To:, Cc: and

becomes possible to detect spammers even if some non-spammers

Bcc: ). If A sends emails to B, then it can be regarded as

vote for spammers (e.g., because the computer of a non-spammer

trusting B, or voting for B . This input data is typically very

is infected by a virus). However, in this case some non-spammers

accurate since it is manually selected (i.e. it does not con-

with a very low rank are at risk of being counted as spammers.

tain spammer addresses), and it is more accurate than data

The Basic MailRank algorithm is summarized in Alg. 3.1.

from address books, since address books can comprise old

or outdated information and there is normally no informa- Algorithm 3.1. The Basic MailRank Algorithm.

tion available about when the address book entry was cre-

Client Side:

ated / modi ed last. Furthermore, address books are private

Each vote sent to the MailRank server comprises:

and would have to be released manually by the owner to be Addr(u) : The hashed version of the email address of the voter u.

accessible for the MailRank system. In contrast, data based TrustVotes(u) : Hashed version of all email addresses

on the To: elds can also be extracted automatically via a u votes for (i.e., she has sent an email to)

light-weight email proxy deployable on any machine. Server Side:

3. Autowhitelists created by anti-spam tools (e.g., SpamAssassin) 1: Combine all received data into a global email network graph. Let

T be the Markov chain transition probability matrix, computed as:

contain a list of all email addresses from which emails have

ForEach known email address i

been received recently, plus one score for each email address If i is a registered address, i.e., user i has submitted her votes

which determines if mainly spam or ham emails have been ForEach trust vote from i to j

Tji = 1/NumOfVotes(i)

received from the associated email address. All email ad-

Else ForEach known address j

dresses with a high score can be regarded as being trusted. Tji = 1/N, where N is the number of known addresses.

3: Determine the biasing set B (i.e., the most popular email addr.)

3.2 Basic MailRank 3a: Manual selection or

3b: Automatic selection or

The main goal of MailRank is to assign a rank to each email ad- 3c: Semi-automatic selection

dress known to the system and to use this rank (1) to decide whether 4: Let T = c T + (1 c) E, with c = 0.85 and

1

each email is coming from a spammer or not, and (2) to build up E [i] = [ B ]N 1, if i B, or E [i] = [0]N 1, otherwise

5: Initialize the vector of scores x = [1/N ]N 1, and the error =

a ranking among the ltered non-spam emails. Its basic version

6: While 0, in the formulas below:

munication partners will achieve much higher ranks, even though Dk+1 [u] = Dk [u] + q Q (u) c Ek [u](q )xq

k

Ek+1 [u] = Ek [u] q Q (u) Ek [u](q )xq +

initially they were not among the highest ones. Moreover, due to k

O (q )

1 c

the rank propagation, their votes will have a high in uence as well. Ek [u](q )xOi (q)

q Qk (u) O (q ) i=1

Under this choice, Dk [u] + c Ek [u] will converge to ru rH,

Now that we have captured the user requirements mentioned, we u

the partial vector corresponding to u.

should also focus our attention on a nal design issue of our system:

scalability. Simply biasing MailRank on user s acquaintances will 2: (Repeated squaring) Having the results from the rst step as input, one can now

not scale well, because it must be computed for each preference set, compute the hubs skeleton (ru (H )). This is represented by the nal D [u] vectors

calculated using Qk (u) = H into:

that is for every registered user.

D2k [u] = Dk [u] + q Q (u) Ek [u](q ) Dk [q ]

Jeh and Widom [11] have proposed an approach to calculate k

E2k [u] = Ek [u] q Q (u) Ek [u](q )xq + q Q (u) Ek [u](q )Ek [q ]

Personalized PageRank vectors, which can also be adapted to our k k

As this step refers to hub-users only, the computation of D2k [u] and E2k [u]

scenario, and which can be used with millions of subscribers. To should consider only the components regarding users from H,

achieve scalability, the resulting personalized vectors are divided in as it signi cantly decreases the computation time.

two parts: one common to all users, precomputed and stored off-

3: Let p = 1 u1 + + z uz be a preferenced vector,

line (called partial vectors ), and one which captures the speci cs where ui are from H and i is between 1 and z, and let:

z

of each preference set, generated at run-time (called hubs skele- i=1 i (rui (h) c xpi (h)), h H

rp (h) =

which can be computed from the hubs skeleton.

ton ). We will have to de ne a restricted set of users on which

The PPV v for p can then be constructed as:

rankings can be biased (we shall call this set hub set, and note z H

i=1 i (rui ru )+

v=

i

it with H ). There is one partial vector and one hub skeleton for

rp (h) (ru rH ) c xh

1

h H rp (h)>0 u

c

each user from H . Once an additional regular user registers, her

personalized ranking vector will be generated by reading the al-

ready precomputed partial vectors corresponding to her preference

3.4 MailRank System Architecture

set (step 1), by calculating their hubs skeleton (step 2), and -

MailRank is composed of a server, which collects all user votes

nally by tying these two parts together (step 3). Both the algorithm

and delivers a score for any known email address, and an email

from step 1 (called Selective Expansion ) and the one from step

proxy on the client side, which interacts with the MailRank server.

2 (named Repeated Squaring ) can be mathematically reduced to

The MailRank Server collects the input data (i.e., the votes)

biased PageRank. The latter decreases the computation error much

from all MailRank users to run the MailRank algorithm. The votes

faster along the iterations and is thus more ef cient, but works only

are assigned with a lifetime for (1) Identifying and deleting email

with the output of the former one as input. In the nal phase, the

addresses which haven t been used for a long time, and (2) Detect-

two sub-vectors resulted from the previous steps are combined into

ing spammers which behave good for some time to get a high rank

a global one. The algorithm is depicted in the following lines. To

and start to send spam emails afterwards.

make it clearer, we have also collected the most important de ni-

The MailRank Proxy resides between user s email client and

tions it relies on in table 1.

her regular local email server. It performs two tasks: When re-

ceiving an outgoing email, it rst extracts the user s votes from the

Term Description

available input data (e.g., by listening to ongoing email activities or

Set V The set of all users.

Hub Set H A subset of users. by analyzing existing sent-mail folders). Then, it sends the votes

Preference Set P Set of users on which to personalize.

to the MailRank server and forwards the email to the local email

Preference Vector p Preference set with weights.

server. To increase ef ciency, only those votes that have not been

Personalized PageRank Importance distribution induced by a preference vector.

submitted yet (or that would expire otherwise) are sent. Also, for

Vector (PPV)

Basis Vector ru PPV for a preference vector with a single nonzero entry privacy reasons, votes are encoded using hashed versions of email

at u.

addresses. Upon receiving an email, the proxy queries the Mail-

Basis vector for a hub user u H .

Hub Vector ru

Rank server about the ranking of the sender address (if not cached

H

Partial Vector ru ru Used with the hubs skeleton to construct a hub vector.

locally) and classi es / ranks the email accordingly.

Hubs Skeleton ru (H ) Used with partial vectors to construct a hub vector.

Further extensions of our prototype will make use of secure sign-

Table 1: Terms speci c to Personalized MailRank. ing schemes to enable us to analyze both outgoing and incoming

emails for extracting the votes and submitting them to the Mail-

Rank server.8 This helps not only to bootstrap the system initially, would move that address from spammer to non-spammer . How-

but also introduces the votes of spammers into MailRank. Such ever, spammers could still pay non-spammers to send spam on their

votes have a very positive aspect, since they increase the score for behalf. Such an attack can be successful initially, but the rank of

the spam recipients (i.e., non-spammers). Thus, spammers face the non-spammer addresses will decrease after some time to those

more dif culties to attack the system and increase their own rank. of spammers due to the limited life time of votes. We will discuss

simulations based on such attack scenarios in the next section.

3.5 MailRank Under Spammer Attacks

By de nition, spammers send the same / very similar message

4. EXPERIMENTAL RESULTS

to very many (typically millions of) recipients. However, they can

Real-world data about email networks is almost unavailable be-

run two different strategies to choose the sender address: First, they

cause of privacy reasons. Yet some small studies do exist, using

use a new (random) email address for each spam message even if

data gathered from the log les of a student email server [4], or of

they send the same message to millions of recipients (from an anal-

a comany wide server [9], etc. In all cases, the analyzed email net-

ysis we performed on the autowhitelists of several large university

work graph exhibits a power-law distribution of in-going (exponent

institutions in Germany, we found that 95% of the spammer ad-

1.49) and out-going (exponent 1.81) links.

dresses were used only once). In this manner, they are trying to

To be able to vary certain parameters such as the number of

circumvent blacklists of email addresses. Furthermore, they use

spammers, we evaluated MailRank using an extensive set of simu-

these addresses only for sending spam emails to non-spammers.

lations, based on a power-law model of an email network, follow-

Second, they use email addresses from well-known non-spammers

ing the characteristics presented in the above mentioned literature

(forging of sender address) assuming that these addresses are in

studies. Additionally, we used an exponential cut-off at both tails

the whitelists of many spam detection tools. Sender authentica-

to ensure that a node has at least ve and at most 1500 links to other

tion schemes as those listed in Sect. 2 already prevent forging the

nodes, which re ects the nature of true social contacts [9]. If not

sender address (when installed on the email server) and are actually

noted otherwise, the graph consisted of 100,000 non-spammers10

required for any whitelist-based scheme. However, sender authen-

and the threshold T was set to 0. In a scenario without virus infec-

tication cannot counteract the much more common former spam-

tions, this is suf cient to detect spammers and to ensure that non-

ming strategy.

spammers are not falsely classi ed. Furthermore, we repeated all

As soon as the MailRank service becomes widespread, spam-

simulations for at least three times with different randomly gen-

mers will surely try to attack it in order to increase the rank of their

erated email networks to determine average values. Finally, as

own address(es). We identi ed and simulated several ways of at-

personalization brought a signi cant improvement only in creating

tacking MailRank9 . For example, spammers could issue votes from

user-speci c rankings of email addresses (i.e., it resulted only in

one or several spammer addresses to one or several non-spammer

minor improvements for spam detection), we omitted it here due to

addresses. However, the algorithm ensures that it is not possible to

space limitations. Therefore, our analysis is focused around three

change your own score by the votes you are issuing towards others.

issues: Effectiveness in case of very sparse MailRank networks

Therefore, such attacks are only reasonable if the spammers vote

(i.e., only few nodes submit votes, the others only receive votes),

for another spammer address to increase its rank, forming a mali-

Copyright 200X ACM X-XXXXX-XX-X/XX/XX 5.00.



Contact this candidate