Data Network

Location:

El Segundo, CA

Posted:

February 12, 2013

Contact this candidate

Resume:

A Brief Survey of Machine Learning Methods for

Classi cation in Networked Data and an

Application to Suspicion Scoring

Sofus Attila Macskassy1 and Foster Provost2

Fetch Technologies,

**** ****crans Ave, Suite 245, El Segundo, CA 90245

******@*****.***

New York University,

Stern School of Business, 44 W. 4th Street, New York, NY 10012

********@*****.***.***

Abstract. This paper surveys work from the eld of machine learning

on the problem of within-network learning and inference. To give mo-

tivation and context to the rest of the survey, we start by presenting

some (published) applications of within-network inference. After a brief

formulation of this problem and a discussion of probabilistic inference

in arbitrary networks, we survey machine learning work applied to net-

worked data, along with some important predecessors mostly from the

statistics and pattern recognition literature. We then describe an appli-

cation of within-network inference in the domain of suspicion scoring in

social networks. We close the paper with pointers to toolkits and bench-

mark data sets used in machine learning research on classi cation in

network data. We hope that such a survey will be a useful resource to

workshop participants, and perhaps will be complemented by others.

1 Introduction

This paper brie y surveys work from the eld of machine learning, summarizes

work in a trio of research papers [1,2,3]. This extended abstract consists of the

abstracts for those papers in which we concentrate on methods published in the

machine learning literature, as well as methods from other elds that have had

considerable impact on the machine learning literature.

Networked data are the special case of relational data where entities are inter-

connected, such as web-pages or research papers (connected through citations).

We focus on within-network inference, for which training entities are connected

directly to entities whose classi cations (labels ) are to be estimated. This is in

contrast to across-network inference: learning from one network and applying

the learned models to a separate, presumably similar network [4,5]. For within-

network inference, networked data have several unique characteristics that both

complicate and provide leverage to learning and inference.

E.M. Airoldi et al. (Eds.): ICML 2006 Ws, LNCS 4503, pp. 172 175, 2007.

c Springer-Verlag Berlin Heidelberg 2007

A Brief Survey of Machine Learning Methods for Classi cation 173

Although the network may contain disconnected components, generally there

is not a clean separation between the entities for which class membership is

known and the entities for which estimations of class membership are to be

made. The data are patently not i.i.d., which introduces bias to learning and in-

ference procedures [6]. The usual careful separation of data into training and test

sets is di cult, and more importantly, thinking in terms of separating training

and test sets obscures an important facet of the data. Entities with known clas-

si cations can serve two roles. They act rst as training data and subsequently

as background knowledge during inference. Relatedly, within-network inference

allows models to use speci c node identi ers to aid inference [7].

Network data generally allow collective inference, meaning that various inter-

related values can be inferred simultaneously. For example, inference in Markov

random elds [8] uses estimates of a node s neighbor s labels to in uence the esti-

mation of the nodes labels and vice versa. Within-network inference

complicates such procedures by pinning certain values, but again also o ers

opportunities such as the application of network- ow algorithms to inference.

More generally, network data allow the use of the features of a node s neighbors,

although that must be done with care to avoid greatly increasing estimation

variance (and thereby error) [9].

2 Network Learning

Abstract from [1]:

This paper presents NetKit, a modular toolkit for classi cation in networked

data, and a case-study of its application to networked data used in prior ma-

chine learning research. We consider within-network classi cation : entities whose

classes are to be estimated are linked to entities for which the class is known.

NetKit is based on a node-centric framework in which classi ers comprise a lo-

cal classi er, a relational classi er, and a collective inference procedure. Various

existing node-centric relational learning algorithms can be instantiated with ap-

propriate choices for these components, and new combinations of components

realize new algorithms. The case study focuses on univariate network classi-

cation, for which the only information used is the structure of class linkage

in the network (i.e., only links and some class labels). To our knowledge, no

work previously has evaluated systematically the power of class-linkage alone

for classi cation in machine learning benchmark data sets. The results demon-

strate that very simple network-classi cation models perform quite well well

enough that they should be used regularly as baseline classi ers for studies of

learning with networked data. The simplest method (which performs remarkably

well) highlights the close correspondence between several existing methods intro-

duced for di erent purposes i.e., Gaussian- eld classi ers, Hop eld networks,

and relational-neighbor classi ers. The results also show that a small number of

component combinations excel. In particular, there are two sets of techniques

that are preferable in di erent situations, namely when few versus many labels

174 S.A. Macskassy and F. Provost

are known initially. We also demonstrate that link selection plays an important

role similar to traditional feature selection.

3 Suspicion Scoring

Abstract from [2]:

We describe a guilt-by-association system that can be used to rank entities by

their suspiciousness. We demonstrate the algorithm on a suite of data sets gen-

erated by a terrorist-world simulator developed under a DoD program. The data

sets consist of thousands of people and some known links between them. We

show that the system ranks truly mali-cious individuals highly, even if only rela-

tively few are known to be malicious ex ante. When used as a tool for identifying

promising data-gathering opportunities, the sys-tem focuses on gathering more

information about the most suspicious people and thereby increases the den-

sity of link-age in appropriate parts of the network. We assess per-formance

under conditions of noisy prior knowledge (score quality varies by data set un-

der moderate noise), and whether augmenting the network with prior scores

based on pro ling information improves the scoring (it doesn t). Although the

level of performance reported here would not support direct action on all data

sets, it does recommend the consideration of network-scoring techniques as a

new source of evidence in decision making. For example, the system can op-

erate on networks far larger and more com-plex than could be processed by a

human analyst.

Abstract from [3]:

We describe a guilt-by-association system that can be used to rank networked

entities by their suspiciousness. We demonstrate the algorithm on a suite of

data sets generated by a terrorist-world simulator developed to support a DoD

program. Each data set consists of thousands of entities and some known links

between them. The system ranks truly malicious entities highly, even if only

relatively few are known to be malicious ex ante. When used as a tool for iden-

tifying promising data-gathering opportunities, the system focuses on gathering

more information about the most suspicious entities and thereby increases the

density of linkage in appropriate parts of the network. We assess performance

under conditions of noisy prior knowledge of maliciousness. Although the levels

of performance reported here would not support direct action on all data sets,

the results do recommend the consideration of network-scoring techniques as a

new source of evidence for decision making. For example, the system can op-

erate on networks far larger and more complex than could be processed by a

human analyst. This is a follow-up study to a prior paper; although there is a

considerable amount of overlap, here we focus on more data sets and improve

the evaluation by identifying entities with high scores simply as an artifact of

the data acquisition process.

A Brief Survey of Machine Learning Methods for Classi cation 175

References

1. Macskassy, S.A., Provost, F.: Classi cation in Networked Data: A toolkit and a

univariate case study. Technical Report CeDER Working Paper 04-08, Stern School

of Business, New York University (2004). [June 2006 revision]

2. Macskassy, S.A., Provost, F.: Suspicion scoring based on guilt-by-association, collec-

tive inference, and focused data access. In: International Conference on Intelligence

Analysis. (2005)

3. Macskassy, S.A., Provost, F.: Suspicion scoring of entities based on guilt-by-

association, collective inference, and focused data access. In: Annual Conference

of the North American Association for Computational Social and Organizational

Science (NAACSOS). (2005)

4. Craven, M., Freitag, D., McCallum, A., Mitchell, T., Nigam, K., Quek, C.Y.: Learn-

ing to Extract Symbolic Knowledge from the World Wide Web. In: 15th Conference

of the American Association for Arti cial Intelligence. (1998)

5. Lu, Q., Getoor, L.: Link-Based Classi cation. In: Proceedings of the 20th Interna-

tional Conference on Machine Learning (ICML). (2003)

6. Jensen, D., Neville, J.: Linkage and Autocorrelation Cause Feature Selection Bias

in Relational Learning. In: Proceedings of the 19th International Conference on

Machine Learning (ICML). (2002)

7. Perlich, C., Provost, F.: Distribution-based aggregation for relational learning with

identi er attributes. Machine Learning 62(1/2) (2006) 65 105

8. Besag, J.: Spatial interaction and the statistical analysis of lattice systems. Journal

of the Royal Statistical Society 36(2) (1974) 192 236

9. Jensen, D., Neville, J., Gallagher, B.: Why Collective Inference Improves Relational

Classi cation. In: Proceedings of the 10th ACM SIGKDD International Conference

on Knowledge Discovery and Data Mining. (2004)

Contact this candidate