Data Class

Location:

China

Posted:

November 21, 2012

Contact this candidate

Resume:

Neural Comput & Applic (****) **:*** ***

DOI **.*007/s00521-006-0076-4

O R I G I N A L A RT I C L E

Predicting the distance between antibody s interface residue

and antigen to recognize antigen types by support vector machine

Yong Shi Xinyang Zhang Jia Wan

Yong Wang Wei Yin Zhiwei Cao

Yajun Guo

Received: 14 November 2005 / Accepted: 3 November 2006 / Published online: 25 November 2006

Springer-Verlag London Limited 2006

Abstract In this paper, a machine learning approach, distance between antibody s interface residue and

known as support vector machine (SVM) is employed antigen. Furthermore, the antigen class is predicted

to predict the distance between antibody s interface from residue composition information that belongs to

residue and antigen in antigen antibody complex. The different distance range by SVM, which shows some

heavy chains, light chains and the corresponding anti- potential signi cance.

gens of 37 antibodies are extracted from the antibody

Keywords Bioinformatics Protein Protein data

antigen complexes in protein data bank. According to

bank Antibody antigen complexes Support vector

different distance ranges, sequence patch sizes and

machine Cross-validation

antigen classes, a number of computational experi-

ments are conducted to describe the distance between

antibody s interface residue and antigen with antibody

sequence information. The high prediction accuracy of 1 Introduction

both self-consistent and cross-validation tests indicates

that the sequential discovered information from anti- The explosive growth in biotechnology combined with

body structure characterizes much in predicting the major advances in information technology has the po-

tential to radically transform immunology in the post

genomics era [1]. Indeed, the computational immu-

Y. Shi X. Zhang W. Yin

nology aims to provide tools for extraction, compari-

Research Center on Data Technology

son, analysis and interpretation on not only the vast

and Knowledge Economy, Graduate University

of Chinese Academy of Sciences, quantities of existing data, but also the newly accu-

Beijing 100080, China

mulated data with relevance to immunology.

e-mail: ****@*****.**.**

Antibodies are similar immunoglobulins in sequence

and structure. According to the ratio of different amino

J. Wan

Institute of Biophysics, Chinese Academy of Sciences, acids in a given position among different antibodies,

Beijing 100101, China

each antibody can be divided into constant and vari-

able regions. Antibody binds its corresponding antigen

Y. Wang

by complementarity determining regions (CDRs),

Department of Electrical Engineering and Electronics,

Osaka Sangyo University, Daito, which are also known as hypervariable (HV) loops

Osaka 574-8530, Japan

because of their dramatic variability as relative to

others in the variable regions. Much effort has focused

Z. Cao

on characters of antibody structure, antibody antigen

Shanghai Center for Bioinformation Technology,

Shanghai 200235, China binding site, the af nity and speci city of the antibody

[2 8]. A good t of the two binding surfaces in anti-

Y. Guo

body antigen complex is important for high af nity [9].

Shanghai International Cancer Institute of China,

The circumstantialities of antibody antigen interaction

Shanghai 200433, China

123

482 Neural Comput & Applic (2007) 16:481 490

5,253 surface residues is then selected from these 37

surface can be observed by the distance between

complex structures to form data set.

antibody s interface residue and antigen. It will help us

Stage To distinguish interface residue. Jones and

to understand position of each interface residue rela-

Thornton [27] reported that if the ASA calculated

tive to antigen in 3D, which connects af nity of anti-

from one surface residue is 1 A less than the value

body antigen interaction.

calculated from monomer, then this surface residue is

In bioinformatics eld, there are many studies of

regarded as an interface residue. Fariselli et al. [15]

protein protein interaction by different machine

also indicated that if the Ca distance of two

learning approaches [10 13]. Fariselli and Casadio [14]

neighboring surface residues is less then 12 A, these

and Fariselli et al. [15] used neural network methods to

two surface residues are regarded as the contacted

study protein protein contact sites based on chemico-

residues. The latter standard is adopted by other

physical and evolutionary information. The reported

researchers for its simplicity and easy implement.

precision is about 73%. Similarly, Ofran and Rost [16,

Surface residues are de ned as interface residues if

17] predicted protein protein interaction sites from

surface residues contact other residues in different

sequence information by using neural network. Besides

structure. In this paper, if the calculated distance is less

neural network, support vector machine (SVM) has

than 12 A, it is de ned that residue contacts antigen.

been applied the eld [18, 19]. In these studies, the

The total of 668 interface residues is then selected from

protein protein complexes were divided into six clas-

these 5,253 surface residues.

ses in order to improve the accuracy. Each class was

Stage To predict the distance range between

analyzed and predicted using sequence information.

antibody s interface residue and antigen. In order to

Among these six classes, antibody antigen complex

study the distance-range, the distance from Ca in

was studied as a speci c one. The sensitivity is 82.3%,

speci city is 81.0% and correlation coef cient is 0.43. interface residues to the nearest atom in antigen is

However, the characters of residue, such as distance calculated. If the calculated distance is less than the

range of antibody and antigen, need to be further threshold value, it is de ned that the distance between

investigated. interface residue and antigen belong to this range. The

To promote the direction of research, this paper distance 8, 10 and 12 A are selected as the threshold

proposes a method of using machine learning to pre- values to cope with range. The prediction accuracies

dict the distance between antibody s interface residue derived from these distance ranges are compared and

and antigen by characters of residue sequence in anti- analyzed. There are 329, 508, 668 interface residues

body structures as a pilot study to discover the corre- belong to distance ranges 8, 10, 12 A, respectively.

lation between antibody antigen interaction residue With the preparation of the data set, a machine

and antibody antigen interaction surface. learning framework in this paper can be considered as

In this research, the proposed framework of the a two-step process:

methods for predicting the distance between antibody s Step To extract feature vector from the sequence

interface residue and antigen includes three stages: information. In this paper, two coding methods are

Stage The surface residues and the interior proposed for handling the sequence feature, amino

residues are distinguished from protein structure acid composition, etc. A scheme is also proposed to

information. To nd surface residues, relative solvent extract feature vector from information of sequence

accessibility is employed. Relative solvent accessibility neighbors.

is an important character in protein structures, which Step To utilize experimental scheme to mine the

has been extensively studied [20 24]. The commonly useful information by using support vector machine

known program DSSP [25] is employed to compute (SVM). The details of the dataset, methodology and

accessible surface area (ASA). Threshold values are evaluation measures are reported in Sect. 2.

used to determine the binary categories. In this study, two factors, the sequence patch size

In our study, a kind of complex structure is desig- and the antigen classes are mainly explored to exam

nated as the basic standard in the paper. It is extracted how they affect the prediction accuracy.

from the antibody antigen complexes, which is se- In our experiments, ve sequence patch sizes are

lected from protein data bank (PDB) [26]. The com- chosen. They are 1, 2, 3, 4 and 5, respectively. The

plex structure is composed by heavy chains, light chains prediction accuracies of these sequences are also

and the corresponding antigens. It is then modi ed by investigated as follows.

xing the missing residues in heavy chain or in light It is also concerned that how antigen classes affect

chain sequences. After removing the redundant infor- the antibody antigen interaction surface. Different

mation, 37 basic structures are extracted. The total of types of antibody combining sites have been studied

123

Neural Comput & Applic (2007) 16:481 490 483

heterocomplex, and the rest are proteins. To simplify

[2]. These are cavity or pocket (hapten), groove (pep-

the problem, the antibody structures are further di-

tide, DNA, carbohydrate) and planar (protein). In our

vided into two classes according to whether the antigen

research, antigens are divided into two classes (protein,

is protein or not. Meanwhile the antigen classi cation

non-protein) for study. Meanwhile the antigen classi-

in uence is studied by comparing the accuracy with

cation in uence is studied by comparing the accuracy

classi cation to the accuracy without classi cation.

with classi cation to the accuracy without classi ca-

Another kind of antigen class (protein and non-pro-

tion. Another kind of antigen class (protein and non-

tein) is not subject to classi cation.

protein) is not subject to classi cation.

There are totally 45 samples produced according to

different distance ranges, sequence patch sizes and 2.2 Antibody surface residue and antibody

antigen classes in the data set. The samples are trained interface residues

and tested using SVM for prediction the distance range

between antibody s interface residue and antigen. The residues in the antibody structure can be divided

Based on the prediction results of interface residues into two groups: (1) residues embedded in the struc-

that belong to different distance ranges, another ture; (2) residues in the surface of the structure. It is

experiment is designed to predict the class of antigen commonly accepted that the surface residues in the

because different types antibody antigen interface antibody structure plays a vital role in interaction be-

have different surface characters [2]. In our study, the tween antibody and antigen.

number of each kind of interface residues belong to The accessible surface area method is utilized in the

different distance range is calculated as a basis of recognition of surface residues. The ASA is calculated

composing the attributes of interface residue. using the DSSP program. If the ASA in the chain

There are totally three samples are produced structure is 25% larger than the value calculated by the

according to three distance ranges in our data set. The residue alone, this residue can be regarded to surface

SVM is used again to train and test so that the func- residue of the antibody structure.

tional class from antibody structures data can be After computation, 5,253 surface residues are rec-

identi ed. ognized from the 37 non-redundant complex struc-

These numerical results are shown and discussed in tures. Among them, 4,139 surface residues antigens

Sect. 3, while the conclusion is given in Sect. 4. are of the protein class, and the other 1,114 surface

residues antigens are of the non-protein class.

Antibody function is accomplished by combination

2 Materials and methods of antibody antigen interaction surface. Interface res-

idues in antibody play an important role in antibody

2.1 Collection of complex structure antigen interaction. There are many methods to iden-

tify interface residues from protein protein complex.

The structural data of antibody antigen complex exists Fariselli et al. [15] indicated that two surface residues

in relevant biological literature and databases. In this are regarded as interface residues, if distance between

paper, all the antigen antibody compound les are Ca atom of one surface residue in one protein and Ca

selected from the PDB le library. The heavy chain, atom of one surface residue in another protein is less

light chain and corresponding antigen from the com- then 12 A.

plex are collectively de ned as the basic complex In this paper, the coordinate of Ca atom in residue is

structure, which is the start point of our research. taken as the coordinate of this residue. Distances be-

After testing the selected basic structure, some tween one surface residue in the antibody structure in

missing residues in the heavy chain and light chain of one complex structure and every atom in antigen in

these structures are detected. We ll these missing this complex structure are calculated to identify whe-

residues using [HyperChem 5.1 for Windows (Hyper- ther this surface residue is interface residue or not. If

cube, FL, USA)]. the distance is less than 12 A the surface residue is

If the heavy chains and the light chains of antibody considered to contact with the antigen, and it is iden-

in two complex structures are exactly the same, one of ti ed as interface residue.

the structures is de ned redundant. After examination After computation, 668 interface residues are rec-

of all the basic structures, 37 non-redundant complex ognized from 5,253 surface residues. Among them, 532

structures are extracted. interface residues antigens are of the protein class, and

In the 37 complex structures obtained, 3 of the the other 136 interface residues antigens are of the

antigens are nucleic acid, 4 of the antigens are non-protein class.

123

484 Neural Comput & Applic (2007) 16:481 490

2.3 Feature for inferring distance range

For the research of the distance between antibodies

interface residue and antigen, different ranges are

Recent studies in interface residues show that, com-

used here. Phenomenon of interaction surface be-

pared with the physicohemical property of structure,

tween antibody and antigen is different, which can be

sequence feature can better describe the structural

described by different distance ranges between inter-

information. This is because the sequence feature has a

face residue and antigen. In this paper, given 1 ktr,

strong connection of the physicohemical properties of

when the range is chosen as 8, 10 and 12 A, the

residues. Therefore, sequence feature of target residue

amount of interface residues are derived 2, 5 and 10,

is used for identifying distance ranges between inter-

respectively. Figure 1 graphically depicts the data

face residue and antigen in this paper.

comparison of the three groups derived from the

There are many studies on composing sequence

different distance ranges on 1 ktr. There is clear show

feature of the target residue [28, 29]. One method is

the interface residue position in interaction surface

surface patch. The target residue is de ned as a central

relative to antigen in 3D. It will help us to understand

surface residue and some nearest surface neighbors are

the in uence of interface residue belong to different

selected. The number of nearest surface neighbors is

rang.

decided by the surface patch size. Surface patch con-

In the experiments, three groups of interface residue

cerns the neighbor relationship in space. In this paper,

data are generated from complex structure. When the

sequence patch is used for composing sequence fea-

distance range is set to 8 A, we get 329 interface resi-

ture. When the sequence patch size is set to be 5, it

dues. Among them, 253s antigens are protein class and

means the join sequence feature is composed by the

the other 76s antigens are non-protein class. When the

target residue and 5 front neighbors and 5 back

distance is set to 10 A, we have 508 interface residues,

neighbors (the total of 11 residues) in sequence. Se-

where 400s antigens are protein class and the other

quence patch is about the neighbor relationship in se-

108s antigens are non-protein class. When the distance

quence.

is set to 12 A, we obtain 668 interface residues with

Sequence feature for inferring distance range is co-

532s antigens identi ed as protein class and the rest

ded as follows: each residue is represented by a 20D

136s antigens as non-protein class.

Fig. 1 The gure depicts the

interface residues with

different distance of complex

1 ktr. a Depicts the position

of antigen in antibody

antigen complex. From b to d,

the distance range is set to 8,

10, 12 A, respectively

123

Neural Comput & Applic (2007) 16:481 490 485

basic vector, i.e., each kind of residue corresponds to 1 called support vectors. The separating hyperplane

equation is (x x) + b = 0, where the sample vectors (xi,

of 20 dimensions in the basic vector. The element of

the vector having value one means that it belongs to yi), i = 1 n, should satisfy

that kind of residue. Only one position has value one

yi x x b ! 1; i 1; . . . ; n: 1

and others has zero. This sequence feature will be co-

ded as a 220D vector if the sequence patch size is set to

The distance of point x to the hyperplane (x, b) is

j x xI bj

d x; b; x kxk : The optimal hyperplane is given

2.4 Feature for inferring antigen class by maximizing the margin d, subject to (1). The margin

can be given by q x; b kxk : Hence the hyperplane

Feature for inference of antigen class is constructed by that optimally separates the data is the one that mini-

interface residues belong to different range in antibody mizes

structure. It is a 21D vector, which denotes 20 kinds of

kxk

residue composition in the interface residues plus the / x 2

number of interface residue belong to distance ranges

8, 10 and 12 A, respectively. The Lagrange function of (2) under constraints

(1) is,

2.5 Support vector machines

/ x; b; a kxk2 ai yi x; xi b 1 ; 3

To predict the distance range between antibody s

2 i 1

interface residue and antigen from antibody/antigen

complex, we adopted algorithm called support vector

The optimal classi cation function, if solved, is

machine (SVM) [30, 31]. SVM is a kind of learning

f x sgn x x b :

machine based on statistical learning theory [32, 33]. It

For nonlinear case, we map the original space into a

is a theory of machine learning focusing on small

high dimension space by a nonlinear mapping, in which

sample data based on the structural risk minimization

an optimal hyperplane can be sought. The inner

principle from computational learning theory [34].

product function enables the classi cation in the new

Here is a brief description of the SVM algorithm.

space; however, the computation complexity will not

Consider the problem of separating the set of training

increase. Thus the corresponding program is,

vectors belonging to two separate classes. D = {(x1,

y1 (xl, yl)}, x 2 Rn, y 2 { 1,1}, with a hyper plane, X 1X

n n

(x x) + b = 0. Figure 2 is a simple linearly separatable Q a ai ai aj yi yj K xi ; xj 4

2 i;j 1

case. Solid points and circle points represent two kinds i 1

of sample separately. H is the separating line. H1 and

The corresponding separating function is,

H2 are the closest lines parallel to the separating line of

the two-class sample vectors. The distance between H1

and H2 is called margin. The separating hyperplane is

f x sgn a i yi K xi ; x b 5

said to be optimal if it classi es the samples into two i 1

classes without error (training error is zero) and the

This is the so-called SVM.

margin is maximal. The sample vectors in H1, H2 are

Support vector machine provides a method to solve

the possible dimension disaster in the algorithm: when

constructing a discriminant function, SVM does not

obtain solution in the feature space after mapping the

original sample space into a high dimension space by

nonlinear mapping. Instead, it compares the sample

vectors in the input space, then it performs nonlinear

mapping after the comparison. Function K is called

the kernel function of dot product. In [30], it is de ned

as a distance between sample vectors. The method

above can assure all training samples are accurately

classi ed. That is, on condition that the empirical risk

is zero, SVM can get the best generalization ability by

Fig. 2 Optimal separating hyperplane

123

486 Neural Comput & Applic (2007) 16:481 490

maximizing the margin. SVM have been used to FN (false negative): the number of records in the

handle many problems in bioinformatics [10, 18, rst class that has been classi ed into second class.

35 37]. After passing the re-substitution test, a vefold cross

In this paper, an integrated software for support validation is applied on the dataset. The details of a

vector classi cation named LIBSVM (Version 2.71) vefold cross validation is discusses as follows. The

[38] is employed to predict antibody interaction sites dataset is split into ve parts. One of ve acts as the

and antigen class. testing set and the other four as the training set to

construct the mathematical model. The process rotates

for ve times with each part as a testing set in a single

2.6 Evaluation measure round. In the generalization test, the mean value of the

accuracy in the vefold cross validation tests is used as

The tests of classi cation accuracy can be divided into the accuracy measure of the experiment. If the utilized

two parts: self-consistent test and cross-validation test, method is correct, then the extracted features can well

which are the common ways in testing the capability of explain the correlation between antibody antigen

the prediction power. Self-consistent test aims at test- interaction residue and antibody antigen interaction

ing the self-consistence, which takes the same training surface. In this situation, the cross-validation test

dataset as testing data. Cross-validation is the test of should have a high performance level. The mean

the generalization ability of the method. accuracy of the vefold cross validation should also be

In the self-consistent test, the dataset acts as both comparatively high.

the training set and the testing set. The modi cation

process of the parameter in the LIBSVM continues

until the result of the self-consistent is satis ed. The 3 Results and discussions

indictors, such as prediction accuracy and correlation

coef cient, can be obtained from the self-consistent 3.1 Predict the distance range

test for analyzing the effectiveness of the method.

The rationale behind the method is that using the 3.1.1 Self-consistent test

original data as the testing data, the model shows

good prediction accuracy if it is good. Correlation

When the distance range is set to 8 A and the antigen is

coef cient falls into the range of ( 1, 1). The intro- of the protein class, the accuracy of the self-consistent

duction of correlation coef cient is to avoid the test increases from 97.53 to 98.79%; correlation coef-

negative impacts of the imbalance between different cient is from 0.587861 to 0.79216, with the increase of

classes of data. For example, if two types of data sequence path size from 1 to 5. When sequence path

take up a 4:1 position in a single dataset, then the size is four, three indictors: accuracy and correlation

prediction of the type with a large size of data will coef cient reach their best value, which are 99.01%

be 80% accurate. If the same dataset is used in the and 0.829109, respectively.

testing, then it is meaningless in the case of predic- The self-consistent test results for predicting the

tion. When using the model constructed on predic- distance ranges when the antigen is of protein class, the

tion of the data, the correlation coef cient will be 1 non-protein class and the class of protein and non-

if the prediction is completely contrary to the exact protein can be similarly explained as shown in

value, 1 if the prediction is correct, and 0 if the Tables 1, 2 and 3, respectively.

prediction is randomly produced. The Correlation

coef cient are calculated as follows. 3.1.2 Cross-validation test

Correlation coefficient

When the distance range is 8 A and the antigen is of

TP TN FP FN the protein class, the accuracy of the cross validation

on the data increases from 94.66 to 95.79%, with the

TP FN TP FP TN FP TN FN

increase of sequence path size from 1 to 5. When se-

quence path size is four, the accuracy reaches its best

where TP (true positive): the number of records in the

value of 95.86%. When the antigen is of the non-pro-

rst class that has been classi ed correctly.

tein class, the accuracy of the cross validation increases

FP (false positive): the number of records in the

from 93.72 to 94.52%, with the changes of the sequence

second class that has been classi ed into the rst class.

path size from 1 to 5. When sequence path size is four

TN (true negative): the number of records in the

or ve, the best accuracy value is 94.52%. When the

second class that has been classi ed correctly.

123

Neural Comput & Applic (2007) 16:481 490 487

Table 1 The self-consistent test results with different sequence Table 3 The self-consistent test results with different sequence

patch size when the antigen is protein class patch size when the antigen is protein and non-protein class

Distance Sequence Accuracy Correlation Distance Sequence Accuracy Correlation

range (A) patch size coef cient range (A) patch size coef cient

8 1 97.53 0.587861 8 1 97.39 0.573704

8 2 98.96 0.820909 8 2 98.88 0.811367

8 3 98.72 0.77995 8 3 98.50 0.74865

8 4 99.01 0.829109 8 4 98.91 0.817575

8 5 98.79 0.79216 8 5 98.69 0.780672

10 1 97.03 0.672668 10 1 96.80 0.649831

10 2 98.57 0.839235 10 2 98.40 0.820558

10 3 98.96 0.882228 10 3 98.08 0.785489

10 4 98.67 0.849841 10 4 98.88 0.873085

10 5 98.33 0.812528 10 5 98.17 0.796534

12 1 95.43 0.613591 12 1 95.51 0.616922

12 2 98.12 0.834975 12 2 98.00 0.823716

12 3 97.44 0.777317 12 3 97.93 0.817578

12 4 98.09 0.832873 12 4 97.51 0.781856

12 5 97.68 0.797929 12 5 97.70 0.797733

Table 4 The cross validation test results with different sequence

Table 2 The self-consistent test results with different sequence

patch size and antigen class

patch size when the antigen is non-protein class

Distance Sequence Antigen: Antigen: Antigen: protein

Distance Sequence Accuracy Correlation

range patch protein non-protein and non-protein

range (A) patch size coef cient

(A) size 8 1 96.77 0.515651

8 1 94.66 93.72 94.34

8 2 99.01 0.846579

8 2 95.43 93.72 94.95

8 3 98.65 0.79309

8 3 95.65 94.25 95.35

8 4 98.65 0.796243

8 4 95.86 94.52 95.60

8 5 99.46 0.918223

8 5 95.79 94.52 95.83

10 1 96.41 0.60911

10 1 93.14 93.72 93.30

10 2 94.88 0.450205

10 2 94.63 93.72 93.83

10 3 95.51 0.521592

10 3 94.85 94.25 94.67

10 4 98.56 0.840512

10 4 95.50 94.52 95.31

10 5 98.11 0.800671

10 5 95.65 94.52 95.41

12 1 97.22 0.751736

12 1 91.76 92.01 92.00

12 2 98.65 0.878701

12 2 93.57 92.10 93.95

12 3 99.10 0.917999

12 3 93.94 93.36 94.46

12 4 98.38 0.853383

12 4 94.59 94.08 95.03

12 5 98.56 0.869748

12 5 95.24 94.70 95.45

The accuracy in the table denotes the average accuracy of ve-

antigen is of the protein and non-protein class, the fold experiment

accuracy of the cross validation increases from 94.34 to

95.83%, with the increase of the sequence path size

interface residues of antibody antigen interaction be-

from 1 to 5. When sequence path size is ve, the

long to different distance range are computed directly

accuracy reaches its best value of 95.83%.

from its structure information. Then, we use the SVM

The accuracy of the cross validation for the distance

to train and test the input features that consist of the

ranges 10 and 12 A when the antigen is of the protein

residue composition of antibody interface residues and

class, non-protein class, and the class of protein and

the total of interface residues. The prediction results

non-protein can be similarly explained as shown in

are listed in Table 5 with different distance ranges and

Table 4.

they show that high accuracy is attained when the

distance range is set to 8 and 12 A.

3.2 Identi cation of antigen class

Another test is performed through constructing a

two-layer classi er. Using prediction results of inter-

The experiment of identifying the antigen class by

face residue s different distance range as the start point

composing antibody interface residues belong to

of prediction of antigen class, we can examine the

different range is designed in two aspects. First, the

123

488 Neural Comput & Applic (2007) 16:481 490

Table 5 The self-consistent

Distance Self-consistent Self-consistent Cross validation test

and cross validation test

range (A) test accuracy test correlation average accuracy of

results with different distance

coef cient vefold experiment threshold

8 100 1 86.49

10 94.59 0.669643 83.78

12 100 1 86.49

antibody sequence and surface residues information, distance range. Based on different distance range,

and predict antigen class directly. The reason is that sequence patch size and antigen class, 45 samples have

the high prediction accuracy of interface residues is the been created in the research of prediction distance

solid guarantee and makes this scheme feasible. As range between antibody s interface residue and antigen

evidence, the prediction accuracy listed in Table 6 is by feature of antibody s surface residue sequence and 3

satisfactory. samples have been chosen for identifying antigen class

from the antibody s interface residue character by

different distance range. The result shows that differ-

ent settings have complicate results for nding the

4 Discussions

prediction accuracy.

Through these experiments by using support vector

The vast quantities of existing immunological data and

machine, we can observe the following: antigen class

advanced information technology have boosted the

greatly in uences the accuracy of prediction interface

research work on computational immunology. Under-

residue s distance range. High accuracy will be

standing the circumstantialities of antibody antigen

achieved by antigen classi cation. In the same distance

interaction surface via information technology be-

range, the increase in the sequence path size will help

comes a new research direction in immunoreaction.

increase the prediction accuracy. The larger the se-

Experimental data analysis by using machine learning

quence patch size is, the closer the effect of different

methods may help explain and provide signi cant

distance range on the accuracy will be. When the se-

insight into the complex phenomenon of antibody

quence patch size is small, different distance ranges

antigen interaction. The circumstantialities of anti-

have a great in uence of the accuracy. The results of

body antigen interaction surface can be observed by

this paper show that it is possible to infer distance

the distance between antibody s interface residue and

range between antibody s interface residue and antigen

antigen. It will help us to understand position of each

and compatible antigen class from the antibody struc-

interface residue relative to antigen in 3D, which

tures data.

connect af nity of antibody antigen interaction.

The results obtained in this paper can be further

As a pilot study of discovering the correlation be-

explored through the following aspects: The size of

tween antibody antigen interaction residue and anti-

manually selected data is still smaller and the balance

body antigen interaction surface, total of 668 interface

dataset structure is needed. The prediction accuracy

residues was extracted from these 5,253 surface resi-

could be lower than what obtained in this paper if the

dues and divided into three classes by different

data size is larger with a balanced structure. Interface

of antibody combine antigen is difference for different

antigen class. We will focus our study on one kind of

Table 6 The cross validation test results with different distance antigen. The position of every interface residue relative

range and sequence patch size combining the antibody interface

to antigen in 3D is mate af nity of antibody antigen

residue prediction results

interaction. More precision of identifying interface

Sequence Distance range residue s distance ranges may provide a chance to de-

patch size

rive a batter understanding of interaction between

8A 10 A 12 A

antibody and antigen. If other known computational

1 81.59 78.17 79.57

methods, such as neural networks is used, we may

2 82.12 78.61 81.26

investigate which method bring better results in rela-

3 82.47 79.31 81.70

tionship between antibody structure and antibody

4 82.68 79.85 82.19

5 82.88 79.93 82.55 functions.

Note that only the sequence information of antibody

The accuracy list in the table is the average accuracy for vefold

experiment has been used in this paper. Following this paper, our

123

Neural Comput & Applic (2007) 16:481 490 489

4. Bath TN, Bentley GA, Fischmann TO, Boulot G, Poljak RJ

future interest will be using different machine learning

(1990) Small rearrangements in structures of Fv and Fab

techniques, such as multiple criteria mathematical

fragments of antibody D1.3 on antigen binding. Nature

programming [39, 40], on the large-scale structure 347:483 485

information for possible higher degree of prediction 5. Colman PM, Laver WG, Varghese JN, Baker AT, Tulloch

PA, Air GM, Webster RG (1987) Three-dimensional struc-

accuracy.

ture of a complex of antibody with in uenza virus neur-

aminidase. Nature 326:358 363

6. Xiang J, Sha Y, Prasad L, Delbaere LTJ (1996) Comple-

5 Conclusions mentarity determining region residues aspartic acid at H55

serine at tyrosines at H97 andL96 play important roles in the

B72.3 antibody-TAG72 antigen interaction. Protein Eng

For high af nity of antibody antigen interaction, it is

9:539 543

important to know whether two surfaces match each 7. Chothia C, Lesk AM, Gherardi E, Tomlinson IM, Walter G,

other. The circumstantialities of antibody antigen Marks JD, Lewelyn MB, Winter G (1992) Structural reper-

toire of the human Vh segments. J Mol Biol 227:799 817

interaction are represented by the distance between

8. Iba Y, Hayshi N, Sawada I, Titani K, Kurosawa Y (1998)

antibody s interface residue and antigen. In the past,

Changes in the speci city of antibodies against steroid anti-

much effort has focused on characters of antibody gens by introduction of mutations into complementarity-

structure, antibody antigen binding site, the af nity determining regions of Vh domain. Protein Eng 11:361 370

9. Rees AR, Staunton D, Webster DM (1994) Antibody design:

and speci city of the antibody. This paper has pro-

beyond the natural limits. Trends Biotechnol 12:199 207

posed a machine learning approach to explore the

10. Minakuchi1 Y, Konagaya A (2004) Prediction of protein

interaction between antibody s interface residue and protein interaction sites using support vector machines.

antigen. That is to say, the interaction information can Protein Eng Des Sel 17:165 173

11. Chakrabarti P, Janin J (2002) Dissecting protein protein

be predicted from the structure characters. Based on

Contact this candidate