Neural Comput & Applic (****) **:*** ***
O R I G I N A L A RT I C L E
Predicting the distance between antibody s interface residue
and antigen to recognize antigen types by support vector machine
Yong Shi Xinyang Zhang Jia Wan
Yong Wang Wei Yin Zhiwei Cao
Yajun Guo
Received: 14 November 2005 / Accepted: 3 November 2006 / Published online: 25 November 2006
Springer-Verlag London Limited 2006
Abstract In this paper, a machine learning approach, distance between antibody s interface residue and
known as support vector machine (SVM) is employed antigen. Furthermore, the antigen class is predicted
to predict the distance between antibody s interface from residue composition information that belongs to
residue and antigen in antigen antibody complex. The different distance range by SVM, which shows some
heavy chains, light chains and the corresponding anti- potential signi cance.
gens of 37 antibodies are extracted from the antibody
Keywords Bioinformatics Protein Protein data
antigen complexes in protein data bank. According to
bank Antibody antigen complexes Support vector
different distance ranges, sequence patch sizes and
machine Cross-validation
antigen classes, a number of computational experi-
ments are conducted to describe the distance between
antibody s interface residue and antigen with antibody
sequence information. The high prediction accuracy of 1 Introduction
both self-consistent and cross-validation tests indicates
that the sequential discovered information from anti- The explosive growth in biotechnology combined with
body structure characterizes much in predicting the major advances in information technology has the po-
tential to radically transform immunology in the post
genomics era [1]. Indeed, the computational immu-
Y. Shi X. Zhang W. Yin
nology aims to provide tools for extraction, compari-
Research Center on Data Technology
son, analysis and interpretation on not only the vast
and Knowledge Economy, Graduate University
of Chinese Academy of Sciences, quantities of existing data, but also the newly accu-
Beijing 100080, China
mulated data with relevance to immunology.
e-mail: ****@*****.**.**
Antibodies are similar immunoglobulins in sequence
and structure. According to the ratio of different amino
J. Wan
Institute of Biophysics, Chinese Academy of Sciences, acids in a given position among different antibodies,
Beijing 100101, China
each antibody can be divided into constant and vari-
able regions. Antibody binds its corresponding antigen
Y. Wang
by complementarity determining regions (CDRs),
Department of Electrical Engineering and Electronics,
Osaka Sangyo University, Daito, which are also known as hypervariable (HV) loops
Osaka 574-8530, Japan
because of their dramatic variability as relative to
others in the variable regions. Much effort has focused
Z. Cao
on characters of antibody structure, antibody antigen
Shanghai Center for Bioinformation Technology,
Shanghai 200235, China binding site, the af nity and speci city of the antibody
[2 8]. A good t of the two binding surfaces in anti-
Y. Guo
body antigen complex is important for high af nity [9].
Shanghai International Cancer Institute of China,
The circumstantialities of antibody antigen interaction
Shanghai 200433, China
123
482 Neural Comput & Applic (2007) 16:481 490
5,253 surface residues is then selected from these 37
surface can be observed by the distance between
complex structures to form data set.
antibody s interface residue and antigen. It will help us
Stage To distinguish interface residue. Jones and
to understand position of each interface residue rela-
Thornton [27] reported that if the ASA calculated
tive to antigen in 3D, which connects af nity of anti-
from one surface residue is 1 A less than the value
body antigen interaction.
calculated from monomer, then this surface residue is
In bioinformatics eld, there are many studies of
regarded as an interface residue. Fariselli et al. [15]
protein protein interaction by different machine
also indicated that if the Ca distance of two
learning approaches [10 13]. Fariselli and Casadio [14]
neighboring surface residues is less then 12 A, these
and Fariselli et al. [15] used neural network methods to
two surface residues are regarded as the contacted
study protein protein contact sites based on chemico-
residues. The latter standard is adopted by other
physical and evolutionary information. The reported
researchers for its simplicity and easy implement.
precision is about 73%. Similarly, Ofran and Rost [16,
Surface residues are de ned as interface residues if
17] predicted protein protein interaction sites from
surface residues contact other residues in different
sequence information by using neural network. Besides
structure. In this paper, if the calculated distance is less
neural network, support vector machine (SVM) has
than 12 A, it is de ned that residue contacts antigen.
been applied the eld [18, 19]. In these studies, the
The total of 668 interface residues is then selected from
protein protein complexes were divided into six clas-
these 5,253 surface residues.
ses in order to improve the accuracy. Each class was
Stage To predict the distance range between
analyzed and predicted using sequence information.
antibody s interface residue and antigen. In order to
Among these six classes, antibody antigen complex
study the distance-range, the distance from Ca in
was studied as a speci c one. The sensitivity is 82.3%,
speci city is 81.0% and correlation coef cient is 0.43. interface residues to the nearest atom in antigen is
However, the characters of residue, such as distance calculated. If the calculated distance is less than the
range of antibody and antigen, need to be further threshold value, it is de ned that the distance between
investigated. interface residue and antigen belong to this range. The
To promote the direction of research, this paper distance 8, 10 and 12 A are selected as the threshold
proposes a method of using machine learning to pre- values to cope with range. The prediction accuracies
dict the distance between antibody s interface residue derived from these distance ranges are compared and
and antigen by characters of residue sequence in anti- analyzed. There are 329, 508, 668 interface residues
body structures as a pilot study to discover the corre- belong to distance ranges 8, 10, 12 A, respectively.
lation between antibody antigen interaction residue With the preparation of the data set, a machine
and antibody antigen interaction surface. learning framework in this paper can be considered as
In this research, the proposed framework of the a two-step process:
methods for predicting the distance between antibody s Step To extract feature vector from the sequence
interface residue and antigen includes three stages: information. In this paper, two coding methods are
Stage The surface residues and the interior proposed for handling the sequence feature, amino
residues are distinguished from protein structure acid composition, etc. A scheme is also proposed to
information. To nd surface residues, relative solvent extract feature vector from information of sequence
accessibility is employed. Relative solvent accessibility neighbors.
is an important character in protein structures, which Step To utilize experimental scheme to mine the
has been extensively studied [20 24]. The commonly useful information by using support vector machine
known program DSSP [25] is employed to compute (SVM). The details of the dataset, methodology and
accessible surface area (ASA). Threshold values are evaluation measures are reported in Sect. 2.
used to determine the binary categories. In this study, two factors, the sequence patch size
In our study, a kind of complex structure is desig- and the antigen classes are mainly explored to exam
nated as the basic standard in the paper. It is extracted how they affect the prediction accuracy.
from the antibody antigen complexes, which is se- In our experiments, ve sequence patch sizes are
lected from protein data bank (PDB) [26]. The com- chosen. They are 1, 2, 3, 4 and 5, respectively. The
plex structure is composed by heavy chains, light chains prediction accuracies of these sequences are also
and the corresponding antigens. It is then modi ed by investigated as follows.
xing the missing residues in heavy chain or in light It is also concerned that how antigen classes affect
chain sequences. After removing the redundant infor- the antibody antigen interaction surface. Different
mation, 37 basic structures are extracted. The total of types of antibody combining sites have been studied
123
Neural Comput & Applic (2007) 16:481 490 483
heterocomplex, and the rest are proteins. To simplify
[2]. These are cavity or pocket (hapten), groove (pep-
the problem, the antibody structures are further di-
tide, DNA, carbohydrate) and planar (protein). In our
vided into two classes according to whether the antigen
research, antigens are divided into two classes (protein,
is protein or not. Meanwhile the antigen classi cation
non-protein) for study. Meanwhile the antigen classi-
in uence is studied by comparing the accuracy with
cation in uence is studied by comparing the accuracy
classi cation to the accuracy without classi cation.
with classi cation to the accuracy without classi ca-
Another kind of antigen class (protein and non-pro-
tion. Another kind of antigen class (protein and non-
tein) is not subject to classi cation.
protein) is not subject to classi cation.
There are totally 45 samples produced according to
different distance ranges, sequence patch sizes and 2.2 Antibody surface residue and antibody
antigen classes in the data set. The samples are trained interface residues
and tested using SVM for prediction the distance range
between antibody s interface residue and antigen. The residues in the antibody structure can be divided
Based on the prediction results of interface residues into two groups: (1) residues embedded in the struc-
that belong to different distance ranges, another ture; (2) residues in the surface of the structure. It is
experiment is designed to predict the class of antigen commonly accepted that the surface residues in the
because different types antibody antigen interface antibody structure plays a vital role in interaction be-
have different surface characters [2]. In our study, the tween antibody and antigen.
number of each kind of interface residues belong to The accessible surface area method is utilized in the
different distance range is calculated as a basis of recognition of surface residues. The ASA is calculated
composing the attributes of interface residue. using the DSSP program. If the ASA in the chain
There are totally three samples are produced structure is 25% larger than the value calculated by the
according to three distance ranges in our data set. The residue alone, this residue can be regarded to surface
SVM is used again to train and test so that the func- residue of the antibody structure.
tional class from antibody structures data can be After computation, 5,253 surface residues are rec-
identi ed. ognized from the 37 non-redundant complex struc-
These numerical results are shown and discussed in tures. Among them, 4,139 surface residues antigens
Sect. 3, while the conclusion is given in Sect. 4. are of the protein class, and the other 1,114 surface
residues antigens are of the non-protein class.
Antibody function is accomplished by combination
2 Materials and methods of antibody antigen interaction surface. Interface res-
idues in antibody play an important role in antibody
2.1 Collection of complex structure antigen interaction. There are many methods to iden-
tify interface residues from protein protein complex.
The structural data of antibody antigen complex exists Fariselli et al. [15] indicated that two surface residues
in relevant biological literature and databases. In this are regarded as interface residues, if distance between
paper, all the antigen antibody compound les are Ca atom of one surface residue in one protein and Ca
selected from the PDB le library. The heavy chain, atom of one surface residue in another protein is less
light chain and corresponding antigen from the com- then 12 A.
plex are collectively de ned as the basic complex In this paper, the coordinate of Ca atom in residue is
structure, which is the start point of our research. taken as the coordinate of this residue. Distances be-
After testing the selected basic structure, some tween one surface residue in the antibody structure in
missing residues in the heavy chain and light chain of one complex structure and every atom in antigen in
these structures are detected. We ll these missing this complex structure are calculated to identify whe-
residues using [HyperChem 5.1 for Windows (Hyper- ther this surface residue is interface residue or not. If
cube, FL, USA)]. the distance is less than 12 A the surface residue is
If the heavy chains and the light chains of antibody considered to contact with the antigen, and it is iden-
in two complex structures are exactly the same, one of ti ed as interface residue.
the structures is de ned redundant. After examination After computation, 668 interface residues are rec-
of all the basic structures, 37 non-redundant complex ognized from 5,253 surface residues. Among them, 532
structures are extracted. interface residues antigens are of the protein class, and
In the 37 complex structures obtained, 3 of the the other 136 interface residues antigens are of the
antigens are nucleic acid, 4 of the antigens are non-protein class.
123
484 Neural Comput & Applic (2007) 16:481 490
2.3 Feature for inferring distance range
For the research of the distance between antibodies
interface residue and antigen, different ranges are
Recent studies in interface residues show that, com-
used here. Phenomenon of interaction surface be-
pared with the physicohemical property of structure,
tween antibody and antigen is different, which can be
sequence feature can better describe the structural
described by different distance ranges between inter-
information. This is because the sequence feature has a
face residue and antigen. In this paper, given 1 ktr,
strong connection of the physicohemical properties of
when the range is chosen as 8, 10 and 12 A, the
residues. Therefore, sequence feature of target residue
amount of interface residues are derived 2, 5 and 10,
is used for identifying distance ranges between inter-
respectively. Figure 1 graphically depicts the data
face residue and antigen in this paper.
comparison of the three groups derived from the
There are many studies on composing sequence
different distance ranges on 1 ktr. There is clear show
feature of the target residue [28, 29]. One method is
the interface residue position in interaction surface
surface patch. The target residue is de ned as a central
relative to antigen in 3D. It will help us to understand
surface residue and some nearest surface neighbors are
the in uence of interface residue belong to different
selected. The number of nearest surface neighbors is
rang.
decided by the surface patch size. Surface patch con-
In the experiments, three groups of interface residue
cerns the neighbor relationship in space. In this paper,
data are generated from complex structure. When the
sequence patch is used for composing sequence fea-
distance range is set to 8 A, we get 329 interface resi-
ture. When the sequence patch size is set to be 5, it
dues. Among them, 253s antigens are protein class and
means the join sequence feature is composed by the
the other 76s antigens are non-protein class. When the
target residue and 5 front neighbors and 5 back
distance is set to 10 A, we have 508 interface residues,
neighbors (the total of 11 residues) in sequence. Se-
where 400s antigens are protein class and the other
quence patch is about the neighbor relationship in se-
108s antigens are non-protein class. When the distance
quence.
is set to 12 A, we obtain 668 interface residues with
Sequence feature for inferring distance range is co-
532s antigens identi ed as protein class and the rest
ded as follows: each residue is represented by a 20D
136s antigens as non-protein class.
Fig. 1 The gure depicts the
interface residues with
different distance of complex
1 ktr. a Depicts the position
of antigen in antibody
antigen complex. From b to d,
the distance range is set to 8,
10, 12 A, respectively
123
Neural Comput & Applic (2007) 16:481 490 485
basic vector, i.e., each kind of residue corresponds to 1 called support vectors. The separating hyperplane
equation is (x x) + b = 0, where the sample vectors (xi,
of 20 dimensions in the basic vector. The element of
the vector having value one means that it belongs to yi), i = 1 n, should satisfy
that kind of residue. Only one position has value one
yi x x b ! 1; i 1; . . . ; n: 1
and others has zero. This sequence feature will be co-
ded as a 220D vector if the sequence patch size is set to
The distance of point x to the hyperplane (x, b) is
5.
j x xI bj
d x; b; x kxk : The optimal hyperplane is given
2.4 Feature for inferring antigen class by maximizing the margin d, subject to (1). The margin
2
can be given by q x; b kxk : Hence the hyperplane
Feature for inference of antigen class is constructed by that optimally separates the data is the one that mini-
interface residues belong to different range in antibody mizes
structure. It is a 21D vector, which denotes 20 kinds of
kxk
residue composition in the interface residues plus the / x 2
:
2
number of interface residue belong to distance ranges
8, 10 and 12 A, respectively. The Lagrange function of (2) under constraints
(1) is,
2.5 Support vector machines
X
l
1
/ x; b; a kxk2 ai yi x; xi b 1 ; 3
To predict the distance range between antibody s
2 i 1
interface residue and antigen from antibody/antigen
complex, we adopted algorithm called support vector
The optimal classi cation function, if solved, is
machine (SVM) [30, 31]. SVM is a kind of learning
f x sgn x x b :
machine based on statistical learning theory [32, 33]. It
For nonlinear case, we map the original space into a
is a theory of machine learning focusing on small
high dimension space by a nonlinear mapping, in which
sample data based on the structural risk minimization
an optimal hyperplane can be sought. The inner
principle from computational learning theory [34].
product function enables the classi cation in the new
Here is a brief description of the SVM algorithm.
space; however, the computation complexity will not
Consider the problem of separating the set of training
increase. Thus the corresponding program is,
vectors belonging to two separate classes. D = {(x1,
y1 (xl, yl)}, x 2 Rn, y 2 { 1,1}, with a hyper plane, X 1X
n n
(x x) + b = 0. Figure 2 is a simple linearly separatable Q a ai ai aj yi yj K xi ; xj 4
2 i;j 1
case. Solid points and circle points represent two kinds i 1
of sample separately. H is the separating line. H1 and
The corresponding separating function is,
H2 are the closest lines parallel to the separating line of
the two-class sample vectors. The distance between H1
Xn
and H2 is called margin. The separating hyperplane is
f x sgn a i yi K xi ; x b 5
said to be optimal if it classi es the samples into two i 1
classes without error (training error is zero) and the
This is the so-called SVM.
margin is maximal. The sample vectors in H1, H2 are
Support vector machine provides a method to solve
the possible dimension disaster in the algorithm: when
constructing a discriminant function, SVM does not
obtain solution in the feature space after mapping the
original sample space into a high dimension space by
nonlinear mapping. Instead, it compares the sample
vectors in the input space, then it performs nonlinear
mapping after the comparison. Function K is called
the kernel function of dot product. In [30], it is de ned
as a distance between sample vectors. The method
above can assure all training samples are accurately
classi ed. That is, on condition that the empirical risk
is zero, SVM can get the best generalization ability by
Fig. 2 Optimal separating hyperplane
123
486 Neural Comput & Applic (2007) 16:481 490
maximizing the margin. SVM have been used to FN (false negative): the number of records in the
handle many problems in bioinformatics [10, 18, rst class that has been classi ed into second class.
35 37]. After passing the re-substitution test, a vefold cross
In this paper, an integrated software for support validation is applied on the dataset. The details of a
vector classi cation named LIBSVM (Version 2.71) vefold cross validation is discusses as follows. The
[38] is employed to predict antibody interaction sites dataset is split into ve parts. One of ve acts as the
and antigen class. testing set and the other four as the training set to
construct the mathematical model. The process rotates
for ve times with each part as a testing set in a single
2.6 Evaluation measure round. In the generalization test, the mean value of the
accuracy in the vefold cross validation tests is used as
The tests of classi cation accuracy can be divided into the accuracy measure of the experiment. If the utilized
two parts: self-consistent test and cross-validation test, method is correct, then the extracted features can well
which are the common ways in testing the capability of explain the correlation between antibody antigen
the prediction power. Self-consistent test aims at test- interaction residue and antibody antigen interaction
ing the self-consistence, which takes the same training surface. In this situation, the cross-validation test
dataset as testing data. Cross-validation is the test of should have a high performance level. The mean
the generalization ability of the method. accuracy of the vefold cross validation should also be
In the self-consistent test, the dataset acts as both comparatively high.
the training set and the testing set. The modi cation
process of the parameter in the LIBSVM continues
until the result of the self-consistent is satis ed. The 3 Results and discussions
indictors, such as prediction accuracy and correlation
coef cient, can be obtained from the self-consistent 3.1 Predict the distance range
test for analyzing the effectiveness of the method.
The rationale behind the method is that using the 3.1.1 Self-consistent test
original data as the testing data, the model shows
good prediction accuracy if it is good. Correlation
When the distance range is set to 8 A and the antigen is
coef cient falls into the range of ( 1, 1). The intro- of the protein class, the accuracy of the self-consistent
duction of correlation coef cient is to avoid the test increases from 97.53 to 98.79%; correlation coef-
negative impacts of the imbalance between different cient is from 0.587861 to 0.79216, with the increase of
classes of data. For example, if two types of data sequence path size from 1 to 5. When sequence path
take up a 4:1 position in a single dataset, then the size is four, three indictors: accuracy and correlation
prediction of the type with a large size of data will coef cient reach their best value, which are 99.01%
be 80% accurate. If the same dataset is used in the and 0.829109, respectively.
testing, then it is meaningless in the case of predic- The self-consistent test results for predicting the
tion. When using the model constructed on predic- distance ranges when the antigen is of protein class, the
tion of the data, the correlation coef cient will be 1 non-protein class and the class of protein and non-
if the prediction is completely contrary to the exact protein can be similarly explained as shown in
value, 1 if the prediction is correct, and 0 if the Tables 1, 2 and 3, respectively.
prediction is randomly produced. The Correlation
coef cient are calculated as follows. 3.1.2 Cross-validation test
Correlation coefficient
When the distance range is 8 A and the antigen is of
TP TN FP FN the protein class, the accuracy of the cross validation
p
on the data increases from 94.66 to 95.79%, with the
TP FN TP FP TN FP TN FN
increase of sequence path size from 1 to 5. When se-
quence path size is four, the accuracy reaches its best
where TP (true positive): the number of records in the
value of 95.86%. When the antigen is of the non-pro-
rst class that has been classi ed correctly.
tein class, the accuracy of the cross validation increases
FP (false positive): the number of records in the
from 93.72 to 94.52%, with the changes of the sequence
second class that has been classi ed into the rst class.
path size from 1 to 5. When sequence path size is four
TN (true negative): the number of records in the
or ve, the best accuracy value is 94.52%. When the
second class that has been classi ed correctly.
123
Neural Comput & Applic (2007) 16:481 490 487
Table 1 The self-consistent test results with different sequence Table 3 The self-consistent test results with different sequence
patch size when the antigen is protein class patch size when the antigen is protein and non-protein class
Distance Sequence Accuracy Correlation Distance Sequence Accuracy Correlation
range (A) patch size coef cient range (A) patch size coef cient
8 1 97.53 0.587861 8 1 97.39 0.573704
8 2 98.96 0.820909 8 2 98.88 0.811367
8 3 98.72 0.77995 8 3 98.50 0.74865
8 4 99.01 0.829109 8 4 98.91 0.817575
8 5 98.79 0.79216 8 5 98.69 0.780672
10 1 97.03 0.672668 10 1 96.80 0.649831
10 2 98.57 0.839235 10 2 98.40 0.820558
10 3 98.96 0.882228 10 3 98.08 0.785489
10 4 98.67 0.849841 10 4 98.88 0.873085
10 5 98.33 0.812528 10 5 98.17 0.796534
12 1 95.43 0.613591 12 1 95.51 0.616922
12 2 98.12 0.834975 12 2 98.00 0.823716
12 3 97.44 0.777317 12 3 97.93 0.817578
12 4 98.09 0.832873 12 4 97.51 0.781856
12 5 97.68 0.797929 12 5 97.70 0.797733
Table 4 The cross validation test results with different sequence
Table 2 The self-consistent test results with different sequence
patch size and antigen class
patch size when the antigen is non-protein class
Distance Sequence Antigen: Antigen: Antigen: protein
Distance Sequence Accuracy Correlation
range patch protein non-protein and non-protein
range (A) patch size coef cient
(A) size 8 1 96.77 0.515651
8 1 94.66 93.72 94.34
8 2 99.01 0.846579
8 2 95.43 93.72 94.95
8 3 98.65 0.79309
8 3 95.65 94.25 95.35
8 4 98.65 0.796243
8 4 95.86 94.52 95.60
8 5 99.46 0.918223
8 5 95.79 94.52 95.83
10 1 96.41 0.60911
10 1 93.14 93.72 93.30
10 2 94.88 0.450205
10 2 94.63 93.72 93.83
10 3 95.51 0.521592
10 3 94.85 94.25 94.67
10 4 98.56 0.840512
10 4 95.50 94.52 95.31
10 5 98.11 0.800671
10 5 95.65 94.52 95.41
12 1 97.22 0.751736
12 1 91.76 92.01 92.00
12 2 98.65 0.878701
12 2 93.57 92.10 93.95
12 3 99.10 0.917999
12 3 93.94 93.36 94.46
12 4 98.38 0.853383
12 4 94.59 94.08 95.03
12 5 98.56 0.869748
12 5 95.24 94.70 95.45
The accuracy in the table denotes the average accuracy of ve-
antigen is of the protein and non-protein class, the fold experiment
accuracy of the cross validation increases from 94.34 to
95.83%, with the increase of the sequence path size
interface residues of antibody antigen interaction be-
from 1 to 5. When sequence path size is ve, the
long to different distance range are computed directly
accuracy reaches its best value of 95.83%.
from its structure information. Then, we use the SVM
The accuracy of the cross validation for the distance
to train and test the input features that consist of the
ranges 10 and 12 A when the antigen is of the protein
residue composition of antibody interface residues and
class, non-protein class, and the class of protein and
the total of interface residues. The prediction results
non-protein can be similarly explained as shown in
are listed in Table 5 with different distance ranges and
Table 4.
they show that high accuracy is attained when the
distance range is set to 8 and 12 A.
3.2 Identi cation of antigen class
Another test is performed through constructing a
two-layer classi er. Using prediction results of inter-
The experiment of identifying the antigen class by
face residue s different distance range as the start point
composing antibody interface residues belong to
of prediction of antigen class, we can examine the
different range is designed in two aspects. First, the
123
488 Neural Comput & Applic (2007) 16:481 490
Table 5 The self-consistent
Distance Self-consistent Self-consistent Cross validation test
and cross validation test
range (A) test accuracy test correlation average accuracy of
results with different distance
coef cient vefold experiment threshold
8 100 1 86.49
10 94.59 0.669643 83.78
12 100 1 86.49
antibody sequence and surface residues information, distance range. Based on different distance range,
and predict antigen class directly. The reason is that sequence patch size and antigen class, 45 samples have
the high prediction accuracy of interface residues is the been created in the research of prediction distance
solid guarantee and makes this scheme feasible. As range between antibody s interface residue and antigen
evidence, the prediction accuracy listed in Table 6 is by feature of antibody s surface residue sequence and 3
satisfactory. samples have been chosen for identifying antigen class
from the antibody s interface residue character by
different distance range. The result shows that differ-
ent settings have complicate results for nding the
4 Discussions
prediction accuracy.
Through these experiments by using support vector
The vast quantities of existing immunological data and
machine, we can observe the following: antigen class
advanced information technology have boosted the
greatly in uences the accuracy of prediction interface
research work on computational immunology. Under-
residue s distance range. High accuracy will be
standing the circumstantialities of antibody antigen
achieved by antigen classi cation. In the same distance
interaction surface via information technology be-
range, the increase in the sequence path size will help
comes a new research direction in immunoreaction.
increase the prediction accuracy. The larger the se-
Experimental data analysis by using machine learning
quence patch size is, the closer the effect of different
methods may help explain and provide signi cant
distance range on the accuracy will be. When the se-
insight into the complex phenomenon of antibody
quence patch size is small, different distance ranges
antigen interaction. The circumstantialities of anti-
have a great in uence of the accuracy. The results of
body antigen interaction surface can be observed by
this paper show that it is possible to infer distance
the distance between antibody s interface residue and
range between antibody s interface residue and antigen
antigen. It will help us to understand position of each
and compatible antigen class from the antibody struc-
interface residue relative to antigen in 3D, which
tures data.
connect af nity of antibody antigen interaction.
The results obtained in this paper can be further
As a pilot study of discovering the correlation be-
explored through the following aspects: The size of
tween antibody antigen interaction residue and anti-
manually selected data is still smaller and the balance
body antigen interaction surface, total of 668 interface
dataset structure is needed. The prediction accuracy
residues was extracted from these 5,253 surface resi-
could be lower than what obtained in this paper if the
dues and divided into three classes by different
data size is larger with a balanced structure. Interface
of antibody combine antigen is difference for different
antigen class. We will focus our study on one kind of
Table 6 The cross validation test results with different distance antigen. The position of every interface residue relative
range and sequence patch size combining the antibody interface
to antigen in 3D is mate af nity of antibody antigen
residue prediction results
interaction. More precision of identifying interface
Sequence Distance range residue s distance ranges may provide a chance to de-
patch size
rive a batter understanding of interaction between
8A 10 A 12 A
antibody and antigen. If other known computational
1 81.59 78.17 79.57
methods, such as neural networks is used, we may
2 82.12 78.61 81.26
investigate which method bring better results in rela-
3 82.47 79.31 81.70
tionship between antibody structure and antibody
4 82.68 79.85 82.19
5 82.88 79.93 82.55 functions.
Note that only the sequence information of antibody
The accuracy list in the table is the average accuracy for vefold
experiment has been used in this paper. Following this paper, our
123
Neural Comput & Applic (2007) 16:481 490 489
4. Bath TN, Bentley GA, Fischmann TO, Boulot G, Poljak RJ
future interest will be using different machine learning
(1990) Small rearrangements in structures of Fv and Fab
techniques, such as multiple criteria mathematical
fragments of antibody D1.3 on antigen binding. Nature
programming [39, 40], on the large-scale structure 347:483 485
information for possible higher degree of prediction 5. Colman PM, Laver WG, Varghese JN, Baker AT, Tulloch
PA, Air GM, Webster RG (1987) Three-dimensional struc-
accuracy.
ture of a complex of antibody with in uenza virus neur-
aminidase. Nature 326:358 363
6. Xiang J, Sha Y, Prasad L, Delbaere LTJ (1996) Comple-
5 Conclusions mentarity determining region residues aspartic acid at H55
serine at tyrosines at H97 andL96 play important roles in the
B72.3 antibody-TAG72 antigen interaction. Protein Eng
For high af nity of antibody antigen interaction, it is
9:539 543
important to know whether two surfaces match each 7. Chothia C, Lesk AM, Gherardi E, Tomlinson IM, Walter G,
other. The circumstantialities of antibody antigen Marks JD, Lewelyn MB, Winter G (1992) Structural reper-
toire of the human Vh segments. J Mol Biol 227:799 817
interaction are represented by the distance between
8. Iba Y, Hayshi N, Sawada I, Titani K, Kurosawa Y (1998)
antibody s interface residue and antigen. In the past,
Changes in the speci city of antibodies against steroid anti-
much effort has focused on characters of antibody gens by introduction of mutations into complementarity-
structure, antibody antigen binding site, the af nity determining regions of Vh domain. Protein Eng 11:361 370
9. Rees AR, Staunton D, Webster DM (1994) Antibody design:
and speci city of the antibody. This paper has pro-
beyond the natural limits. Trends Biotechnol 12:199 207
posed a machine learning approach to explore the
10. Minakuchi1 Y, Konagaya A (2004) Prediction of protein
interaction between antibody s interface residue and protein interaction sites using support vector machines.
antigen. That is to say, the interaction information can Protein Eng Des Sel 17:165 173
11. Chakrabarti P, Janin J (2002) Dissecting protein protein
be predicted from the structure characters. Based on