Amino Acids (****) **: *** ***
Printed in The Netherlands
AAIndexLoc: predicting subcellular localization of proteins based
on a new representation of sequences using amino acid indices
E. Tantoso1 and Kuo-Bin Li2
1
Bioinformatics Institute, Singapore
2
Center for Systems and Synthetic Biology, National Yang-Ming University, Taipei, Taiwan
Received September 28, 2007
Accepted October 4, 2007
Published online December 28, 2007; # Springer-Verlag 2007
Summary. Identifying a protein s subcellular localization is an important tional predictions are important to minimize the time and
step to understand its function. However, the involved experimental work
cost in experimental work. Many efforts have been made
is usually laborious, time consuming and costly. Computational prediction
in this regard (Nakai and Kanehisa, 1992; Nakashima and
hence becomes valuable to reduce the inef ciency. Here we provide a
Nishikawa, 1994; Cedano et al., 1997; Chou and Elrod,
method to predict protein subcellular localization by using amino acid
composition and physicochemical properties. The method concatenates
1998; 1999a, b; Nakai and Horton, 1999; Yuan, 1999;
the information extracted from a protein s N-terminal, middle and full
Chou, 2000a, b, 2001; Emanuelsson et al., 2000; Murphy
sequence. Each part is represented by amino acid composition, weighted
et al., 2000; Nakai, 2000; Feng, 2001, 2002; Feng and
amino acid composition, ve-level grouping composition and ve-level
dipeptide composition. We divided our dataset into training and testing set. Zhang, 2001; Hua and Sun, 2001; Chou and Cai, 2002;
The training set is used to determine the best performing amino acid index
Gardy et al., 2003; Pan et al., 2003; Park and Kanehisa,
by using ve-fold cross validation, whereas the testing set acts as the
2003; Zhou and Doctor, 2003; Huang and Li, 2004; Gao
independent dataset to evaluate the performance of our model. With the
novel representation method, we achieve an accuracy of approximately et al., 2005a, b; Garg et al., 2005; Lei and Dai, 2005;
75% on independent dataset. We conclude that this new representation
Matsuda et al., 2005; Xiao et al., 2005, 2006a; Chou and
indeed performs well and is able to extract the protein sequence information.
Shen, 2006c, 2007a; Guo et al., 2006a; Hoglund et al.,
We have developed a web server for predicting protein subcellular locali-
2006; Lee et al., 2006; Xiao et al., 2006a; Chou and Shen,
zation. The web server is available at http:==aaindexloc.bii.a-star.edu.sg.
2007a, b, d; Shi et al., 2007; Zhang and Ding, 2007). A
Keywords: Subcellular localization Support vector machine Amino
summary in this area was given in a recent review paper
acid indices
(Chou and Shen, 2007d).
The approach for predicting protein subcellular locali-
Introduction
zation can be divided into two steps, i.e., the representa-
tion and the classi cation one. The representation step is
Many novel bioinformatics applications rely on accurate
the most challenging part to obtain high prediction accu-
prediction of protein s subcellular localization. The lter-
racy. The step can be seen as a data mining process where,
ing of putative protein-protein interactions is one of the
examples where false predictions are to be removed pro- for a protein in a given localization, information embed-
ded in the primary sequence is extracted so that a com-
vided the two interacting partners are involved in differ-
ent subcellular compartments (Mahdavi and Lin, 2007). puter program can discriminate the protein from proteins
Another example is the identi cation of serum biomar- in other localizations. There have been several ways to
kers. Here selecting genes and gene products that possess extract the information from protein sequences, such as
identical protein localization may serve as one of the using amino acid composition (Cedano et al., 1997), sig-
criteria (Klee et al., 2006). However, experimentally iden- nal sequence (Nakai and Horton, 1999) or N-Terminal
tifying the localization of a protein is usually a laborious, sequence (Emanuelsson et al., 2000), n-peptide compo-
sition (Yu et al., 2004), pseudo-amino acid composition
time-consuming and costly process. Therefore, computa-
346 E. Tantoso and K.-B. Li
has been shown to be an accurate protein localization
(Chou, 2001), functional domain composition (Cai et al.,
predictor.
2003), gene ontology (Chou and Cai, 2003), amino acid
The second step for the localization prediction problem
property (Feng and Zhang, 2001; Sarda et al., 2005) and
is the classi cation step. Once the protein is represented
homology (Bhasin and Raghava, 2004; Xie et al., 2005).
with an appropriate encoding scheme, the remaining work
Amino acid composition was originally used to repre-
is to propose a robust classi er to predict the subcellular
sent protein samples for predicting protein structural class
localization. Many classi cation approaches have been
(Chou and Zhang, 1994, 1995; Zhou, 1998), indicating
proposed lately, such as neural network (Reinhardt and
that there is some correlation between AA composition
Hubbard, 1998), support vector machine (Hua and Sun,
of a protein and its attributes (Chou, 2000c, 2002). Since
2001; Chou and Cai, 2002; Park and Kanehisa, 2003;
then, such a descriptor has been widely used to predict
Hoglund et al., 2006), Markov chain model (Yuan, 1999),
protein subcellular localization (see, e.g., Cedano et al.,
covariant discriminant algorithm (Chou and Elrod, 1998,
1997; Chou and Elrod, 1999b; Hua and Sun, 2001; Zhou
1999a, b), fuzzy k-NN method (Huang and Li, 2004) and
and Doctor, 2003; Jin et al., 2005). The AA composition
FDOD function (Jin et al., 2005).
does not contain any sequence order information. To
AAIndexLoc is a new representation method for pre-
avoid completely losing the sequence order information,
diction of protein subcellular localization. We hypothesize
the pseudo amino acid (PseAA) composition was intro-
that the physicochemical properties of amino acids play
duced (Chou, 2001, 2005). Since the introduction of
an important role in determining a protein s function
PseAA composition, it has been adopted to improve the
and therefore might be used to predict the protein s local-
prediction quality of various protein attributes by many
ization. Given the 55 amino acid indices collected by
investigators (Pan et al., 2003; Wang et al., 2004; Chou
ProtScale (http:==www.expasy.org=cgi-bin=protscale.pl),
and Cai, 2005; Gao et al., 2005b; Liu et al., 2005a, b;
we attempted to determine the optimum amino acid index
Shen and Chou, 2005a, b; Du and Li, 2006; Mondal et al.,
to characterize a speci c subcellular localization.We in-
2006; Shen and Chou, 2006; Shen et al., 2006; Wang et al.,
troduced the weighted amino acid (AA) composition, ve-
2006; Xiao et al., 2006a, b; Zhang et al., 2006a, b; Chen
group-AA composition and ve-group dipeptide compo-
and Li, 2007; Ding et al., 2007; Kurgan et al., 2007; Lin
sition as the new encoding scheme to represent the protein
and Li, 2007a, b; Mundra et al., 2007; Pu et al., 2007;
sequences. The rational of the weighted AA composition
Shen and Chou, 2007c; Shen et al., 2007; Shi et al., 2007;
is that some amino acids may be more important in terms
Zhang and Ding, 2007; Zhou et al., 2007). Because
of protein translocation even if their frequency is relative-
PseAA composition has been widely used, recently a web-
server called PseAA was established at http:==chou.med. ly low. Therefore, the weighted AA composition provides
harvard.edu=bioinf=PseAA=, by which users can generate a way to increase the contribution from the rare but cri-
tical amino acid residues. In addition to the weighted
various different kinds of PseAA compositions for a given
AA composition, we also categorize amino acids into ve
protein sequence. ESLpred (Bhasin and Raghava, 2004)
groups by using k-means clustering and calculate the
has used amino acid composition, dipeptide composition,
group composition of a protein. Proteins in a common
physico-chemical properties and PSI-BLAST pro les to
cellular location may share amino acids with similar phys-
predict protein subcellular localization. An alternative
icochemical properties. Group composition is meant to
method to extract protein localization information is to
extract such information. We have also considered the
use signal sequences. TargetP (Emanuelsson et al., 2000)
ve-level dipeptide composition which might detect some
used the N-terminal sequence information only and was
features about the appearance of consecutive amino acids
shown to be able to discriminate the protein in four loca-
with certain properties. On top of that, to avoid losing
tions, i.e., mitochondrion, chloroplast, secretory pathway
global information, we divided protein sequences into
and others. However, in the case where the signal region
three parts, i.e., the N-terminus, middle and C-Terminus.
is located at regions other than the N-terminus, there is
Information from the N-terminus, middle and full-
a risk of information loss if only the N-terminal sequence
length protein is used to create input features for support
is used. As a result, Matsuda et al. (2005) introduced
vector machine (SVM) classi er. To test our approach, the
a representation method that uses different parts of a
localization data are divided into the training and the
protein s sequence to predict its subcellular localization,
independent testing set. The training set is used to nd
i.e., N-terminus, middle and C-Terminus. Recently, a
the best performing AA index for individual localization
novel software, MultiLoc (Hoglund et al., 2006), also in-
by using the ve-fold cross validation method; then the
corporated amino acid composition, N-terminus sequence
best performing model is used to predict the protein s
and sequence motifs to represent a protein. MultiLoc
AAIndexLoc 347
Five-level grouping composition
localization on the independent testing set. Our results
show that accurate prediction of a protein s subcellular Five-level grouping composition means that the amino acids are classi ed
into ve groups based on their amino acid index values, i.e., the highest
localization can be obtained using both the local and glob-
(T), high (H), medium (M), low (L), and lowest (B) properties. After that,
al information of a protein sequence. the composition of each group is calculated. The method used for grouping
is k-means clustering (k 5).
Let Gm denotes the set of amino acids in group m, Gm
Materials and methods
fg1 ; g2 ; ; gNm g where Nm is the number of amino acids in Gm . The
composition of Gm is
Datasets
X
Nm
Dataset created by MultiLoc (Hoglund et al., 2006) were used in our CM Gm C gj 3
experiments. The MultiLoc datasets are categorized into animal (nine j 1
locations), fungal (nine locations) and plant (ten locations). Table 2
shows the number of sequences in each location. Note that MultiLoc
data are not formed by three separate sets of animal, plants and fungal Five-level dipeptide composition
sequences. Instead, there is only one set of cytoplasmic sequences
As explained in the ve-level grouping method, the 20 amino acids are
containing 1411 sequences. In Table 2, for example, one may nd that
classi ed into ve groups, the highest (T), high (H), medium (M), low (L)
all three versions of the datasets, the animal, the plant and the fungal,
and lowest (B) groups. The ve-level dipeptide composition is de ned as
share the 1411 cytoplasmic sequences, but only plant version has the
the composition of the occurrence of two consecutive groups, for example:
449 chloroplast sequences.
TT, TH, TM, TL, TB, HT, HH, etc. There are 25 combinations of two
consecutive group altogether.
Support vector machine
Support vector machine (SVM), rst introduced by Vapnik in 1995 Features vector
(Vapnik, 1995), is a learning algorithm for pattern recognition and regres-
A protein sequence is divided into three parts, i.e., the N-terminus, the
sion. SVM has recently gained a lot of attention in biology (Brown et al.,
middle and the C-terminus. The feature vectors consist of the information
2000; Hua and Sun, 2001; Lee and Lee, 2003; Ward et al., 2003), particu-
from the N-terminus, middle and the full-length protein. We ignore the C-
larly for classi cation purposes. In those applications, an SVM classi er
terminus because it does not give signi cant improvement to the predic-
is trained with a set of positively and negatively labelled samples. Once
tion of protein localizations.
trained, the classi er can be used to classify an unlabeled sample into the
Let L length of the protein P, LN length of the N-terminus,
positive or the negative class. In principle, SVM maps the input vector into
LM length of the middle part of the protein, and LC length of the
a high dimensional space and constructs an optimal hyperplane with the
C-terminus. In this work, the length of N-Terminus and C-terminus is
maximum margin of separation between the hyperplane and the nearest
xed while the length of middle part is varied depending on the length
data points of each class in the space.
of protein. To determine the length of middle sequence, we have three
In building the SVM classi ers for protein subcellular localization, each
conditions, i.e.:
localization site is corresponded to a class. We used the scheme called
One-vs-All SVM. For example, to predict the mitochondria protein, the set 1. If L > LN LC, then LM L LN LC. Check LM: if LM LN but L