Al Training

Location:

Singapore

Posted:

November 21, 2012

Contact this candidate

Resume:

Amino Acids (****) **: *** ***

DOI **.****/s*****-007-0616-y

Printed in The Netherlands

AAIndexLoc: predicting subcellular localization of proteins based

on a new representation of sequences using amino acid indices

E. Tantoso1 and Kuo-Bin Li2

Bioinformatics Institute, Singapore

Center for Systems and Synthetic Biology, National Yang-Ming University, Taipei, Taiwan

Received September 28, 2007

Accepted October 4, 2007

Published online December 28, 2007; # Springer-Verlag 2007

Summary. Identifying a protein s subcellular localization is an important tional predictions are important to minimize the time and

step to understand its function. However, the involved experimental work

cost in experimental work. Many efforts have been made

is usually laborious, time consuming and costly. Computational prediction

in this regard (Nakai and Kanehisa, 1992; Nakashima and

hence becomes valuable to reduce the inef ciency. Here we provide a

Nishikawa, 1994; Cedano et al., 1997; Chou and Elrod,

method to predict protein subcellular localization by using amino acid

composition and physicochemical properties. The method concatenates

1998; 1999a, b; Nakai and Horton, 1999; Yuan, 1999;

the information extracted from a protein s N-terminal, middle and full

Chou, 2000a, b, 2001; Emanuelsson et al., 2000; Murphy

sequence. Each part is represented by amino acid composition, weighted

et al., 2000; Nakai, 2000; Feng, 2001, 2002; Feng and

amino acid composition, ve-level grouping composition and ve-level

dipeptide composition. We divided our dataset into training and testing set. Zhang, 2001; Hua and Sun, 2001; Chou and Cai, 2002;

The training set is used to determine the best performing amino acid index

Gardy et al., 2003; Pan et al., 2003; Park and Kanehisa,

by using ve-fold cross validation, whereas the testing set acts as the

2003; Zhou and Doctor, 2003; Huang and Li, 2004; Gao

independent dataset to evaluate the performance of our model. With the

novel representation method, we achieve an accuracy of approximately et al., 2005a, b; Garg et al., 2005; Lei and Dai, 2005;

75% on independent dataset. We conclude that this new representation

Matsuda et al., 2005; Xiao et al., 2005, 2006a; Chou and

indeed performs well and is able to extract the protein sequence information.

Shen, 2006c, 2007a; Guo et al., 2006a; Hoglund et al.,

We have developed a web server for predicting protein subcellular locali-

2006; Lee et al., 2006; Xiao et al., 2006a; Chou and Shen,

zation. The web server is available at http:==aaindexloc.bii.a-star.edu.sg.

2007a, b, d; Shi et al., 2007; Zhang and Ding, 2007). A

Keywords: Subcellular localization Support vector machine Amino

summary in this area was given in a recent review paper

acid indices

(Chou and Shen, 2007d).

The approach for predicting protein subcellular locali-

Introduction

zation can be divided into two steps, i.e., the representa-

tion and the classi cation one. The representation step is

Many novel bioinformatics applications rely on accurate

the most challenging part to obtain high prediction accu-

prediction of protein s subcellular localization. The lter-

racy. The step can be seen as a data mining process where,

ing of putative protein-protein interactions is one of the

examples where false predictions are to be removed pro- for a protein in a given localization, information embed-

ded in the primary sequence is extracted so that a com-

vided the two interacting partners are involved in differ-

ent subcellular compartments (Mahdavi and Lin, 2007). puter program can discriminate the protein from proteins

Another example is the identi cation of serum biomar- in other localizations. There have been several ways to

kers. Here selecting genes and gene products that possess extract the information from protein sequences, such as

identical protein localization may serve as one of the using amino acid composition (Cedano et al., 1997), sig-

criteria (Klee et al., 2006). However, experimentally iden- nal sequence (Nakai and Horton, 1999) or N-Terminal

tifying the localization of a protein is usually a laborious, sequence (Emanuelsson et al., 2000), n-peptide compo-

sition (Yu et al., 2004), pseudo-amino acid composition

time-consuming and costly process. Therefore, computa-

346 E. Tantoso and K.-B. Li

has been shown to be an accurate protein localization

(Chou, 2001), functional domain composition (Cai et al.,

predictor.

2003), gene ontology (Chou and Cai, 2003), amino acid

The second step for the localization prediction problem

property (Feng and Zhang, 2001; Sarda et al., 2005) and

is the classi cation step. Once the protein is represented

homology (Bhasin and Raghava, 2004; Xie et al., 2005).

with an appropriate encoding scheme, the remaining work

Amino acid composition was originally used to repre-

is to propose a robust classi er to predict the subcellular

sent protein samples for predicting protein structural class

localization. Many classi cation approaches have been

(Chou and Zhang, 1994, 1995; Zhou, 1998), indicating

proposed lately, such as neural network (Reinhardt and

that there is some correlation between AA composition

Hubbard, 1998), support vector machine (Hua and Sun,

of a protein and its attributes (Chou, 2000c, 2002). Since

2001; Chou and Cai, 2002; Park and Kanehisa, 2003;

then, such a descriptor has been widely used to predict

Hoglund et al., 2006), Markov chain model (Yuan, 1999),

protein subcellular localization (see, e.g., Cedano et al.,

covariant discriminant algorithm (Chou and Elrod, 1998,

1997; Chou and Elrod, 1999b; Hua and Sun, 2001; Zhou

1999a, b), fuzzy k-NN method (Huang and Li, 2004) and

and Doctor, 2003; Jin et al., 2005). The AA composition

FDOD function (Jin et al., 2005).

does not contain any sequence order information. To

AAIndexLoc is a new representation method for pre-

avoid completely losing the sequence order information,

diction of protein subcellular localization. We hypothesize

the pseudo amino acid (PseAA) composition was intro-

that the physicochemical properties of amino acids play

duced (Chou, 2001, 2005). Since the introduction of

an important role in determining a protein s function

PseAA composition, it has been adopted to improve the

and therefore might be used to predict the protein s local-

prediction quality of various protein attributes by many

ization. Given the 55 amino acid indices collected by

investigators (Pan et al., 2003; Wang et al., 2004; Chou

ProtScale (http:==www.expasy.org=cgi-bin=protscale.pl),

and Cai, 2005; Gao et al., 2005b; Liu et al., 2005a, b;

we attempted to determine the optimum amino acid index

Shen and Chou, 2005a, b; Du and Li, 2006; Mondal et al.,

to characterize a speci c subcellular localization.We in-

2006; Shen and Chou, 2006; Shen et al., 2006; Wang et al.,

troduced the weighted amino acid (AA) composition, ve-

2006; Xiao et al., 2006a, b; Zhang et al., 2006a, b; Chen

group-AA composition and ve-group dipeptide compo-

and Li, 2007; Ding et al., 2007; Kurgan et al., 2007; Lin

sition as the new encoding scheme to represent the protein

and Li, 2007a, b; Mundra et al., 2007; Pu et al., 2007;

sequences. The rational of the weighted AA composition

Shen and Chou, 2007c; Shen et al., 2007; Shi et al., 2007;

is that some amino acids may be more important in terms

Zhang and Ding, 2007; Zhou et al., 2007). Because

of protein translocation even if their frequency is relative-

PseAA composition has been widely used, recently a web-

server called PseAA was established at http:==chou.med. ly low. Therefore, the weighted AA composition provides

harvard.edu=bioinf=PseAA=, by which users can generate a way to increase the contribution from the rare but cri-

tical amino acid residues. In addition to the weighted

various different kinds of PseAA compositions for a given

AA composition, we also categorize amino acids into ve

protein sequence. ESLpred (Bhasin and Raghava, 2004)

groups by using k-means clustering and calculate the

has used amino acid composition, dipeptide composition,

group composition of a protein. Proteins in a common

physico-chemical properties and PSI-BLAST pro les to

cellular location may share amino acids with similar phys-

predict protein subcellular localization. An alternative

icochemical properties. Group composition is meant to

method to extract protein localization information is to

extract such information. We have also considered the

use signal sequences. TargetP (Emanuelsson et al., 2000)

ve-level dipeptide composition which might detect some

used the N-terminal sequence information only and was

features about the appearance of consecutive amino acids

shown to be able to discriminate the protein in four loca-

with certain properties. On top of that, to avoid losing

tions, i.e., mitochondrion, chloroplast, secretory pathway

global information, we divided protein sequences into

and others. However, in the case where the signal region

three parts, i.e., the N-terminus, middle and C-Terminus.

is located at regions other than the N-terminus, there is

Information from the N-terminus, middle and full-

a risk of information loss if only the N-terminal sequence

length protein is used to create input features for support

is used. As a result, Matsuda et al. (2005) introduced

vector machine (SVM) classi er. To test our approach, the

a representation method that uses different parts of a

localization data are divided into the training and the

protein s sequence to predict its subcellular localization,

independent testing set. The training set is used to nd

i.e., N-terminus, middle and C-Terminus. Recently, a

the best performing AA index for individual localization

novel software, MultiLoc (Hoglund et al., 2006), also in-

by using the ve-fold cross validation method; then the

corporated amino acid composition, N-terminus sequence

best performing model is used to predict the protein s

and sequence motifs to represent a protein. MultiLoc

AAIndexLoc 347

Five-level grouping composition

localization on the independent testing set. Our results

show that accurate prediction of a protein s subcellular Five-level grouping composition means that the amino acids are classi ed

into ve groups based on their amino acid index values, i.e., the highest

localization can be obtained using both the local and glob-

(T), high (H), medium (M), low (L), and lowest (B) properties. After that,

al information of a protein sequence. the composition of each group is calculated. The method used for grouping

is k-means clustering (k 5).

Let Gm denotes the set of amino acids in group m, Gm

Materials and methods

fg1 ; g2 ; ; gNm g where Nm is the number of amino acids in Gm . The

composition of Gm is

Datasets

Dataset created by MultiLoc (Hoglund et al., 2006) were used in our CM Gm C gj 3

experiments. The MultiLoc datasets are categorized into animal (nine j 1

locations), fungal (nine locations) and plant (ten locations). Table 2

shows the number of sequences in each location. Note that MultiLoc

data are not formed by three separate sets of animal, plants and fungal Five-level dipeptide composition

sequences. Instead, there is only one set of cytoplasmic sequences

As explained in the ve-level grouping method, the 20 amino acids are

containing 1411 sequences. In Table 2, for example, one may nd that

classi ed into ve groups, the highest (T), high (H), medium (M), low (L)

all three versions of the datasets, the animal, the plant and the fungal,

and lowest (B) groups. The ve-level dipeptide composition is de ned as

share the 1411 cytoplasmic sequences, but only plant version has the

the composition of the occurrence of two consecutive groups, for example:

449 chloroplast sequences.

TT, TH, TM, TL, TB, HT, HH, etc. There are 25 combinations of two

consecutive group altogether.

Support vector machine

Support vector machine (SVM), rst introduced by Vapnik in 1995 Features vector

(Vapnik, 1995), is a learning algorithm for pattern recognition and regres-

A protein sequence is divided into three parts, i.e., the N-terminus, the

sion. SVM has recently gained a lot of attention in biology (Brown et al.,

middle and the C-terminus. The feature vectors consist of the information

2000; Hua and Sun, 2001; Lee and Lee, 2003; Ward et al., 2003), particu-

from the N-terminus, middle and the full-length protein. We ignore the C-

larly for classi cation purposes. In those applications, an SVM classi er

terminus because it does not give signi cant improvement to the predic-

is trained with a set of positively and negatively labelled samples. Once

tion of protein localizations.

trained, the classi er can be used to classify an unlabeled sample into the

Let L length of the protein P, LN length of the N-terminus,

positive or the negative class. In principle, SVM maps the input vector into

LM length of the middle part of the protein, and LC length of the

a high dimensional space and constructs an optimal hyperplane with the

C-terminus. In this work, the length of N-Terminus and C-terminus is

maximum margin of separation between the hyperplane and the nearest

xed while the length of middle part is varied depending on the length

data points of each class in the space.

of protein. To determine the length of middle sequence, we have three

In building the SVM classi ers for protein subcellular localization, each

conditions, i.e.:

localization site is corresponded to a class. We used the scheme called

One-vs-All SVM. For example, to predict the mitochondria protein, the set 1. If L > LN LC, then LM L LN LC. Check LM: if LM LN but L

Contact this candidate