Operator Data

Location:

China

Posted:

November 21, 2012

Contact this candidate

Resume:

Li et al. BMC Bioinformatics ****, **:***

http://www.biomedcentral.com/1471-2105/11/325

Open Access

RESEARCH ARTICLE

Classification of G-protein coupled receptors based

Research article

on support vector machine with maximum

relevance minimum redundancy and genetic

algorithm

Zhanchao Li, Xuan Zhou, Zong Dai and Xiaoyong Zou*

Abstract

Background: Because a priori knowledge about function of G protein-coupled receptors (GPCRs) can provide useful

information to pharmaceutical research, the determination of their function is a quite meaningful topic in protein

science. However, with the rapid increase of GPCRs sequences entering into databanks, the gap between the number

of known sequence and the number of known function is widening rapidly, and it is both time-consuming and

expensive to determine their function based only on experimental techniques. Therefore, it is vitally significant to

develop a computational method for quick and accurate classification of GPCRs.

Results: In this study, a novel three-layer predictor based on support vector machine (SVM) and feature selection is

developed for predicting and classifying GPCRs directly from amino acid sequence data. The maximum relevance

minimum redundancy (mRMR) is applied to pre-evaluate features with discriminative information while genetic

algorithm (GA) is utilized to find the optimized feature subsets. SVM is used for the construction of classification

models. The overall accuracy with three-layer predictor at levels of superfamily, family and subfamily are obtained by

cross-validation test on two non-redundant dataset. The results are about 0.5% to 16% higher than those of GPCR-CA

and GPCRPred.

Conclusion: The results with high success rates indicate that the proposed predictor is a useful automated tool in

predicting GPCRs. GPCR-SVMFS, a corresponding executable program for GPCRs prediction and classification, can be

acquired freely on request from the authors.

Background differentiation and growth, inflammatory and immune

G protein-coupled receptors (GPCRs), also known as 7 - response [3-9]. For these reasons, GPCRs have been the

helices transmembrane receptors due to their character- most important and common targets for pharmacological

istic configuration of an anticlockwise bundle of 7 trans- intervention. At present, about 30% of drugs available on

membrane helices [1], are one of the largest superfamily the market act through GPCRs. However, detailed infor-

of membrane proteins and play an extremely important mation about the structure and function of GPCRs are

role in transducing extracellular signals across the cell deficient for structure-based drug design, because the

membrane via guanine-binding proteins (G-proteins) determination of their structure and functional using

with high specificity and sensitivity [2]. GPCRs regulate experimental approach is both time-consuming and

many basic physicochemical processes contained in a cel- expensive.

lular signaling network, such as smell, taste, vision, secre- As membrane proteins, GPCRs are very difficult to

tion, neurotransmission, metabolism, cellular crystallize and most of them will not dissolve in normal

solvents [10]. Accordingly, the 3 D structure of only squid

* Correspondence: ******@****.****.***.**

rhodopsin, 1, 2 adrenergic receptor and the A2A ade-

1School of Chemistry and Chemical Engineering, Sun Yat-Sen University,

nosine receptor have been solved to data. In contrast, the

Guangzhou 510275, PR China

amino acid sequences of more than 1000 GPCRs are

Full list of author information is available at the end of the article

2010 Li et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attri-

bution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any

medium, provided the original work is properly cited.

Li et al. BMC Bioinformatics 2010, 11:325 Page 2 of 15

http://www.biomedcentral.com/1471-2105/11/325

known with the rapid accumulation of data of new pro- feature vector that posses the most discriminative infor-

tein sequence produced by the high-throughput sequenc- mation. Therefore, feature selection should be used for

ing technology. In view of the extremely unbalanced state, accurate SVM classification.

it is vitally important to develop a computational method Feature selection, also known as variable selection or

that can fast and accurately predict the structure and attribute selection, is the technique commonly used in

function of GPCRs from sequence information. machine learning and has played an important role in

Actually, many predictive methods have been devel- bioinformatics studies. It can be employed along with

oped, which in general, can be roughly divided into three classifier construction to avoid over-fitting, to generate

categories. The first one is proteochemometric approach more reliable classifier and to provide more insight into

developed by Lapinsh [11]. However, the methods need the underlying causal relationships [40]. The technique

structural information of organic compounds. The sec- has been greatly applied to the field of microarray and

ond one is based on similarity searches using primary mass spectra (MS) analysis [41-50], which has a great

database search tools (e.g. BLAST, FASTA) and such challenge for computational techniques due to their high

database searches coupled with searches of pattern data- dimensionality. However, there is still few works utilizing

bases (PRINTS) [12]. However, they do not seem to be feature selection in GPCRs prediction to obtain the most

sufficiently successful for comprehensive functional iden- informative features or to improve the prediction accu-

tification of GPCRs, since GPCRs make up a highly diver- racy.

gent family, and even when they are grouped according to So, a new predictor combining feature selection and

similarity of function, their sequences share strikingly lit- support vector machine is proposed for the identification

tle homology or similarity to each other [13]. The third and classification of GPCRs at the three levels of super-

one is based on statistical and machine learning method, family, family and subfamily. In every level, minimum

including support vector machines (SVM) [8,14-17], hid- redundancy maximum relevance (mRMR) [51] is utilized

den Markov models (HMMs) [1,3,6,18], covariant dis- to pre-evaluate features with discriminative information.

criminant (CD) [7,11,19,20], nearest neighbor (NN) After that, to further improve the prediction accuracy

[2,21] and other techniques [13,22-24]. and to obtain the most important features, genetic algo-

Among them, SVM that is based on statistical learning rithms (GA) [52] is applied to feature selection. Finally,

theory has been extensively used to solve various biologi- three models based on SVM are constructed and used to

cal problems, such as protein secondary structure [25,26], identify whether a query protein is GPCR and which fam-

subcellular localization [27,28], membrane protein types ily or subfamily the protein belongs to. The prediction

[29], due to its attractive features including scalability, quality evaluated on a non-redundant dataset by the jack-

absence of local minima and ability to condense informa- knife cross-validation test exhibited significant improve-

tion contained in the training set. In SVM, an initial step ment compared with published results.

to transform protein sequence into a fixed length feature

Methods

vector is essential because SVM can not be directly

applied to amino acid sequences with different length. Dataset

Two commonly used feature vectors to predict GPCRs As is well-known, sequence similarity in dataset has an

functional classes are amino acid composition (AAC) and important effect on the prediction accuracy, i.e. accuracy

dipeptide composition (DipC) [2,7,10,16,19,20,22], where will be overestimated when using high similarity protein

every protein is represented by 20 or 400 discrete num- sequence. Thus, in order to disinterestedly test current

bers. Obviously, if one uses AAC or DipC to represent a method and facilitate to compare with other existing

protein, many important information associated with the approaches, the dataset constructed by Xiao [10] is used

sequence order will be lost. To take into account the as the working dataset. The similarity in the dataset is less

information, the so-called pseudo amino acid composi- than 40%. The dataset contains 730 protein sequences

tion (PseAA) was proposed [30] and has been widely used that can be classified into two parts: 365 non-GPCRs and

to GPCRs and other attributes of protein studies [10,31- 365 GPCRs. The 365 GPCRs can be divided into 6 fami-

36]. However, the existing methods were established only lies: 232 rhodopsin-like, 44 metabotropic glutamate/

based on a single feature-set. And, few works tried to pheromone, 39 secretin-like, 23 fungal pheromone, 17

research the relationship between features and the func- frizzled/smoothened and 10 cAMP receptor. For rhodop-

tional classes of protein [37-39], or to find the informative sin-like of GPCRs, we further partitioned into 15 subfam-

features which contribute most to discriminate func- ilies based on GPCRDB (release 10.0) [53], including 46

tional types. Karchin et al [8] also indicated that the per- amine, 72 peptide, 2 hormone, 17 rhodopsin, 19 olfac-

formance of SVM could be further improved by using tory, 7 prostanoid, 13 nucleotide, 2 cannabinoid, 1 plate-

Li et al. BMC Bioinformatics 2010, 11:325 Page 3 of 15

http://www.biomedcentral.com/1471-2105/11/325

let activating factor, 2 gonadotropin-releasing hormone, 3 N ( i,i + 2 )

CHP ( i, i + 2 ) = i = 1, 2,, L 2

thyrotropin-releasing hormone & secretagogue, 2 mela-

L 2

tonin, 9 viral, 4 lysosphingolipid, 2 leukotriene B4 recep-

(1)

tor and 31 orphan. Those subfamilies, which the number

of proteins is lower than 10, are combined into a class,

Where, N(i,i+2) is the number of pattern in position i and

because they contain too few sequences to have any sta-

i+2 that simultaneously belong to any of 7 kinds of amino

tistical significance. So, 6 classes (46 amine, 72 peptide,

acids, and L is the sequence length. Other CHP are calcu-

17 rhodopsin, 19 olfactory, 13 nucleotide and 34 other)

lated by using the rule mentioned above. The DHP of pat-

are obtained at subfamily level.

tern (i, i+2), which describes the distribution of the

pattern in protein sequence, can be calculated according

Protein represent

to Eq. (2):

In order to fully characterize protein primary structure,

10 feature vectors are employed to represent the protein

sample, including AAC, DipC, normalized Moreau-Broto S( i,i + 2 )

DHP ( i, i + 2 ) = i = 1, 2,, L 2

autocorrelation (NMBAuto), Moran autocorrelation

L 2

(MAuto), Geary autocorrelation (GAuto), composition

(2)

(C), transition (T), distribution (D) [54], composition and

distribution of hydrophobicity pattern (CHP, DHP). Here

Where, S(i,i+2) is a feature vector composing of 5 values

8 and 7 amino acid properties extracted from AAIndex

that are the position in the whole sequence for the first

database [55] are selected to compute autocorrelation, C,

pattern (i, i+2), 25% patterns (i, i+2), 50% patterns (i, i+2),

T and D features, respectively. The properties and defini-

75% patterns(i, i+2) and 100% patterns (i, i+2), respec-

tions of amino acids attributed to each group are shown

tively. How to calculate these values is explained below by

in Additional file 1 and 2.

using a random sequence with 40 amino acids, as shown

According to the theory of Lim [56], 6 kinds of hydro-

in Figure 1, which consists of 10 patterns (i, i+2). The 10

phobicity patterns include: (i, i+2), (i, i+2, i+4), (i, i+3), (i,

patterns (i, i+2) included, CAL (1), IQF (2), FKM (3),

i+1, i+4), (i, i+3, i+4) and (i, i+5). The patterns (i, i+2) and

MDV (4), CTF (5), FYL (6), CFM (7), FMI (8), IRI (9),

(i, i+2, i+4) often appear in the -sheets while the pat-

CAW (10). The first pattern (i, i+2) is in the pattern posi-

terns (i, i+3), (i, i+1, i+4) and (i, i+3, i+4) occur more

tion of 1(CAL). The pattern (i, i+2) of (10 25% = 2.5 3)

often in -helices. The pattern (i, i+5) is an extension of

is in the pattern position of 3 (FKM). The pattern (i, i+2)

the concept of the "helical wheel" or amphipathic -helix

of (10 50% = 5) is in the pattern position of 5 (CTF).

[57]. Seven kinds of amino acids, including Cys (C), Phe

The pattern (i, i+2) of (10 75% = 7.5 8) is in the pattern

(F), Ile (I), Leu (L), Met (M), Val (V) and Trp (W), may

position of 8 (FMI). The pattern (i, i+2) of (10 100% =

occur in the 6 patterns based on the observed of Rose et

10) is in the pattern position of 10 (CAW). The first letter

al [58]. Because transmembrane regions of membrane

of the 5 patterns (i, i+2) are C, F, C, F, C, which is corre-

protein are usually composed of -sheet and -helix,

sponding to the residue position of 1, 10, 17, 28, and 36 in

CHP and DHP are used to represent protein sequence.

the sequence, respectively. Thus, S(i,i+2) = [1 10 17 28 36].

For the pattern (i, i+2), the CHP is computed by Eq. (1):

Figure 1 Random sequence consisting of 40 residues as an example to illustrate derivation of feature vector.

Li et al. BMC Bioinformatics 2010, 11:325 Page 4 of 15

http://www.biomedcentral.com/1471-2105/11/325

Similarly, the DHP for pattern other than (i, i+2) is also

MI(x, y)

min(mR), mR = (5)

calculated by using the rule. 2

S x, y S

The optimized feature subset selection

SVM is one of the most powerful machine learning meth- Where, S denoted that the feature subset, and S is the

ods, but it cannot perform automatic feature selection. To number of feature in S.

overcome this limitation, various feature selection meth- The maximum relevance condition is to maximize the

ods were introduced [59,60]. Feature selection methods total relevance between all features in S and classification

typically were divided into two categories: filter and variable. The condition can be obtained by Eq. (6):

wrapper methods. Although filter methods are computa-

tionally simple and easily scale to high-dimensional data-

MI(x, l)

max( MR), MR = (6)

set, they ignore the interaction between selected feature

x S

and classifier. In contrast, wrapper approaches include

the interaction and can also take into account the correla-

To achieve feature subset, the two conditions should be

tion between features, but they have a higher risk of over-

optimized simultaneously according to Eq. (7):

fitting than filter techniques and are very computationally

intensive, especially if building the classifier has a high

max( MI), MI = MR mR (7)

computational cost [61]. Considering the characteristics

of the two methods, the mRMR belonging to filter meth-

If continuous features exist in feature set, the feature

ods is used to preselect a feature subset, and then GA

must be discretized by using "mean standard deviation/

belonging to wrapper methods is utilized to obtain the

2" as boundary of the 3 states. The value of feature larger

optimized feature subset.

than "mean + standard deviation/2" is transformed to

Minimum redundancy maximum relevance (mRMR) state 1; The value of feature between "mean - standard

The mRMR method tries to select a feature subset, each deviation/2" and "mean + standard deviation/2" is trans-

of which has the maximal relevance with target class and formed to state 2; The value of feature smaller than "mean

the minimal redundancy with other features. The feature - standard deviation/2" is transformed to state 3. In this

subset can be obtained by calculating the mutual infor- case, computing mutual information is straightforward,

mation between the features themselves and between the because both joint and marginal probability tables can be

features and the class variables. In the current study, fea- estimated by tallying the samples of categorical variables

ture is a vector contains 10 type descriptors values of pro- in the data [51]. More explanation about the calculation

teins (AAC, DipC, NMBAuto, MAuto, GAuto, C, T, D, of probability can be seen from Additional file 3. Detailed

CHP and DHP). For binary classification problem, classi- depiction of the mRMR method can be found in refer-

fication variable lk 1 or 2. The mutual information ence [51], and mRMR program can be obtained from

MI(x,y) of between two features x and y is computed by http://penglab.janelia.org/proj/mRMR/index.htm

Eq.(3):

Genetic algorithms (GA)

GA can effectively search the interesting space and easily

p( xi, yj)

MI( x, y) = p( xi, yj) log solve complex problems without requiring a prior knowl-

(3)

p( xi)p(yj) edge about the space and the problem. These advantages

i, j N

of GA make it possible to simultaneously optimize the

feature subset and the SVM parameters. The chromo-

Where, p(xi,yj) is joint probabilistic density, p(xi) and

some representations, fitness function, selection, cross-

p(yj) is marginal probabilistic density.

over and mutation operator in GA are described in the

Similarly, the mutual information MI(x,l) of between

following sections.

classification variable l and feature x is also calculated by

Chromosome representation

Eq.(4):

The chromosome is composed of decimal and binary

coding systems, where binary genes are applied to the

p(x, l ) log p(xi)p(lk)

p( xi,lk)

MI( x, l) = selection of features and decimal genes are utilized to the

(4)

optimization of SVM parameters.

i,k N

Fitness function

The minimum redundancy condition is to minimize In this study, two objectives must be simultaneously con-

the total redundancy of all features selected by Eq.(5): sidered when designing the fitness function. One is to

Li et al. BMC Bioinformatics 2010, 11:325 Page 5 of 15

http://www.biomedcentral.com/1471-2105/11/325

maximize the classification accuracy, and the other is to TN

Spe = (12)

minimize the number of selected features. The perfor-

TN + FP

mances of these two objectives can be evaluated by Eq.

(8):

TP + TN

Acc = (13)

TP + TN + FP + FN

(8)

fitness=SVM _ accuracy+(1-n /N )

MCC =

Where, SVM _ accuracy is SVM classification accuracy,

TP TN FP FN

n is the number of selected features, N is the number of

( TP + FN ) ( TP + FP ) ( TN + FN ) ( TN + FP )

overall features.

Selection, crossover and mutation operator (14)

Elitist strategy that guarantees chromosome with the

highest fitness value is always replicated into the next Here, TP, TN, FP and FN are the numbers of true posi-

generation is used to select operation. Once a pair of tives, true negatives, false positives and false negatives,

chromosome is selected for crossover, five random respectively.

selected positions are assigned to the crossover operator The classification of GPCRs into its families and sub-

of the binary coding part. The crossover operator was families is a multi-class classification problem, namely a

determined according to Eq. (9) and (10) for the decimal given protein can be classified into specific family or sub-

coding part, where p is the random number of (0, 1). family. The simple solution is to reduce the multi-classifi-

cation to a series of binary classifications. We adopted the

one-versus-one strategy to transfer it into a series of two-

(9)

child1=p*parent1+(1-p)*parent2

class problems. The overall accuracy (Q) and accuracy

(Qi) for each family or subfamily calculated for assess-

(10)

child2=p*parent 2+(1-p)*parent1

ment of the prediction system are given by Eqs. (15)-(16).

The method based on chaos [62] is applied to the muta-

tion operator of decimal coding. Mutation to the part of i =1 p(i) (15)

binary coding is the same as traditional GA.

The population size of GA is 30, and the termination

condition is that the generation numbers reach 10000. A

p(i)

Qi =

detailed depiction of the GA can be reference to our pre- (16)

obs(i)

vious works [63].

Where, N is the total number of sequences, obs(i) is the

Model construction and assessment of performance

number of sequences observed in class i, p(i) is the num-

For the present SVM, the publicly available LIBSVM soft-

ber of correctly predicted sequences of class.

ware [64] is used to construct the classifier with the radial

The whole procedure for recognizing GPCRs form pro-

basis function as the kernel. Ten-fold cross-validation test

tein sequences and further classifying GPCRs to family

is used to examine a predictor for its effectiveness. In the

and subfamily is illustrated in Figure 2, and the steps are

10-fold cross-validation, the dataset is divided randomly

as follows:

into 10 equally sized subsets. The training and testing are

Step 1. Produce various feature vectors that represent a

carried out 10 times, each time using one distinct subset

query protein sequence.

for testing and the remaining 9 subsets for training.

Step 2. Preselect a feature subset by running mRMR.

Classifying GPCRs in superfamily level can be formu-

Select an optimized feature subset from the preselect

lated as a binary classification problem, namely each pro-

subset by GA and SVM. Predict whether the query pro-

tein can be classified as either GPCRs or non-GPCRs. So,

tein belong to the GPCRs or not. If the protein is classi-

the performance of classifier are measured in terms of

fied into non-GPCRs, stop the process and output results,

sensitivity (Sen), specificity (Spe), accuracy (Acc) and

otherwise, go to the next step.

Matthew's correlation coefficient (MCC) [65], and are

Step 3. Preselect again a feature subset and further

given by Eqs. (11)-(14).

select an optimized feature subset. Predict which family

the protein belongs. If the protein is divided into non-

Sen = (11) Rhodopsin like, stop the process with the output of

TP + FN

results, otherwise, go to the next step.

Li et al. BMC Bioinformatics 2010, 11:325 Page 6 of 15

http://www.biomedcentral.com/1471-2105/11/325

Figure 2 Flowchart of the current method.

Li et al. BMC Bioinformatics 2010, 11:325 Page 7 of 15

http://www.biomedcentral.com/1471-2105/11/325

Step 4. Preselect a feature subset again and select an approximate 275 features are selected by GA and a pre-

optimized feature subset. Predict which subfamily the dictive accuracy about 94.93% is achieved based on 10-

protein belongs to. fold cross-validation tested. Along with the implementa-

tion of GA, the number of selected features gradually

Results and discussion decreased while fitness is improved. Finally, the good fit-

Identification a GPCR from non-GPCR ness, high classification accuracy (98.77% based on 10-

At the first step of feature selection, only 600 different fold cross-validation test) and optimized feature subset

feature subsets are selected based on mRMR due to our (only contains 38 features) can be obtained from about

limited computational power, and the feature subsets 6600 generations. Consequently, the optimal classifier at

contains 1, 2, 3, 598, 599 and 600 features respec- superfamily level is constructed with the optimal feature

tively. The performance of various feature subsets for dis- subset.

criminating between GPCRs and other protein is The results of the optimized features subset are shown

investigated based on grid search for maximal 10-fold in Figure 6. The optimized features subset contains 38

features, including 1 feature of cysteine composition; 7

cross-validation tested accuracy with ranging among 2-

features of DipC based on Phe-Phe, Gly-Glu acid, His-

5, 2-4 215, and C ranging among 2-15, 2-14 25 ( and C

Asp, Ile-Glu, Asn-Ala, Asn-Met and Ser-Glu; 1 feature of

are needed to optimize parameters of SVM), and the

C based on polarity grouping; 2 features of T based on

results are shown in Figure 3. The accuracy for a single

hydrophobicity and buried volume grouping; 7 features of

feature is 85.89%. And the accuracy dramatically

D based on charge, hydrophobicity, Van der Waals vol-

increased when the number of features increased from 2

ume, polarizability and solvent accessibility grouping; 5

to 150, and achieved the highest values (98.22%) while the

features of NMBAuto based on hydrophbicity, flexibility,

feature subset consists of 543 features. However, the

residue accessible surface area and relative mutability; 11

accuracy did not change dramatically when the number

features of MAuto based on hydrophobicity, flexibility,

of features increased to 600.

residue volume, steric parameter and relative mutability;

Although the highest accuracy can be obtained by using

2 features of GAuto based on hydrophobicity and free

the feature subset with 543 features, many features

energy; 2 features of DHP based on pattern (i, i+3, i+4).

impede the discovery of physicochemical properties that

The results suggest that the order of these feature groups

affect the prediction of GPCRs. So, we perform further

that contributed to the discrimination GPCRs from non-

GA for the preselecting feature subset that consists of 600

GPCRs is: MAuto > Dipc and D > NMBAuto > T, GAuto

features. Figure 4 and Figure 5 illustrate the convergence

and DHP > AAC and C.

processes for GA to select feature subset. Initially,

Figure 3 The relationship between the accuracy and the number of features.

Li et al. BMC Bioinformatics 2010, 11:325 Page 8 of 15

http://www.biomedcentral.com/1471-2105/11/325

Figure 4 The relationship between the number of features and the number of generations.

Recognition of GPCR family feature increased from 1 to 301, and the highest overall

Following the same steps described above, the quality of accuracy of 96.99% can be achieved.

various feature subsets are investigated at family level We also further perform GA for preselecting feature

based on grid search and 10-fold cross-validation tested. subset with 600 features to acquire an optimized feature

The relationship between number of feature and overall subset. The processes of optimization are displayed in

accuracy is shown in Figure 3. A significant increase in Figure 4 and Figure 5. It can be observed that the number

overall accuracy can be observed when the number of of features dramatically decreased from 250 to 57 when

Figure 5 Fitness values and overall accuracy based on the most fitted member of each generation.

Li et al. BMC Bioinformatics 2010, 11:325 Page 9 of 15

http://www.biomedcentral.com/1471-2105/11/325

Figure 6 Composition of the optimized features subset.

the number of generation increased from 1 to 2300, and In order to get an optimized feature subset, GA is fur-

the best fitness and highest overall accuracy of 99.73% ther applied to further feature selection from a prese-

can be achieved. So, the optimal classifier with 57 fea- lected feature subset with 600 features. The processes of

tures is used to construct classifier at family level. convergence are shown in Figure 4 and Figure 5. The

The results of the optimized feature subset are also number of features in optimized feature subset signifi-

shown in Figure 6. The optimized features subset con- cantly decreased from 278 to 115 when the number of

tains 2 AAC, 14 DipC, 8 D, 7 NMBAuto, 21 MAuto, 2 generation increased from 1 to 1400, and corresponding

GAuto and 3 DHP features. The results reveal that the fitness value is significantly increased. Subsequently, the

order of these feature groups that contributed to the clas- number of features and fitness value maintained invari-

sification GPCRs into 6 families is: MAuto > DipC > D > able. It clearly shows a premature convergence. However,

NMBAuto > DHP > AAC and GAuto. the number of features decreased from 113 to 92 when

the number of generation increased from 1800 to 3100,

Classification of GPCR subfamily indicating GA has ability to escape from local optima.

Because knowledge of GPCRs subfamilies can provide The finally optimized feature subset with 91 features can

useful information to pharmaceutical companies and be obtained within 3200 generations. Therefore, we

biologists, the identification of subfamilies is a quite developed a classifier by the features from the optimized

meaningful topic in assigning a function to GPCRs. feature subset for classifying the subfamilies of the rho-

Therefore, we constructed a classifier at subfamily level dopsin-like family.

to predict the subfamily belonging to the rhodopsin-like The composition of optimized feature subset is shown

family. Rhodopsin-like family is considered because it in Figure 6. The optimized feature subset contains 3

covers more than 80% of sequences in the GPCRDB data- AAC, 17 DipC, 3 C, 6 D, 18 NMBAuto, 31 MAuto, 6

base [53], and the number of other family in current data- GAuto, 2 CHP and 5 DHP features. The results suggest

set is too few to have any statistical significance. Similarly, that the order of these feature groups that contributed to

we also study the quality of various feature subsets from the prediction subfamily belonging to the rhodopsin-like

mRMR based on grid search and 10-fold cross-validation family is: MAuto > NMBAuto > DipC > D and GAuto >

tested. The correlation between number of features and DHP > AAC and C > CHP.

overall accuracy is also illustrated in Figure 3. Overall

accuracy enhanced when the number of features Comparison with GPCR-CA

increased from 1 to 300, and the highest overall accuracy To facilitate a comparison with GPCR-CA method devel-

of 87.56% can be obtained by using the feature subset oped by Xiao [10], we perform jackknife cross-validation

with 418 features. test based on the current predictor. GPCR-CA is a two-

Li et al. BMC Bioinformatics 2010, 11:325 Page 10 of 15

http://www.biomedcentral.com/1471-2105/11/325

layer classifier that is used to classify at the levels of lies: 692 class A-rhodopsin and andrenergic, 56 class B-

superfamily and family, respectively, and each protein is calcitonin and parathyroid hormone, 16 class C-

characterized by PseAA, which is based on "cellular auto- metabotropic, 11 class D-pheromone and 3 class E-

mation" and gray-level co-occurrence matrix factors. In cAMP. The class A at subfamily level is composed of 14

the jackknife test, each protein in the dataset is in turn major classes and sequences are from the work of

singled out as an independent test sample and all the Karchin [8].

rule-parameters are calculated without using this protein. The success rates are listed in Tables 3, 4 and 5. And the

The results of jackknife test obtained with proposed results of GPCR-SVMFS are compared with those of

method in comparison with GPCR-CA are listed in Table GPCRPred for the same dataset. From Table 3 we can see

1 and Figure 7. The performances of the proposed predic- that the accuracy of GPCR-SVMFS is 0.5% higher than

tor (GPCR-SVMFS) in predicting the subfamilies are that of GPCRPred based on DipC at superfamily level. As

summarized in Table 2. can be seen from Table 4, the accuracies for class A, class

It can be seen from Table 1 that the accuracy, sensitiv- B and class C are 100%, which is almost 2%, 15% and 19%

ity, specificity and MCC by GPCR-SVMFS are 97.81%, higher than that of GPCRPred, respectively. Especially for

97.04%, 98.61% and 0.9563, respectively, which are 4.7% the class D, the predictive accuracy is improved to 81.82%

to 7.6% improvement over GPCR-CA method [10]. The by GPCR-SVMFS, which is almost 45% higher than that

results indicated that the GPCR-SVMFS can identify of GPCRPred. As can be seen in Table 5, the accuracies of

GPCRs from non-GPCRs with high accuracy using opti- the nucleotide, viral and lysospingolipids are improved to

mized feature subset as the sequence feature. 93.75%, 76.47%, 100.0%, about 8%, 43% and 42% higher

As can be seen from Figure 7, the overall accuracy of than GPCRPred. Although the accuracy of cannabis is

GPCR-SVMFS is 99.18%, which is almost 15% higher decreased from 100% to 90.91%, the overall accuracy is

than that of GPCR-CA. Furthermore, the accuracies of improved from 97.30% to 98.77%. All the results show

fungal pheromone, cAMP and frizzled/smoothened fam- that GPCR-SVMFS is superior to GPCRPred, which may

ily are dramatically improved. The accuracy by GPCR- be caused by the fact that optimized feature subset con-

SVMFS for fungal pheromone family is 100%, approxi- tains more information than single DipC, and therefore

mately 93% higher than the accuracy by the GPCR-CA. can enhance predictive performance significantly.

The accuracies of cAMP and frizzled/smoothened are

Predictive power of GPCR-SVMFS

100% and 94.12% based on GPCR-SVMFS, approximately

In order to test the performance of GPCR-SVMFS to

40% and 47% higher than the accuracy by the GPCR-CA,

identify orphan GPCRs, a dataset (we called it as "deor-

respectively. In additional, as for secretin and metabotro-

phan") containing 274 orphan proteins are collected from

pic glutamate/pheromone family, the predictive accura-

the GPCRDB database (released on 2006). We further

cies are 97.44% and 97.73% by GPCR-SVMFS,

verify the 274 orphan proteins by searching accession

approximately 23% and 16% higher than those of GPCR-

number in the latest version of GPCRDB (released on

CA, indicating GPCR-SVMFS is effective and helpful for

2009). The results indicated that 8 proteins, 19 proteins

the prediction of GPCRs at family level.

and 2 proteins belong to amine, peptide and nucleotide

As shown in Table 2, the accuracies of amine, peptide,

respectively. Finally, the dataset of 29 proteins is con-

rhodopsin, olfactory and other are 93.48%, 98.61%,

structed (The dataset can be obtained from Additional

88.24% and 94.12%, respectively. Meanwhile, we also have

file 4.

notice that the accuracy of nucleotide is lower than that

The GPCR-SVMFS is able to accurately identify 13 pep-

of amine, peptide, rhodopsin, olfactory, which may be

tides from 19 proteins, and 2 nucleotides are completed

caused by the less protein samples contained in nucle-

recognized. However, none of the 8 amines is correctly

otide class. Although the accuracy for nucleotide is only

identified. So, overall success rate is 19/29 = 51.72%. The

76.92%, the overall accuracy is 94.53% for identifying sub-

result is higher than that of completely randomized pre-

familiy, indicating the current method can yield quite

diction, because the rate of correct identification by ran-

reliable results at subfamily level.

domly assignment is 1/6 = 16.67% if the protein samples

Comparison with GPCRPred are completely randomly distributed among the 6 possi-

Furthermore, in order to roundly evaluate our method we ble subfamilies (i.e. amine, peptide, rhodopsin, olfactory,

also performed it on another dataset used in GPCRPred nucleotide and other). The results imply that GPCR-

[14], which is a three-layer classifier based on SVM. In SVMFS is indeed powerful to identify orphan GPCRs.

the classifier, DipC is used for characterizing GPCRs at In addition, the prediction power of GPCR-SVMFS is

the levels of superfamily, family and subfamily. The data- also evaluated at family level and subfamily level by using

set obtained from GPCRPred contains 778 GPCRs and 99 8 independent dataset, which are collected based on the

non-GPCRs. The 778 GPCRs can be divided into 5 fami- GPCRDB (released on 2009). Three of the 8 dataset at

Li et al. BMC Bioinformatics 2010, 11:325 Page 11 of 15

http://www.biomedcentral.com/1471-2105/11/325

Table 1: Comparison of different method by the jackknife Table 2: Success rates obtained with the GPCR-SVMFS

at superfamily level predictor by jackknife test at subfamily level

Method Acc Sen Spe MCC GPCR Number Number of Qi/Q subfamily of proteins correct

prediction

GPCR-CA [10] 91.46 92.33 90.96 N/A

GPCR-SVMFS 97.81 97.04 98.61 0.9563

Amine 46 43 93.48

Peptide 72 71 98.61

family level are rhodopsin-like, metabotropic and secre- Rhodopsin 17 15 88.24

tin-like, which contains 20290, 1194 and 1484 proteins, Olfactory 19 19 100.0

respectively. Other 5 dataset at subfamily level are amine, Nucleotide 13 10 76.92

peptide, rhodopsin, olfactory and nucleotide. The 5 data-

Other 34 32 94.12

set is composed of 1840, 4169, 1376, 9977 and 576 pro-

teins, respectively (8 dataset are given in Additional file 5,

Overall 201 190 94.53

6, 7, 8, 9, 10, 11, 12).

The results at family level are shown in Table 6. The

proposed method achieves accuracy of 96.16% for rho-

Conclusion

dopsin-like, 85.76% for metabotropic and 68.53% for

With the rapid increment of protein sequence data, it is

secretin-like, and an overall accuracy of 93.81% can also

indispensable to develop an automated and reliable

be obtained. The results indicate that the performance of

method for classification of GPCRs. In this paper, a three-

GPCR-SVMFS is good enough at family level.

layer classifier is proposed for GPCRs by coupling SVM

The results for 5 subfamilies are listed in Table 7. The

with feature selection method. Compared with existing

prediction accuracies for the rhodopsin, amine and pep-

methods, the proposed method provides better predic-

tide reach 87.79%, 80.22% and 74.12%, respectively. For

tive performance, and high accuracies for superfamily,

the largest subfamily (olfactory) that contains 9977 pro-

family and subfamily of GPCRs in jackknife cross-valida-

teins, the accuracy achieves the highest values of 90.96%.

tion test, indicating the investigation of optimized fea-

Although the accuracy for nucleotide is only 54.69%, the

tures subset are quite promising, and might also hold a

overall prediction accuracy achieves 84.54% for classify-

potential as a useful technique for the prediction of other

ing subfamily, indicating the GPCR-SVMFS method can

attributes of protein.

yield good results at subfamily level.

Figure 7 Comparison of different method by the jackknife test at family level.

Li et al. BMC Bioinformatics 2010, 11:325 Page 12 of 15

http://www.biomedcentral.com/1471-2105/11/325

Table 3: The performance of GPCR-SVMFS and GPCRPred at superfamily level

Method Acc Sen Spe MCC

GPCRPred [14] 99.50 98.60 99.80 0.9900

100.0 100.0 100.0 1.0000

GPCR-SVMFSa

a In order to consistent with evaluation method of GPCRPred, 5-fold cross-validation is utilized.

Table 4: The performance of GPCR-SVMFS and GPCRPred at family level

Method Qi/Q Class A Class B Class C Class D Class E Overall

GPCRPred [14] 98.10 85.70 81.30 36.40 100.0 97.30

100.0 100.0 100.0 81.82 100.0 99.74

GPCR-SVMFSa

a In order to consistent with evaluation method of GPCRPred, 2-fold cross-validation is utilized.

Table 5: The performance of GPCR-SVMFS and GPCRPred at subfamily level

Class A subfamilies Number of proteins Qi/Q GPCR-SVMFSa

GPCRPred [14]

Amine 221 99.10 100.0

Peptide 381 99.70 99.21

Hormone 25 100.0 100.0

Rhodopsin 183 98.90 99.45

Olfactory 87 100.0 100.0

Prostanoid 38 100.0 100.0

Nucleotide 48 85.40 93.75

Cannabis 11 100.0 90.91

Platelet activating factor 4 100.0 100.0

Gonadotrophin releasing hormone 10 100.0 100.0

Thyrotropin releasing hormone 7 85.70 85.71

Melatonin 13 100.0 100.0

Viral 17 33.30 76.47

Lysospingolipids 9 58.80 100.0

Overall 1054 97.30 98.77

a In order to consistent with evaluation method of GPCRPred, 2-fold cross-validation is utilized.

Li et al. BMC Bioinformatics 2010, 11:325 Page 13 of 15

http://www.biomedcentral.com/1471-2105/11/325

Table 6: The prediction power of GPCR-SVMFS to independent dataset at family level

GPCR family Number of proteins Number of correct prediction Qi/Q Rhodopsin-like 20290

Contact this candidate