Li et al. BMC Bioinformatics ****, **:***
http://www.biomedcentral.com/1471-2105/11/325
Open Access
RESEARCH ARTICLE
Classification of G-protein coupled receptors based
Research article
on support vector machine with maximum
relevance minimum redundancy and genetic
algorithm
Zhanchao Li, Xuan Zhou, Zong Dai and Xiaoyong Zou*
Abstract
Background: Because a priori knowledge about function of G protein-coupled receptors (GPCRs) can provide useful
information to pharmaceutical research, the determination of their function is a quite meaningful topic in protein
science. However, with the rapid increase of GPCRs sequences entering into databanks, the gap between the number
of known sequence and the number of known function is widening rapidly, and it is both time-consuming and
expensive to determine their function based only on experimental techniques. Therefore, it is vitally significant to
develop a computational method for quick and accurate classification of GPCRs.
Results: In this study, a novel three-layer predictor based on support vector machine (SVM) and feature selection is
developed for predicting and classifying GPCRs directly from amino acid sequence data. The maximum relevance
minimum redundancy (mRMR) is applied to pre-evaluate features with discriminative information while genetic
algorithm (GA) is utilized to find the optimized feature subsets. SVM is used for the construction of classification
models. The overall accuracy with three-layer predictor at levels of superfamily, family and subfamily are obtained by
cross-validation test on two non-redundant dataset. The results are about 0.5% to 16% higher than those of GPCR-CA
and GPCRPred.
Conclusion: The results with high success rates indicate that the proposed predictor is a useful automated tool in
predicting GPCRs. GPCR-SVMFS, a corresponding executable program for GPCRs prediction and classification, can be
acquired freely on request from the authors.
Background differentiation and growth, inflammatory and immune
G protein-coupled receptors (GPCRs), also known as 7 - response [3-9]. For these reasons, GPCRs have been the
helices transmembrane receptors due to their character- most important and common targets for pharmacological
istic configuration of an anticlockwise bundle of 7 trans- intervention. At present, about 30% of drugs available on
membrane helices [1], are one of the largest superfamily the market act through GPCRs. However, detailed infor-
of membrane proteins and play an extremely important mation about the structure and function of GPCRs are
role in transducing extracellular signals across the cell deficient for structure-based drug design, because the
membrane via guanine-binding proteins (G-proteins) determination of their structure and functional using
with high specificity and sensitivity [2]. GPCRs regulate experimental approach is both time-consuming and
many basic physicochemical processes contained in a cel- expensive.
lular signaling network, such as smell, taste, vision, secre- As membrane proteins, GPCRs are very difficult to
tion, neurotransmission, metabolism, cellular crystallize and most of them will not dissolve in normal
solvents [10]. Accordingly, the 3 D structure of only squid
* Correspondence: ******@****.****.***.**
rhodopsin, 1, 2 adrenergic receptor and the A2A ade-
1School of Chemistry and Chemical Engineering, Sun Yat-Sen University,
nosine receptor have been solved to data. In contrast, the
Guangzhou 510275, PR China
amino acid sequences of more than 1000 GPCRs are
Full list of author information is available at the end of the article
2010 Li et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attri-
bution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any
medium, provided the original work is properly cited.
Li et al. BMC Bioinformatics 2010, 11:325 Page 2 of 15
http://www.biomedcentral.com/1471-2105/11/325
known with the rapid accumulation of data of new pro- feature vector that posses the most discriminative infor-
tein sequence produced by the high-throughput sequenc- mation. Therefore, feature selection should be used for
ing technology. In view of the extremely unbalanced state, accurate SVM classification.
it is vitally important to develop a computational method Feature selection, also known as variable selection or
that can fast and accurately predict the structure and attribute selection, is the technique commonly used in
function of GPCRs from sequence information. machine learning and has played an important role in
Actually, many predictive methods have been devel- bioinformatics studies. It can be employed along with
oped, which in general, can be roughly divided into three classifier construction to avoid over-fitting, to generate
categories. The first one is proteochemometric approach more reliable classifier and to provide more insight into
developed by Lapinsh [11]. However, the methods need the underlying causal relationships [40]. The technique
structural information of organic compounds. The sec- has been greatly applied to the field of microarray and
ond one is based on similarity searches using primary mass spectra (MS) analysis [41-50], which has a great
database search tools (e.g. BLAST, FASTA) and such challenge for computational techniques due to their high
database searches coupled with searches of pattern data- dimensionality. However, there is still few works utilizing
bases (PRINTS) [12]. However, they do not seem to be feature selection in GPCRs prediction to obtain the most
sufficiently successful for comprehensive functional iden- informative features or to improve the prediction accu-
tification of GPCRs, since GPCRs make up a highly diver- racy.
gent family, and even when they are grouped according to So, a new predictor combining feature selection and
similarity of function, their sequences share strikingly lit- support vector machine is proposed for the identification
tle homology or similarity to each other [13]. The third and classification of GPCRs at the three levels of super-
one is based on statistical and machine learning method, family, family and subfamily. In every level, minimum
including support vector machines (SVM) [8,14-17], hid- redundancy maximum relevance (mRMR) [51] is utilized
den Markov models (HMMs) [1,3,6,18], covariant dis- to pre-evaluate features with discriminative information.
criminant (CD) [7,11,19,20], nearest neighbor (NN) After that, to further improve the prediction accuracy
[2,21] and other techniques [13,22-24]. and to obtain the most important features, genetic algo-
Among them, SVM that is based on statistical learning rithms (GA) [52] is applied to feature selection. Finally,
theory has been extensively used to solve various biologi- three models based on SVM are constructed and used to
cal problems, such as protein secondary structure [25,26], identify whether a query protein is GPCR and which fam-
subcellular localization [27,28], membrane protein types ily or subfamily the protein belongs to. The prediction
[29], due to its attractive features including scalability, quality evaluated on a non-redundant dataset by the jack-
absence of local minima and ability to condense informa- knife cross-validation test exhibited significant improve-
tion contained in the training set. In SVM, an initial step ment compared with published results.
to transform protein sequence into a fixed length feature
Methods
vector is essential because SVM can not be directly
applied to amino acid sequences with different length. Dataset
Two commonly used feature vectors to predict GPCRs As is well-known, sequence similarity in dataset has an
functional classes are amino acid composition (AAC) and important effect on the prediction accuracy, i.e. accuracy
dipeptide composition (DipC) [2,7,10,16,19,20,22], where will be overestimated when using high similarity protein
every protein is represented by 20 or 400 discrete num- sequence. Thus, in order to disinterestedly test current
bers. Obviously, if one uses AAC or DipC to represent a method and facilitate to compare with other existing
protein, many important information associated with the approaches, the dataset constructed by Xiao [10] is used
sequence order will be lost. To take into account the as the working dataset. The similarity in the dataset is less
information, the so-called pseudo amino acid composi- than 40%. The dataset contains 730 protein sequences
tion (PseAA) was proposed [30] and has been widely used that can be classified into two parts: 365 non-GPCRs and
to GPCRs and other attributes of protein studies [10,31- 365 GPCRs. The 365 GPCRs can be divided into 6 fami-
36]. However, the existing methods were established only lies: 232 rhodopsin-like, 44 metabotropic glutamate/
based on a single feature-set. And, few works tried to pheromone, 39 secretin-like, 23 fungal pheromone, 17
research the relationship between features and the func- frizzled/smoothened and 10 cAMP receptor. For rhodop-
tional classes of protein [37-39], or to find the informative sin-like of GPCRs, we further partitioned into 15 subfam-
features which contribute most to discriminate func- ilies based on GPCRDB (release 10.0) [53], including 46
tional types. Karchin et al [8] also indicated that the per- amine, 72 peptide, 2 hormone, 17 rhodopsin, 19 olfac-
formance of SVM could be further improved by using tory, 7 prostanoid, 13 nucleotide, 2 cannabinoid, 1 plate-
Li et al. BMC Bioinformatics 2010, 11:325 Page 3 of 15
http://www.biomedcentral.com/1471-2105/11/325
let activating factor, 2 gonadotropin-releasing hormone, 3 N ( i,i + 2 )
CHP ( i, i + 2 ) = i = 1, 2,, L 2
thyrotropin-releasing hormone & secretagogue, 2 mela-
L 2
tonin, 9 viral, 4 lysosphingolipid, 2 leukotriene B4 recep-
(1)
tor and 31 orphan. Those subfamilies, which the number
of proteins is lower than 10, are combined into a class,
Where, N(i,i+2) is the number of pattern in position i and
because they contain too few sequences to have any sta-
i+2 that simultaneously belong to any of 7 kinds of amino
tistical significance. So, 6 classes (46 amine, 72 peptide,
acids, and L is the sequence length. Other CHP are calcu-
17 rhodopsin, 19 olfactory, 13 nucleotide and 34 other)
lated by using the rule mentioned above. The DHP of pat-
are obtained at subfamily level.
tern (i, i+2), which describes the distribution of the
pattern in protein sequence, can be calculated according
Protein represent
to Eq. (2):
In order to fully characterize protein primary structure,
10 feature vectors are employed to represent the protein
sample, including AAC, DipC, normalized Moreau-Broto S( i,i + 2 )
DHP ( i, i + 2 ) = i = 1, 2,, L 2
autocorrelation (NMBAuto), Moran autocorrelation
L 2
(MAuto), Geary autocorrelation (GAuto), composition
(2)
(C), transition (T), distribution (D) [54], composition and
distribution of hydrophobicity pattern (CHP, DHP). Here
Where, S(i,i+2) is a feature vector composing of 5 values
8 and 7 amino acid properties extracted from AAIndex
that are the position in the whole sequence for the first
database [55] are selected to compute autocorrelation, C,
pattern (i, i+2), 25% patterns (i, i+2), 50% patterns (i, i+2),
T and D features, respectively. The properties and defini-
75% patterns(i, i+2) and 100% patterns (i, i+2), respec-
tions of amino acids attributed to each group are shown
tively. How to calculate these values is explained below by
in Additional file 1 and 2.
using a random sequence with 40 amino acids, as shown
According to the theory of Lim [56], 6 kinds of hydro-
in Figure 1, which consists of 10 patterns (i, i+2). The 10
phobicity patterns include: (i, i+2), (i, i+2, i+4), (i, i+3), (i,
patterns (i, i+2) included, CAL (1), IQF (2), FKM (3),
i+1, i+4), (i, i+3, i+4) and (i, i+5). The patterns (i, i+2) and
MDV (4), CTF (5), FYL (6), CFM (7), FMI (8), IRI (9),
(i, i+2, i+4) often appear in the -sheets while the pat-
CAW (10). The first pattern (i, i+2) is in the pattern posi-
terns (i, i+3), (i, i+1, i+4) and (i, i+3, i+4) occur more
tion of 1(CAL). The pattern (i, i+2) of (10 25% = 2.5 3)
often in -helices. The pattern (i, i+5) is an extension of
is in the pattern position of 3 (FKM). The pattern (i, i+2)
the concept of the "helical wheel" or amphipathic -helix
of (10 50% = 5) is in the pattern position of 5 (CTF).
[57]. Seven kinds of amino acids, including Cys (C), Phe
The pattern (i, i+2) of (10 75% = 7.5 8) is in the pattern
(F), Ile (I), Leu (L), Met (M), Val (V) and Trp (W), may
position of 8 (FMI). The pattern (i, i+2) of (10 100% =
occur in the 6 patterns based on the observed of Rose et
10) is in the pattern position of 10 (CAW). The first letter
al [58]. Because transmembrane regions of membrane
of the 5 patterns (i, i+2) are C, F, C, F, C, which is corre-
protein are usually composed of -sheet and -helix,
sponding to the residue position of 1, 10, 17, 28, and 36 in
CHP and DHP are used to represent protein sequence.
the sequence, respectively. Thus, S(i,i+2) = [1 10 17 28 36].
For the pattern (i, i+2), the CHP is computed by Eq. (1):
Figure 1 Random sequence consisting of 40 residues as an example to illustrate derivation of feature vector.
Li et al. BMC Bioinformatics 2010, 11:325 Page 4 of 15
http://www.biomedcentral.com/1471-2105/11/325
Similarly, the DHP for pattern other than (i, i+2) is also
MI(x, y)
1
min(mR), mR = (5)
calculated by using the rule. 2
S x, y S
The optimized feature subset selection
SVM is one of the most powerful machine learning meth- Where, S denoted that the feature subset, and S is the
ods, but it cannot perform automatic feature selection. To number of feature in S.
overcome this limitation, various feature selection meth- The maximum relevance condition is to maximize the
ods were introduced [59,60]. Feature selection methods total relevance between all features in S and classification
typically were divided into two categories: filter and variable. The condition can be obtained by Eq. (6):
wrapper methods. Although filter methods are computa-
tionally simple and easily scale to high-dimensional data-
MI(x, l)
1
max( MR), MR = (6)
set, they ignore the interaction between selected feature
S
x S
and classifier. In contrast, wrapper approaches include
the interaction and can also take into account the correla-
To achieve feature subset, the two conditions should be
tion between features, but they have a higher risk of over-
optimized simultaneously according to Eq. (7):
fitting than filter techniques and are very computationally
intensive, especially if building the classifier has a high
max( MI), MI = MR mR (7)
computational cost [61]. Considering the characteristics
of the two methods, the mRMR belonging to filter meth-
If continuous features exist in feature set, the feature
ods is used to preselect a feature subset, and then GA
must be discretized by using "mean standard deviation/
belonging to wrapper methods is utilized to obtain the
2" as boundary of the 3 states. The value of feature larger
optimized feature subset.
than "mean + standard deviation/2" is transformed to
Minimum redundancy maximum relevance (mRMR) state 1; The value of feature between "mean - standard
The mRMR method tries to select a feature subset, each deviation/2" and "mean + standard deviation/2" is trans-
of which has the maximal relevance with target class and formed to state 2; The value of feature smaller than "mean
the minimal redundancy with other features. The feature - standard deviation/2" is transformed to state 3. In this
subset can be obtained by calculating the mutual infor- case, computing mutual information is straightforward,
mation between the features themselves and between the because both joint and marginal probability tables can be
features and the class variables. In the current study, fea- estimated by tallying the samples of categorical variables
ture is a vector contains 10 type descriptors values of pro- in the data [51]. More explanation about the calculation
teins (AAC, DipC, NMBAuto, MAuto, GAuto, C, T, D, of probability can be seen from Additional file 3. Detailed
CHP and DHP). For binary classification problem, classi- depiction of the mRMR method can be found in refer-
fication variable lk 1 or 2. The mutual information ence [51], and mRMR program can be obtained from
MI(x,y) of between two features x and y is computed by http://penglab.janelia.org/proj/mRMR/index.htm
Eq.(3):
Genetic algorithms (GA)
GA can effectively search the interesting space and easily
p( xi, yj)
MI( x, y) = p( xi, yj) log solve complex problems without requiring a prior knowl-
(3)
p( xi)p(yj) edge about the space and the problem. These advantages
i, j N
of GA make it possible to simultaneously optimize the
feature subset and the SVM parameters. The chromo-
Where, p(xi,yj) is joint probabilistic density, p(xi) and
some representations, fitness function, selection, cross-
p(yj) is marginal probabilistic density.
over and mutation operator in GA are described in the
Similarly, the mutual information MI(x,l) of between
following sections.
classification variable l and feature x is also calculated by
Chromosome representation
Eq.(4):
The chromosome is composed of decimal and binary
coding systems, where binary genes are applied to the
p(x, l ) log p(xi)p(lk)
p( xi,lk)
MI( x, l) = selection of features and decimal genes are utilized to the
(4)
ik
optimization of SVM parameters.
i,k N
Fitness function
The minimum redundancy condition is to minimize In this study, two objectives must be simultaneously con-
the total redundancy of all features selected by Eq.(5): sidered when designing the fitness function. One is to
Li et al. BMC Bioinformatics 2010, 11:325 Page 5 of 15
http://www.biomedcentral.com/1471-2105/11/325
maximize the classification accuracy, and the other is to TN
Spe = (12)
minimize the number of selected features. The perfor-
TN + FP
mances of these two objectives can be evaluated by Eq.
(8):
TP + TN
Acc = (13)
TP + TN + FP + FN
(8)
fitness=SVM _ accuracy+(1-n /N )
MCC =
Where, SVM _ accuracy is SVM classification accuracy,
TP TN FP FN
n is the number of selected features, N is the number of
( TP + FN ) ( TP + FP ) ( TN + FN ) ( TN + FP )
overall features.
Selection, crossover and mutation operator (14)
Elitist strategy that guarantees chromosome with the
highest fitness value is always replicated into the next Here, TP, TN, FP and FN are the numbers of true posi-
generation is used to select operation. Once a pair of tives, true negatives, false positives and false negatives,
chromosome is selected for crossover, five random respectively.
selected positions are assigned to the crossover operator The classification of GPCRs into its families and sub-
of the binary coding part. The crossover operator was families is a multi-class classification problem, namely a
determined according to Eq. (9) and (10) for the decimal given protein can be classified into specific family or sub-
coding part, where p is the random number of (0, 1). family. The simple solution is to reduce the multi-classifi-
cation to a series of binary classifications. We adopted the
one-versus-one strategy to transfer it into a series of two-
(9)
child1=p*parent1+(1-p)*parent2
class problems. The overall accuracy (Q) and accuracy
(Qi) for each family or subfamily calculated for assess-
(10)
child2=p*parent 2+(1-p)*parent1
ment of the prediction system are given by Eqs. (15)-(16).
The method based on chaos [62] is applied to the muta-
k
tion operator of decimal coding. Mutation to the part of i =1 p(i) (15)
Q=
binary coding is the same as traditional GA.
N
The population size of GA is 30, and the termination
condition is that the generation numbers reach 10000. A
p(i)
Qi =
detailed depiction of the GA can be reference to our pre- (16)
obs(i)
vious works [63].
Where, N is the total number of sequences, obs(i) is the
Model construction and assessment of performance
number of sequences observed in class i, p(i) is the num-
For the present SVM, the publicly available LIBSVM soft-
ber of correctly predicted sequences of class.
ware [64] is used to construct the classifier with the radial
The whole procedure for recognizing GPCRs form pro-
basis function as the kernel. Ten-fold cross-validation test
tein sequences and further classifying GPCRs to family
is used to examine a predictor for its effectiveness. In the
and subfamily is illustrated in Figure 2, and the steps are
10-fold cross-validation, the dataset is divided randomly
as follows:
into 10 equally sized subsets. The training and testing are
Step 1. Produce various feature vectors that represent a
carried out 10 times, each time using one distinct subset
query protein sequence.
for testing and the remaining 9 subsets for training.
Step 2. Preselect a feature subset by running mRMR.
Classifying GPCRs in superfamily level can be formu-
Select an optimized feature subset from the preselect
lated as a binary classification problem, namely each pro-
subset by GA and SVM. Predict whether the query pro-
tein can be classified as either GPCRs or non-GPCRs. So,
tein belong to the GPCRs or not. If the protein is classi-
the performance of classifier are measured in terms of
fied into non-GPCRs, stop the process and output results,
sensitivity (Sen), specificity (Spe), accuracy (Acc) and
otherwise, go to the next step.
Matthew's correlation coefficient (MCC) [65], and are
Step 3. Preselect again a feature subset and further
given by Eqs. (11)-(14).
select an optimized feature subset. Predict which family
the protein belongs. If the protein is divided into non-
TP
Sen = (11) Rhodopsin like, stop the process with the output of
TP + FN
results, otherwise, go to the next step.
Li et al. BMC Bioinformatics 2010, 11:325 Page 6 of 15
http://www.biomedcentral.com/1471-2105/11/325
Figure 2 Flowchart of the current method.
Li et al. BMC Bioinformatics 2010, 11:325 Page 7 of 15
http://www.biomedcentral.com/1471-2105/11/325
Step 4. Preselect a feature subset again and select an approximate 275 features are selected by GA and a pre-
optimized feature subset. Predict which subfamily the dictive accuracy about 94.93% is achieved based on 10-
protein belongs to. fold cross-validation tested. Along with the implementa-
tion of GA, the number of selected features gradually
Results and discussion decreased while fitness is improved. Finally, the good fit-
Identification a GPCR from non-GPCR ness, high classification accuracy (98.77% based on 10-
At the first step of feature selection, only 600 different fold cross-validation test) and optimized feature subset
feature subsets are selected based on mRMR due to our (only contains 38 features) can be obtained from about
limited computational power, and the feature subsets 6600 generations. Consequently, the optimal classifier at
contains 1, 2, 3, 598, 599 and 600 features respec- superfamily level is constructed with the optimal feature
tively. The performance of various feature subsets for dis- subset.
criminating between GPCRs and other protein is The results of the optimized features subset are shown
investigated based on grid search for maximal 10-fold in Figure 6. The optimized features subset contains 38
features, including 1 feature of cysteine composition; 7
cross-validation tested accuracy with ranging among 2-
features of DipC based on Phe-Phe, Gly-Glu acid, His-
5, 2-4 215, and C ranging among 2-15, 2-14 25 ( and C
Asp, Ile-Glu, Asn-Ala, Asn-Met and Ser-Glu; 1 feature of
are needed to optimize parameters of SVM), and the
C based on polarity grouping; 2 features of T based on
results are shown in Figure 3. The accuracy for a single
hydrophobicity and buried volume grouping; 7 features of
feature is 85.89%. And the accuracy dramatically
D based on charge, hydrophobicity, Van der Waals vol-
increased when the number of features increased from 2
ume, polarizability and solvent accessibility grouping; 5
to 150, and achieved the highest values (98.22%) while the
features of NMBAuto based on hydrophbicity, flexibility,
feature subset consists of 543 features. However, the
residue accessible surface area and relative mutability; 11
accuracy did not change dramatically when the number
features of MAuto based on hydrophobicity, flexibility,
of features increased to 600.
residue volume, steric parameter and relative mutability;
Although the highest accuracy can be obtained by using
2 features of GAuto based on hydrophobicity and free
the feature subset with 543 features, many features
energy; 2 features of DHP based on pattern (i, i+3, i+4).
impede the discovery of physicochemical properties that
The results suggest that the order of these feature groups
affect the prediction of GPCRs. So, we perform further
that contributed to the discrimination GPCRs from non-
GA for the preselecting feature subset that consists of 600
GPCRs is: MAuto > Dipc and D > NMBAuto > T, GAuto
features. Figure 4 and Figure 5 illustrate the convergence
and DHP > AAC and C.
processes for GA to select feature subset. Initially,
Figure 3 The relationship between the accuracy and the number of features.
Li et al. BMC Bioinformatics 2010, 11:325 Page 8 of 15
http://www.biomedcentral.com/1471-2105/11/325
Figure 4 The relationship between the number of features and the number of generations.
Recognition of GPCR family feature increased from 1 to 301, and the highest overall
Following the same steps described above, the quality of accuracy of 96.99% can be achieved.
various feature subsets are investigated at family level We also further perform GA for preselecting feature
based on grid search and 10-fold cross-validation tested. subset with 600 features to acquire an optimized feature
The relationship between number of feature and overall subset. The processes of optimization are displayed in
accuracy is shown in Figure 3. A significant increase in Figure 4 and Figure 5. It can be observed that the number
overall accuracy can be observed when the number of of features dramatically decreased from 250 to 57 when
Figure 5 Fitness values and overall accuracy based on the most fitted member of each generation.
Li et al. BMC Bioinformatics 2010, 11:325 Page 9 of 15
http://www.biomedcentral.com/1471-2105/11/325
Figure 6 Composition of the optimized features subset.
the number of generation increased from 1 to 2300, and In order to get an optimized feature subset, GA is fur-
the best fitness and highest overall accuracy of 99.73% ther applied to further feature selection from a prese-
can be achieved. So, the optimal classifier with 57 fea- lected feature subset with 600 features. The processes of
tures is used to construct classifier at family level. convergence are shown in Figure 4 and Figure 5. The
The results of the optimized feature subset are also number of features in optimized feature subset signifi-
shown in Figure 6. The optimized features subset con- cantly decreased from 278 to 115 when the number of
tains 2 AAC, 14 DipC, 8 D, 7 NMBAuto, 21 MAuto, 2 generation increased from 1 to 1400, and corresponding
GAuto and 3 DHP features. The results reveal that the fitness value is significantly increased. Subsequently, the
order of these feature groups that contributed to the clas- number of features and fitness value maintained invari-
sification GPCRs into 6 families is: MAuto > DipC > D > able. It clearly shows a premature convergence. However,
NMBAuto > DHP > AAC and GAuto. the number of features decreased from 113 to 92 when
the number of generation increased from 1800 to 3100,
Classification of GPCR subfamily indicating GA has ability to escape from local optima.
Because knowledge of GPCRs subfamilies can provide The finally optimized feature subset with 91 features can
useful information to pharmaceutical companies and be obtained within 3200 generations. Therefore, we
biologists, the identification of subfamilies is a quite developed a classifier by the features from the optimized
meaningful topic in assigning a function to GPCRs. feature subset for classifying the subfamilies of the rho-
Therefore, we constructed a classifier at subfamily level dopsin-like family.
to predict the subfamily belonging to the rhodopsin-like The composition of optimized feature subset is shown
family. Rhodopsin-like family is considered because it in Figure 6. The optimized feature subset contains 3
covers more than 80% of sequences in the GPCRDB data- AAC, 17 DipC, 3 C, 6 D, 18 NMBAuto, 31 MAuto, 6
base [53], and the number of other family in current data- GAuto, 2 CHP and 5 DHP features. The results suggest
set is too few to have any statistical significance. Similarly, that the order of these feature groups that contributed to
we also study the quality of various feature subsets from the prediction subfamily belonging to the rhodopsin-like
mRMR based on grid search and 10-fold cross-validation family is: MAuto > NMBAuto > DipC > D and GAuto >
tested. The correlation between number of features and DHP > AAC and C > CHP.
overall accuracy is also illustrated in Figure 3. Overall
accuracy enhanced when the number of features Comparison with GPCR-CA
increased from 1 to 300, and the highest overall accuracy To facilitate a comparison with GPCR-CA method devel-
of 87.56% can be obtained by using the feature subset oped by Xiao [10], we perform jackknife cross-validation
with 418 features. test based on the current predictor. GPCR-CA is a two-
Li et al. BMC Bioinformatics 2010, 11:325 Page 10 of 15
http://www.biomedcentral.com/1471-2105/11/325
layer classifier that is used to classify at the levels of lies: 692 class A-rhodopsin and andrenergic, 56 class B-
superfamily and family, respectively, and each protein is calcitonin and parathyroid hormone, 16 class C-
characterized by PseAA, which is based on "cellular auto- metabotropic, 11 class D-pheromone and 3 class E-
mation" and gray-level co-occurrence matrix factors. In cAMP. The class A at subfamily level is composed of 14
the jackknife test, each protein in the dataset is in turn major classes and sequences are from the work of
singled out as an independent test sample and all the Karchin [8].
rule-parameters are calculated without using this protein. The success rates are listed in Tables 3, 4 and 5. And the
The results of jackknife test obtained with proposed results of GPCR-SVMFS are compared with those of
method in comparison with GPCR-CA are listed in Table GPCRPred for the same dataset. From Table 3 we can see
1 and Figure 7. The performances of the proposed predic- that the accuracy of GPCR-SVMFS is 0.5% higher than
tor (GPCR-SVMFS) in predicting the subfamilies are that of GPCRPred based on DipC at superfamily level. As
summarized in Table 2. can be seen from Table 4, the accuracies for class A, class
It can be seen from Table 1 that the accuracy, sensitiv- B and class C are 100%, which is almost 2%, 15% and 19%
ity, specificity and MCC by GPCR-SVMFS are 97.81%, higher than that of GPCRPred, respectively. Especially for
97.04%, 98.61% and 0.9563, respectively, which are 4.7% the class D, the predictive accuracy is improved to 81.82%
to 7.6% improvement over GPCR-CA method [10]. The by GPCR-SVMFS, which is almost 45% higher than that
results indicated that the GPCR-SVMFS can identify of GPCRPred. As can be seen in Table 5, the accuracies of
GPCRs from non-GPCRs with high accuracy using opti- the nucleotide, viral and lysospingolipids are improved to
mized feature subset as the sequence feature. 93.75%, 76.47%, 100.0%, about 8%, 43% and 42% higher
As can be seen from Figure 7, the overall accuracy of than GPCRPred. Although the accuracy of cannabis is
GPCR-SVMFS is 99.18%, which is almost 15% higher decreased from 100% to 90.91%, the overall accuracy is
than that of GPCR-CA. Furthermore, the accuracies of improved from 97.30% to 98.77%. All the results show
fungal pheromone, cAMP and frizzled/smoothened fam- that GPCR-SVMFS is superior to GPCRPred, which may
ily are dramatically improved. The accuracy by GPCR- be caused by the fact that optimized feature subset con-
SVMFS for fungal pheromone family is 100%, approxi- tains more information than single DipC, and therefore
mately 93% higher than the accuracy by the GPCR-CA. can enhance predictive performance significantly.
The accuracies of cAMP and frizzled/smoothened are
Predictive power of GPCR-SVMFS
100% and 94.12% based on GPCR-SVMFS, approximately
In order to test the performance of GPCR-SVMFS to
40% and 47% higher than the accuracy by the GPCR-CA,
identify orphan GPCRs, a dataset (we called it as "deor-
respectively. In additional, as for secretin and metabotro-
phan") containing 274 orphan proteins are collected from
pic glutamate/pheromone family, the predictive accura-
the GPCRDB database (released on 2006). We further
cies are 97.44% and 97.73% by GPCR-SVMFS,
verify the 274 orphan proteins by searching accession
approximately 23% and 16% higher than those of GPCR-
number in the latest version of GPCRDB (released on
CA, indicating GPCR-SVMFS is effective and helpful for
2009). The results indicated that 8 proteins, 19 proteins
the prediction of GPCRs at family level.
and 2 proteins belong to amine, peptide and nucleotide
As shown in Table 2, the accuracies of amine, peptide,
respectively. Finally, the dataset of 29 proteins is con-
rhodopsin, olfactory and other are 93.48%, 98.61%,
structed (The dataset can be obtained from Additional
88.24% and 94.12%, respectively. Meanwhile, we also have
file 4.
notice that the accuracy of nucleotide is lower than that
The GPCR-SVMFS is able to accurately identify 13 pep-
of amine, peptide, rhodopsin, olfactory, which may be
tides from 19 proteins, and 2 nucleotides are completed
caused by the less protein samples contained in nucle-
recognized. However, none of the 8 amines is correctly
otide class. Although the accuracy for nucleotide is only
identified. So, overall success rate is 19/29 = 51.72%. The
76.92%, the overall accuracy is 94.53% for identifying sub-
result is higher than that of completely randomized pre-
familiy, indicating the current method can yield quite
diction, because the rate of correct identification by ran-
reliable results at subfamily level.
domly assignment is 1/6 = 16.67% if the protein samples
Comparison with GPCRPred are completely randomly distributed among the 6 possi-
Furthermore, in order to roundly evaluate our method we ble subfamilies (i.e. amine, peptide, rhodopsin, olfactory,
also performed it on another dataset used in GPCRPred nucleotide and other). The results imply that GPCR-
[14], which is a three-layer classifier based on SVM. In SVMFS is indeed powerful to identify orphan GPCRs.
the classifier, DipC is used for characterizing GPCRs at In addition, the prediction power of GPCR-SVMFS is
the levels of superfamily, family and subfamily. The data- also evaluated at family level and subfamily level by using
set obtained from GPCRPred contains 778 GPCRs and 99 8 independent dataset, which are collected based on the
non-GPCRs. The 778 GPCRs can be divided into 5 fami- GPCRDB (released on 2009). Three of the 8 dataset at
Li et al. BMC Bioinformatics 2010, 11:325 Page 11 of 15
http://www.biomedcentral.com/1471-2105/11/325
Table 1: Comparison of different method by the jackknife Table 2: Success rates obtained with the GPCR-SVMFS
at superfamily level predictor by jackknife test at subfamily level
Method Acc Sen Spe MCC GPCR Number Number of Qi/Q subfamily of proteins correct
prediction
GPCR-CA [10] 91.46 92.33 90.96 N/A
GPCR-SVMFS 97.81 97.04 98.61 0.9563
Amine 46 43 93.48
Peptide 72 71 98.61
family level are rhodopsin-like, metabotropic and secre- Rhodopsin 17 15 88.24
tin-like, which contains 20290, 1194 and 1484 proteins, Olfactory 19 19 100.0
respectively. Other 5 dataset at subfamily level are amine, Nucleotide 13 10 76.92
peptide, rhodopsin, olfactory and nucleotide. The 5 data-
Other 34 32 94.12
set is composed of 1840, 4169, 1376, 9977 and 576 pro-
teins, respectively (8 dataset are given in Additional file 5,
Overall 201 190 94.53
6, 7, 8, 9, 10, 11, 12).
The results at family level are shown in Table 6. The
proposed method achieves accuracy of 96.16% for rho-
Conclusion
dopsin-like, 85.76% for metabotropic and 68.53% for
With the rapid increment of protein sequence data, it is
secretin-like, and an overall accuracy of 93.81% can also
indispensable to develop an automated and reliable
be obtained. The results indicate that the performance of
method for classification of GPCRs. In this paper, a three-
GPCR-SVMFS is good enough at family level.
layer classifier is proposed for GPCRs by coupling SVM
The results for 5 subfamilies are listed in Table 7. The
with feature selection method. Compared with existing
prediction accuracies for the rhodopsin, amine and pep-
methods, the proposed method provides better predic-
tide reach 87.79%, 80.22% and 74.12%, respectively. For
tive performance, and high accuracies for superfamily,
the largest subfamily (olfactory) that contains 9977 pro-
family and subfamily of GPCRs in jackknife cross-valida-
teins, the accuracy achieves the highest values of 90.96%.
tion test, indicating the investigation of optimized fea-
Although the accuracy for nucleotide is only 54.69%, the
tures subset are quite promising, and might also hold a
overall prediction accuracy achieves 84.54% for classify-
potential as a useful technique for the prediction of other
ing subfamily, indicating the GPCR-SVMFS method can
attributes of protein.
yield good results at subfamily level.
Figure 7 Comparison of different method by the jackknife test at family level.
Li et al. BMC Bioinformatics 2010, 11:325 Page 12 of 15
http://www.biomedcentral.com/1471-2105/11/325
Table 3: The performance of GPCR-SVMFS and GPCRPred at superfamily level
Method Acc Sen Spe MCC
GPCRPred [14] 99.50 98.60 99.80 0.9900
100.0 100.0 100.0 1.0000
GPCR-SVMFSa
a In order to consistent with evaluation method of GPCRPred, 5-fold cross-validation is utilized.
Table 4: The performance of GPCR-SVMFS and GPCRPred at family level
Method Qi/Q Class A Class B Class C Class D Class E Overall
GPCRPred [14] 98.10 85.70 81.30 36.40 100.0 97.30
100.0 100.0 100.0 81.82 100.0 99.74
GPCR-SVMFSa
a In order to consistent with evaluation method of GPCRPred, 2-fold cross-validation is utilized.
Table 5: The performance of GPCR-SVMFS and GPCRPred at subfamily level
Class A subfamilies Number of proteins Qi/Q GPCR-SVMFSa
GPCRPred [14]
Amine 221 99.10 100.0
Peptide 381 99.70 99.21
Hormone 25 100.0 100.0
Rhodopsin 183 98.90 99.45
Olfactory 87 100.0 100.0
Prostanoid 38 100.0 100.0
Nucleotide 48 85.40 93.75
Cannabis 11 100.0 90.91
Platelet activating factor 4 100.0 100.0
Gonadotrophin releasing hormone 10 100.0 100.0
Thyrotropin releasing hormone 7 85.70 85.71
Melatonin 13 100.0 100.0
Viral 17 33.30 76.47
Lysospingolipids 9 58.80 100.0
Overall 1054 97.30 98.77
a In order to consistent with evaluation method of GPCRPred, 2-fold cross-validation is utilized.
Li et al. BMC Bioinformatics 2010, 11:325 Page 13 of 15
http://www.biomedcentral.com/1471-2105/11/325
Table 6: The prediction power of GPCR-SVMFS to independent dataset at family level
GPCR family Number of proteins Number of correct prediction Qi/Q Rhodopsin-like 20290