Training Test

Location:

Logan, UT

Posted:

November 21, 2012

Contact this candidate

Resume:

Amino Acids (****) **: ** **

DOI **.****/s*****-*07-0628-7

Printed in The Netherlands

Discrimination of outer membrane proteins using a K-nearest

neighbor method

C. Yan, J. Hu, and Y. Wang

Department of Computer Science, Utah State University, Logan, UT, USA

Received September 17, 2007

Accepted October 28, 2007

Published online January 25, 2008; # Springer-Verlag 2008

Summary. Identi cation of outer membrane proteins (OMPs) from ge- on long stretches of hydrophobic residues, OMPs are more

nome is an important task. This paper presents a k-nearest neighbor (K-

dif cult to predict, mainly due to shorter membrane-span-

NN) method for discriminating outer membrane proteins (OMPs). The

ning regions with higher variations in properties (Koebnik

method makes predictions based on a weighted Euclidean distance that

et al., 2000). A few methods have been developed to

is computed from residue composition. The method achieves 89.1% accu-

racy with 0.668 MCC (Matthews correlation coef cient) in discriminating

identify OMPs. Gnanasekaran et al. (2000) used pro les

OMPs and non-OMPs. The performance of the method is improved by

developed from structure-based alignments of porins to

including homologous information into the calculation of residue compo-

identify OMPs. Wimley et al. (2002) analyzed the struc-

sition. The nal method achieves an accuracy of 96.1%, with 0.873 MCC,

87.5% sensitivity, and 98.2% speci city. Comparisons with multiple re- tures of 15 non-redundant OMPs and developed a method

cently published methods show that the method proposed in this study

to identify OMPs based on residue composition and struc-

outperforms the others.

tural features. Martelli et al. (2002) and Bagos et al. (2004a,

Keywords: Prediction Transmembrane proteins Machine learning b) used hidden Markov models (HMMs) to discriminate

Gram-negative bacteria

OMPs from globular proteins. Liu et al. (2003) developed

a method that combines the residue composition of mem-

1. Introduction

brane spanning regions and predicted secondary structure to

Membrane proteins are important targets of protein sci- identify OMPs. Garrow et al. (2005a,b) developed a method

ence and cell biology research (Chou and Shen, 2007c; for discrimination of OMPs in genomes. Berven et al. (2004)

Douglas et al., 2007). Two of the hot topics related to developed the BOMP method that predicts OMPs by com-

bining pattern search, b-barrel score, and a lter that ex-

membrane proteins are to identify the type of membrane

proteins (Chou and Elrod, 1999; Wang et al., 2004, 2005b, plores the abundance of asparagine and isoleucine in

2006; Chou and Cai, 2005a, b; Liu et al., 2005a, b; Shen the protein. Gromiha and Suwa (2005) developed a simple

and Chou, 2005b, 2007e; Shen et al., 2006; Chou and method to identify OMPs using a deviation distance

Shen, 2007c; Pu et al., 2007) and to identify transmem- based on amino acid composition. Later, they evaluated 11

brane regions (e.g. Diao et al., 2007). Outer membrane machine-learning methods for the discrimination of OMPs

proteins (OMPs) perform diverse functional roles, in- using residue composition as input (Gromiha and Suwa,

cluding bacterial adhesion, structural integrity of the cell 2006). In another study, researchers from the same group

wall, and material transport (Koebnik et al., 2000; Schulz, used a backward-and-forward approach to select discrimi-

2000; Wimley, 2003). These proteins consist of b-barrel native features from residue composition and dipeptide

transmembrane regions and are found in the outer mem- composition and used a SVM method to identify OMPs

branes of gram-negative bacteria and the outer mem- (Park et al., 2005).

branes of mitochondria and chloroplasts (Schulz, 2000; K-nearest neighbor (K-NN) method has been success-

Waldispuhl et al., 2006; Wimley, 2003). Unlike a-helical fully adopted to predict various protein attributes (Chou,

2002; Shen et al., 2007b), such as protein subcellular

membrane proteins, which can be easily identi ed based

66 C. Yan et al.

the globular train protein, and "i glo is the average composition of residue

localization (see, e.g., (Chou and Shen, 2006a, b, c; Chou

type i for all globular proteins in the training set. The weighted Euclidean

and Shen, 2007a, b; Shen and Chou, 2007b, c, f; Shen distance between the test protein and an AMP protein

r in the training set

et al., 2007a)), subnuclear protein localization (Shen and 2

x x

(AMP train protein) was calculated using Damp i i test "i i amp train,

amp

Chou, 2005a), protein structural classi cation (Shen et al., where xi test is the composition of residue type i in the test pro-

2005), protein fold pattern (Shen and Chou, 2006), mem- tein, xi amp train is the composition of residue type i in the AMP train

protein, and "i amp is the average composition of residue type i for all

brane protein type (Shen and Chou, 2005b, 2007e; Shen

AMPs in the training set.

et al., 2006; Chou and Shen, 2007c), enzyme main and

sub functional classi cation (Shen and Chou, 2007a), and 2.3 K-Nearest neighbor (K-NN) algorithm

signal peptide (Chou and Shen, 2007e; Shen and Chou,

For a test protein, its weighted Euclidean distances to every OMP in

2007d). In this study, we propose a K-NN method for the the training set were calculated separately. K smallest distances were

discrimination of OMPs from non-OMPs. The method chosen. Let them be Domp 1, Domp 2 ; . . . ; Domp k . The distance between

the test protein and the OMP class was given by Domp 1 Domp 1

achieves 96.1% accuracy, with 0.873 MCC, 87.5% sensi- k

Domp 2 Domp k . The distance between the test protein and the

tivity, and 98.2% speci city. Comparisons with multiple globular protein class (Dglo ) and the distance between the test protein and

recently published methods show that the proposed method the AMP class (Damp ) were computed in a similar way. In this study, the

value of k was determined empirically. Various values were tried, and the

outperforms the others.

best performance was achieved when k 4. The test protein was classi ed

into a class to which it has a shorter distance.

2. Materials and methods

2.4 Five-fold cross-validations

2.1 Datasets

Five-fold cross-validation was used to evaluate the proposed method. The

Three datasets were obtained from a previous study by Park et al. (2005): overall dataset was divided into ve subsets, such that the identity between

SetA contains 208 well-annotated outer membrane proteins (OMPs); SetB any two proteins from different datasets was less than 25%. This threshold

has 673 globular proteins (which includes 155 all-a, 156 all-b, 184 a b, has been used in many studies for removing redundancy (Rost and Sander,

and 178 a=b proteins) from various structure families; and SetC has 206 1993; Ahmad and Sarai, 2004; Deng et al., 2004; Prasad Bahadur et al.,

a-helical membrane proteins (AMPs). In these datasets, the sequence 2004; Wang and Brown, 2006). OMPs, globular proteins and AMPs

identity between any two proteins is less than 40%. We rst used these were distributed into the subsets evenly. In each round of experiment, four

datasets to evaluate the proposed method. Then, we ltered these datasets subsets were used as the training set and the remaining subset was used as

so that the similarity between any two proteins is less than 25%. After the the test set. This procedure was repeated ve times with each subset being

ltering, 112 OMPs, 673 globular proteins, and 178 AMPs are left. We used as test set once. The average performance was reported. It is instruc-

also used the ltered datasets to evaluate the proposed method. tive to point out that independent dataset test, sub-sampling (e.g., ve or

ten-fold cross-validation) test, and jackknife test are often used for exam-

ining the accuracy of a statistical prediction method. Among them, the

2.1.1. Residue composition

jackknife test is deemed the most objective and being able to yield a

The residue composition of a protein was calculated as xi ni = j nj, unique result (Chou and Zhang, 1995), as demonstrated by an incisive

where ni and nj are the numbers of residues of types i and j. The average analysis in a recent comprehensive review (Chou and Shen, 2007d) as

residue composition of OMPs was given by "i omp ni omp = j nj omp,

x well as has been increasingly and widely adopted by investigators to test

where ni omp and nj omp are the total numbers of residues of types i and the power of various prediction methods (Zhou, 1998; Zhou and Doctor,

j in OMPs. The average residue compositions of globular proteins "i glo

x 2003; Huang and Li, 2004; Gao et al., 2005a, b; Wang et al., 2005a; Xiao

and AMPs "i amp were also calculated in a similar way.

x et al., 2005, 2006; Cao et al., 2006; Chen et al., 2006a, b, 2007; Du and Li,

2006; Gao and Wang, 2006; Guo et al., 2006a, b; Kedarisetti et al., 2006;

Mondal et al., 2006; Niu et al., 2006; Sun and Huang, 2006; Wen et al.,

2.2 Weighted Euclidean distance

2006; Yan et al., 2006; Zhang et al., 2006; Chen and Li, 2007; Diao et al.,

For a protein (test protein) in the test set, its distance to an OMP protein 2007; Ding et al., 2007; Fang et al., 2007; Jahandideh et al., 2007; Lin

in (OMP train protein) was calculated using Domp

r the training set and Li, 2007a, b; Liu et al., 2007; Shen and Chou, 2007e; Shen et al.,

xi test xi omp train

2007a; Shi et al., 2007; Tan et al., 2007; Xiao and Chou, 2007; Zhang and

i, where xi is the composition of residue type i in

test

x omp

Ding, 2007; Zhou et al., 2007). However, in the current study, we choose

the test protein, xi omp train is the composition of residue type i in the OMP to use ve-fold cross-validation because it also has been widely used

train protein, and "i omp is the average q i for all

x composition of residue type in previous studies and, more importantly, it is less time-consuming than

OMPs in the training set. Notice that i xi test xi omp train 2 gives the jackknife test.

Euclidean distance between the test protein and the OMP train protein.

Here, in the calculation of Domp, each item within the summation is

2.5 Including homologous sequences into the calculation of residue

weighted by a factor of 1="i omp . Therefore, Domp is referred to as

composition

weighted Euclidean distance in this study. The composition of all 20

For each protein, the BLAST program (Altschul et al., 1997) was used

amino acid residues was used to calculate the distances for all experi-

to search for homologous sequences from the National Center for

ments in this study.

Biotechnology Information (NCBI) non-redundant database using an E-

Similarly, the weighted Euclidean distance between the test protein and

value of 0.0001. 50 best hits (not including the query sequence itself) were

a globular protein in the training set (globular train protein) was calculated

chosen from the returned result. If less than 50 hits were returned, then all

xi test xi glo train

using Dglo i, where xi is the composition of resi-

test

x of the hits were chosen. These homologous proteins plus the query protein

glo

due type i in the test protein, xi is the composition of residue type i in were used to compute the residue composition for the query protein.

glo

Prediction of outer membrane proteins 67

2.6 Performance measures materials and methods. The comparison of residue com-

positions calculated using single sequence and homolo-

Three types of two-class classi cations were performed in this study:

OMPs vs. globular, OMPs vs. AMPs, and OMPs vs. non-OMPs (i.e., gous sequences are available at http:==www.cs.usu.edu=

globular AMPs). In each of these experiments, OMP class was de ned

$cyan=OMP_KNN=supplement.htm. We repeated the

as the positive class, and the other was the negative class. Let TP be the

ve-fold cross-validations using the same dataset partition

number of true positives (i.e., the number of OMPs predicted as OMPs);

TN be the number of true negatives (i.e., the number of proteins from the used in Section 3.1. Comparisons (Table 1, columns 2 and

negative class that are predicted to belong to the negative class); FN be the

3) show that using homologous information can improve

number of false negatives (i.e., the number of OMPs incorrectly predicted

the performance remarkably: the accuracy is increased to

as negative) and FP be the number of false positives (i.e., the number of

negative proteins incorrectly predicted as OMPs). Several measures were 96.0%; MCC is increased to 0.888; 87.5% of the OMPs

used to evaluate the method:

(sensitivity) and 98.7% of the globular proteins (speci ci-

TP ty) are correctly identi ed.

Sensitivity

TP FN

TN 3.3 The proposed method can distinguish OMPs

Specificity

TN FP

from a-helical membrane proteins (AMPs)

TP TN

We then evaluated the proposed method s ability to dis-

Accuracy

TP FN TN FP

criminate OMPs from AMPs. Five-fold cross-validations

were performed such that the identity between any protein

TP TN FP FN

MCC p

TP FN TP FP TN FP TN FN from the training set and any protein from the test set is

less than 25%. The results (Table 1, columns 4 and 5)

show that when using single protein sequence as input,

3. Results

the proposed method can discriminate between OMPs and

AMPs with 90.1% overall accuracy and 0.802 MCC.

3.1 The proposed method can distinguish between OMPs 88.9% (sensitivity) of OMPs and 91.3% (speci city) of

and globular proteins AMPs are correctly predicted. When homologous se-

quences are used, the performance is improved remark-

First, we evaluated the proposed method s ability to dis-

ably, reaching 94.7% accuracy with 0.894 MCC, 95.7%

criminate OMPs from globular proteins. Five-fold cross-

sensitivity, and 93.7% speci city.

validations were performed as described in Materials and

Methods. The results (Table 1, column 2) show that the

proposed method achieves 88.8% overall accuracy with 3.4 Discrimination between OMPs and non-OMPs

0.708 MCC. 84.1% (sensitivity) of the OMPs and 90.2%

Based on the results obtained in the previous sections, we

(speci city) of the globular proteins are correctly identi ed.

designed a method for discriminating OMPs from non-

OMPs (AMPs globular proteins). In the method, the com-

3.2 Using homologous sequences to calculate residue

position of a test protein was calculated. The distances from

composition can improve the performance

the protein to OMPs, AMPs and globular proteins were

For each protein, we included homologous sequences into calculated based on the K-NN method. If the distance to

the calculation of residue composition as described in OMPs was the smallest among them, then the protein was

Table 1. Performance of the K-NN method evaluated using ve-fold cross-validations

Classi cation OMPs vs Globular OMPs vs AMPs OMPs vs Non-OMPs

Singlea Homologousb

Mode Single Homologous Single Homologous

Accuracy 88.8% 96.0% 90.1% 94.7% 89.1% 96.1%

MCC 0.708 0.888 0.802 0.894 0.668 0.873

Sensitivity 84.1% 87.5 % 88.9% 95.7% 78.8% 87.5 %

Speci city 90.2% 98.7% 91.3% 93.7% 91.5% 98.2%

For each protein, only the protein itself was used to calculate residue composition

For each protein, 50 homologous proteins were included in the calculation of residue composition

68 C. Yan et al.

Table 3. Comparison of the proposed method (K-NN) with BLAST

predicted to be OMP. Otherwise, it was predicted to be non-

search in the discrimination of OMPs and non-OMPs

OMP. The method was evaluated using ve-fold cross-vali-

dations. The results (Table 1, columns 6 and 7) show that the K-NN BLAST

proposed method achieves 96.1% accuracy with 0.873 search

MCC when homologous sequences are used. The distances Single Homologous

sequencea sequencesb

to OMPs, AMPs and globular proteins for each protein are

available at http:==www.cs.usu.edu=$cyan=OMP_KNN= Accuracy 91.8% 95.3% 77.6%

supplement.htm. MCC 0.606 0.760 0.446

Sensitivity 66.1% 72.3% 88.4%

Speci city 95.2% 98.4% 76.1%

3.5 Evaluating the proposed method using datasets a

For each protein, only the protein itself was used to calculate residue

with lower sequence similarity composition

For each protein, 50 homologous proteins were included in the calcula-

In the datasets used so far, sequence similarity between

tion of residue composition

two proteins can be as high as 40%. It will be important

to know how the proposed method performs if the mutual

The improvement of using homologous sequences is ob-

similarities between sequences in datasets are lower. To

vious: accuracy is improved from 91.8% to 95.3% and

investigate this, we ltered the datasets so that the se-

MCC is improved from 0.606 to 0.760.

quence similarity between any two sequences was less

than 25%. After the ltering, 112 OMPs, 673 globular

proteins, and 178 AMPs were left. We then used these 3.6 Comparison with predictions solely based

ltered datasets to evaluate the proposed method using on similarity search

ve-fold cross-validation. When homologous sequences

The results have shown that combining homologous in-

were used to calculate residue composition, we also re-

formation with the K-NN method can improve the perfor-

quired that the homologous sequences of any two se-

mance dramatically. Then, how well it performs if we

quences did not overlap. Thus, if one protein from the

make the predictions solely based on homologous search?

NCBI non-redundant database was homologous to both

To explore this, for each test protein, we performed a

proteins A and B in the datasets, it was only used to cal-

homologous search on the training set using the BLAST

culate the residue composition of protein A (or B). We

program. The test protein was classi ed into the class of

applied this requirement to prevent the case that one test

the protein from the training set that shares the highest

protein was correctly classi ed only because its homolo-

similarly with the test protein. We evaluated this approach

gous sequences overlaped with the homologous sequences

using the same ve-fold cross-validations that we used to

of some proteins in the training set. The results (Table 2)

evaluate our K-NN method. The results (Table 3) show

show that reducing the sequence similarity in the dataset

that the BLAST search approach only achieves 77.6%

only reduces the performance slightly. The proposed

accuracy with 0.446 MCC. In comparison, the K-NN meth-

method can still discriminate OMPs and non-OMPs with

od achieves as high as 95.3% accuracy and 0.760 MCC

very high performance: 95.3% accuracy with 0.760 MCC.

on the same datasets.

Table 2. Discrimination of outer membrane proteins (OMPs) from non-

OMPs on datasets with sequence similarity less than 25%

3.7 Comparisons with other methods

a b

Single sequence Homologous sequences

We also compare the proposed method with multiple pre-

viously published methods. As discussed in Baldi et al.

Accuracy 91.8% (89.1%) 95.3% (96.1%)

(2000), in a two-class classi cation, if the numbers of

MCC 0.606 (0.668) 0.760 (0.873)

Sensitivity 66.1% (78.8%) 72.3% (87.5 %) examples of the two classes are not equal, MCC is a better

Speci city 95.2% (91.5%) 98.4% (98.2%)

measure for evaluating the classi cation performance.

In many studies, MCC has been used as the standard for

For each protein, only the protein itself was used to calculate residue

comparing different predicting methods (Bao and Cui,

composition

For each protein, 50 homologous proteins were included in the calcula-

2005; Dobson et al., 2006; Ye et al., 2007). In the discrim-

tion of residue composition

ination of OMPs and non-OMPs, the numbers of exam-

Values in parenthesis are the performance on the datasets with mutual

ples in the two classes are not equal. Therefore, we will

sequence similarity less than 40%

Prediction of outer membrane proteins 69

Table 4. Comparisons of different methods in the discrimination of OMPs and non-OMPs

Accuracy MCC Sensitivity Speci city K-NN methoda 86.3 1.0 97.9 0.3

95.7 0.4 0.858 0.011

Neural Networkb,c 91.0 0.716 79.3 93.8

(Gromiha and Suwa, 2006)

Support Vector 93.9 0.816 90.9 94.7

Machineb (Park et al., 2005)

The method proposed in this study. The ve-fold cross-validations performed in this study require that the sequence

similarity between any protein from the training set and any protein from the test sets is less than 25%. The ve-fold cross-

validations are repeated ve times. The average and standard deviation are reported

The statistics are obtained from the original publications (Gromiha and Suwa, 2006; Park et al., 2005). Note that the

original studies used the same datasets and the same type of cross-validation ( ve-fold cross-validations) as the current study.

But the sequence similarity between training and test sets can be as high as 40%. In the original publication, only Accuracy,

Sensitivity and Speci city were reported. Here, we calculate the MCC based on their published statistics

In their study, Gromiha and Suwa (2006) evaluated 11 different methods. Neural network was reported to be the best

use MCC as the primary measure in the comparison of Table 4. Since neural network was reported to achieve

different methods. At the same time, we also report accu- the best performance among the 11 methods, we only

racy, speci city and sensitivity. show the results of the neural network method in the table.

Gromiha and Suwa (2006) tried a set of 11 machine Table 4 shows that the MCC of K-NN is remarkably

learning methods for the discrimination of OMPs using higher than those of neural network and SVM. The differ-

residue composition as input. One of the 11 methods was ences are larger than 3 times of the deviation in both

a k-nearest neighbor method based on Euclidean distance. cases. This con rms the statistical signi cance of the im-

Neural network method was reported to achieve the best provement. It is also worth to point out that in the ve-fold

performance in their study. In another study, researcher cross-validations performed in the current study, we made

from the same group (Park et al., 2005) developed a sup- sure that the sequence similarity between any protein

port vector machine (SVM) method to discriminate OMPs. from the training set and any protein from the test sets is

Both studies used the same datasets that we use in this less than 25%. Meanwhile, in the studies of Gromiha and

study to evaluate their methods based on ve-fold cross- Suwa (2006) and Park et al. (2005), the sequence similar-

validations. So, we compared the results we obtained in ity between training and test sets can be as high as 40%.

the current study with what their reported in their publica- Although we use a stricter criterion to evaluate our K-NN,

tions. To assess the statistical signi cance of the compari- the performance of our method is still better than what

son, we evaluated our K-NN method by repeating the ve- were reported for the other two methods using a looser

fold cross-validations ve times using different data splits. criterion.

The average and the standard deviation are shown in Berven et al. (2004) developed a method, BOMP, for

Table 4 (row 2). The results (Table 4) show that our K- the prediction of OMPs. It is one of the best scoring meth-

NN outperforms all of the 11 methods used in Gromiha ods in identifying OMPs from genome. The BOMP meth-

od combines a pattern search, a b-barrel score based on

and Suwa s study (2006) and the SVM method developed

by Park et al. (2005). Note that not all of the 11 methods amino acid distribution, and a lter that explores the abun-

from Gromiha and Suwa s study (2006) are shown in dance of Asparagine and Isoleucine in the protein. Here,

Table 5. Comparison of the proposed method (K-NN) and the BOMP method (Berven et al., 2004)

Accuracy MCC Sensitivity Speci city 74.3 1.5% 98.4 0.2%

Datasets used K-NN 95.6 0.2 0.774 0.011

in this studya BOMP 93.1 0.623 52.7 98.5

Datasets used in K-NN 98.8 0.870 83.1 99.6

Berven et al. (2004) BOMP 98.3 0.831 88.1 98.8

The dataset used in this study was submitted to the BOMP server. The dataset submitted to the BOMP server is likely overlap with the

dataset that BOMP was trained on. On the contrary, in the evaluation of our K-NN method, we make sure that the sequence similarity

between training and test sets is less than 25%

70 C. Yan et al.

we also compared our K-NN method with BOMP. First,

we submitted the ltered datasets (in which similarity

between any two proteins is less than 25%) used in this

study to the BOMP server. The results (Table 5) show that

our K-NN method outperforms BOMP. The MCC of K-NN

outperforms that of BOMP by more than 10 times of the

standard deviation. This con rms the statistical signi -

cance of the improvement. It is worth to point out that

in this comparison, the dataset submitted to the BOMP

server is likely overlap with the dataset that BOMP was

trained on. On the contrary, in the evaluation of our K-NN

Fig. 1. ROC curve of the K-NN method

method, we make sure that the sequence similarity be-

tween training and test sets is less than 25%. We also

evaluated our K-NN method using the same datasets that

The advantage of introducing this parameter a to the

Berven et al. (2004) used to evaluate their BOMP method.

K-NN method is that users can chose a threshold based

The results (Table 5) show that our K-NN method still

on their need. When a is set to a lower value, the K-NN

outperforms BOMP on their datasets. When BOMP data-

can achieve higher speci city. On other hand, when a

sets were used, leave-one-out cross validations were used

high value of a is chosen, the K-NN can achieve higher

to evaluate both methods as described in Berven et al.

sensitivity.

(2004). We notice that when compared using the BOMP

dataset, the improvement of K-NN method over BOMP is

not so big as when our dataset is used. The possible reason

3.9 Identi cation of OMPs in the proteome of E.coli

is that the BOMP dataset contains only a small number of

positive examples (59 in total). Since leave-one-out cross- We applied the K-NN method to search for OMPs in the

proteome of E. coli using a 0.11, which corresponds

validations were performed, we could not calculate the

standard deviation as we did in ve-fold cross-validation to 99% speci city in the ROC curve. The E. Coli proteome

(because there is only one possible way to split data for consists of 4319 proteins. 123 of them were predicted to be

leave-one-out cross validation). However, the improve- OMP proteins. That accounts for 2.8% of the whole prote-

ment of the K-NN method in MCC is still clear. ome. This ratio is consistent with the previous estimation

that 2 3% of the genes in Gram-negative bacteria encodes

OMPs (Wimley, 2003). Among these 123 hits, 61 proteins

3.8 Receiver operating characteristic (ROC) curve

are annotated as OMP proteins in Swiss-Prot (Bairoch

et al., 2004) or ePSORTdb (Rey et al., 2005), a database

In the K-NN method, a protein is classi ed as OMP or

of protein subcellular locations that have been determined

non-OMP based on the comparison of Domp (its distance

through laboratory experiments. In addition, 20 proteins

to the OMP group), Dglo (its distance to the globu-

are annotated with Membrane, Cell membrane and

lar protein group), and Dimp (its distance to the AMP

group). A protein is predicted to be OMP if Domp

Contact this candidate