Printed in The Netherlands
Discrimination of outer membrane proteins using a K-nearest
neighbor method
C. Yan, J. Hu, and Y. Wang
Department of Computer Science, Utah State University, Logan, UT, USA
Received September 17, 2007
Accepted October 28, 2007
Published online January 25, 2008; # Springer-Verlag 2008
Summary. Identi cation of outer membrane proteins (OMPs) from ge- on long stretches of hydrophobic residues, OMPs are more
nome is an important task. This paper presents a k-nearest neighbor (K-
dif cult to predict, mainly due to shorter membrane-span-
NN) method for discriminating outer membrane proteins (OMPs). The
ning regions with higher variations in properties (Koebnik
method makes predictions based on a weighted Euclidean distance that
et al., 2000). A few methods have been developed to
is computed from residue composition. The method achieves 89.1% accu-
racy with 0.668 MCC (Matthews correlation coef cient) in discriminating
identify OMPs. Gnanasekaran et al. (2000) used pro les
OMPs and non-OMPs. The performance of the method is improved by
developed from structure-based alignments of porins to
including homologous information into the calculation of residue compo-
identify OMPs. Wimley et al. (2002) analyzed the struc-
sition. The nal method achieves an accuracy of 96.1%, with 0.873 MCC,
87.5% sensitivity, and 98.2% speci city. Comparisons with multiple re- tures of 15 non-redundant OMPs and developed a method
cently published methods show that the method proposed in this study
to identify OMPs based on residue composition and struc-
outperforms the others.
tural features. Martelli et al. (2002) and Bagos et al. (2004a,
Keywords: Prediction Transmembrane proteins Machine learning b) used hidden Markov models (HMMs) to discriminate
Gram-negative bacteria
OMPs from globular proteins. Liu et al. (2003) developed
a method that combines the residue composition of mem-
1. Introduction
brane spanning regions and predicted secondary structure to
Membrane proteins are important targets of protein sci- identify OMPs. Garrow et al. (2005a,b) developed a method
ence and cell biology research (Chou and Shen, 2007c; for discrimination of OMPs in genomes. Berven et al. (2004)
Douglas et al., 2007). Two of the hot topics related to developed the BOMP method that predicts OMPs by com-
bining pattern search, b-barrel score, and a lter that ex-
membrane proteins are to identify the type of membrane
proteins (Chou and Elrod, 1999; Wang et al., 2004, 2005b, plores the abundance of asparagine and isoleucine in
2006; Chou and Cai, 2005a, b; Liu et al., 2005a, b; Shen the protein. Gromiha and Suwa (2005) developed a simple
and Chou, 2005b, 2007e; Shen et al., 2006; Chou and method to identify OMPs using a deviation distance
Shen, 2007c; Pu et al., 2007) and to identify transmem- based on amino acid composition. Later, they evaluated 11
brane regions (e.g. Diao et al., 2007). Outer membrane machine-learning methods for the discrimination of OMPs
proteins (OMPs) perform diverse functional roles, in- using residue composition as input (Gromiha and Suwa,
cluding bacterial adhesion, structural integrity of the cell 2006). In another study, researchers from the same group
wall, and material transport (Koebnik et al., 2000; Schulz, used a backward-and-forward approach to select discrimi-
2000; Wimley, 2003). These proteins consist of b-barrel native features from residue composition and dipeptide
transmembrane regions and are found in the outer mem- composition and used a SVM method to identify OMPs
branes of gram-negative bacteria and the outer mem- (Park et al., 2005).
branes of mitochondria and chloroplasts (Schulz, 2000; K-nearest neighbor (K-NN) method has been success-
Waldispuhl et al., 2006; Wimley, 2003). Unlike a-helical fully adopted to predict various protein attributes (Chou,
2002; Shen et al., 2007b), such as protein subcellular
membrane proteins, which can be easily identi ed based
66 C. Yan et al.
the globular train protein, and "i glo is the average composition of residue
x
localization (see, e.g., (Chou and Shen, 2006a, b, c; Chou
type i for all globular proteins in the training set. The weighted Euclidean
and Shen, 2007a, b; Shen and Chou, 2007b, c, f; Shen distance between the test protein and an AMP protein
r in the training set
et al., 2007a)), subnuclear protein localization (Shen and 2
x x
(AMP train protein) was calculated using Damp i i test "i i amp train,
amp
x
Chou, 2005a), protein structural classi cation (Shen et al., where xi test is the composition of residue type i in the test pro-
2005), protein fold pattern (Shen and Chou, 2006), mem- tein, xi amp train is the composition of residue type i in the AMP train
protein, and "i amp is the average composition of residue type i for all
x
brane protein type (Shen and Chou, 2005b, 2007e; Shen
AMPs in the training set.
et al., 2006; Chou and Shen, 2007c), enzyme main and
sub functional classi cation (Shen and Chou, 2007a), and 2.3 K-Nearest neighbor (K-NN) algorithm
signal peptide (Chou and Shen, 2007e; Shen and Chou,
For a test protein, its weighted Euclidean distances to every OMP in
2007d). In this study, we propose a K-NN method for the the training set were calculated separately. K smallest distances were
discrimination of OMPs from non-OMPs. The method chosen. Let them be Domp 1, Domp 2 ; . . . ; Domp k . The distance between
the test protein and the OMP class was given by Domp 1 Domp 1
achieves 96.1% accuracy, with 0.873 MCC, 87.5% sensi- k
Domp 2 Domp k . The distance between the test protein and the
tivity, and 98.2% speci city. Comparisons with multiple globular protein class (Dglo ) and the distance between the test protein and
recently published methods show that the proposed method the AMP class (Damp ) were computed in a similar way. In this study, the
value of k was determined empirically. Various values were tried, and the
outperforms the others.
best performance was achieved when k 4. The test protein was classi ed
into a class to which it has a shorter distance.
2. Materials and methods
2.4 Five-fold cross-validations
2.1 Datasets
Five-fold cross-validation was used to evaluate the proposed method. The
Three datasets were obtained from a previous study by Park et al. (2005): overall dataset was divided into ve subsets, such that the identity between
SetA contains 208 well-annotated outer membrane proteins (OMPs); SetB any two proteins from different datasets was less than 25%. This threshold
has 673 globular proteins (which includes 155 all-a, 156 all-b, 184 a b, has been used in many studies for removing redundancy (Rost and Sander,
and 178 a=b proteins) from various structure families; and SetC has 206 1993; Ahmad and Sarai, 2004; Deng et al., 2004; Prasad Bahadur et al.,
a-helical membrane proteins (AMPs). In these datasets, the sequence 2004; Wang and Brown, 2006). OMPs, globular proteins and AMPs
identity between any two proteins is less than 40%. We rst used these were distributed into the subsets evenly. In each round of experiment, four
datasets to evaluate the proposed method. Then, we ltered these datasets subsets were used as the training set and the remaining subset was used as
so that the similarity between any two proteins is less than 25%. After the the test set. This procedure was repeated ve times with each subset being
ltering, 112 OMPs, 673 globular proteins, and 178 AMPs are left. We used as test set once. The average performance was reported. It is instruc-
also used the ltered datasets to evaluate the proposed method. tive to point out that independent dataset test, sub-sampling (e.g., ve or
ten-fold cross-validation) test, and jackknife test are often used for exam-
ining the accuracy of a statistical prediction method. Among them, the
2.1.1. Residue composition
jackknife test is deemed the most objective and being able to yield a
The residue composition of a protein was calculated as xi ni = j nj, unique result (Chou and Zhang, 1995), as demonstrated by an incisive
where ni and nj are the numbers of residues of types i and j. The average analysis in a recent comprehensive review (Chou and Shen, 2007d) as
residue composition of OMPs was given by "i omp ni omp = j nj omp,
x well as has been increasingly and widely adopted by investigators to test
where ni omp and nj omp are the total numbers of residues of types i and the power of various prediction methods (Zhou, 1998; Zhou and Doctor,
j in OMPs. The average residue compositions of globular proteins "i glo
x 2003; Huang and Li, 2004; Gao et al., 2005a, b; Wang et al., 2005a; Xiao
and AMPs "i amp were also calculated in a similar way.
x et al., 2005, 2006; Cao et al., 2006; Chen et al., 2006a, b, 2007; Du and Li,
2006; Gao and Wang, 2006; Guo et al., 2006a, b; Kedarisetti et al., 2006;
Mondal et al., 2006; Niu et al., 2006; Sun and Huang, 2006; Wen et al.,
2.2 Weighted Euclidean distance
2006; Yan et al., 2006; Zhang et al., 2006; Chen and Li, 2007; Diao et al.,
For a protein (test protein) in the test set, its distance to an OMP protein 2007; Ding et al., 2007; Fang et al., 2007; Jahandideh et al., 2007; Lin
in (OMP train protein) was calculated using Domp
r the training set and Li, 2007a, b; Liu et al., 2007; Shen and Chou, 2007e; Shen et al.,
2
xi test xi omp train
2007a; Shi et al., 2007; Tan et al., 2007; Xiao and Chou, 2007; Zhang and
i, where xi is the composition of residue type i in
test
"i
x omp
Ding, 2007; Zhou et al., 2007). However, in the current study, we choose
the test protein, xi omp train is the composition of residue type i in the OMP to use ve-fold cross-validation because it also has been widely used
train protein, and "i omp is the average q i for all
x composition of residue type in previous studies and, more importantly, it is less time-consuming than
OMPs in the training set. Notice that i xi test xi omp train 2 gives the jackknife test.
Euclidean distance between the test protein and the OMP train protein.
Here, in the calculation of Domp, each item within the summation is
2.5 Including homologous sequences into the calculation of residue
weighted by a factor of 1="i omp . Therefore, Domp is referred to as
x
composition
weighted Euclidean distance in this study. The composition of all 20
For each protein, the BLAST program (Altschul et al., 1997) was used
amino acid residues was used to calculate the distances for all experi-
to search for homologous sequences from the National Center for
ments in this study.
Biotechnology Information (NCBI) non-redundant database using an E-
Similarly, the weighted Euclidean distance between the test protein and
value of 0.0001. 50 best hits (not including the query sequence itself) were
a globular protein in the training set (globular train protein) was calculated
r
chosen from the returned result. If less than 50 hits were returned, then all
2
xi test xi glo train
using Dglo i, where xi is the composition of resi-
test
"i
x of the hits were chosen. These homologous proteins plus the query protein
glo
due type i in the test protein, xi is the composition of residue type i in were used to compute the residue composition for the query protein.
glo
Prediction of outer membrane proteins 67
2.6 Performance measures materials and methods. The comparison of residue com-
positions calculated using single sequence and homolo-
Three types of two-class classi cations were performed in this study:
OMPs vs. globular, OMPs vs. AMPs, and OMPs vs. non-OMPs (i.e., gous sequences are available at http:==www.cs.usu.edu=
globular AMPs). In each of these experiments, OMP class was de ned
$cyan=OMP_KNN=supplement.htm. We repeated the
as the positive class, and the other was the negative class. Let TP be the
ve-fold cross-validations using the same dataset partition
number of true positives (i.e., the number of OMPs predicted as OMPs);
TN be the number of true negatives (i.e., the number of proteins from the used in Section 3.1. Comparisons (Table 1, columns 2 and
negative class that are predicted to belong to the negative class); FN be the
3) show that using homologous information can improve
number of false negatives (i.e., the number of OMPs incorrectly predicted
the performance remarkably: the accuracy is increased to
as negative) and FP be the number of false positives (i.e., the number of
negative proteins incorrectly predicted as OMPs). Several measures were 96.0%; MCC is increased to 0.888; 87.5% of the OMPs
used to evaluate the method:
(sensitivity) and 98.7% of the globular proteins (speci ci-
TP ty) are correctly identi ed.
Sensitivity
TP FN
TN 3.3 The proposed method can distinguish OMPs
Specificity
TN FP
from a-helical membrane proteins (AMPs)
TP TN
We then evaluated the proposed method s ability to dis-
Accuracy
TP FN TN FP
criminate OMPs from AMPs. Five-fold cross-validations
were performed such that the identity between any protein
TP TN FP FN
MCC p
TP FN TP FP TN FP TN FN from the training set and any protein from the test set is
less than 25%. The results (Table 1, columns 4 and 5)
show that when using single protein sequence as input,
3. Results
the proposed method can discriminate between OMPs and
AMPs with 90.1% overall accuracy and 0.802 MCC.
3.1 The proposed method can distinguish between OMPs 88.9% (sensitivity) of OMPs and 91.3% (speci city) of
and globular proteins AMPs are correctly predicted. When homologous se-
quences are used, the performance is improved remark-
First, we evaluated the proposed method s ability to dis-
ably, reaching 94.7% accuracy with 0.894 MCC, 95.7%
criminate OMPs from globular proteins. Five-fold cross-
sensitivity, and 93.7% speci city.
validations were performed as described in Materials and
Methods. The results (Table 1, column 2) show that the
proposed method achieves 88.8% overall accuracy with 3.4 Discrimination between OMPs and non-OMPs
0.708 MCC. 84.1% (sensitivity) of the OMPs and 90.2%
Based on the results obtained in the previous sections, we
(speci city) of the globular proteins are correctly identi ed.
designed a method for discriminating OMPs from non-
OMPs (AMPs globular proteins). In the method, the com-
3.2 Using homologous sequences to calculate residue
position of a test protein was calculated. The distances from
composition can improve the performance
the protein to OMPs, AMPs and globular proteins were
For each protein, we included homologous sequences into calculated based on the K-NN method. If the distance to
the calculation of residue composition as described in OMPs was the smallest among them, then the protein was
Table 1. Performance of the K-NN method evaluated using ve-fold cross-validations
Classi cation OMPs vs Globular OMPs vs AMPs OMPs vs Non-OMPs
Singlea Homologousb
Mode Single Homologous Single Homologous
Accuracy 88.8% 96.0% 90.1% 94.7% 89.1% 96.1%
MCC 0.708 0.888 0.802 0.894 0.668 0.873
Sensitivity 84.1% 87.5 % 88.9% 95.7% 78.8% 87.5 %
Speci city 90.2% 98.7% 91.3% 93.7% 91.5% 98.2%
a
For each protein, only the protein itself was used to calculate residue composition
b
For each protein, 50 homologous proteins were included in the calculation of residue composition
68 C. Yan et al.
Table 3. Comparison of the proposed method (K-NN) with BLAST
predicted to be OMP. Otherwise, it was predicted to be non-
search in the discrimination of OMPs and non-OMPs
OMP. The method was evaluated using ve-fold cross-vali-
dations. The results (Table 1, columns 6 and 7) show that the K-NN BLAST
proposed method achieves 96.1% accuracy with 0.873 search
MCC when homologous sequences are used. The distances Single Homologous
sequencea sequencesb
to OMPs, AMPs and globular proteins for each protein are
available at http:==www.cs.usu.edu=$cyan=OMP_KNN= Accuracy 91.8% 95.3% 77.6%
supplement.htm. MCC 0.606 0.760 0.446
Sensitivity 66.1% 72.3% 88.4%
Speci city 95.2% 98.4% 76.1%
3.5 Evaluating the proposed method using datasets a
For each protein, only the protein itself was used to calculate residue
with lower sequence similarity composition
b
For each protein, 50 homologous proteins were included in the calcula-
In the datasets used so far, sequence similarity between
tion of residue composition
two proteins can be as high as 40%. It will be important
to know how the proposed method performs if the mutual
The improvement of using homologous sequences is ob-
similarities between sequences in datasets are lower. To
vious: accuracy is improved from 91.8% to 95.3% and
investigate this, we ltered the datasets so that the se-
MCC is improved from 0.606 to 0.760.
quence similarity between any two sequences was less
than 25%. After the ltering, 112 OMPs, 673 globular
proteins, and 178 AMPs were left. We then used these 3.6 Comparison with predictions solely based
ltered datasets to evaluate the proposed method using on similarity search
ve-fold cross-validation. When homologous sequences
The results have shown that combining homologous in-
were used to calculate residue composition, we also re-
formation with the K-NN method can improve the perfor-
quired that the homologous sequences of any two se-
mance dramatically. Then, how well it performs if we
quences did not overlap. Thus, if one protein from the
make the predictions solely based on homologous search?
NCBI non-redundant database was homologous to both
To explore this, for each test protein, we performed a
proteins A and B in the datasets, it was only used to cal-
homologous search on the training set using the BLAST
culate the residue composition of protein A (or B). We
program. The test protein was classi ed into the class of
applied this requirement to prevent the case that one test
the protein from the training set that shares the highest
protein was correctly classi ed only because its homolo-
similarly with the test protein. We evaluated this approach
gous sequences overlaped with the homologous sequences
using the same ve-fold cross-validations that we used to
of some proteins in the training set. The results (Table 2)
evaluate our K-NN method. The results (Table 3) show
show that reducing the sequence similarity in the dataset
that the BLAST search approach only achieves 77.6%
only reduces the performance slightly. The proposed
accuracy with 0.446 MCC. In comparison, the K-NN meth-
method can still discriminate OMPs and non-OMPs with
od achieves as high as 95.3% accuracy and 0.760 MCC
very high performance: 95.3% accuracy with 0.760 MCC.
on the same datasets.
Table 2. Discrimination of outer membrane proteins (OMPs) from non-
OMPs on datasets with sequence similarity less than 25%
3.7 Comparisons with other methods
a b
Single sequence Homologous sequences
We also compare the proposed method with multiple pre-
viously published methods. As discussed in Baldi et al.
c
Accuracy 91.8% (89.1%) 95.3% (96.1%)
(2000), in a two-class classi cation, if the numbers of
MCC 0.606 (0.668) 0.760 (0.873)
Sensitivity 66.1% (78.8%) 72.3% (87.5 %) examples of the two classes are not equal, MCC is a better
Speci city 95.2% (91.5%) 98.4% (98.2%)
measure for evaluating the classi cation performance.
In many studies, MCC has been used as the standard for
a
For each protein, only the protein itself was used to calculate residue
comparing different predicting methods (Bao and Cui,
composition
b
For each protein, 50 homologous proteins were included in the calcula-
2005; Dobson et al., 2006; Ye et al., 2007). In the discrim-
tion of residue composition
ination of OMPs and non-OMPs, the numbers of exam-
c
Values in parenthesis are the performance on the datasets with mutual
ples in the two classes are not equal. Therefore, we will
sequence similarity less than 40%
Prediction of outer membrane proteins 69
Table 4. Comparisons of different methods in the discrimination of OMPs and non-OMPs
Accuracy MCC Sensitivity Speci city K-NN methoda 86.3 1.0 97.9 0.3
95.7 0.4 0.858 0.011
Neural Networkb,c 91.0 0.716 79.3 93.8
(Gromiha and Suwa, 2006)
Support Vector 93.9 0.816 90.9 94.7
Machineb (Park et al., 2005)
a
The method proposed in this study. The ve-fold cross-validations performed in this study require that the sequence
similarity between any protein from the training set and any protein from the test sets is less than 25%. The ve-fold cross-
validations are repeated ve times. The average and standard deviation are reported
b
The statistics are obtained from the original publications (Gromiha and Suwa, 2006; Park et al., 2005). Note that the
original studies used the same datasets and the same type of cross-validation ( ve-fold cross-validations) as the current study.
But the sequence similarity between training and test sets can be as high as 40%. In the original publication, only Accuracy,
Sensitivity and Speci city were reported. Here, we calculate the MCC based on their published statistics
c
In their study, Gromiha and Suwa (2006) evaluated 11 different methods. Neural network was reported to be the best
use MCC as the primary measure in the comparison of Table 4. Since neural network was reported to achieve
different methods. At the same time, we also report accu- the best performance among the 11 methods, we only
racy, speci city and sensitivity. show the results of the neural network method in the table.
Gromiha and Suwa (2006) tried a set of 11 machine Table 4 shows that the MCC of K-NN is remarkably
learning methods for the discrimination of OMPs using higher than those of neural network and SVM. The differ-
residue composition as input. One of the 11 methods was ences are larger than 3 times of the deviation in both
a k-nearest neighbor method based on Euclidean distance. cases. This con rms the statistical signi cance of the im-
Neural network method was reported to achieve the best provement. It is also worth to point out that in the ve-fold
performance in their study. In another study, researcher cross-validations performed in the current study, we made
from the same group (Park et al., 2005) developed a sup- sure that the sequence similarity between any protein
port vector machine (SVM) method to discriminate OMPs. from the training set and any protein from the test sets is
Both studies used the same datasets that we use in this less than 25%. Meanwhile, in the studies of Gromiha and
study to evaluate their methods based on ve-fold cross- Suwa (2006) and Park et al. (2005), the sequence similar-
validations. So, we compared the results we obtained in ity between training and test sets can be as high as 40%.
the current study with what their reported in their publica- Although we use a stricter criterion to evaluate our K-NN,
tions. To assess the statistical signi cance of the compari- the performance of our method is still better than what
son, we evaluated our K-NN method by repeating the ve- were reported for the other two methods using a looser
fold cross-validations ve times using different data splits. criterion.
The average and the standard deviation are shown in Berven et al. (2004) developed a method, BOMP, for
Table 4 (row 2). The results (Table 4) show that our K- the prediction of OMPs. It is one of the best scoring meth-
NN outperforms all of the 11 methods used in Gromiha ods in identifying OMPs from genome. The BOMP meth-
od combines a pattern search, a b-barrel score based on
and Suwa s study (2006) and the SVM method developed
by Park et al. (2005). Note that not all of the 11 methods amino acid distribution, and a lter that explores the abun-
from Gromiha and Suwa s study (2006) are shown in dance of Asparagine and Isoleucine in the protein. Here,
Table 5. Comparison of the proposed method (K-NN) and the BOMP method (Berven et al., 2004)
Accuracy MCC Sensitivity Speci city 74.3 1.5% 98.4 0.2%
Datasets used K-NN 95.6 0.2 0.774 0.011
in this studya BOMP 93.1 0.623 52.7 98.5
Datasets used in K-NN 98.8 0.870 83.1 99.6
Berven et al. (2004) BOMP 98.3 0.831 88.1 98.8
a
The dataset used in this study was submitted to the BOMP server. The dataset submitted to the BOMP server is likely overlap with the
dataset that BOMP was trained on. On the contrary, in the evaluation of our K-NN method, we make sure that the sequence similarity
between training and test sets is less than 25%
70 C. Yan et al.
we also compared our K-NN method with BOMP. First,
we submitted the ltered datasets (in which similarity
between any two proteins is less than 25%) used in this
study to the BOMP server. The results (Table 5) show that
our K-NN method outperforms BOMP. The MCC of K-NN
outperforms that of BOMP by more than 10 times of the
standard deviation. This con rms the statistical signi -
cance of the improvement. It is worth to point out that
in this comparison, the dataset submitted to the BOMP
server is likely overlap with the dataset that BOMP was
trained on. On the contrary, in the evaluation of our K-NN
Fig. 1. ROC curve of the K-NN method
method, we make sure that the sequence similarity be-
tween training and test sets is less than 25%. We also
evaluated our K-NN method using the same datasets that
The advantage of introducing this parameter a to the
Berven et al. (2004) used to evaluate their BOMP method.
K-NN method is that users can chose a threshold based
The results (Table 5) show that our K-NN method still
on their need. When a is set to a lower value, the K-NN
outperforms BOMP on their datasets. When BOMP data-
can achieve higher speci city. On other hand, when a
sets were used, leave-one-out cross validations were used
high value of a is chosen, the K-NN can achieve higher
to evaluate both methods as described in Berven et al.
sensitivity.
(2004). We notice that when compared using the BOMP
dataset, the improvement of K-NN method over BOMP is
not so big as when our dataset is used. The possible reason
3.9 Identi cation of OMPs in the proteome of E.coli
is that the BOMP dataset contains only a small number of
positive examples (59 in total). Since leave-one-out cross- We applied the K-NN method to search for OMPs in the
proteome of E. coli using a 0.11, which corresponds
validations were performed, we could not calculate the
standard deviation as we did in ve-fold cross-validation to 99% speci city in the ROC curve. The E. Coli proteome
(because there is only one possible way to split data for consists of 4319 proteins. 123 of them were predicted to be
leave-one-out cross validation). However, the improve- OMP proteins. That accounts for 2.8% of the whole prote-
ment of the K-NN method in MCC is still clear. ome. This ratio is consistent with the previous estimation
that 2 3% of the genes in Gram-negative bacteria encodes
OMPs (Wimley, 2003). Among these 123 hits, 61 proteins
3.8 Receiver operating characteristic (ROC) curve
are annotated as OMP proteins in Swiss-Prot (Bairoch
et al., 2004) or ePSORTdb (Rey et al., 2005), a database
In the K-NN method, a protein is classi ed as OMP or
of protein subcellular locations that have been determined
non-OMP based on the comparison of Domp (its distance
through laboratory experiments. In addition, 20 proteins
to the OMP group), Dglo (its distance to the globu-
are annotated with Membrane, Cell membrane and
lar protein group), and Dimp (its distance to the AMP
group). A protein is predicted to be OMP if Domp