Automated classi cation of prokaryotic proteins **7
ARC: Automated Resource Classi er for agglomerative
functional classi cation of prokaryotic proteins using
annotation texts
MUTHIAH GNANAMANI, NAVEEN KUMAR and SRINIVASAN RAMACHANDRAN*
G N Ramachandran Knowledge Centre for Genome Informatics, Institute of Genomics and Integrative Biology,
Mall Road, Delhi 110 007, India
*Corresponding author (Fax, 91-11-276*-****; Email, ********@*****.***)
Functional classi cation of proteins is central to comparative genomics. The need for algorithms tuned to enable
integrative interpretation of analytical data is felt globally. The availability of a general, automated software with
built-in exibility will signi cantly aid this activity. We have prepared ARC (Automated Resource Classi er), which
is an open source software meeting the user requirements of exibility. The default classi cation scheme based on
keyword match is agglomerative and directs entries into any of the 7 basic non-overlapping functional classes: Cell
wall, Cell membrane and Transporters (C), Cell division (D), Information (I), Translocation (L), Metabolism (M),
Stress(R), Signal and communication(S) and 2 ancillary classes: Others (O) and Hypothetical (H). The keyword library
of ARC was built serially by rst drawing keywords from Bacillus subtilis and Escherichia coli K12. In subsequent
steps, this library was further enriched by collecting terms from archaeal representative Archaeoglobus fulgidus, Gene
Ontology, and Gene Symbols. ARC is 94.04% successful on 6,75,663 annotated proteins from 348 prokaryotes. Three
examples are provided to illuminate the current perspectives on mycobacterial physiology and costs of proteins in 333
prokaryotes. ARC is available at http://arc.igib.res.in.
[Gnanamani M, Kumar N and Ramachandran S 2007 ARC: Automated Resource Classi er for agglomerative functional classi cation of
prokaryotic proteins using annotation texts; J. Biosci. 32 937 945]
scheme, the GO could be considered as an agglomerative
1. Introduction
approach enabling comparative analysis with respect
The complete sequence determination of more than to the ontologies. The prime mover towards this trend
300 micro-organisms has opened new opportunities for is the realization by many biologists that we need to
comparative analyses. An essential prerequisite step in this transit from the reductionist schematic to the integrative
exercise is to annotate the functional roles of newly identi ed schematic as elucidated by Ren e Descartes in his second
proteins and classify them into biological groups of common and third precepts, 365 years ago (Auffray et al 2003; Van
overall activity. The traditionally used and perhaps widely Regenmortel 2004).
known system of classi cation was originally proposed The second, to divide each of the dif culties which I
by Riley, more than a decade ago (Riley 1993). Recently, would examine into as many parts as it would be possible,
some microbial genes have been annotated using the and as might be required to resolve them best.
controlled vocabulary system of Gene Ontology (GO) The third, to conduct my thoughts in order, beginning
consortium (Harris et al 2004). Compared with the Riley s with the simplest and easiest objects to know, to rise little by
Keywords. Automated resource; functional classi cation; integrative biology
Additional material pertaining to this article is available with authors.
J. Biosci. 32(5), August 2007, 937 945, IndianJ. Biosci. 32(5), August 2007
Academy of Sciences 937
http://www.ias.ac.in/jbiosci
Muthiah Gnanamani, Naveen Kumar and Srinivasan Ramachandran
938
little, as it were by steps, up to the knowledge of the most nd its entry into the class of Cell Wall, Cell Membrane
complex; and assuming even order between those which do and Transporters. Among all three attributes, process and
not precede each other naturally. location are clear whereas purpose is highly subjective and
While the GO approach is systematic, the appearance poorly de ned. Therefore, in this system of classi cation,
of a given protein either in more than one node within we focus on process (molecular function) and location
an ontology or in more than one ontology can confound (cellular component). The GO equivalent of purpose is
comparative analyses. These effects are particularly Biological process.
noticeable when the breadth of the functional class is
narrow. Although narrowly sectioned functional classes
Functional class description
2.2
offer specialized views of biological phenomena suited for
speci c investigations, they tend to limit straightforward
ARC classi es proteins into seven basic functional classes:
interpretations from holistic perspectives (Van Regenmortel
C: Cell wall and Cell membrane and Transporters, D: Cell
2004). Early origins of agglomerative approaches can be
division and binary ssion, I: Information (Replication,
traced back to Adams et al (1995) and Andrade et al (1999).
Transcription, Translation), L: Translocation and Secretion,
Adams et al (1995) classi ed proteins and their encoding
M: Metabolism, R: Stress, S: Signalling and communication
genes into functional classes such as energy metabolism, cell
and two ancillary functional classes: O: Others, H:
structure, homeostasis and cell division, RNA and protein
Hypothetical proteins.
synthesis and processing, cell signaling and communication.
Andrade et al (1999) classi ed proteins into superclasses
ENERGY, METABOLISM and INFORMATION. Our Keyword collection
2.3
classi cation scheme is based on these early foundations
laid for comparative analyses. Our goal is to use a functional
Keywords qualifying the entry patterns for proteins with
classi cation scheme such that a given protein can be
annotated functions were rst collected from two bacteria
classi ed singularly into one functional class. Although
Bacillus subtilis (Kunst et al 1997) and Escherichia coli
this goal is ambitious considering the multiple functional
K-12 (Blattner et al 1997), which are well studied compared
roles exhibited by some proteins, our broad de nition of
to others. This approach is similar to that followed in
functional class offers suf cient space to accommodate such
constructing the Gene and Protein Synonyms Database
candidates as well. In this work we develop a software using
(GPSDB) (Pillet et al 2004). Subsequently, ARC was operated
this approach involving classi cation of proteins into 7 basic
on an archaeal representative, Archaeoglobus fulgidus
and 2 ancillary agglomerative non-overlapping functional
(Klenk et al 1997). Additional keywords were collected
classes to enable integrative interpretation of analytical
from proteins with known function that were not classi ed
data.
using the keyword library prepared from B. subtilis and
E. coli K-12. Subsequently, keywords were also collected
2. Implementation from GO terms belonging to Cellular Component (for
cellular locations) and to Molecular Function (processes).
Rationale
2.1 The terms belonging to Biological Process serve to
describe the biological goals, which are more subjective
Three attributes are associated with the function of a and overlapping and therefore were not considered in
protein. These are the process it performs, the sub-cellular our scheme. The keyword library was organized into 26
location where it performs the process, and the purpose les encompassing all the functional classes arranged
for which the process is used. For example, the process alphabetically.
of a DNA polymerase is to synthesize DNA. The cellular
location in prokaryotes for this activity is intracellular.
KEYWORD_FunctionalClassSymbol
2.4
The purpose is to enable cell multiplication and division
or DNA repair. Focusing on process will direct the entry of
Functional Class Symbol is a single character corresponding
this protein into the class of replication and extending this
to the name of a functional class. This nomenclature can
further agglomeratively, will place the protein in the class
be modi ed by the user to extend the symbol character
of Information. On the other hand, if a protein is annotated
length up to 10 characters without any space. ARC does
only as GTP binding or ATP binding, although the process
not support those keyword entries which are without the
is clear, it is required in many processes, purposes and
locations. Therefore, we would have to render it as FunctionalClassSymbol. ARC is case insensitive; user can
unclassi ed. If a protein is annotated as membrane enter keywords and associated class symbols in either case or
protein, then, only the location is known. This protein will a mixture of both. The keywords along with their associated
J. Biosci. 32(5), August 2007
Automated classi cation of prokaryotic proteins 939
FunctionalClassSymbol are separated by an underscore Annotations similar to these cases in any species will likely
and stored as ASCII les. This simple format supports the confound analysis. Fortunately in a majority of species,
user to edit the keyword library by either deleting or modify- these cases are very low in number.
ing existing keyword entries or including keywords of their At the start of the algorithm, a positive hit for the presence
own choice with corresponding FunctionalClassSymbols. of word nonribosomal in the annotation text will assign
Users can also update the keyword library as and when M as functional class because these proteins (nonribosomal
needed. Organization of keywords need to follow a simple peptide synthetases and associated proteins) are responsible
directional rule of decreasing complexity from the top. For for the synthesis of secondary metabolites (Grunewald and
example, the keyword histone lysine_M is placed before Marahiel 2006). For a negative hit, ARC will start searching
histone_I so that ARC rst searches for the complex for the presence of words or substrings not or non in
keyword followed by searching for simpler keywords, the annotation text. A positive hit will place the protein into
which are substring of the complex keyword. Similarly, Unclassi ed category. A negative hit will cause ARC to
Signal peptide_L keyword is placed before Signal_S. This search for the common words or substrings synthesis and
organization is founded on the time honoured principle that synthetic . If any of these general substrings is present in
complex keywords attribute higher clarity and speci city the annotation information, and a positive hit for a keyword
than simpler keywords. for a functional class other than metabolism class, then,
ARC directs the protein to the respective functional class.
A negative hit for all keywords for other functional classes
Gene symbol and synonym collection
2.5
will cause ARC to direct the protein in the metabolism class.
It is to be noted that the substrings non, not, synthesis,
Gene symbols along with their synonyms (associated
synthetic and nonribosomal are not part of the keyword
annotation information) were retrieved from the websites
library.
of SubtiList (Bacillus subtilis ListiList8) [SubtiList],
In subsequent steps, ARC will assign the destination
Colibri (Escherichia coli K-12) [Colibri], TubercuList
functional class to the protein for a positive hit to keywords
(Mycobacterium tuberculosis H37Rv [TubercuList],
corresponding to that class. After exhausting all the
Leproma (Mycobacterium leprae TN) [Leproma], BoviList
possibilities ARC checks whether a protein should be placed
(Mycobacterium bovis AF2122/97) [BoviList], ListiList
in the Hypothetical class. If no keyword of its library is
(Listeria monocytogenes EGD-e, L. innocua CLIP 11262)
found in the annotation text, ARC starts searching for the
[ListiList], LegioList (Legionella pneumophila Paris, L.
presence of Gene Symbols. ARC is equipped to classify
pneumophila Lens) [LegioList], SagaList (Streptococcus
proteins having of cial gene symbol as their annotation
agalactiae nem 316) [SagaList], PhotoList (Photorhabdus
text. For a positive hit, ARC searches for the keywords of its
luminescens TT01) [PhotoList], and UniProt [UniProt].
library in the corresponding synonym of the gene symbol.
The gene symbols with their synonyms were alphabetically
For a negative hit, ARC places the respective protein in the
sorted into 26 accessory les. These les contain 47,456
Unclassi ed category.
gene symbols. The gene symbols in these les are not
nonredundant. For example, both gene symbols adh1
and adh-1 (alcohol dehydrogenase) are included in these 2.7 ARC Web server
les.
The ARC algorithm was coded in C and compiled using
the GNU gcc compiler 3.4.3 in the Itanium 2 64 bit dual
2.6 Algorithm
processor server running on RedHat Linux Enterprise
version 4. The Web server was prepared in Apache version
The algorithm is shown in gure 1. ARC follows rst 2.0, Server side scripting in PHP version 5.0 using the
hit and assign approach. This strategy is based on the graphics library JPgraph version 2.2. The client side scripting
premise that the main process to which a given annota- was done in HTML and AJAX. The C code of the algorithm
tion refers to is written rst, in the subject line, followed can be compiled by other (UNIX based) C compilers as
by other related information, if available. In a majority
well.
of annotations, this principle is evidently adhered to and
File Formats: Three types of le formats can be accepted.
leads to a singular decision, but in a small minority
An input le with (i) annotations only in FASTA format
of cases (1.44% to 3.6% of annotated proteins), the
with or without sequence, (ii) annotations and expression
annotation information leads to pluralistic decision. The
fold change for microarray data, (iii) annotations and
algorithm however, directs the entry of these proteins
numeric value representing user speci ed data. Users
to the corresponding functional class based on the rst
also can upload MS Excel les. Graphical displays are
keyword it spots, but these entries are potential confounders.
J. Biosci. 32(5), August 2007
Muthiah Gnanamani, Naveen Kumar and Srinivasan Ramachandran
940
Figure 1. The ARC algorithm. Rules for classi cation of proteins performing distinct functions but have common keywords in their
annotation texts: Signalling and communication: Presence of kinase pre xed by serine OR threonine OR tyrosine OR polo OR
pros OR tele OR SAP OR protein ; Presence of phosphatase pre xed by serine OR threonine OR aspartate OR histidine OR
threonine OR protein ; Presence of PAS, which is not a substring of any other word; Metabolism: Presence of kinase NOT pre xed by
serine OR threonine OR tyrosine OR polo OR pros OR tele OR SAP OR protein ; Presence of phosphatase NOT pre xed by
serine OR threonine, OR aspartate OR histidine OR protein ; Presence of any of the words PII, PTS, P700, CoA, which is not
a substring of any other word; Information: Presence of any of the words RNA, DNA, SOS, which is not a substring of any other word;
Cell wall and Cell membrane: Presence of the word porin which is not a substring of any other word. Proteins with the following key
words or substrings are directed to the hypothetical class: hypothetic, unknown cds, conserved archael protein, conserved protein,
conserved crenarchael protein, putative enzyme, uncharacterized protein, uncharacterized conserved protein, putative conserved
protein, unknown protein, function unknown, unknown function .
J. Biosci. 32(5), August 2007
Automated classi cation of prokaryotic proteins 941
generated dynamically. The output les from ARC are tab preference for each category with any of the functional
delimited ASCII les containing the annotation text and the classes.
FunctionalClassSymbol.
2.11 Availability and requirements
Sequence les and annotations
2.8
Project name: ARC; Project home page : http://arc.igib.res.in;
The July 2006 release of genomic data of NCBI (NCBI) Operating system(s): Linux, UNIX;
had 348 prokaryotes with a total number of proteins Programming language : C; Other requirements : gcc, cc
amounting to 1,056,660, containing 675, 663 annotated or other equivalent compilers;
proteins. License : GNU GPL; Any restriction to use by non-
academics : No
2.9 Orthologs and paralogs
3. Results and discussion
Orthologs were examined by mixing the proteins a given
class from 3 species, Mycobacterium tuberculosis, M. bovis Ef ciency
3.1
and M. leprae. Orthologs were noted if clusters of proteins
ARC was tested on three model organisms B. subtilis (Kunst
had singular entries from all three species meeting the
criteria S = 1.5, L = 0.95 in the program BLASTCLUST et al 1997), E. coli K-12 (Blattner et al 1997) and A. fulgidus
(Kondrashov et al 2002, NCBI). Paralogs in each species (Klenk et al 1997), and found to classify 97.3%, 96.8%
were identi ed by running BLASTCLUST on proteins and 96.2% of the annotated proteins of these organisms.
of a given functional class within a species meeting the The proportions of confounding entries were 1.6%, 3.6%
criteria S= 0.8, L= 0.95 (Subramanyam et al 2006). All other and 1.4% respectively. A run of ARC on 348 prokaryotic
parameters were used at their default settings. proteomes classi ed 635,390 proteins out of 675,663
annotated proteins representing an overall success rate of
94.04%. Given that the keyword library of ARC was initially
Statistical methods
2.10
built from 2 representative model organisms with subsequent
serial enrichment, the high success rates of classi cation by
Statistical signi cance of differences in proportions were
ARC shows that most annotation groups have been coherent
carried out using the Binomial proportions test (Uitenbroek
and in phase with the earliest genome sequencing groups.
1997). For data in table 1, expected proportions were
It is now possible to formulate standardized annotation
computed assuming the same pattern of the reference
scheme. These results also attest that a great majority
organism to which a given species is compared. For data
of proteins could be classi ed with singular decisions
in table 2 observed proportions in each functional class
directing entry into a single class. However, a minority of
were compared to global proportions of genes for the same
annotation cases pose persisting problems by presenting
functional class in each expression category assuming no
plural decisions. One probable cause for such confounders
Table 1. Comparative proteomics of mycobacteriaa
Species Total number Number of proteins classi ed into seven functional classes Number of proteins in
of proteins the ancillary classes
Dd Rd
C I L M S O U H
Mycobacterium tuberculosis 398*-***-**-*** 43-113*-**-**-***-** 1343
H37Rv
Mycobacterium bovis 392*-***-**-*** 46-112*-**-**-***-** 1353
AF2122/97
157b,1,2 50c, 1,2 512c, 1,2 38b,1,2
Mycobacterium leprae TN 160*-**-***-*-** 31 573
a
We observed that the genomes of M. tuberculosis CDC1551 and M. avium subsp. paratuberculosis had an alarmingly high number of
hypothetical proteins and therefore, these species were dropped from this analysis.
b
Signi cantly different. Lower than expected proportion (P
c
Signi cantly different. Higher than expected proportion (P