Process Functional

Location:

India

Posted:

November 15, 2012

Contact this candidate

Resume:

Automated classi cation of prokaryotic proteins **7

ARC: Automated Resource Classi er for agglomerative

functional classi cation of prokaryotic proteins using

annotation texts

MUTHIAH GNANAMANI, NAVEEN KUMAR and SRINIVASAN RAMACHANDRAN*

G N Ramachandran Knowledge Centre for Genome Informatics, Institute of Genomics and Integrative Biology,

Mall Road, Delhi 110 007, India

*Corresponding author (Fax, 91-11-276*-****; Email, ********@*****.***)

Functional classi cation of proteins is central to comparative genomics. The need for algorithms tuned to enable

integrative interpretation of analytical data is felt globally. The availability of a general, automated software with

built-in exibility will signi cantly aid this activity. We have prepared ARC (Automated Resource Classi er), which

is an open source software meeting the user requirements of exibility. The default classi cation scheme based on

keyword match is agglomerative and directs entries into any of the 7 basic non-overlapping functional classes: Cell

wall, Cell membrane and Transporters (C), Cell division (D), Information (I), Translocation (L), Metabolism (M),

Stress(R), Signal and communication(S) and 2 ancillary classes: Others (O) and Hypothetical (H). The keyword library

of ARC was built serially by rst drawing keywords from Bacillus subtilis and Escherichia coli K12. In subsequent

steps, this library was further enriched by collecting terms from archaeal representative Archaeoglobus fulgidus, Gene

Ontology, and Gene Symbols. ARC is 94.04% successful on 6,75,663 annotated proteins from 348 prokaryotes. Three

examples are provided to illuminate the current perspectives on mycobacterial physiology and costs of proteins in 333

prokaryotes. ARC is available at http://arc.igib.res.in.

[Gnanamani M, Kumar N and Ramachandran S 2007 ARC: Automated Resource Classi er for agglomerative functional classi cation of

prokaryotic proteins using annotation texts; J. Biosci. 32 937 945]

scheme, the GO could be considered as an agglomerative

1. Introduction

approach enabling comparative analysis with respect

The complete sequence determination of more than to the ontologies. The prime mover towards this trend

300 micro-organisms has opened new opportunities for is the realization by many biologists that we need to

comparative analyses. An essential prerequisite step in this transit from the reductionist schematic to the integrative

exercise is to annotate the functional roles of newly identi ed schematic as elucidated by Ren e Descartes in his second

proteins and classify them into biological groups of common and third precepts, 365 years ago (Auffray et al 2003; Van

overall activity. The traditionally used and perhaps widely Regenmortel 2004).

known system of classi cation was originally proposed The second, to divide each of the dif culties which I

by Riley, more than a decade ago (Riley 1993). Recently, would examine into as many parts as it would be possible,

some microbial genes have been annotated using the and as might be required to resolve them best.

controlled vocabulary system of Gene Ontology (GO) The third, to conduct my thoughts in order, beginning

consortium (Harris et al 2004). Compared with the Riley s with the simplest and easiest objects to know, to rise little by

Keywords. Automated resource; functional classi cation; integrative biology

Additional material pertaining to this article is available with authors.

J. Biosci. 32(5), August 2007, 937 945, IndianJ. Biosci. 32(5), August 2007

Academy of Sciences 937

http://www.ias.ac.in/jbiosci

Muthiah Gnanamani, Naveen Kumar and Srinivasan Ramachandran

938

little, as it were by steps, up to the knowledge of the most nd its entry into the class of Cell Wall, Cell Membrane

complex; and assuming even order between those which do and Transporters. Among all three attributes, process and

not precede each other naturally. location are clear whereas purpose is highly subjective and

While the GO approach is systematic, the appearance poorly de ned. Therefore, in this system of classi cation,

of a given protein either in more than one node within we focus on process (molecular function) and location

an ontology or in more than one ontology can confound (cellular component). The GO equivalent of purpose is

comparative analyses. These effects are particularly Biological process.

noticeable when the breadth of the functional class is

narrow. Although narrowly sectioned functional classes

Functional class description

2.2

offer specialized views of biological phenomena suited for

speci c investigations, they tend to limit straightforward

ARC classi es proteins into seven basic functional classes:

interpretations from holistic perspectives (Van Regenmortel

C: Cell wall and Cell membrane and Transporters, D: Cell

2004). Early origins of agglomerative approaches can be

division and binary ssion, I: Information (Replication,

traced back to Adams et al (1995) and Andrade et al (1999).

Transcription, Translation), L: Translocation and Secretion,

Adams et al (1995) classi ed proteins and their encoding

M: Metabolism, R: Stress, S: Signalling and communication

genes into functional classes such as energy metabolism, cell

and two ancillary functional classes: O: Others, H:

structure, homeostasis and cell division, RNA and protein

Hypothetical proteins.

synthesis and processing, cell signaling and communication.

Andrade et al (1999) classi ed proteins into superclasses

ENERGY, METABOLISM and INFORMATION. Our Keyword collection

2.3

classi cation scheme is based on these early foundations

laid for comparative analyses. Our goal is to use a functional

Keywords qualifying the entry patterns for proteins with

classi cation scheme such that a given protein can be

annotated functions were rst collected from two bacteria

classi ed singularly into one functional class. Although

Bacillus subtilis (Kunst et al 1997) and Escherichia coli

this goal is ambitious considering the multiple functional

K-12 (Blattner et al 1997), which are well studied compared

roles exhibited by some proteins, our broad de nition of

to others. This approach is similar to that followed in

functional class offers suf cient space to accommodate such

constructing the Gene and Protein Synonyms Database

candidates as well. In this work we develop a software using

(GPSDB) (Pillet et al 2004). Subsequently, ARC was operated

this approach involving classi cation of proteins into 7 basic

on an archaeal representative, Archaeoglobus fulgidus

and 2 ancillary agglomerative non-overlapping functional

(Klenk et al 1997). Additional keywords were collected

classes to enable integrative interpretation of analytical

from proteins with known function that were not classi ed

data.

using the keyword library prepared from B. subtilis and

E. coli K-12. Subsequently, keywords were also collected

2. Implementation from GO terms belonging to Cellular Component (for

cellular locations) and to Molecular Function (processes).

Rationale

2.1 The terms belonging to Biological Process serve to

describe the biological goals, which are more subjective

Three attributes are associated with the function of a and overlapping and therefore were not considered in

protein. These are the process it performs, the sub-cellular our scheme. The keyword library was organized into 26

location where it performs the process, and the purpose les encompassing all the functional classes arranged

for which the process is used. For example, the process alphabetically.

of a DNA polymerase is to synthesize DNA. The cellular

location in prokaryotes for this activity is intracellular.

KEYWORD_FunctionalClassSymbol

2.4

The purpose is to enable cell multiplication and division

or DNA repair. Focusing on process will direct the entry of

Functional Class Symbol is a single character corresponding

this protein into the class of replication and extending this

to the name of a functional class. This nomenclature can

further agglomeratively, will place the protein in the class

be modi ed by the user to extend the symbol character

of Information. On the other hand, if a protein is annotated

length up to 10 characters without any space. ARC does

only as GTP binding or ATP binding, although the process

not support those keyword entries which are without the

is clear, it is required in many processes, purposes and

locations. Therefore, we would have to render it as FunctionalClassSymbol. ARC is case insensitive; user can

unclassi ed. If a protein is annotated as membrane enter keywords and associated class symbols in either case or

protein, then, only the location is known. This protein will a mixture of both. The keywords along with their associated

J. Biosci. 32(5), August 2007

Automated classi cation of prokaryotic proteins 939

FunctionalClassSymbol are separated by an underscore Annotations similar to these cases in any species will likely

and stored as ASCII les. This simple format supports the confound analysis. Fortunately in a majority of species,

user to edit the keyword library by either deleting or modify- these cases are very low in number.

ing existing keyword entries or including keywords of their At the start of the algorithm, a positive hit for the presence

own choice with corresponding FunctionalClassSymbols. of word nonribosomal in the annotation text will assign

Users can also update the keyword library as and when M as functional class because these proteins (nonribosomal

needed. Organization of keywords need to follow a simple peptide synthetases and associated proteins) are responsible

directional rule of decreasing complexity from the top. For for the synthesis of secondary metabolites (Grunewald and

example, the keyword histone lysine_M is placed before Marahiel 2006). For a negative hit, ARC will start searching

histone_I so that ARC rst searches for the complex for the presence of words or substrings not or non in

keyword followed by searching for simpler keywords, the annotation text. A positive hit will place the protein into

which are substring of the complex keyword. Similarly, Unclassi ed category. A negative hit will cause ARC to

Signal peptide_L keyword is placed before Signal_S. This search for the common words or substrings synthesis and

organization is founded on the time honoured principle that synthetic . If any of these general substrings is present in

complex keywords attribute higher clarity and speci city the annotation information, and a positive hit for a keyword

than simpler keywords. for a functional class other than metabolism class, then,

ARC directs the protein to the respective functional class.

A negative hit for all keywords for other functional classes

Gene symbol and synonym collection

2.5

will cause ARC to direct the protein in the metabolism class.

It is to be noted that the substrings non, not, synthesis,

Gene symbols along with their synonyms (associated

synthetic and nonribosomal are not part of the keyword

annotation information) were retrieved from the websites

library.

of SubtiList (Bacillus subtilis ListiList8) [SubtiList],

In subsequent steps, ARC will assign the destination

Colibri (Escherichia coli K-12) [Colibri], TubercuList

functional class to the protein for a positive hit to keywords

(Mycobacterium tuberculosis H37Rv [TubercuList],

corresponding to that class. After exhausting all the

Leproma (Mycobacterium leprae TN) [Leproma], BoviList

possibilities ARC checks whether a protein should be placed

(Mycobacterium bovis AF2122/97) [BoviList], ListiList

in the Hypothetical class. If no keyword of its library is

(Listeria monocytogenes EGD-e, L. innocua CLIP 11262)

found in the annotation text, ARC starts searching for the

[ListiList], LegioList (Legionella pneumophila Paris, L.

presence of Gene Symbols. ARC is equipped to classify

pneumophila Lens) [LegioList], SagaList (Streptococcus

proteins having of cial gene symbol as their annotation

agalactiae nem 316) [SagaList], PhotoList (Photorhabdus

text. For a positive hit, ARC searches for the keywords of its

luminescens TT01) [PhotoList], and UniProt [UniProt].

library in the corresponding synonym of the gene symbol.

The gene symbols with their synonyms were alphabetically

For a negative hit, ARC places the respective protein in the

sorted into 26 accessory les. These les contain 47,456

Unclassi ed category.

gene symbols. The gene symbols in these les are not

nonredundant. For example, both gene symbols adh1

and adh-1 (alcohol dehydrogenase) are included in these 2.7 ARC Web server

les.

The ARC algorithm was coded in C and compiled using

the GNU gcc compiler 3.4.3 in the Itanium 2 64 bit dual

2.6 Algorithm

processor server running on RedHat Linux Enterprise

version 4. The Web server was prepared in Apache version

The algorithm is shown in gure 1. ARC follows rst 2.0, Server side scripting in PHP version 5.0 using the

hit and assign approach. This strategy is based on the graphics library JPgraph version 2.2. The client side scripting

premise that the main process to which a given annota- was done in HTML and AJAX. The C code of the algorithm

tion refers to is written rst, in the subject line, followed can be compiled by other (UNIX based) C compilers as

by other related information, if available. In a majority

well.

of annotations, this principle is evidently adhered to and

File Formats: Three types of le formats can be accepted.

leads to a singular decision, but in a small minority

An input le with (i) annotations only in FASTA format

of cases (1.44% to 3.6% of annotated proteins), the

with or without sequence, (ii) annotations and expression

annotation information leads to pluralistic decision. The

fold change for microarray data, (iii) annotations and

algorithm however, directs the entry of these proteins

numeric value representing user speci ed data. Users

to the corresponding functional class based on the rst

also can upload MS Excel les. Graphical displays are

keyword it spots, but these entries are potential confounders.

J. Biosci. 32(5), August 2007

Muthiah Gnanamani, Naveen Kumar and Srinivasan Ramachandran

940

Figure 1. The ARC algorithm. Rules for classi cation of proteins performing distinct functions but have common keywords in their

annotation texts: Signalling and communication: Presence of kinase pre xed by serine OR threonine OR tyrosine OR polo OR

pros OR tele OR SAP OR protein ; Presence of phosphatase pre xed by serine OR threonine OR aspartate OR histidine OR

threonine OR protein ; Presence of PAS, which is not a substring of any other word; Metabolism: Presence of kinase NOT pre xed by

serine OR threonine OR tyrosine OR polo OR pros OR tele OR SAP OR protein ; Presence of phosphatase NOT pre xed by

serine OR threonine, OR aspartate OR histidine OR protein ; Presence of any of the words PII, PTS, P700, CoA, which is not

a substring of any other word; Information: Presence of any of the words RNA, DNA, SOS, which is not a substring of any other word;

Cell wall and Cell membrane: Presence of the word porin which is not a substring of any other word. Proteins with the following key

words or substrings are directed to the hypothetical class: hypothetic, unknown cds, conserved archael protein, conserved protein,

conserved crenarchael protein, putative enzyme, uncharacterized protein, uncharacterized conserved protein, putative conserved

protein, unknown protein, function unknown, unknown function .

J. Biosci. 32(5), August 2007

Automated classi cation of prokaryotic proteins 941

generated dynamically. The output les from ARC are tab preference for each category with any of the functional

delimited ASCII les containing the annotation text and the classes.

FunctionalClassSymbol.

2.11 Availability and requirements

Sequence les and annotations

2.8

Project name: ARC; Project home page : http://arc.igib.res.in;

The July 2006 release of genomic data of NCBI (NCBI) Operating system(s): Linux, UNIX;

had 348 prokaryotes with a total number of proteins Programming language : C; Other requirements : gcc, cc

amounting to 1,056,660, containing 675, 663 annotated or other equivalent compilers;

proteins. License : GNU GPL; Any restriction to use by non-

academics : No

2.9 Orthologs and paralogs

3. Results and discussion

Orthologs were examined by mixing the proteins a given

class from 3 species, Mycobacterium tuberculosis, M. bovis Ef ciency

3.1

and M. leprae. Orthologs were noted if clusters of proteins

ARC was tested on three model organisms B. subtilis (Kunst

had singular entries from all three species meeting the

criteria S = 1.5, L = 0.95 in the program BLASTCLUST et al 1997), E. coli K-12 (Blattner et al 1997) and A. fulgidus

(Kondrashov et al 2002, NCBI). Paralogs in each species (Klenk et al 1997), and found to classify 97.3%, 96.8%

were identi ed by running BLASTCLUST on proteins and 96.2% of the annotated proteins of these organisms.

of a given functional class within a species meeting the The proportions of confounding entries were 1.6%, 3.6%

criteria S= 0.8, L= 0.95 (Subramanyam et al 2006). All other and 1.4% respectively. A run of ARC on 348 prokaryotic

parameters were used at their default settings. proteomes classi ed 635,390 proteins out of 675,663

annotated proteins representing an overall success rate of

94.04%. Given that the keyword library of ARC was initially

Statistical methods

2.10

built from 2 representative model organisms with subsequent

serial enrichment, the high success rates of classi cation by

Statistical signi cance of differences in proportions were

ARC shows that most annotation groups have been coherent

carried out using the Binomial proportions test (Uitenbroek

and in phase with the earliest genome sequencing groups.

1997). For data in table 1, expected proportions were

It is now possible to formulate standardized annotation

computed assuming the same pattern of the reference

scheme. These results also attest that a great majority

organism to which a given species is compared. For data

of proteins could be classi ed with singular decisions

in table 2 observed proportions in each functional class

directing entry into a single class. However, a minority of

were compared to global proportions of genes for the same

annotation cases pose persisting problems by presenting

functional class in each expression category assuming no

plural decisions. One probable cause for such confounders

Table 1. Comparative proteomics of mycobacteriaa

Species Total number Number of proteins classi ed into seven functional classes Number of proteins in

of proteins the ancillary classes

Dd Rd

C I L M S O U H

Mycobacterium tuberculosis 398*-***-**-*** 43-113*-**-**-***-** 1343

H37Rv

Mycobacterium bovis 392*-***-**-*** 46-112*-**-**-***-** 1353

AF2122/97

157b,1,2 50c, 1,2 512c, 1,2 38b,1,2

Mycobacterium leprae TN 160*-**-***-*-** 31 573

We observed that the genomes of M. tuberculosis CDC1551 and M. avium subsp. paratuberculosis had an alarmingly high number of

hypothetical proteins and therefore, these species were dropped from this analysis.

Signi cantly different. Lower than expected proportion (P

Signi cantly different. Higher than expected proportion (P

Contact this candidate