Resume

Network Data

Location:

Australia

Posted:

November 23, 2012

Contact this candidate

Resume:

Evaluating an Evolutionary Approach for Reconstructing Gene Regulatory

Networks

Dion J. Whitehead1, Andre Skusa1 and Paul J. Kennedy2

1NRW Graduate School in Bioinformatics and Genome Research, Center of Biotechnology (CeBiTec),

University of Bielefeld, Postfach 10 01 31, D-33501 Bielefeld, GERMANY

2 Faculty of Information Technology, University of Technology, Sydney, PO Box 123, Broadway, NSW 2007, AUSTRALIA

Both authors contributed equally

abpwv5@r.postjobfree.com

Abstract Microarray data is dif cult to use for modelling networks

of gene expression. There are challenges processing the

Reconstructing networks from (partial) incomplete data is a data itself, partly due to the dif culty of comparing data

general problem in biology. In this paper, we use an evolu- from different samples and partly because of the large num-

tionary approach as one of many possible methods in an arti -

ber of genes (tens of thousands) compared to the number

cial network creation and reconstruction framework to inves-

of repetitions of experiments (tens), which is often referred

tigate the limitations of inferring gene expression networks

to as the high dimensionality of microarray data. See (Hu-

from simulated microarray data. For this, the simulated dy-

namics of the evolved networks are optimized to t the target ber et al., 2003) for a recent model of the error associated

dynamics. Evolving networks with similar dynamics is not as with microarray experiments and (Yang et al., 2000) for de-

dif cult as comparing the resulting network topologies to the

tails on normalising data within and between experiments.

original network to be reconstructed.

However, the biggest problem using microarray data to build

models of gene expression networks is not the noise of the

sample, nor the intrinsic noise of the biological phenomena.

Introduction

It is the fact that microarrays only measure one part of the

A standard bioinformatics problem is the reconstruction of real network, that is, the transcription stage of gene reg-

biological networks from incomplete data. An important ex- ulation, ignoring RNA regulation, post translational mod-

ample of this problem is the inference of gene regulatory i cation, localization signals, phosphorylation, etc. Noise

networks (GRNs) from time series of gene expression data. starts to look unimportant against all the missing data.

GRNs may be modelled using a variety of formalisms A challenge when inferring GRNs is that it is dif cult

of differing complexities ranging from directed graphs, to assess the effectiveness of algorithms because the ac-

boolean networks, differential equations and Bayesian net- tual GRNs are mostly unknown. Some researchers approach

works (De Jong, 2002). this challenge by comparing the induced GRNs with known

Time series gene expression data from high throughput GRNs for the same data. See, for example, (Wahde and

assays (eg. cDNA microarray experiments (Baldi and Hat- Hertz, 2000).

eld, 2002)) may be used to recover the original GRNs that This dif culty has led some researchers, eg. (Repsilber

generated the data. This is called a reverse problem. The re- and Kim, 2003), (P. Mendes and Ye, 2003), to test proposed

verse problem has been approached in many different ways, methods with arti cially generated, but plausible GRNs.

depending on the formalism used to model the GRN. For ex- Microarray data is simulated for the arti cial GRN and the

ample, (Liang et al., 1998) infer GRNs modelled using ran- proposed solution to the reverse problem is compared with

dom boolean networks (RBN); (Kikuchi et al., 2003) nd the known arti cial GRN.

weights for systems of differential equations using genetic This contribution follows a similar approach but looks at

algorithms (GA) and (Friedman et al., 2000) infer Bayesian a slightly different question: what are the similarities and

network representations of a GRN. differences between arti cial GRNs that generate the same

Microarray experiments (Parmigiani et al., 2003) are a re- gene expression data time series.

cent cellular biology technology able to measure the level Our methodology consists of four stages (illustrated in

of expression of thousands of genes in cells at one instant. Fig. 1): (i) create an arti cial GRN (called the target

There are generally two kinds of experiments: cDNA mi- GRN ) at random including marked mRNA and gene nodes,

croarray experiments that compare the relative gene expres- (ii) simulate microarray data for the target GRN, (iii) evolve

sion levels of two samples on the same chip, and oligonu- arti cial GRNs which produce simulated microarray data

cleotide microarrays (Affymetrix) that use one chip per sam- matching the simulated microarray data for the target GRN,

ple. and (iv) compare the evolved GRNs with the target GRN.

Method Overview

Step 1 Step 2 Step 3 Step 4

Simulate & Evolve GRNs to

Compare target

Create GRN Generate match

The following sections describe our proposed methodology

& evolved GRNs

Microarray Data microarray data

in detail.

Target GRN GRN

Target GRN microarray functioning similarities &

Step 1: Create Random Gene Regulatory Networks

data like target differences

One possibility of creating arti cial regulatory networks

(ARNs) at random is to generate a bit string and

Figure 1: Methodology used in this paper mimic the process of gene regulation occurring from this

genome (Banzhaf, 2003). The network then emerges from

the interaction of the components. In contrast to this we

Network Model generate the network topology directly. Though it is triv-

ial to create networks at random it is not as straightforward

GRNs consist of many different types of reactions, but with

to create networks at random which should make sense,

current technologies only a subset of these reactions can be

ie. they should exhibit dynamic behavior that is arbitrary, but

directly measured. In the case of microarray experiments,

reasonable in the domain they are created for. In our case,

only mRNA levels can be directly measured. Crucial mea-

reasonable GRNs are those which do not stop expression at

surements that are not made are the actual concentration of

an early time step or where activity is con ned to only a

the protein for which the mRNA is translated, the result-

small subset of the whole network. Therefore, we applied

ing variant if different exon combinations are possible, post-

a method to prune and modify random networks to produce

translational modi cation, protein localization, its current

interesting dynamics:

state (for example bound by an inhibitor protein) and many

First, a directed random network is created by applying

others, as eg. the recent discovery of the impact of small

the Erd s-Renyi approach (Erd s and Renyi, 1960) to di-

o o

RNA strands (microRNA) on all levels of genomic regula-

rected graphs: for a number of nodes each possible edge

tion (Ambros, 2001). The result is a network with many

is tested and drawn if a random number is lower than a

unknown nodes and edges (Fig. 2). This is the basis for

given probability p. Each edge is assigned a uniform ran-

our inquiry: assuming it is possible for a network to be re-

dom weight in the range of [ 1, 1]. The node which reaches

constructed from experimental data, what affect do invisible

the most other nodes is determined and set to the starting

nodes have on the reconstructed networks?

node for the simulation (for the concept of a starting node

see also Step 2 in this section). All nodes unreachable by the

rst node are connected at random to the reachable nodes

? ?

(also chosen at random).

All nodes in the network are now reachable from the rst

Reality

Observed

node in respect to the edge directions, but not necessarily in

Figure 2: An example of incomplete measurement. The dark respect to the reaction rates, because negative reaction rates

nodes are mRNA levels measured by DNA microarray tech- might inhibit the expression of complete subnetworks be-

nologies, the unknown nodes in this network are the unmea- hind a node. Thus, a second pruning is performed consider-

sured entities. ing the assigned edge weights (which serve later as reaction

rates) and all nodes which are not reachable in respect to

negative weights are simply deleted from the network. At

In most models of GRNs the nodes represent equivalent

the end, the nodes are randomly assigned as visible or invis-

entities ie. gene products. The GRNs we use are different in

ible.

that nodes represent all of the entities that regulate protein

production (eg. genes, mRNA, modi ed protein states etc.)

Step 2: Simulate Microarray Data

rather than simply gene products. This generalisation is mo-

tivated by the fact that gene regulation also occurs at places In this step we simulate microarray data for the target GRN.

other than just promoter sites. Simulation of microarray data has been accomplished by

In our model, based on microarray experiments, nodes are other researchers in a variety of ways and with different mo-

assigned to one of two categories: visible ie. experimentally tivations. Repsilber and Kim (Repsilber and Kim, 2003)

measured or invisible ie. not measured. Visible nodes repre- start with an arti cial GRN and simulate the network dy-

sent entities that may be detected in the simulated microar- namics for a number of time steps to build a time series.

ray experiment (ie. mRNA molecules). Conversely, invisible They then change the initial conditions of the GRN and sim-

nodes represent molecules undetectable by the simulated mi- ulate another time series. A simple model combines the two

croarray. Both kinds of nodes, however, contribute towards sets of expression values to give a time series of log ratios.

the network dynamics. Cui et al (Cui et al., 2003) take a different approach. They

use a more complicated model of the error implicit in mi- This matrix completely de nes the behaviour of a particu-

croarray experiments but generate the raw expression val- lar GRN from given initial conditions. For this reason, we

ues randomly from statistically plausible probability density simply encode each GRN as its matrix W. The top leftmost

n n sub matrix (where n is the size of the target matrix)

functions.

Our approach is more similar to that of Kim and Rep- is constrained such that these nodes may not be lost by evo-

silber. We chose to implement a relatively simple and lution. These are the visible nodes. Wiring between these

widespread formalism: ordinary differential equations (De nodes, however, is adaptable by evolution. Also, the rst

Jong, 2002), based on the so-called Hill function, a sig- node (ie. at W11 ) has a hard wired feedback to itself to sim-

moidal function that agrees with experimental data. That plify the evolution task. All other weights, however, are set

the function is non-linear, is important. Even though lin- by evolution.

ear functions are simpler, most biochemical interactions are The arti cial evolution begins with a population of ran-

non-linear, which has important consequences in terms of domly created networks 10% larger than the target matrix

network dynamics. A degradation term is added to model with weights drawn uniformly randomly from the range [-1,

gradual decay. 1]. However, evolution may increase or decrease the number

dgi of genes (rows/columns) in the matrix. For the experiments

= h+ (gi ) i gi (1)

dt in this contribution, we typically used population sizes of

1000 individuals and run lengths up to 6000 generations.

where gi is the expression level of the ith gene and i ranges

from 1 to N for N total genes. i = 0.3 is the degradation Each member of the population is scored by a tness func-

term and h+ (gi ) is the Hill function given by tion that quanti es how close the output from the GRN cor-

responding to the genome is to the output of the target GRN.

x(gi )3 So, for each genome, the GRN is simulated using equations

h(gi ) = (2)

x(gi )3 + K (1), (2) and (3) with the same initial conditions and constant

parameters as for the target GRN. Only expression values

with

for the rst n genes (ie. the upper leftmost n n submatrix

matching the target W matrix) are recorded. The distance

x(gi ) = W ji g j (3)

of the resulting time series of expression values for each of

j=1

the n genes to the target values is computed using a simple

where W is the matrix of interactions and K = 1 the thresh-

squared error approach:

old for regulatory in uence on gi . To allow for repression

of genes, then if x(gi ) is negative then h+ (gi ) is replaced by n T

(git yit )2

h (gi ) = 1 h+ (gi ) and K changes sign. f= (4)

Using Euler s method and a xed step size, the arti cial i=1 t =1

GRN is simulated using numerical integration of the differ- where f is the tness of the genome. Evolution seeks to min-

ential equations (ie. equation 1) for a small number of time imise the value f . f sums over each of the n visible genes yit

steps. At the initial time step, the rst node (starting node) indexed by gene with i and by time with t for each of the T

is given a positive value (2 in our simulations) and the value time steps. The value git is the target gene expression value

of all other nodes are set to 0. The network is then sim- of gene i at time t .

ulated for s time steps with node values recorded for each An elitist strategy is used to manage the population. After

time step to form a time series of network activity (s is cho- calculating the tness of the population, the weakest 40%

sen such that the steady state starts right after that). In an ap- of the population is deleted and replaced with the ttest

proach inspired by oligonucleotide microarray experiments 40%. Genetic operators are then applied to the population

(eg. Affymetrix experiments), a single gene expression value with (small) xed probability. Genetic operators include:

is generated for each measured node. At this stage, sampling node duplication, node deletion, edge creation, edge dele-

error and other forms of noise are not added to the simulated tion, edge mutation (ie. changing the weight value) and edge

data. The result of this step is a matrix of values with a row duplication. Genetic operators do not modify the feedback

for each measured node and a column for each time step. edge to the initial node ie. W11 nor delete any of the hard

coded visible nodes in the top leftmost n n submatrix. Fit-

Step 3: Evolve Gene Regulatory Networks ness is computed for members of the new population and

Next, GRNs are evolved using an evolutionary computation evolution continues until networks with suitably small t-

algorithm to discover networks that produce similar output ness are designed or the maximum number of generations is

to the target GRN. reached.

To use an evolutionary approach to discover networks, it

Step 4: Compare Gene Regulatory Networks

must be possible to encode each candidate GRN on an ar-

ti cial genome. In Step 2 above, we describe how GRNs In this paper, no sophisticated comparison methods were ap-

are represented as a square matrix W of interaction weights. plied. As described in the following section (Experiments

and Results) the evolved networks were much larger and

more densely connected then expected. Thus, to apply and

develop meaningful comparison methods is the most impor-

tant next step. Here we compared the network only in parts

and manually .

Experiments and Results

The goal for the experiments was to evolve networks simi-

lar to a given target network regarding the behavior, ie. the

expression time series. We chose two target networks for

evolution from the randomly created networks, each with

20 nodes. One had a simple sequential connection topol-

Figure 3: Topology of the simple sequential network. The

ogy with resulting simple dynamics. The other had a more

starting node is given in black. The two parts of the network

complicated topology including negative feedback. Both

on the left and the right side are arranged according to the

networks contain only visible nodes (see section Network

two main activity patterns in the resulting dynamics (Fig. 4).

Model). However, the evolution protocol allowed the addi-

tion and deletion of invisible nodes. By manually comparing

the time series of networks of size 20 we concluded that a

distance smaller than or equal to approx. 1.0 indicated very

similar network dynamics.

The same mutation probabilities were used in both exper-

iments. Several parameter settings have been tested. Dif-

ferent simulation lengths (40 and 20 time steps respectively)

were used for the two networks to capture the signi cant Simple Sequential Network

dynamics before the steady state is reached. To simplify the 2.0 Target

plots some signi cant nodes were selected for drawing the

expression level time series.1 1.5

expression level

Simple Sequential Network

1.0

The rst target we chose can be seen in Fig. 3. It con-

tains mostly sequentially connected genes and therefore the

0.5

resulting expression time series consists mainly of sequen-

tially increasing expression levels for each gene reaching a

nal steady state (Fig. 4, top). The main characteristics of 0.0

this network is the division of the nodes into two groups: 2.0

Evolved

one in which the nodes are immediately expressed, and the

another where the nodes are expressed around 20 time steps 1.5

later at much lower expression levels. This re ects the struc-

expression level

ture of the network, which can be divided into mainly two 1.0

groups of genes activated separately by the starting gene but

then do not subsequently interact. 0.5

A resulting time series for one of the evolved networks

can be found in the bottom of gure 4. The distance to the 0.0

target network for this was approx. 0.74, and it contains 64 0 *-**-**-**-**-** 35 40

time

nodes and 826 edges. Comparing the two time series shows

that the main characteristics described previously are in the

Figure 4: Dynamics of the simple sequential network. Top:

evolved network. Due to the much higher number of nodes

Target network behavior. Bottom: Evolved network behav-

and edges in the resulting network, a comparison of the net-

ior. In each gure the upper and lower expression level

work topology is not trivial and is deferred to future work.

refers to the right and left side in the network topology in

The simplicity of the target network topology suggests that a

Fig. 3.

large number of possible topologies could exhibit the target

dynamics.

1 Detailed parameter descriptions and complete experimental

data are available on request.

Network with Negative Feedback serve only the overall dynamics, but also parts of the original

topology can be related to the evolved network. Further de-

The other network chosen for further analysis is shown in

tailed analysis and development of tools for detecting such

Fig. 5, the corresponding time series can be found in the top

similarities have yet to be done.

of Fig. 6. This network is more complex than the previous

due to the existence of negative feedback structures which

leads to a delayed inhibition of expression. The dynamics

Negative Feedback Network

are mainly driven by a node (marked in Fig. 5) which is ac-

tivated by the starting node and inhibited later by all other 2 Target

nodes it is connected to. In the time series this node ( in

Fig. 6) is increased rst by the starting node and then inhib-

ited nally to zero, with subsequent nodes decaying to zero

expression level

as well. 1

Evolved

expression level

0 5 10 15 20

time

Figure 5: Topology of the network with negative feedback.

The starting node is given in black. The single circled node

is crucial in the dynamics of the whole network and refers Figure 6: Dynamics of the negative feedback network. Top:

to the curves containing in Fig. 6. The three lower nodes Target network behavior. Bottom: Evolved network behav-

marked are the motif preserved in the evolved network in a ior. Crucial for the dynamics of the whole network is the

slightly different way (Fig. 7). delayed inhibition shown in the line, belonging to the sin-

gle marked node in Fig. 7. Following that other expression

levels are decreasing as well or are able to increase as soon

The time series from one of the resulting evolved net-

as the expression level of decreases (eg. or ).

works demonstrates that more complicated dynamics can

be reasonably approximated (Fig. 6, bottom). The distance

between networks is approx. 1.0 and the evolved network

Conclusions and Future Work

consists of 55 nodes and 298 edges. Thus, the tendency of

the networks to grow during evolution is still present, even Although the distances between target and evolved networks

though the growth is not as much as the previous network. are low, meaning the dynamics are quite similar, the corre-

This might have been caused by the more restricted space sponding comparison of network topologies is non trivial.

for possible solutions. Although a comparison of network Until a structural similarity can be measured in addition to

topologies is non-trivial, manual examination shows that one the dynamics similarity, the results of GRN reconstruction

of the target network s substructures has been preserved by by an evolutionary approach appears unreliable. However,

the evolved network. The three nodes of the target motif are the example of a preserved substructure in the network with

circled in Fig. 5 and the according subgraph of the evolved negative feedback shows a direction of future work.

network is shown in Fig. 7. In both cases the expression level A further problem to be considered is the increase of node

of the middle node increases after the central node has and edge numbers compared to the target network. Although

been decreased by other in uences. In the evolved network this seems to be resolvable by restricting node and edge

motif in Fig. 7 all edges which do not contribute to these numbers during the evolution such restrictions make the ob-

dynamics have been removed. The remaining structure only jective function harder to solve and lead often to local op-

differs in the position of the negative feedback (moved into tima as rst tests on this demonstrated. One reason could

a self-loop of node ). Thus, this network seems not to pre- be that adding additional nodes which turn the behavior into

Erd s, P. and Renyi, A. (1960). On the evolution of random

graphs. Publ. Math. Inst. Hung. Acad. Sci., 5:17 61.

Friedman, N., Linial, M., Nachman, I., and Pe er, D. (2000).

Using bayesian networks to analyze expression data. In

Proceedings of Conference on Research in Computa-

Figure 7: Topology of the preserved motif. All edges which

tional Molecular Biology, pages 127 135, New York,

do not contribute to the dynamics of these nodes have been

NY, USA. ACM Press.

removed. The only change is in the inhibiting edge which

appears now as a self-loop at the target node. Huber, W., von Heydebreck, A., Sueltmann, H., Poustka,

A., and Vingron, M. (2003). Parameter estimation for

the calibration and variance stabilization of microarray

the desired one is probably easier than correctly connecting data. Statistical Applications in Genetics and Molecu-

a prede ned number of nodes including the right reaction lar Biology, 2(1).

rates. This is in fact the way nature explores parameter space

Kikuchi, S., Tominaga, D., Arita, M., Takahashi, K., and

by gene duplication followed by modi cation.

Tomita, M. (2003). Dynamic modeling of genetic net-

In addition to that, the next future steps will complete the

works using genetic algorithm and S system. Bioinfor-

stages of our proposed methodology. For example, adjusting

matics, 19(5):643 650.

the ratio of visible to invisible nodes (ie. the extent of data

incompleteness) will allow us to tune the degree to which the

Liang, S., Fuhrman, S., and Somogyi, R. (1998). REVEAL,

microarray can detect the arti cial GRN and adding noise to

a general reverse engineering algorithm for inference of

the expression values would create more realistic arti cial

genetic network architectures. In Proceedings Paci c

microarray data. Also, other established GRN reconstruc-

Symposium on Biocomputing 3 1998, pages 18 29.

tion methods can be used.

P. Mendes, W. S. and Ye, K. (2003). Arti cial gene net-

Acknowledgements works for objective comparison of analysis algorithms.

We would like to thank Jacob K hler for helpful comments.

o Bioinformatics, 19 Suppl. 2:ii122 ii129.

Paul Kennedy would like to thank Klaus Prank and the NRW

Parmigiani, G., Garrett, E. S., et al. (2003). The analy-

Graduate School in Bioinformatics and Genome Research

sis of gene expression data: An overview of meth-

in Bielefeld for inviting and hosting him for his sabbatical in

ods and software. In Parmigiani, G., Garrett, E. S.,

2003 during which time ideas for this paper were developed.

Irizarry, R. A., and Zeger, S. L., editors, The analy-

Andre Skusa and Dion Whitehead would like to acknowl-

sis of gene expression data, pages 1 45, Heidelberg.

edge the same institution for general nancial support.

Springer Verlag.

References

Repsilber, D. and Kim, J. T. (2003). Developing and testing

Ambros, V. (2001). microRNAs: tiny regulators with great methods for microarray data analysis using an arti cial

potential. Cell, 107:823 826. life framework. In Banzhaf, W., Christaller, T., Dit-

trich, P., Kim, J. T., and Ziegler, J., editors, Advances in

Baldi, P. and Hat eld, G. W. (2002). DNA microarrays and

Arti cial Life, Proceedings of European Conference of

gene expression. Cambridge University Press, Cam-

Arti cial Life (ECAL) 2003, pages 686 695. Springer

bridge, UK.

Verlag.

Banzhaf, W. (2003). On the dynamics of an arti cial regula-

Wahde, M. and Hertz, J. (2000). Coarse grained reverse en-

tory network. In Banzhaf, W., Christaller, T., Dittrich,

gineering of genetic regulatory networks. Biosystems,

P., Kim, J. T., and Ziegler, J., editors, Advances in Arti-

55:129 136.

cial Life, Proceedings of European Conference of Ar-

ti cial Life (ECAL) 2003, LNAI 2801, pages 217 227. Yang, Y., Buckley, M. J., Dudoit, S., and Speed, T. P. (2000).

Springer Verlag. Comparison of methods for image analysis on cDNA

microarray data. Technical report 584, Department of

Cui, X., Kerr, M. K., and Churchill, G. A. (2003). Data Statistics, University of California, Berkeley.

transformations for cDNA microarray data. Statistical

Applications in Genetics and Molecular Biology, 2(1).

De Jong, H. (2002). Modeling and simulation of genetic reg-

ulatory systems: A literature review. Journal of Com-

putational Biology, 9(1):67 103.

Contact this candidate