Training Network

Location:

Hong Kong

Posted:

February 15, 2013

Contact this candidate

Resume:

Neurocomputing ** (****) *** }***

Training multilayer neural networks using fast global

learning algorithm - least-squares and penalized

optimization methods

Siu-yeung Cho*, Tommy W.S. Chow

Department of Electronic Engineering, City University of Hong Kong, Tat Chee Avenue, Kowloon, Hong Kong

Received 1 May 1998; accepted 8 December 1998

Abstract

The major limitations of conventional learning algorithms are attributed to local minima and

slow convergence speed. This paper presents a novel heuristics approach for neural networks

global learning algorithm. The proposed algorithm is based upon the least-squares (LS) method

to maintain the fast convergence speed and a Penalty (PEN) approach to solve the problem of

local minima. The penalty term superimposes into the error surface, which likely to provide

a way of escape from the local minima when the convergence stalls. The choice and adjustment

for the penalty factor are also derived to demonstrate the e!ect of the penalty term and to

ensure the convergence of the algorithm. The developed learning algorithm is applied to several

problems of classi"cation application. In all the tested problems, the proposed algorithm

outperforms other conventional algorithms in terms of convergence speed and the ability of

Keywords: Multilayer neural networks; Global learning algorithm; Least-squares method;

Penalized optimization

1. Introduction

The most in#uential e!ort in arti"cial neural networks learning algorithm is the

development of backpropagation (BP) algorithm, which is a systematic method for

* Corresponding author. Fax: 852-********.

E-mail address: *****@**.*****.***.** (S. Cho)

E-mail: *******@*****.***.**.

PII: S 0 9 2 5 - 2 3 1 2 ( 9 9 ) 0 0 0 5 5 - 7

116 S.-y. Cho, T.W.S. Chow / Neurocomputing 25 (1999) 115 }131

learning multilayer perceptron (MLP) neural networks. One of the primary tasks for

exploiting the BP algorithm involves the development of supervised learning method-

ologies, which aim at determining an optimal function mapping training patterns and

unseen test patterns. A lot of supervised learning algorithms have been proposed in

accordance with a popular error function of sum-squared error. Basically, there are

two major shortcomings for the BP learning algorithm } the training may be getting

stuck in local minima and the slow convergence speed. In 1992, Gori and Tesi

proposed a theoretical framework of the standard BP algorithm in order to identify

the problem of local minima. In their studies, they also used examples to illustrate that

the gradient descent-type BP algorithm are prone to local minima [13,14]. From the

theoretical point of view, their results could lead us to believe that the standard BP

algorithm is not a very reliable learning algorithm. On the other hand, the learning

process cannot guarantee to be completed within a reasonable time for most compli-

cated problems, such as [9,23]. Although the rate of convergence can be speeded up

simply by selecting larger learning parameters, this would probably introduce oscilla-

tion and may result in failing to "nd an optimal solution. Many fast learning

algorithms were, therefore, proposed in the literature. Jacobs [19], Tollenaere [25],

and, Yam and Chow [28] have proposed several acceleration techniques of adapting

the learning parameters in the computational processes. Other learning algorithms

based on di!erent optimization methods were also proposed in recent years. For

instance, conjugate-gradient (CG) [6,27] method and Newton's or quasi-Newton's

method [4] appear to be much faster learning algorithms but at the expense of

memory size and computational complexity. Meanwhile, linear least-squares based

[30,31], recursive least squares (RLS) [2,3,7] and extended Kalman "lter (EKF) [29]

learning algorithms have become widely used for multilayer neural networks. The

results indicated that these types of algorithms can provide a signi"cant reduction in

the total number of iterations by exploiting the fast convergence characteristics of the

RLS method, but the convergence stalling, getting stuck in local minima and poor

generalization are yielded.

Global optimization techniques [11,16,26] are, recently, employed to provide

training methods of MLP networks for the avoidance of trapping in undesired local

minima. For instance, simulated annealing (SA) [1,12] is a stochastic-type global

minimization method to allow any down-hill and up-hill movement for escaping from

the local minima which results the search converges to a global minimum at the end.

Evolutionary algorithm (EA) [21] or genetic algorithm (GA) [8,17], based on the

computational model of evolution, is another mechanism for the global optimization

to determine connectivity in feedforward networks. These methods have been applied

to complex, multi-model minimization problems with both discrete and continuous

variables. Battiti and Tecchiolli proposed a heuristic scheme of the reactive tabu

search, based on a `modi"ed local searcha component complemented with a meta-

strategy, for the neural networks training [5]. Most recently, a trajectory-based

nonlinear optimization method was developed by Shang and Wah for neural net-

works [24]. Their approach combines the global search with the local search to "nd

an optimal solution. It relies on an external force to pull a search out of a local

minimum. This global optimization method is capable of providing very promising

S.-y. Cho, T.W.S. Chow / Neurocomputing 25 (1999) 115 }131 117

results even when it is applied to very complicated recognition and classi"cation

problems. In this algorithm, the evaluation of a trace function, leading the trajectory

out of the local minima, requires the solution of an ordinary di!erential equation

(ODE). The algorithm is extremely computational complex which requires about

10 h for two-spiral benchmark problem operating powerful machines or parallel

computers.

In this paper, a robust and e$cient global learning algorithm, based on the hybrid

of the least squares and penalty search (LS-PEN) methods, is developed to reduce the

convergence time. The proposed algorithm has a very strong ability to escape from the

local minima. For simplicity, we assume that the MLP networks has a single hidden

layer. The weights (V), connected between the output layer and the hidden layer, are

"rstly determined using the least-squares (LS) based method. Afterward, the weights

(W), connected between the hidden layer and the input layer, are determined by the

gradient descent optimization. When the algorithm is stuck, the learning dynamic is

switched to minimizing the new penalty approach cost function [22,32] in terms of the

weight matrix (V). The novel penalty approach is de"ned by superimposing a Gaus-

sian-type function under the weight space domain. Many di!erent types of function

may be theoretically suitable. Our preliminary studies included the use of Laplace

function, Rayleigh function and Gaussian function. In this paper, all the presented

works are based on Gaussian function because of its outstanding performance

compared to those of others. Despite the success of our proposed heuristic LS-PEN

algorithm in performing global learning, like all other learning algorithms, we are still

unable to determine the distribution of minima in error surfaces and unable to identify

whether a minimum is a local one or a global one. In terms of its working mechanism,

the weights (V) are estimated by the LS-based method whilst the weights (W) are

estimated by the penalty approached global optimization. These enable the dimen-

sionality of the weight space for the global search algorithm to be signi"cantly

diminished which results in speeding up the rate of convergence. The proposed

LS-PEN algorithm has been validated by a modi"ed XOR problem, given by Gori

and Tesi's study, which has a unique global minimum. The results illustrate that our

proposed algorithm is capable of providing an ability to escape from the local minima.

The LS-PEN algorithm is also validated by the parity 4-bit problem, and the

two-spiral benchmark problem.

This paper is organized as follows: the background of the global optimization is

presented in Section 2. The design technique of the LS-PEN algorithm for MLP

networks training is derived and summarized in Section 3. The choice and adaptation

of the penalty factor are also demonstrated in the same section. In Section 4, the

results of the simulations for classi"cation tasks are discussed, and "nally the con-

clusion is drawn in Section 5.

2. Background of our global learning algorithm

Let us de"ne the global optimization problem to be considered in this paper

can be stated as follows. Let E(w) : RLPR be the cost function which can be twice

118 S.-y. Cho, T.W.S. Chow / Neurocomputing 25 (1999) 115 }131

di!erentiable continuously, where w is a vector of n state variables or parameters. The

objective is to "nd the global minimum solution, w which minimizes E(w),

EH"E(w )"min+E(w)"w3I,, (1)

where I is the domain of the state variables over which one seeks the global minimum

and I is assumed to be compact and connected. Further, we assume that every local

minimum w of E(w) in I satis"es the conditions

*E(w )

*+ "0, (2)

* E(w )

*+ y50,

y2 yoRL. (3)

Based on the above conditions, we assume that the global minimum satis"es these

local minima and that the global minimum does not occur beyond the boundary of I.

A well-known penalized optimization method was introduced in [22,32] stating

that problem (1) can be approximated by solutions of associated penalized uncon-

strained problems in "nite-dimensional spaces as shown below:

min+E(w)# E (w),"min+E(w)"w3I,"EH, (4)

where E (w) is a penalty function for the constraint set I and is a real number that

'0. The penalized optimization technique provides an uphill force whenever the

convergence is trapped in local minima. Thus, a discontinuous penalty function,

de"ned from [32], should satisfy

E (w)"0, w3I,

E (w)"g(w), w, I, (5)

where g(w) is a penalty-like function. In our study, a Gaussian type function is chosen.

3. Least squares and penalized (LS-PEN) global learning algorithm

3.1. Deviation of the LS-PEN-based global learning algorithm

For simplicity, we consider a MLP network with a single hidden layer network.

Suppose the network has m input units, n hidden units and q output units, and we

assume that the activation function takes a form of (x)"1/(1#exp(!x)). The

general form of the MLP network can be represented as follows:

L K

o" v w x #b #, 14k5q, (6)

I IH HG G H I

H G

where o and x denote network output and input values, respectively. w and

I G HG

b denote the synapses weights and bias term, respectively, connected between the

hidden layer and the input layer which form elements of the weight matrix W.

S.-y. Cho, T.W.S. Chow / Neurocomputing 25 (1999) 115 }131 119

Similarly, v and denote the synapses and bias, respectively, connected between the

IH I

output layer and the hidden layer which form elements of the weight matrix V.

Assume there are P patterns in the training set. For pattern p"1,2,2,P, let

t "(t, t,2, t )2 and o "(o,o,2, o )2 denote the desired output vector and

N N N ON N N N ON

the true output vector of the network. a "(a, a,2, a, 1)2 denotes the vector of

N N N LN

outputs at the hidden layer, where the entries a " ( K w x #b ) . x "

HN G HG GN HN

(x, x,2, x, 1)2 denotes the vector of inputs of the network.

N N KN

A total sum-squared-error function,

(t !o )2(t !o )

E (V, W)" (7)

2 N NN N

is chosen as a cost function for the MLP network optimization. The goal of the

learning algorithm is required to optimize the weights of the network by minimizing

the cost function such that all the derivatives of E (V, W) with respect to V are equal to

zero, so the optimal weight matrix V can be exactly computed by

V"TA2(AA2)>,

K (8)

where T"(t, t,2, t ) and A"(a, a,2, a ). `#a denotes an operation of

. .

pseudo-inverse of the matrix. Thus, the cost function E (V, W) can be reformulated in

a form of

E (V, W)"E(V, W(V))"E (W), (9)

because the weight matrix V can be expressed as a matrix function of matrix W. There

exists a new weight space R with lower dimension mapping into the original weight

space. In other words, the minimization of the new cost function E (W ) is equivalent

to minimizing the error function over the whole weight space. In this study, the

weights matrix, W can be iteratively estimated via minimizing the cost function by the

gradient descent optimization.

As our aforementioned statements, the minimization by the gradient descent

optimization su!ers from the problem of the local minima. Therefore, it is advisable to

modify the new cost function, E (W) by including a penalty function to provide

a search out of the local minima when the convergence gets stuck. In this paper, the

penalty function is introduced by a Gaussian function under the weight space domain.

The proposed penalty function is given by

"w (n!1)!wH"

L K

exp ! HG HG,

E (W)" (10)

H G

where w (n!1) is the weight in the specify weight space, R at the previous iteration.

wH is the sub-optimal weight value which is assumed to be getting stuck in a local

minimum. denotes the penalty factor which determines the in#uence of the penalty

term against the original least squares cost function. This weighting factor is used to

control the breadth of the search iteratively to force the trajectory out of the local

minima. The correct choice and adaptation procedure of will be described in Section

120 S.-y. Cho, T.W.S. Chow / Neurocomputing 25 (1999) 115 }131

3.2. In accordance with the penalty approach, the modi"ed cost function for the

LS-PEN algorithm is de"ned as

E (W)"E (W)#E (W). (11)

Consequently, the updated equations for the weights connected between the hidden

layer and the input layer is

*E (W),

wHH(n)"w (n!1)# (12a)

HG HG *w

*E (W) *E (W) *E (W)

"! # , (12b)

*w *w *w

HG HG HG

where wHH(n) is an optimal weight in the speci"ed weight space by minimizing the

modi"ed cost function. denotes a learning rate of the algorithm. *E /*w is the

original gradient of E with respect to w and *E /*w is the penalty term of

HG HG

the derivative of E (W) with respect to w using Eq. (10), so we de"ne

"w (n!1)!wH"

*E (W) 2

"! "w (n!1)!wH" exp ! HG HG (13)

HG HG

as the proposed penalty term.

3.2. Choice and adjustment for the penalty factor

In accordance with the Eq. (13), the penalty factor determines the e!ect of the

penalty term against the original negative gradient of *E /*w . A correct choice

and adjustment for are critical considerations for the entirely global optimization

performance. The selection of is based on the following condition:

1. If is too small, large weight changes are possible because of the strong e!ect of the

penalty term. In this case, the algorithm is unstable which may result an invalid

solution.

2. If is too large, only very small weight changes are allowed for the trajectory to

escape from the local minima because the penalty term is virtually redundancy.

The best choice of lies between these two extremes and with the condition of

assuring the training convergence. The proposed global algorithm is said to be

convergence in the mean square if the mean-square value of the error vector,

e (n)"t !o (n) approaches a constant value as the number of iterations n ap-

N N

proaches in"nity; that is:

E+e (n)2e (n),Pconstant as nPR, (14)

so, it implicit that,

E (n)"E (n#1)!E (n)40 as nPR. (15)

S.-y. Cho, T.W.S. Chow / Neurocomputing 25 (1999) 115 }131 121

The modi"ed cost function can be approximated by a "rst-order Taylor series:

*E (n) 2

E (n#1)"E (n)# w (n), (16)

where [*E (n)/*w ]2 represents the gradient of the modi"ed cost function with

respect to the weight vector w ; w (n) represents the change in an arbitrary weight

H H

vector. Based on the above expressions, the guideline of choice of is de"ned as

*E (n) \

0( 4 2"w (n!1)!wH" . (17)

H H *w

The derivation of the upper bound of is brie#y expressed in Appendix A.

Besides, the adjustment of the penalty factor is employed to control the global

search ability and is adjusted according to the following conditions:

1. When the convergence is stuck in local minimum, the penalized optimization is

introduced. (t) decreases gradually by 0.3% of the (t!1) to provide an uphill

force to escape from the local minimum.

2. After running the penalized optimization for few iterations, the (t) should be

increased by 1% of the (t!1) to diminish the e!ect of the penalty term when the

training error starts to decrease.

3.3. Summary of the LS-PEN based global learning algorithm

The LS-PEN algorithm can be summarized as follows:

Step 1: Algorithm initialization, set n"0. Randomize all of the weights, V and W, in

the whole network with zero mean and unit variance. Set the learning parameter,

"1000 and g}-ag"0.

"0.1,

Step 2: Propagate the given training patterns through the whole network and

calculate the E (n) by Eq. (7).

IF the stopping criterion is met, THEN the training stop, ELSE set

E (n!1)"E (n), EH"E (n), VH"V(n) and WH"W(n).

2 2 2 2

Step 3: Set n"n#1.

Step 4: Optimize the weights V by Eq. (8). Optimize the weights W by the gradient

descent method.

Step 5: Propagate the given training patterns through the whole network and

calculate the E (n) by Eq. (7).

IF the stopping criterion is met, THEN the training stops and go to step 8.

IF E (n)(EH, THEN set EH"E (n), VH"V(n) and WH"W(n); go to step 3.

2 2 2 2

IF E (n)5E (n!1), THEN set g}-ag"1 and "0.997, ELSE-IF E (n)5EH,

2 2 2 2

THEN set "1.01, ELSE set g}-ag"0 and " .

Step 6: IF g}-ag"1, THEN set n"n#1, calculate *E (W)/*w by Eq. (13) and

optimize W(n) by Eqs. (12a) and (12b), ELSE optimize W(n) by the conventional

gradient descent method.

122 S.-y. Cho, T.W.S. Chow / Neurocomputing 25 (1999) 115 }131

Step 7: Compute V(n) by Eq. (8) and repeat step 5.

Step 8: IF the training stops, THEN VH and WH are the optimal solutions.

4. Simulation results and discussion

The performance of the LS-PEN algorithm is investigated by applying to the

modi"ed XOR, parity 4-bits and two-spiral classi"cation problems. In all cases,

a single hidden layer network was implemented. Initialization is performed by taking

the weights connected between the hidden and the input layer randomly in the range

!10.0 to 10.0 and the weights connected between the output and the hidden layer in

the range !1.0 to 1.0 with uniform distribution. Root-mean-squares error (RMSE) is

de"ned by

N (t !o )2(t !o )

N N NN N,

RMSE" (18)

that is used as a metric for performance measures.

4.1. Modixed XOR problem

A modi"ed XOR problem was used to justify the problem of local minima during

training with the BP algorithm [13,14]. It is a classical XOR problem but one more

pattern is introduced such that a unique global minimum is existed. The truth table of

this problem is shown in Table 1. We used a network architecture 2}2}1 as the

smallest number of connection weights (nine parameters only) that can accomplish the

modi"ed XOR problem. Fig. 1 shows a 3-D error surface plot of this problem. In the

range shown, the problem has three minima, one of which is the global minimum.

Using the search range of [!5, 5], the LS-PEN algorithm is started at one of the

local minima to search for the global minimum. The 2-D contour plot of this problem

and the search trajectory of the LS-PEN algorithm are shown in Fig. 2. The trajectory

is pulled out of the local minimum on the "rst hundred iterations and then follows the

surface contour to search for the global minimum at the end. Beside, for the investiga-

tion of convergence performance, the LS-PEN algorithm was compared with the

Table 1

The truth table of the modi"ed XOR problem

X1 X2

Pattern Target

C1 0.01 0.01 0.01

C2 0.01 0.99 0.99

C3 0.99 0.01 0.99

C4 0.99 0.99 0.01

C5 0.50 0.50 0.99

S.-y. Cho, T.W.S. Chow / Neurocomputing 25 (1999) 115 }131 123

Fig. 1. 3-D error surface plot of the modi"ed XOR problem.

Fig. 2. 2-D Contour plot of the modi"ed XOR problem with superimposed search trajectory of the

LS-PEN algorithm. (Black solid line represents the trajectory formed by the LS-PEN-based algorithm).

network trained with other well-known algorithms, such as the conventional BP, fast

BP and the Levenberg}Marquardt (LM) [15] algorithms. The stopping criterion is

chosen to be RMSE"0.01 or the number of iterations"10,000. Fig. 3a and b demon-

strate the convergence performance of the di!erent types training algorithm started at

the di!erent initial setting. Fig. 3a shows that all the tested algorithms are capable of

converging to the speci"ed error level within a reasonable training epochs. Comparing

124 S.-y. Cho, T.W.S. Chow / Neurocomputing 25 (1999) 115 }131

Fig. 3. Convergence curves of the MLP networks trained by the conventional BP, fast BP, LM and the

LS-PEN-based algorithms on the modi"ed XOR problem, (a) started at initial weights in general case;

(b) started at initial weights in special case.

to the conventional algorithms, the LS-PEN algorithm provides the fastest rate of

convergence. Fig. 3b illustrates that the proposed LS-PEN algorithm is capable of

escaping from a local minima while other tested algorithms are still stuck in the local

minima. In this test, the training process started from a speci"c set of initial weight,

from which other conventional algorithms are unable to converge to the required

S.-y. Cho, T.W.S. Chow / Neurocomputing 25 (1999) 115 }131 125

Fig. 4. Comparison results for the ability of converging to the global minimum to obtain the best solution

for the di!erent algorithms on the modi"ed XOR problem.

error level. In addition, we also investigate the performance of the LS-PEN algorithm

by running 100 independent trials with di!erent initializations. The histogram of

Fig. 4 manifests that the LS-PEN algorithm consistently provides the best results.

Among these 100 runs, the LS-PEN algorithm obtains the highest rate (about 90%) of

converging to the global minimum successfully whereas over 20% of getting stuck in

a local minimum for the BP algorithm.

4.2. Parity 4-bits problem

The Parity 4-bits problem is essentially a generalization of the XOR problem to

K inputs which has been extensively discussed in other literature. The single output is

required to be one if an odd number of inputs are one; otherwise the output is zero. It

is usually used for evaluating algorithm performances and has been regarded as

a challenging problem, that is well known to have a problem of local minima. The

network topology has four inputs, four hidden and one output nodes. The training set

has 16 patterns. The LS-PEN-based global learning algorithm is compared to two

types of search methods: those are the local search algorithms, such as, the backpropa-

gation (BP) and Levenberg}Marquardt (LM) [15] algorithms; and the global search

algorithms, such as, the simulated annealing (SA) implemented by [12] and genetic

algorithm (GA) implemented by [17]. The results were run 50 trials independently

and are tabulated in Table 2. It is noted that the LS-PEN algorithm provide both

merits of faster convergence speed (only 4.9 min) and lower training error

(RMSE"0.02237) than the other algorithms. Comparing the computation complex-

ity of the algorithms, the ratio of the number of #oating points operations (Flops)

between the BP, LM, SA, GA and the LS-PEN algorithms is 1 : 8.9 : 10.6 : 7.8 : 1.9. In

terms of generalization for this problem, our purposed LS-PEN algorithm produced

the best solution with the accuracy of 93.7% correctness after escaping from the local

minima.

126 S.-y. Cho, T.W.S. Chow / Neurocomputing 25 (1999) 115 }131

Table 2

The performance of di!erent algorithms on the Parity 4-bits problem

Algorithm Averaged Number CPU time taken Final RMSE M#ops per

of epochs when (mins) at the learning error epochs

RMSE40.05 stopping criterion

3.64;10\

Back-propagation (BP) '10,000 5.1 0.49004

32.73;10\

Levenberg}Marquardt 1628 15.2 0.36294

(LM)

38.63;10\

Simulated Annealing (SA) '10,000 193.6 0.18978

28.28;10\

Genetic Algorithm (GA) 8200 156.3 0.17654

7.24;10\

LS-PEN-based global 517 4.9 0.02237

learning

M#ops - Number of #oating points (;10 ) per operation

4.3. Two-spiral classixcation problem

The two-spiral benchmark is chosen because it is reputable to be an extremely

demanding classi"cation for any algorithm. The training set consists of 194 X}>

values, half of which are to produce a#1 value and the other half are 0 value. The

goal is to develop a neural network that properly classi"es all 194 training patterns.

Hitherto, the published results for this problem include training MLP networks using

the conventional BP algorithm, cascade-correlation algorithm [10] and projection

pursuit learning algorithm [18]. Lang and Witbrock [20] solved the problem using

a 2}5}5}5}1 network (three hidden layers of "ve neurons each) that also provide

`shortcutsa connections (i.e. each unit received incoming connections from every unit

in every earlier layer). With this complicated topology, the conventional BP algorithm

was able to solve the problem in 20,000 epochs, the BP algorithm with a modi"ed cost

function required 12,000 epochs, and the Quickprop required 8000 epochs approxi-

mately. It is, therefore, noted that the task for solving this classi"cation problem is

very time consuming and the computational cost is high. Shang and Wah used

a global optimization algorithm, namely NOVEL, on a single hidden layer network

with the lateral connections between each hidden neurons [24]. The obtained results

are promising but an extremely long computation time of about 10 h was required for

this benchmark problem even though it was executed on a multi-processing worksta-

tion machine. In order to validate the proposed LS-PEN-based global algorithm, the

two-spiral problem is run for 50 independent trials with di!erent initializations.

A single hidden layer network without any lateral connections was implemented to

tackle this problem. The number of hidden units are chosen as 30, 40 and 50 for

di!erent topologies. The computation complexity is reduced substantially compared

to other global optimization methods. The rate of convergence is speeded up by using

the fast convergence characteristics of the least squares method. The learning CPU

time taken by the LS-PEN based algorithm are about 4.75, 2.67 and 1.29 h for

implementation on the di!erent number of hidden units with 30, 40 and 50, respective-

ly. Meanwhile, Figs. 5 and 6 summarize the best solutions of the classi"cation contour

S.-y. Cho, T.W.S. Chow / Neurocomputing 25 (1999) 115 }131 127

Fig. 5. The output classi"cation contour images for two-spiral problem created by the LS-PEN algorithm

with di!erent topologies: (a) 2-30-1 (30 hidden units); (b) 2-50-1 (50 hidden units).

Fig. 6. The output classi"cation contour images for two-spiral problem created by the simulated annealing

algorithm with di!erent topologies: (a) 2-30-1 (30 hidden units); (b) 2-50-1 (50 hidden units).

images obtained by the MLP networks trained by the proposed LS-PEN algorithm

and the simulated annealing (implemented by [12]) respectively. The overall classi-

"cation accuracy is over 95% obtained by the LS-PEN-based algorithms whereas

below 70% accuracy yielded by the simulated annealing algorithm. The results

manifest that excellent classi"cation results are obtained from our proposed LS-

PEN-based algorithm and "nd an optimal solution under very complicated problems.

5. Conclusion

The problems of local minima and slow convergence speed in conventional learning

algorithms for the multilayer neural networks are addressed in this paper. A fast and

128 S.-y. Cho, T.W.S. Chow / Neurocomputing 25 (1999) 115 }131

robust global learning algorithm, based upon least-squares and penalized optimiza-

tion methods (LS-PEN), is proposed to tackle the above problems. The LS method is

employed to determine the weights connected between the output layer and the

hidden layer. As a result, the rate of convergence is spectacular because of the

characteristic of fast convergence by the least-squares method. On the other hand, in

tackling the convergence stalling, the weights between the hidden layer and the input

layer are evaluated by the well-known penalized optimization method. This heuristics

mechanism enables the algorithm to provide a trajectory force out of the local minima

until the network converges again. The rigorous analysis from the simulation results

strongly demonstrated that the LS-PEN global learning algorithm overwhelmingly

outperforms the other algorithms and enables to satisfy on the complicated applica-

tions.

Acknowledgements

The authors would like to thank the anonymous reviewers for their useful com-

ments and suggestions.

Appendix A

Assume the weight change w (n) is the weight di!erence between w (n) and

H H

w (n!1), so

*E (n) *E (n)

w (n)" !

H *w *w

H H

*E (n) 2

# " w (n!1)!wH "

H H

(w (n!1)!wH)2(w (n!1)!wH)

exp ! H H H H . (A.1)

Therefore, from (15),

*E (n) 2 *E (n) 2

# "w (n!1)!wH"

E (n)"E (n#1)# !

H H

*w *w

H H

(w (n!1)!wH)2(w (n!1)!wH)

exp ! H H H H !E (n)

*E (n) 2

# " w (n!1)!wH "

H H

(w (n!1)!wH)2(w (n!1)!wH)

exp ! H H H H, (A.2)

where # ) # is a norm of the vector.

S.-y. Cho, T.W.S. Chow / Neurocomputing 25 (1999) 115 }131 129

E (n)40 and assume

By the convergence theorem from Eq. (15), is positive

constant, so,

*E (n) 2

# "w (n!1)!wH"

H H

(w (n!1)!wH)2(w (n!1)!wH)

exp ! H H H H 40, (A.3)

becomes

(w (n!1)!wH)2(w (n!1)!wH)

exp ! H H H H

*E (n) 2

"w (n!1)!wH",

H H

(w (n!1)!wH)2(w (n!1)!wH)

!H H H H

*E (n) 2

"w (n!1)!wH"

4ln,

H H

(w (n!1)!wH)2(w (n!1)!wH)

!H H H H

*E (n)

"w (n!1)!wH"

#ln 5!ln, (A.4)

H H *w

initially, we assume that " "'0 and is a quite large number, so PR then 1/ P0,

*E (n)

"w (n!1)!wH"5,

H H *w

3 H

therefore,

*E (n) \

4 2"w (n!1)!wH", (A.5)

3 H H *w

that Eq. (17) follows.

References

[1] E. Aarts, J. Korst, Simulated Annealing and Boltzmann Machines: A Stochastic Approach to

Combinatorial Optimization and Neural Computing, Wiley, New York, 1989.

[2] M.R. Azimi-Sadjadi, S. Citrin, Fast learning process of multilayer neural nets using recursive least

squares technique, Proceedings of IEEE International Conference on Neural Networks, Washington

DC, May 1989.

[3] M.R. Azimi-Sadjadi, R. Liou, Fast learning process of multilayer neural networks using recursive least

squares technique, IEEE Trans. Signal Process. 40 (2) (1992) 446}450.

130 S.-y. Cho, T.W.S. Chow / Neurocomputing 25 (1999) 115 }131

[4] R. Battiti, First- and second-order methods for learning: between steepest descent and Newton's

method, Neural Computa. 4 (2) (1992) 141}166.

[5] R. Battiti, G. Tecchiolli, Training neural nets with the reactive tabu search, IEEE Trans. Neural

Networks 6 (5) (1995) 1185}1200.

[6] S.-Y. Cho, T.W.S. Chow, An e$cient and stable recurrent neural networks training algorithm for time

series forecasting, Int. J. Knowledge-Based Intell. Eng. Systems 1 (3) (1997) 138}148.

[7] T.W.S. Chow, S.Y. Cho, An accelerated recurrent network training algorithm using IIR "lter model

and recursive least squares method, IEEE Trans. Circuits Systems } I: Fund. Theory Appl. 44 (11)

(1997) 1082}1086.

[8] N. Dodd, Optimization of network structure using genetic techniques, International Joint Conference

on Neural Networks, San Diego, 1990.

[9] R.O. Duda, P.E. Hart, Pattern Classi"cation and Scene Analysis, Wiley, New York, 1973.

[10] S.E. Fahlman, C. Lebiere, The cascade-correlation learning architecture, Technical Report, CMU-

CS-90-100, School of Computer Science, Carnegie Mellon University, 1990.

[11] C.A. Floudas, P.M. Pardalos, State of the Art in Global Optimization, Kluwer Academic Publishers,

Dordrecht, The Netherlands, 1996.

[12] W.L. Go!e, G.D. Ferrier, J. Rogers, Global optimization of statistical functions with simulated

annealing, J. Econometrics 60 (1994) 65}99.

[13] M. Gori, A. Tesi, Some examples of local minima during learning with back-propagation, Parallel

Architectures and Neural Networks, Vietri sul Mare(IT), May 1990.

[14] M. Gori, A. Tesi, On the problem of local minima in backpropagation, IEEE Trans. Pattern Anal.

Mach. Intell. 14 (1) (1992) 76}86.

[15] M.T. Hagan, M.B. Menhaj, Training feedforward networks with the Marquardt algorithm, IEEE

Trans. Neural Networks 5 (6) (1994) 989}993.

[16] R. Horst, P.M. Pardalos, Handbooks of Global Optimization, Kluwer Academic Publishers, Dor-

drecht, The Netherlands, 1995.

[17] C.R. Houch, J.A. Joines, M.G. Kay, A genetic algorithm for function optimization: a matlab

implementation, ACM Trans. Math. Software, submitted for Publication.

[18] J. Hwang, S. Lay, M. Maechler, R.D. Martin, J. Schimert, Regression modeling in backpropagation

and projection pursuit learning, IEEE Trans. Neural Networks 5 (1994) 1}24.

[19] R.A. Jacobs, Increases rates of convergence through learning rate adaptation, Neural Networks

1 (1988) 295}307.

[20] K.J. Lang, M.J. Witbrock, Learning to tell two spiral apart, Proceedings of the 1988 Connectionist

Models Summer School, Morgan Kaufmann, 1988.

[21] J.R. McDonnell, D. Waagen, Determining neural networks connectivity using evolutionary program-

ming, Twenty-Sixth Asilormar Conference on Signals, Systems, and Computers, Monterey, CA, 1992.

[22] S.S. Rao, Optimization Theory and Applications, Second ed., Wiley, New York, 1984.

[23] S. Renals, R. Rohwer, Phoneme classi"cation experiments using radial basis functions, Proceedings of

IEEE International Joint Conference on Neural Networks, Washington, DC, June 1989, pp. I:461-

467.

[24] Y. Shang, B.W. Wah, Global optimization for neural network training, IEEE Comput. 29 (3) (1996)

45}54.

[25] T. Tolleneace, Super SAB: fast adaptive back propagation with good scaling properties, Neural

Networks 3, 561}573.

[26] A. Torn, A. Zillinskas, Global Optimization, Springler, Berlin, 1987.

[27] M. Towsey, D. Alpsan, L. Sztriha, Training a neural network with conjugate gradient methods,

Proceeding of ICNN'95, Perth, Western Australia, (1995), pp. 373}378.

[28] R.J. Williams, Training recurrent networks using the extended kalman "lter, International Joint

Conference on Neural Networks, Baltimore, 1992, vol. IV, pp. 241}246.

[29] Y.F. Yam, T.W.S. Chow, Extended backpropagation algorithm, Electron. Lett. 29 (19) (1993)

1701}1702.

[30] Y.F. Yam, T.W.S. Chow, Accelerated training algorithm for feedforward neural networks based on

linear least squares problems, Neural Process. Lett. 2 (4) (1995) 20}25.

S.-y. Cho, T.W.S. Chow / Neurocomputing 25 (1999) 115 }131 131

[31] Y.F. Yam, T.W. S Chow, Extended least squares based algorithm for training feedforward networks,

IEEE Trans. Neural Networks 8 (3) (1997) 806}810.

[32] Q. Zheng, D. Zhuang, Integral global optimization of constrained problems in functional spaces with

discontinuous penalty functions, in: C.A. Floudas, P.M. Pardalos (Eds.), Recent Advances in Global

Optimization, Princeton University press, Princeton, NJ, 1992, pp. 298}320.

Siu-Yeung Cho received his Beng (Hons) degree in electronic and computer

engineering from University of Brighton, U.K., in 1994. He is now a Ph.D. student

in the Department of Electronic Engineering of the City University of Hong Kong.

His major "eld of interests include neural networks theory and its applications,

and computer vision.

Tommy W.S. Chow received the B.Sc. ("rst class honors) and Ph.D. degrees from the University of

Sunderland, U.K., in 1984 and 1988, respectively. He is currently an Associate Professor in the Department

of Electronic Engineering, City University of Hong Kong. His current research areas are neural networks,

system identi"cation and fault diagnostic analysis.

Contact this candidate