Neurocomputing ** (****) *** }***
Training multilayer neural networks using fast global
learning algorithm - least-squares and penalized
optimization methods
Siu-yeung Cho*, Tommy W.S. Chow
Department of Electronic Engineering, City University of Hong Kong, Tat Chee Avenue, Kowloon, Hong Kong
Received 1 May 1998; accepted 8 December 1998
Abstract
The major limitations of conventional learning algorithms are attributed to local minima and
slow convergence speed. This paper presents a novel heuristics approach for neural networks
global learning algorithm. The proposed algorithm is based upon the least-squares (LS) method
to maintain the fast convergence speed and a Penalty (PEN) approach to solve the problem of
local minima. The penalty term superimposes into the error surface, which likely to provide
a way of escape from the local minima when the convergence stalls. The choice and adjustment
for the penalty factor are also derived to demonstrate the e!ect of the penalty term and to
ensure the convergence of the algorithm. The developed learning algorithm is applied to several
problems of classi"cation application. In all the tested problems, the proposed algorithm
outperforms other conventional algorithms in terms of convergence speed and the ability of
escaping from the local minima. 1999 Elsevier Science B.V. All rights reserved.
Keywords: Multilayer neural networks; Global learning algorithm; Least-squares method;
Penalized optimization
1. Introduction
The most in#uential e!ort in arti"cial neural networks learning algorithm is the
development of backpropagation (BP) algorithm, which is a systematic method for
* Corresponding author. Fax: 852-********.
E-mail address: abqjuq@r.postjobfree.com (S. Cho)
E-mail: abqjuq@r.postjobfree.com.
0925-2312/99/$ } see front matter 1999 Elsevier Science B.V. All rights reserved.
PII: S 0 9 2 5 - 2 3 1 2 ( 9 9 ) 0 0 0 5 5 - 7
116 S.-y. Cho, T.W.S. Chow / Neurocomputing 25 (1999) 115 }131
learning multilayer perceptron (MLP) neural networks. One of the primary tasks for
exploiting the BP algorithm involves the development of supervised learning method-
ologies, which aim at determining an optimal function mapping training patterns and
unseen test patterns. A lot of supervised learning algorithms have been proposed in
accordance with a popular error function of sum-squared error. Basically, there are
two major shortcomings for the BP learning algorithm } the training may be getting
stuck in local minima and the slow convergence speed. In 1992, Gori and Tesi
proposed a theoretical framework of the standard BP algorithm in order to identify
the problem of local minima. In their studies, they also used examples to illustrate that
the gradient descent-type BP algorithm are prone to local minima [13,14]. From the
theoretical point of view, their results could lead us to believe that the standard BP
algorithm is not a very reliable learning algorithm. On the other hand, the learning
process cannot guarantee to be completed within a reasonable time for most compli-
cated problems, such as [9,23]. Although the rate of convergence can be speeded up
simply by selecting larger learning parameters, this would probably introduce oscilla-
tion and may result in failing to "nd an optimal solution. Many fast learning
algorithms were, therefore, proposed in the literature. Jacobs [19], Tollenaere [25],
and, Yam and Chow [28] have proposed several acceleration techniques of adapting
the learning parameters in the computational processes. Other learning algorithms
based on di!erent optimization methods were also proposed in recent years. For
instance, conjugate-gradient (CG) [6,27] method and Newton's or quasi-Newton's
method [4] appear to be much faster learning algorithms but at the expense of
memory size and computational complexity. Meanwhile, linear least-squares based
[30,31], recursive least squares (RLS) [2,3,7] and extended Kalman "lter (EKF) [29]
learning algorithms have become widely used for multilayer neural networks. The
results indicated that these types of algorithms can provide a signi"cant reduction in
the total number of iterations by exploiting the fast convergence characteristics of the
RLS method, but the convergence stalling, getting stuck in local minima and poor
generalization are yielded.
Global optimization techniques [11,16,26] are, recently, employed to provide
training methods of MLP networks for the avoidance of trapping in undesired local
minima. For instance, simulated annealing (SA) [1,12] is a stochastic-type global
minimization method to allow any down-hill and up-hill movement for escaping from
the local minima which results the search converges to a global minimum at the end.
Evolutionary algorithm (EA) [21] or genetic algorithm (GA) [8,17], based on the
computational model of evolution, is another mechanism for the global optimization
to determine connectivity in feedforward networks. These methods have been applied
to complex, multi-model minimization problems with both discrete and continuous
variables. Battiti and Tecchiolli proposed a heuristic scheme of the reactive tabu
search, based on a `modi"ed local searcha component complemented with a meta-
strategy, for the neural networks training [5]. Most recently, a trajectory-based
nonlinear optimization method was developed by Shang and Wah for neural net-
works [24]. Their approach combines the global search with the local search to "nd
an optimal solution. It relies on an external force to pull a search out of a local
minimum. This global optimization method is capable of providing very promising
S.-y. Cho, T.W.S. Chow / Neurocomputing 25 (1999) 115 }131 117
results even when it is applied to very complicated recognition and classi"cation
problems. In this algorithm, the evaluation of a trace function, leading the trajectory
out of the local minima, requires the solution of an ordinary di!erential equation
(ODE). The algorithm is extremely computational complex which requires about
10 h for two-spiral benchmark problem operating powerful machines or parallel
computers.
In this paper, a robust and e$cient global learning algorithm, based on the hybrid
of the least squares and penalty search (LS-PEN) methods, is developed to reduce the
convergence time. The proposed algorithm has a very strong ability to escape from the
local minima. For simplicity, we assume that the MLP networks has a single hidden
layer. The weights (V), connected between the output layer and the hidden layer, are
"rstly determined using the least-squares (LS) based method. Afterward, the weights
(W), connected between the hidden layer and the input layer, are determined by the
gradient descent optimization. When the algorithm is stuck, the learning dynamic is
switched to minimizing the new penalty approach cost function [22,32] in terms of the
weight matrix (V). The novel penalty approach is de"ned by superimposing a Gaus-
sian-type function under the weight space domain. Many di!erent types of function
may be theoretically suitable. Our preliminary studies included the use of Laplace
function, Rayleigh function and Gaussian function. In this paper, all the presented
works are based on Gaussian function because of its outstanding performance
compared to those of others. Despite the success of our proposed heuristic LS-PEN
algorithm in performing global learning, like all other learning algorithms, we are still
unable to determine the distribution of minima in error surfaces and unable to identify
whether a minimum is a local one or a global one. In terms of its working mechanism,
the weights (V) are estimated by the LS-based method whilst the weights (W) are
estimated by the penalty approached global optimization. These enable the dimen-
sionality of the weight space for the global search algorithm to be signi"cantly
diminished which results in speeding up the rate of convergence. The proposed
LS-PEN algorithm has been validated by a modi"ed XOR problem, given by Gori
and Tesi's study, which has a unique global minimum. The results illustrate that our
proposed algorithm is capable of providing an ability to escape from the local minima.
The LS-PEN algorithm is also validated by the parity 4-bit problem, and the
two-spiral benchmark problem.
This paper is organized as follows: the background of the global optimization is
presented in Section 2. The design technique of the LS-PEN algorithm for MLP
networks training is derived and summarized in Section 3. The choice and adaptation
of the penalty factor are also demonstrated in the same section. In Section 4, the
results of the simulations for classi"cation tasks are discussed, and "nally the con-
clusion is drawn in Section 5.
2. Background of our global learning algorithm
Let us de"ne the global optimization problem to be considered in this paper
can be stated as follows. Let E(w) : RLPR be the cost function which can be twice
118 S.-y. Cho, T.W.S. Chow / Neurocomputing 25 (1999) 115 }131
di!erentiable continuously, where w is a vector of n state variables or parameters. The
objective is to "nd the global minimum solution, w which minimizes E(w),
%+
EH"E(w )"min+E(w)"w3I,, (1)
%+
where I is the domain of the state variables over which one seeks the global minimum
and I is assumed to be compact and connected. Further, we assume that every local
minimum w of E(w) in I satis"es the conditions
*+
*E(w )
*+ "0, (2)
*w
* E(w )
*+ y50,
y2 yoRL. (3)
*w
Based on the above conditions, we assume that the global minimum satis"es these
local minima and that the global minimum does not occur beyond the boundary of I.
A well-known penalized optimization method was introduced in [22,32] stating
that problem (1) can be approximated by solutions of associated penalized uncon-
strained problems in "nite-dimensional spaces as shown below:
min+E(w)# E (w),"min+E(w)"w3I,"EH, (4)
where E (w) is a penalty function for the constraint set I and is a real number that
'0. The penalized optimization technique provides an uphill force whenever the
convergence is trapped in local minima. Thus, a discontinuous penalty function,
de"ned from [32], should satisfy
E (w)"0, w3I,
E (w)"g(w), w, I, (5)
where g(w) is a penalty-like function. In our study, a Gaussian type function is chosen.
3. Least squares and penalized (LS-PEN) global learning algorithm
3.1. Deviation of the LS-PEN-based global learning algorithm
For simplicity, we consider a MLP network with a single hidden layer network.
Suppose the network has m input units, n hidden units and q output units, and we
assume that the activation function takes a form of (x)"1/(1#exp(!x)). The
general form of the MLP network can be represented as follows:
L K
o" v w x #b #, 14k5q, (6)
I IH HG G H I
H G
where o and x denote network output and input values, respectively. w and
I G HG
b denote the synapses weights and bias term, respectively, connected between the
H
hidden layer and the input layer which form elements of the weight matrix W.
S.-y. Cho, T.W.S. Chow / Neurocomputing 25 (1999) 115 }131 119
Similarly, v and denote the synapses and bias, respectively, connected between the
IH I
output layer and the hidden layer which form elements of the weight matrix V.
Assume there are P patterns in the training set. For pattern p"1,2,2,P, let
t "(t, t,2, t )2 and o "(o,o,2, o )2 denote the desired output vector and
N N N ON N N N ON
the true output vector of the network. a "(a, a,2, a, 1)2 denotes the vector of
N N N LN
outputs at the hidden layer, where the entries a " ( K w x #b ) . x "
HN G HG GN HN
(x, x,2, x, 1)2 denotes the vector of inputs of the network.
N N KN
A total sum-squared-error function,
.
(t !o )2(t !o )
E (V, W)" (7)
2 N NN N
N
is chosen as a cost function for the MLP network optimization. The goal of the
learning algorithm is required to optimize the weights of the network by minimizing
the cost function such that all the derivatives of E (V, W) with respect to V are equal to
2
zero, so the optimal weight matrix V can be exactly computed by
V"TA2(AA2)>,
K (8)
where T"(t, t,2, t ) and A"(a, a,2, a ). `#a denotes an operation of
. .
pseudo-inverse of the matrix. Thus, the cost function E (V, W) can be reformulated in
2
a form of
E (V, W)"E(V, W(V))"E (W), (9)
2
because the weight matrix V can be expressed as a matrix function of matrix W. There
exists a new weight space R with lower dimension mapping into the original weight
space. In other words, the minimization of the new cost function E (W ) is equivalent
to minimizing the error function over the whole weight space. In this study, the
weights matrix, W can be iteratively estimated via minimizing the cost function by the
gradient descent optimization.
As our aforementioned statements, the minimization by the gradient descent
optimization su!ers from the problem of the local minima. Therefore, it is advisable to
modify the new cost function, E (W) by including a penalty function to provide
a search out of the local minima when the convergence gets stuck. In this paper, the
penalty function is introduced by a Gaussian function under the weight space domain.
The proposed penalty function is given by
"w (n!1)!wH"
L K
exp ! HG HG,
E (W)" (10)
H G
where w (n!1) is the weight in the specify weight space, R at the previous iteration.
HG
wH is the sub-optimal weight value which is assumed to be getting stuck in a local
HG
minimum. denotes the penalty factor which determines the in#uence of the penalty
term against the original least squares cost function. This weighting factor is used to
control the breadth of the search iteratively to force the trajectory out of the local
minima. The correct choice and adaptation procedure of will be described in Section
120 S.-y. Cho, T.W.S. Chow / Neurocomputing 25 (1999) 115 }131
3.2. In accordance with the penalty approach, the modi"ed cost function for the
LS-PEN algorithm is de"ned as
E (W)"E (W)#E (W). (11)
Consequently, the updated equations for the weights connected between the hidden
layer and the input layer is
*E (W),
wHH(n)"w (n!1)# (12a)
HG HG *w
HG
*E (W) *E (W) *E (W)
"! # , (12b)
*w *w *w
HG HG HG
where wHH(n) is an optimal weight in the speci"ed weight space by minimizing the
HG
modi"ed cost function. denotes a learning rate of the algorithm. *E /*w is the
HG
original gradient of E with respect to w and *E /*w is the penalty term of
HG HG
the derivative of E (W) with respect to w using Eq. (10), so we de"ne
HG
"w (n!1)!wH"
*E (W) 2
"! "w (n!1)!wH" exp ! HG HG (13)
HG HG
*w
HG
as the proposed penalty term.
3.2. Choice and adjustment for the penalty factor
In accordance with the Eq. (13), the penalty factor determines the e!ect of the
penalty term against the original negative gradient of *E /*w . A correct choice
HG
and adjustment for are critical considerations for the entirely global optimization
performance. The selection of is based on the following condition:
1. If is too small, large weight changes are possible because of the strong e!ect of the
penalty term. In this case, the algorithm is unstable which may result an invalid
solution.
2. If is too large, only very small weight changes are allowed for the trajectory to
escape from the local minima because the penalty term is virtually redundancy.
The best choice of lies between these two extremes and with the condition of
assuring the training convergence. The proposed global algorithm is said to be
convergence in the mean square if the mean-square value of the error vector,
e (n)"t !o (n) approaches a constant value as the number of iterations n ap-
N N
proaches in"nity; that is:
E+e (n)2e (n),Pconstant as nPR, (14)
so, it implicit that,
E (n)"E (n#1)!E (n)40 as nPR. (15)
S.-y. Cho, T.W.S. Chow / Neurocomputing 25 (1999) 115 }131 121
The modi"ed cost function can be approximated by a "rst-order Taylor series:
*E (n) 2
E (n#1)"E (n)# w (n), (16)
H
*w
H
where [*E (n)/*w ]2 represents the gradient of the modi"ed cost function with
H
respect to the weight vector w ; w (n) represents the change in an arbitrary weight
H H
vector. Based on the above expressions, the guideline of choice of is de"ned as
*E (n) \
0( 4 2"w (n!1)!wH" . (17)
H H *w
H
The derivation of the upper bound of is brie#y expressed in Appendix A.
Besides, the adjustment of the penalty factor is employed to control the global
search ability and is adjusted according to the following conditions:
1. When the convergence is stuck in local minimum, the penalized optimization is
introduced. (t) decreases gradually by 0.3% of the (t!1) to provide an uphill
force to escape from the local minimum.
2. After running the penalized optimization for few iterations, the (t) should be
increased by 1% of the (t!1) to diminish the e!ect of the penalty term when the
training error starts to decrease.
3.3. Summary of the LS-PEN based global learning algorithm
The LS-PEN algorithm can be summarized as follows:
Step 1: Algorithm initialization, set n"0. Randomize all of the weights, V and W, in
the whole network with zero mean and unit variance. Set the learning parameter,
"1000 and g}-ag"0.
"0.1,
Step 2: Propagate the given training patterns through the whole network and
calculate the E (n) by Eq. (7).
2
IF the stopping criterion is met, THEN the training stop, ELSE set
E (n!1)"E (n), EH"E (n), VH"V(n) and WH"W(n).
2 2 2 2
Step 3: Set n"n#1.
Step 4: Optimize the weights V by Eq. (8). Optimize the weights W by the gradient
descent method.
Step 5: Propagate the given training patterns through the whole network and
calculate the E (n) by Eq. (7).
2
IF the stopping criterion is met, THEN the training stops and go to step 8.
IF E (n)(EH, THEN set EH"E (n), VH"V(n) and WH"W(n); go to step 3.
2 2 2 2
IF E (n)5E (n!1), THEN set g}-ag"1 and "0.997, ELSE-IF E (n)5EH,
2 2 2 2
THEN set "1.01, ELSE set g}-ag"0 and " .
Step 6: IF g}-ag"1, THEN set n"n#1, calculate *E (W)/*w by Eq. (13) and
HG
optimize W(n) by Eqs. (12a) and (12b), ELSE optimize W(n) by the conventional
gradient descent method.
122 S.-y. Cho, T.W.S. Chow / Neurocomputing 25 (1999) 115 }131
Step 7: Compute V(n) by Eq. (8) and repeat step 5.
Step 8: IF the training stops, THEN VH and WH are the optimal solutions.
4. Simulation results and discussion
The performance of the LS-PEN algorithm is investigated by applying to the
modi"ed XOR, parity 4-bits and two-spiral classi"cation problems. In all cases,
a single hidden layer network was implemented. Initialization is performed by taking
the weights connected between the hidden and the input layer randomly in the range
!10.0 to 10.0 and the weights connected between the output and the hidden layer in
the range !1.0 to 1.0 with uniform distribution. Root-mean-squares error (RMSE) is
de"ned by
N (t !o )2(t !o )
N N NN N,
RMSE" (18)
Pq
that is used as a metric for performance measures.
4.1. Modixed XOR problem
A modi"ed XOR problem was used to justify the problem of local minima during
training with the BP algorithm [13,14]. It is a classical XOR problem but one more
pattern is introduced such that a unique global minimum is existed. The truth table of
this problem is shown in Table 1. We used a network architecture 2}2}1 as the
smallest number of connection weights (nine parameters only) that can accomplish the
modi"ed XOR problem. Fig. 1 shows a 3-D error surface plot of this problem. In the
range shown, the problem has three minima, one of which is the global minimum.
Using the search range of [!5, 5], the LS-PEN algorithm is started at one of the
local minima to search for the global minimum. The 2-D contour plot of this problem
and the search trajectory of the LS-PEN algorithm are shown in Fig. 2. The trajectory
is pulled out of the local minimum on the "rst hundred iterations and then follows the
surface contour to search for the global minimum at the end. Beside, for the investiga-
tion of convergence performance, the LS-PEN algorithm was compared with the
Table 1
The truth table of the modi"ed XOR problem
X1 X2
Pattern Target
C1 0.01 0.01 0.01
C2 0.01 0.99 0.99
C3 0.99 0.01 0.99
C4 0.99 0.99 0.01
C5 0.50 0.50 0.99
S.-y. Cho, T.W.S. Chow / Neurocomputing 25 (1999) 115 }131 123
Fig. 1. 3-D error surface plot of the modi"ed XOR problem.
Fig. 2. 2-D Contour plot of the modi"ed XOR problem with superimposed search trajectory of the
LS-PEN algorithm. (Black solid line represents the trajectory formed by the LS-PEN-based algorithm).
network trained with other well-known algorithms, such as the conventional BP, fast
BP and the Levenberg}Marquardt (LM) [15] algorithms. The stopping criterion is
chosen to be RMSE"0.01 or the number of iterations"10,000. Fig. 3a and b demon-
strate the convergence performance of the di!erent types training algorithm started at
the di!erent initial setting. Fig. 3a shows that all the tested algorithms are capable of
converging to the speci"ed error level within a reasonable training epochs. Comparing
124 S.-y. Cho, T.W.S. Chow / Neurocomputing 25 (1999) 115 }131
Fig. 3. Convergence curves of the MLP networks trained by the conventional BP, fast BP, LM and the
LS-PEN-based algorithms on the modi"ed XOR problem, (a) started at initial weights in general case;
(b) started at initial weights in special case.
to the conventional algorithms, the LS-PEN algorithm provides the fastest rate of
convergence. Fig. 3b illustrates that the proposed LS-PEN algorithm is capable of
escaping from a local minima while other tested algorithms are still stuck in the local
minima. In this test, the training process started from a speci"c set of initial weight,
from which other conventional algorithms are unable to converge to the required
S.-y. Cho, T.W.S. Chow / Neurocomputing 25 (1999) 115 }131 125
Fig. 4. Comparison results for the ability of converging to the global minimum to obtain the best solution
for the di!erent algorithms on the modi"ed XOR problem.
error level. In addition, we also investigate the performance of the LS-PEN algorithm
by running 100 independent trials with di!erent initializations. The histogram of
Fig. 4 manifests that the LS-PEN algorithm consistently provides the best results.
Among these 100 runs, the LS-PEN algorithm obtains the highest rate (about 90%) of
converging to the global minimum successfully whereas over 20% of getting stuck in
a local minimum for the BP algorithm.
4.2. Parity 4-bits problem
The Parity 4-bits problem is essentially a generalization of the XOR problem to
K inputs which has been extensively discussed in other literature. The single output is
required to be one if an odd number of inputs are one; otherwise the output is zero. It
is usually used for evaluating algorithm performances and has been regarded as
a challenging problem, that is well known to have a problem of local minima. The
network topology has four inputs, four hidden and one output nodes. The training set
has 16 patterns. The LS-PEN-based global learning algorithm is compared to two
types of search methods: those are the local search algorithms, such as, the backpropa-
gation (BP) and Levenberg}Marquardt (LM) [15] algorithms; and the global search
algorithms, such as, the simulated annealing (SA) implemented by [12] and genetic
algorithm (GA) implemented by [17]. The results were run 50 trials independently
and are tabulated in Table 2. It is noted that the LS-PEN algorithm provide both
merits of faster convergence speed (only 4.9 min) and lower training error
(RMSE"0.02237) than the other algorithms. Comparing the computation complex-
ity of the algorithms, the ratio of the number of #oating points operations (Flops)
between the BP, LM, SA, GA and the LS-PEN algorithms is 1 : 8.9 : 10.6 : 7.8 : 1.9. In
terms of generalization for this problem, our purposed LS-PEN algorithm produced
the best solution with the accuracy of 93.7% correctness after escaping from the local
minima.
126 S.-y. Cho, T.W.S. Chow / Neurocomputing 25 (1999) 115 }131
Table 2
The performance of di!erent algorithms on the Parity 4-bits problem
Algorithm Averaged Number CPU time taken Final RMSE M#ops per
of epochs when (mins) at the learning error epochs
RMSE40.05 stopping criterion
3.64;10\
Back-propagation (BP) '10,000 5.1 0.49004
32.73;10\
Levenberg}Marquardt 1628 15.2 0.36294
(LM)
38.63;10\
Simulated Annealing (SA) '10,000 193.6 0.18978
28.28;10\
Genetic Algorithm (GA) 8200 156.3 0.17654
7.24;10\
LS-PEN-based global 517 4.9 0.02237
learning
M#ops - Number of #oating points (;10 ) per operation
4.3. Two-spiral classixcation problem
The two-spiral benchmark is chosen because it is reputable to be an extremely
demanding classi"cation for any algorithm. The training set consists of 194 X}>
values, half of which are to produce a#1 value and the other half are 0 value. The
goal is to develop a neural network that properly classi"es all 194 training patterns.
Hitherto, the published results for this problem include training MLP networks using
the conventional BP algorithm, cascade-correlation algorithm [10] and projection
pursuit learning algorithm [18]. Lang and Witbrock [20] solved the problem using
a 2}5}5}5}1 network (three hidden layers of "ve neurons each) that also provide
`shortcutsa connections (i.e. each unit received incoming connections from every unit
in every earlier layer). With this complicated topology, the conventional BP algorithm
was able to solve the problem in 20,000 epochs, the BP algorithm with a modi"ed cost
function required 12,000 epochs, and the Quickprop required 8000 epochs approxi-
mately. It is, therefore, noted that the task for solving this classi"cation problem is
very time consuming and the computational cost is high. Shang and Wah used
a global optimization algorithm, namely NOVEL, on a single hidden layer network
with the lateral connections between each hidden neurons [24]. The obtained results
are promising but an extremely long computation time of about 10 h was required for
this benchmark problem even though it was executed on a multi-processing worksta-
tion machine. In order to validate the proposed LS-PEN-based global algorithm, the
two-spiral problem is run for 50 independent trials with di!erent initializations.
A single hidden layer network without any lateral connections was implemented to
tackle this problem. The number of hidden units are chosen as 30, 40 and 50 for
di!erent topologies. The computation complexity is reduced substantially compared
to other global optimization methods. The rate of convergence is speeded up by using
the fast convergence characteristics of the least squares method. The learning CPU
time taken by the LS-PEN based algorithm are about 4.75, 2.67 and 1.29 h for
implementation on the di!erent number of hidden units with 30, 40 and 50, respective-
ly. Meanwhile, Figs. 5 and 6 summarize the best solutions of the classi"cation contour
S.-y. Cho, T.W.S. Chow / Neurocomputing 25 (1999) 115 }131 127
Fig. 5. The output classi"cation contour images for two-spiral problem created by the LS-PEN algorithm
with di!erent topologies: (a) 2-30-1 (30 hidden units); (b) 2-50-1 (50 hidden units).
Fig. 6. The output classi"cation contour images for two-spiral problem created by the simulated annealing
algorithm with di!erent topologies: (a) 2-30-1 (30 hidden units); (b) 2-50-1 (50 hidden units).
images obtained by the MLP networks trained by the proposed LS-PEN algorithm
and the simulated annealing (implemented by [12]) respectively. The overall classi-
"cation accuracy is over 95% obtained by the LS-PEN-based algorithms whereas
below 70% accuracy yielded by the simulated annealing algorithm. The results
manifest that excellent classi"cation results are obtained from our proposed LS-
PEN-based algorithm and "nd an optimal solution under very complicated problems.
5. Conclusion
The problems of local minima and slow convergence speed in conventional learning
algorithms for the multilayer neural networks are addressed in this paper. A fast and
128 S.-y. Cho, T.W.S. Chow / Neurocomputing 25 (1999) 115 }131
robust global learning algorithm, based upon least-squares and penalized optimiza-
tion methods (LS-PEN), is proposed to tackle the above problems. The LS method is
employed to determine the weights connected between the output layer and the
hidden layer. As a result, the rate of convergence is spectacular because of the
characteristic of fast convergence by the least-squares method. On the other hand, in
tackling the convergence stalling, the weights between the hidden layer and the input
layer are evaluated by the well-known penalized optimization method. This heuristics
mechanism enables the algorithm to provide a trajectory force out of the local minima
until the network converges again. The rigorous analysis from the simulation results
strongly demonstrated that the LS-PEN global learning algorithm overwhelmingly
outperforms the other algorithms and enables to satisfy on the complicated applica-
tions.
Acknowledgements
The authors would like to thank the anonymous reviewers for their useful com-
ments and suggestions.
Appendix A
Assume the weight change w (n) is the weight di!erence between w (n) and
H H
w (n!1), so
H
*E (n) *E (n)
#
w (n)" !
H *w *w
H H
*E (n) 2
# " w (n!1)!wH "
"!
H H
*w
H
(w (n!1)!wH)2(w (n!1)!wH)
exp ! H H H H . (A.1)
Therefore, from (15),
*E (n) 2 *E (n) 2
# "w (n!1)!wH"
E (n)"E (n#1)# !
H H
*w *w
H H
(w (n!1)!wH)2(w (n!1)!wH)
exp ! H H H H !E (n)
*E (n) 2
# " w (n!1)!wH "
"!
H H
*w
H
(w (n!1)!wH)2(w (n!1)!wH)
exp ! H H H H, (A.2)
where # ) # is a norm of the vector.
S.-y. Cho, T.W.S. Chow / Neurocomputing 25 (1999) 115 }131 129
E (n)40 and assume
By the convergence theorem from Eq. (15), is positive
constant, so,
*E (n) 2
# "w (n!1)!wH"
!
H H
*w
H
(w (n!1)!wH)2(w (n!1)!wH)
exp ! H H H H 40, (A.3)
becomes
(w (n!1)!wH)2(w (n!1)!wH)
exp ! H H H H
\
*E (n) 2
"w (n!1)!wH",
4
H H
*w
H
(w (n!1)!wH)2(w (n!1)!wH)
!H H H H
\
*E (n) 2
"w (n!1)!wH"
4ln,
H H
*w
H
(w (n!1)!wH)2(w (n!1)!wH)
!H H H H
*E (n)
2
"w (n!1)!wH"
#ln 5!ln, (A.4)
H H *w
H
initially, we assume that " "'0 and is a quite large number, so PR then 1/ P0,
*E (n)
2
"w (n!1)!wH"5,
H H *w
3 H
therefore,
*E (n) \
4 2"w (n!1)!wH", (A.5)
3 H H *w
H
that Eq. (17) follows.
References
[1] E. Aarts, J. Korst, Simulated Annealing and Boltzmann Machines: A Stochastic Approach to
Combinatorial Optimization and Neural Computing, Wiley, New York, 1989.
[2] M.R. Azimi-Sadjadi, S. Citrin, Fast learning process of multilayer neural nets using recursive least
squares technique, Proceedings of IEEE International Conference on Neural Networks, Washington
DC, May 1989.
[3] M.R. Azimi-Sadjadi, R. Liou, Fast learning process of multilayer neural networks using recursive least
squares technique, IEEE Trans. Signal Process. 40 (2) (1992) 446}450.
130 S.-y. Cho, T.W.S. Chow / Neurocomputing 25 (1999) 115 }131
[4] R. Battiti, First- and second-order methods for learning: between steepest descent and Newton's
method, Neural Computa. 4 (2) (1992) 141}166.
[5] R. Battiti, G. Tecchiolli, Training neural nets with the reactive tabu search, IEEE Trans. Neural
Networks 6 (5) (1995) 1185}1200.
[6] S.-Y. Cho, T.W.S. Chow, An e$cient and stable recurrent neural networks training algorithm for time
series forecasting, Int. J. Knowledge-Based Intell. Eng. Systems 1 (3) (1997) 138}148.
[7] T.W.S. Chow, S.Y. Cho, An accelerated recurrent network training algorithm using IIR "lter model
and recursive least squares method, IEEE Trans. Circuits Systems } I: Fund. Theory Appl. 44 (11)
(1997) 1082}1086.
[8] N. Dodd, Optimization of network structure using genetic techniques, International Joint Conference
on Neural Networks, San Diego, 1990.
[9] R.O. Duda, P.E. Hart, Pattern Classi"cation and Scene Analysis, Wiley, New York, 1973.
[10] S.E. Fahlman, C. Lebiere, The cascade-correlation learning architecture, Technical Report, CMU-
CS-90-100, School of Computer Science, Carnegie Mellon University, 1990.
[11] C.A. Floudas, P.M. Pardalos, State of the Art in Global Optimization, Kluwer Academic Publishers,
Dordrecht, The Netherlands, 1996.
[12] W.L. Go!e, G.D. Ferrier, J. Rogers, Global optimization of statistical functions with simulated
annealing, J. Econometrics 60 (1994) 65}99.
[13] M. Gori, A. Tesi, Some examples of local minima during learning with back-propagation, Parallel
Architectures and Neural Networks, Vietri sul Mare(IT), May 1990.
[14] M. Gori, A. Tesi, On the problem of local minima in backpropagation, IEEE Trans. Pattern Anal.
Mach. Intell. 14 (1) (1992) 76}86.
[15] M.T. Hagan, M.B. Menhaj, Training feedforward networks with the Marquardt algorithm, IEEE
Trans. Neural Networks 5 (6) (1994) 989}993.
[16] R. Horst, P.M. Pardalos, Handbooks of Global Optimization, Kluwer Academic Publishers, Dor-
drecht, The Netherlands, 1995.
[17] C.R. Houch, J.A. Joines, M.G. Kay, A genetic algorithm for function optimization: a matlab
implementation, ACM Trans. Math. Software, submitted for Publication.
[18] J. Hwang, S. Lay, M. Maechler, R.D. Martin, J. Schimert, Regression modeling in backpropagation
and projection pursuit learning, IEEE Trans. Neural Networks 5 (1994) 1}24.
[19] R.A. Jacobs, Increases rates of convergence through learning rate adaptation, Neural Networks
1 (1988) 295}307.
[20] K.J. Lang, M.J. Witbrock, Learning to tell two spiral apart, Proceedings of the 1988 Connectionist
Models Summer School, Morgan Kaufmann, 1988.
[21] J.R. McDonnell, D. Waagen, Determining neural networks connectivity using evolutionary program-
ming, Twenty-Sixth Asilormar Conference on Signals, Systems, and Computers, Monterey, CA, 1992.
[22] S.S. Rao, Optimization Theory and Applications, Second ed., Wiley, New York, 1984.
[23] S. Renals, R. Rohwer, Phoneme classi"cation experiments using radial basis functions, Proceedings of
IEEE International Joint Conference on Neural Networks, Washington, DC, June 1989, pp. I:461-
467.
[24] Y. Shang, B.W. Wah, Global optimization for neural network training, IEEE Comput. 29 (3) (1996)
45}54.
[25] T. Tolleneace, Super SAB: fast adaptive back propagation with good scaling properties, Neural
Networks 3, 561}573.
[26] A. Torn, A. Zillinskas, Global Optimization, Springler, Berlin, 1987.
[27] M. Towsey, D. Alpsan, L. Sztriha, Training a neural network with conjugate gradient methods,
Proceeding of ICNN'95, Perth, Western Australia, (1995), pp. 373}378.
[28] R.J. Williams, Training recurrent networks using the extended kalman "lter, International Joint
Conference on Neural Networks, Baltimore, 1992, vol. IV, pp. 241}246.
[29] Y.F. Yam, T.W.S. Chow, Extended backpropagation algorithm, Electron. Lett. 29 (19) (1993)
1701}1702.
[30] Y.F. Yam, T.W.S. Chow, Accelerated training algorithm for feedforward neural networks based on
linear least squares problems, Neural Process. Lett. 2 (4) (1995) 20}25.
S.-y. Cho, T.W.S. Chow / Neurocomputing 25 (1999) 115 }131 131
[31] Y.F. Yam, T.W. S Chow, Extended least squares based algorithm for training feedforward networks,
IEEE Trans. Neural Networks 8 (3) (1997) 806}810.
[32] Q. Zheng, D. Zhuang, Integral global optimization of constrained problems in functional spaces with
discontinuous penalty functions, in: C.A. Floudas, P.M. Pardalos (Eds.), Recent Advances in Global
Optimization, Princeton University press, Princeton, NJ, 1992, pp. 298}320.
Siu-Yeung Cho received his Beng (Hons) degree in electronic and computer
engineering from University of Brighton, U.K., in 1994. He is now a Ph.D. student
in the Department of Electronic Engineering of the City University of Hong Kong.
His major "eld of interests include neural networks theory and its applications,
and computer vision.
Tommy W.S. Chow received the B.Sc. ("rst class honors) and Ph.D. degrees from the University of
Sunderland, U.K., in 1984 and 1988, respectively. He is currently an Associate Professor in the Department
of Electronic Engineering, City University of Hong Kong. His current research areas are neural networks,
system identi"cation and fault diagnostic analysis.
ing from the local minima. 1999 Elsevier Science B.V. All rights reserved.