tion neural networks. Experimental results based on
some benchmark regression applications also verify
our conclusion: neural networks without orthonor-
malization transformation may achieve faster training
speed for the same generalization performance.
2 SINGLE-HIDDEN-LAYER
FEEDFORWARD NETWORKS
Before we show our main results, we need to intro-
duce some symbols for a standard SLFN. For N ar-
bitrary distinct samples (x
i
, t
i
), i = 1, ··· , N, where
x
i
= [x
i1
, x
i2
, ··· , x
in
]
T
∈ R
n
is an input vector and
t
i
= [t
i1
,t
i2
, ··· , t
im
]
T
∈ R
m
is a target vector. A stan-
dard SLFN with L hidden neurons with activation
function g(x) can be expressed as
L
∑
i=1
β
i
g(a
i
, b
i
, x
j
) = o
j
, j = 1, ··· ,N,
where o
j
is the actual output of SLFN. As mentioned
in introduction section, g(a
i
, b
i
, x
j
) may be additive
model or RBF model.
Definition 2.1 A standard SLFN with L hidden neu-
rons can learn N arbitrary distinct samples (x
i
, t
i
),
i = 1, ··· , N, with zero error means that there exist
parameters a
i
and b
i
, for i = 1, ··· , L, such that
N
∑
i=1
ko
i
− t
i
k = 0.
According to Definition 2.1, our ideal objective is to
find proper parameters a
i
and b
i
such that
L
∑
i=1
β
i
g(a
i
, b
i
, x
j
) = t
j
, j = 1, ··· ,N,
The above N equations can be expressed as
Hβ = T (1)
where β = [β
1
, ··· , β
L
]
T
, T = [t
1
, ··· , t
N
]
T
and the ma-
trix H is called as the hidden layer matrix of the
SLFN.
In (Kaminski and Strumillo, 1997), they showed
that by randomly choosing centers of kernel neurons,
the column vectors of matrix H are linearly indepen-
dent. In order to extend the corresponding result into
additive neurons, we need to introduce one lemma:
Lemma 2.1 P.491 of (Huang et al., 2006) Given a
standard SLFN with N hidden nodes and activation
function g : R → R which is infinitely differentiable in
any interval, for N arbitrary distinct samples (x
i
, t
i
),
where x
i
∈ R
n
and t
i
∈ R
m
, for any a
i
and b
i
chosen
from any intervals of R
n
and R, respectively, accord-
ing to any continuous probability distribution, then
with probability one, the hidden layer output matrix
H of the SLFN is invertible and kHβ− Tk = 0.
Lemma 2.1 illustrates that when the number of neu-
rons L is equal to the number of samples N, the cor-
responding hidden layer matrix H is nonsingular such
that SLFN can express those samples with zero er-
ror. The value of β could be calculated by H
−1
T.
In another word, it means that the column vectors
of H are linearly independent each other for any in-
finitely differentiable function with almost all the ran-
dom parameters, which is consistent with the conclu-
sion of Kaminski and Strumillo(Kaminski and Stru-
millo, 1997)(p. 1179). However, we should note that
Kaminski and Strumillo’s conclusion is only suitable
for kernel functions.
According to Lemma 2.1, for SLFNs with any in-
finitely differentiable additive neuron g(x), the hid-
den neuron parameters a
i
and b
i
may be assigned with
random values such that SLFN learn training samples
with zero error. In fact, full rank H, i.e., L = N, is not
necessary. The number of neurons L will be far less
than N in most cases. In this case (L < N), the lin-
early independent property is still ensured, then SLFN
can approach a nonzero training error ε by using the
Moore-Penrose generalized inverse of matrix H, i.e.,
β = H
†
T, where H
†
is the Moore-Penrose generalized
inverse of matrix.
3 NO NEED FOR
ORTHONORMALIZATION
In this section, we will demonstrate the equiva-
lence between neural networks without orthonormal
transformation and the orthonormal neural networks.
First, we introduce Gram-Schmidt orthonormaliza-
tion in brief. For the simplicity, we denote g
j
(x) =
g(a
j
, b
j
, x). Our aim is to find proper parameters such
that
β
1
g
1
(x
i
) + ··· + β
L
g
L
(x
i
) = t
i
, i = 1, ··· , N (2)
where t
i
= f(x
i
).
Multiplying equation (2) by g
j
(x
i
) and adding the
corresponding L equations for i = 1, · · · , N, we have
β
1
N
∑
i=1
g
1
(x
i
)g
j
(x
i
) + ··· + β
L
N
∑
i=1
g
L
(x
i
)g
j
(x
i
)
=
N
∑
i=1
y
i
g
j
(x
i
), j = 1,··· ,L (3)
Similar to (Kaminski and Strumillo, 1997) (p.
1179), the inner product of two functions is defined
FEEDFORWARD NEURAL NETWORKS WITHOUT ORTHONORMALIZATION
421